# `ratar` tutorial

### `ratar` - Read-Across the TARgetome

This is an introduction on how to use the `ratar` "package" so far...

1. Import packages
2. Load binding sites
3. Encode binding sites
4. Explore encoded binding site
5. Compare binding sites
6. How to...

In [None]:
%load_ext autoreload
%autoreload 2

## Import packages

In [None]:
import sys
sys.path.append("/home/dominique/Documents/projects/ratar/ratar/")

In [None]:
# FIXME!!!
from encoding import *
from similarity import *
from auxiliary import *
import itertools
import seaborn as sns

In [None]:
%load_ext autoreload
%autoreload 2

### Fix issue

In [None]:
from pathlib import Path
from ratar.auxiliary import MoleculeLoader
from ratar.encoding import BindingSite

In [None]:
!head /Users/dominique/Documents/GitHub/ratar/ratar/tests/data/AAK1_4wsq_altA_chainA.mol2

In [None]:
!head /Users/dominique/Documents/GitHub/ratar/ratar/tests/data/AAK1_4wsq_altA_chainB.mol2

In [None]:
p1 = "/Users/dominique/Documents/GitHub/ratar/ratar/tests/data/AAK1_4wsq_altA_chainA.mol2"
p2 = "/Users/dominique/Documents/GitHub/ratar/ratar/tests/data/AAK1_4wsq_altA_chainB.mol2"
b1 = BindingSite.from_file(p2)
b2 = BindingSite.from_file(p1)
print(b1.representatives.pc.iloc[0, :])
print(b2.representatives.pc.iloc[0, :])

In [None]:
b1 == b2
# Should be FALSE!!!

## Load binding sites

### 1. Get list of file paths to be loaded

In [None]:
input_path = "/home/dominique/Documents/data/klifs/egfr_20190506/structures/HUMAN/EGFR/*/pocket_pp.mol2"
input_path_list = glob.glob(input_path)
input_path_list = input_path_list[:5]

**Note:** Package `glob` was loaded with ratar package - should it be loaded individually in this notebook?

In [None]:
input_path_list[0]

**Note:** Mol2 files were pre-processed for `biopandas` (remove 10th column)

In [None]:
%%bash
# Remove 10th column from mol2 file for biopandas can only read 9 columns
files=/home/dominique/Documents/data/klifs/egfr_20190506/structures/HUMAN/EGFR/*/pocket.mol2
for i in $files
do 
less $i | awk '!($10="")' > ${i:0:-5}_pp.mol2
done

## Load binding sites

### 2. Load file content with `biopandas`

In [None]:
# Load all files
mol_loader = [MolFileLoader(i) for i in input_path_list]

In [None]:
# Get file content as DataFrame
pmols = [i.pmols for i in mol_loader]

In [None]:
pmols

In [None]:
# Flatten list > list of DataFrames
pmols = list(itertools.chain.from_iterable(pmols))

In [None]:
print(f'Number of structures: {len(pmols)}')

## Encode binding sites

In [None]:
BindingSite?

In [None]:
binding_sites = [BindingSite.from_molecule(i) for i in pmols]

In [None]:
# Select example binding site
bs = binding_sites[0]

In [None]:
bs.mol.head()

In [None]:
bs.pdb_id

## Explore encoded binding site

### 1. Representatives

In [None]:
bs.repres.repres_dict.keys()

In [None]:
bs.repres.repres_dict["ca"].head()

## Explore encoded binding site

### 2. Subsets

In [None]:
bs.subset.subsets_indices_dict.keys()

In [None]:
bs.subset.subsets_indices_dict['pc']

## Explore encoded binding site

### 3. Spatial and physicochemical properties

In [None]:
bs.coord.coord_dict.keys()

In [None]:
bs.coord.coord_dict['ca'].head()

In [None]:
bs.pcprop.pcprop_dict.keys()

In [None]:
bs.pcprop.pcprop_dict['ca'].keys()

In [None]:
bs.pcprop.pcprop_dict['ca']['z123'].head()

## Explore encoded binding site

### 4. Set up dimensions for binding site representatives (= points)

In [None]:
bs.points.points_dict.keys()

In [None]:
bs.points.points_dict['ca_z123'].head()

In [None]:
bs.points.points_subsets_dict.keys()

In [None]:
bs.points.points_subsets_dict['pc_z123'].keys()

In [None]:
bs.points.points_subsets_dict['pc_z123']['HBA'].head()

## Explore encoded binding site 

### 5. Get encoding methods

In [None]:
bs.shapes.shapes_dict.keys()

In [None]:
bs.shapes.shapes_dict['ca'].keys()

In [None]:
bs.shapes.shapes_dict['ca_z1'].keys()

In [None]:
bs.shapes.shapes_dict['ca_z123'].keys()

## Explore encoded binding site

### 6. Get reference points

In [None]:
bs.shapes.shapes_dict['ca_z123']['6dim'].keys()

In [None]:
# Reference points!
bs.shapes.shapes_dict['ca_z123']['6dim']['ref_points']

## Explore encoded binding site

### 7. Get distances

In [None]:
# Distances from each reference points to all binding site representatives
bs.shapes.shapes_dict['ca_z123']['6dim']['dist'].head()

In [None]:
bs.shapes.shapes_dict['ca_z123']['6dim']['dist'].shape

In [None]:
data = bs.shapes.shapes_dict['ca_z123']['6dim']['dist']
data.rename(index=str, columns={i:i[-2:] for i in data.keys()}, inplace=True)
data_m = pd.melt(data)
data_m.rename(index=str, columns={'variable': 'Reference points', 'value': 'Distances'}, inplace=True)

sns.violinplot(x=data_m['Reference points'], y=data_m['Distances'], palette='Blues')

## Explore encoded binding site

### 8. Get fingerprint (moments)

In [None]:
# The actual binding site fingerprint!!
bs.shapes.shapes_dict['ca_z123']['6dim']['moments'].head()

## Compare binding sites



In [None]:
# Save binding sites to disc
[save_binding_site(i, f'/home/dominique/Tmp/encoding/{i.pdb_id}/ratar_encoding.p') for i in binding_sites]

In [None]:
# Get all-against-all matrices for different encoding types
aaa_dict = get_similarity_all_against_all('/home/dominique/Tmp/encoding/HUMAN/*/ratar_encoding.p')
aaa_dict.keys()

In [None]:
data = aaa_dict['ca_z123_6dim']
sns.heatmap(data)

## How to...

* organise functions? Class-specific functions within class, more general functions in extra file?
* docstring functions without return value?
* 

In [None]:
save_binding_site?