# Statistics on missing CA and CB atoms AND side chain orientation angles

## Aims of this notebook

### 1. Missing CA and CB atoms

In our fingerprint, both the exposure and side chain orientation features are dependent on CA and CB atoms.
Here, we investigate where and overall how often these atoms are missing in the KLIFS data.

1. Get for each KLIFS molecule CA and CB atom coordinates per residue position.
2. Calculate missing atom rate per residue position: CA, CB and CA+CB missing.

### 2. Side chain orientation (SCO) distribution

SCO shall describe (as the name says) the orientation of a side chain. We need to make sure that the SCO is not an additional measure of size (we have that already in our fingerprint) but can truly show different orientations of a specific amino acid. 

Small amino acids (with tiny side chains) should not show much angle diversion (with smaller angles), larger ones should (with larger angles).

1. Calculate for each amino acid the angle distribution.
2. Save molecule and residue code for each angle, in order to trace back interesting angles.
3. Check diversity of angles per amino acid. If no diversity observed, side chain orientation might not be such a good measure, since it does not depend on structural conformation but solely on amino acid type.

## Imports

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
from pathlib import Path
import pickle
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from kinsim_structure.auxiliary import KlifsMoleculeLoader
from kinsim_structure.analysis import ResidueConservation, GapRate, SideChainOrientationStatistics

sns.set()
%matplotlib inline

## IO paths

In [4]:
path_to_data = Path('/') / 'home' / 'dominique' / 'Documents' / 'data' / 'kinsim' / '20190724_full'
path_to_kinsim = Path('/') / 'home' / 'dominique' / 'Documents' / 'projects' / 'kinsim_structure'
path_to_results = path_to_kinsim / 'results'

metadata_path = path_to_data / 'preprocessed' / 'klifs_metadata_preprocessed.csv'

## Load KLIFS metadata

In [5]:
klifs_metadata = pd.read_csv(metadata_path)

In [6]:
klifs_metadata.shape

(3920, 23)

In [7]:
klifs_metadata.head()

Unnamed: 0.1,Unnamed: 0,index,kinase,family,groups,pdb_id,chain,alternate_model,species,ligand_orthosteric_name,...,dfg,ac_helix,rmsd1,rmsd2,qualityscore,pocket,resolution,missing_residues,missing_atoms,full_ifp
0,0,2886,AAK1,NAK,Other,4wsq,B,A,Human,K-252A,...,in,in,0.777,2.125,8.6,EVLAEGGFAIVFLCALKRMVCKREIQIMRDLSKNIVGYIDSLILMD...,1.95,0,14,0000000000000010000001000000000000000000000000...
1,1,10043,AAK1,NAK,Other,5l4q,A,A,Human,"~{N}-[5-(4-cyanophenyl)-1~{H}-pyrrolo[2,3-b]py...",...,in,in,0.78,2.137,9.7,EVLAEGGFAIVFLCALKRMVCKREIQIMRDLSKNIVGYIDSLILMD...,1.97,0,3,0000000000000010000000000000000000000000000000...
2,2,7046,AAK1,NAK,Other,5te0,A,-,Human,methyl (3Z)-3-{[(4-{methyl[(4-methylpiperazin-...,...,in,in,0.776,2.12,8.8,EVLAEGGFAIVFLCALKRMVCKREIQIMRDLSKNIVGYIDSLILMD...,1.9,0,12,1000101000000010000001000000000000000000000000...
3,3,843,ABL1,Abl,TK,2f4j,A,-,Human,CYCLOPROPANECARBOXYLIC ACID {4-[4-(4-METHYL-PI...,...,in,in,0.779,2.128,8.0,HKLGGGQYGEVYEVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIITE...,1.91,0,0,0000000000000010000001000000000000000000000000...
4,4,815,ABL1,Abl,TK,2g1t,A,-,Human,-,...,in,out,0.825,2.154,8.0,HKLGGGQYGEVYEVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIITE...,1.8,0,0,


## Data generation

In [8]:
gap_rate = GapRate(klifs_metadata)

In [11]:
sco_stats = SideChainOrientationStatistics()
sco_stats.from_metadata(klifs_metadata)

1/3920




2/3920




3/3920
4/3920
5/3920




6/3920




7/3920




8/3920




9/3920




10/3920




11/3920




12/3920




13/3920


KeyboardInterrupt: 

In [None]:
with open(path_to_results / 'stats_missing_ca_cb_and_sco.p', 'wb') as f:
    pickle.dump(sco_stats, f)

## Gap rate

In [None]:
gap_rate.data.iloc[40:55]

In [None]:
gap_rate.plot_gap_rate(path_to_results)

## Missing CA and CB atoms

In [None]:
with open(path_to_results / 'stats_missing_ca_cb_and_sco.p', 'rb') as f:
    sco_stats = pickle.load(f)

In [None]:
sco_stats.data

In [None]:
sco_stats.data.shape

In [None]:
sco_stats.get_missing_residues_ca_cb(gap_rate)

In [None]:
sco_stats.plot_missing_residues_ca_cb(path_to_results)

## SCO angle distribution

In [None]:
sco_stats.data

In [None]:
sco_stats.plot_side_chain_orientation_distribution(path_to_results, kind='violin')

In [None]:
sco_stats.plot_side_chain_orientation_distribution(path_to_results, kind='histograms')

Get mean and median of side chain orientation angles per amino acid and save to file. 
Use these values for residues with missing Ca/Cb atoms.

In [None]:
scos.get_mean_median(from_file='../results/stats_missing_ca_cb_and_sco.p')