# Statistics on missing CA and CB atoms AND side chain angles

## Aims of this notebook

### 1. Missing CA and CB atoms

In our fingerprint, both the exposure and side chain angle features are dependent on CA and CB atoms.
Here, we investigate where and overall how often these atoms are missing in the KLIFS data.

1. Get for each KLIFS molecule CA and CB atom coordinates per residue position.
2. Calculate missing atom rate per residue position: CA, CB and CA+CB missing.

### 2. Side chain angle (SCA) distribution

Side chain angles describe the angle between Ca, Cb, and residue centroid (without backbone atoms and hydrogens). 

Small amino acids (with tiny side chains) should not show much angle diversion (with smaller angles), larger ones should (with larger angles).

1. Calculate for each amino acid the angle distribution.
2. Save molecule and residue code for each angle, in order to trace back interesting angles.
3. Check diversity of angles per amino acid. If no diversity observed, side chain angle might not be such a good measure, since it does not depend on structural conformation but solely on amino acid type.

## Imports

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
from pathlib import Path
import sys
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sys.path.append('../..')
from kinsim_structure.auxiliary import KlifsMoleculeLoader
from kinsim_structure.analysis import GapRate, SideChainAngleGenerator, SideChainAngleAnalyser

sns.set()
%matplotlib inline

In [None]:
warnings.filterwarnings(action='once')

## IO paths

In [None]:
path_to_kinsim = Path('.') / '..' / '..'
path_to_data = path_to_kinsim / 'examples' / 'data'
path_to_results = path_to_kinsim / 'examples' / 'results' / 'features' / 'sca_centroid_wo_backbone' 

## Load KLIFS metadata

In [None]:
klifs_metadata = pd.read_csv(path_to_data / 'postprocessed' / 'klifs_metadata_postprocessed.csv' , index_col=0)

In [None]:
klifs_metadata.shape

## Data generation

### Gap rate

In [None]:
gap_rate = GapRate(klifs_metadata)

### Side chain angle

In [None]:
side_chain_angle_generator = SideChainAngleGenerator()
%time side_chain_angle_generator.get_side_chain_angles(klifs_metadata)

In [None]:
side_chain_angle_generator.data.head()

In [None]:
side_chain_angle_generator.save_data(path_to_results / 'side_chain_angles.csv')

## Data analysis

### Gap rate

In [None]:
gap_rate.plot_gap_rate(
    path_to_results
)

### Missing CA and CB atoms

In [None]:
side_chain_angle_analyser = SideChainAngleAnalyser()
side_chain_angle_analyser.load_data(path_to_results / 'side_chain_angles.csv')

In [None]:
side_chain_angle_analyser.data.head()

In [None]:
side_chain_angle_analyser.data.shape

In [None]:
side_chain_angle_analyser.get_missing_residues_ca_cb(gap_rate)

In [None]:
side_chain_angle_analyser.plot_missing_residues_ca_cb(
    path_to_results
)

In [None]:
# How many residues have a missing Cb but are not GLY?
side_chain_angle_analyser.data[
    (side_chain_angle_analyser.data.cb.isna()) &
    (side_chain_angle_analyser.data.residue_name != 'GLY')
].shape

### SCA angle distribution

In [None]:
side_chain_angle_analyser.plot_side_chain_angle_distribution(
    path_to_results, 
    kind='violin'
)

In [None]:
side_chain_angle_analyser.plot_side_chain_angle_distribution(
    path_to_results, 
    kind='histograms'
)

### SCA statistics

In [None]:
side_chain_angle_analyser.data[['residue_name', 'sca']].groupby('residue_name').describe()

In [None]:
side_chain_angle_analyser.data[side_chain_angle_analyser.data.residue_name == 'ALA' & side_chain_angle_analyser.data.sca != 180.0]

### SCA mean and median

Get mean and median of side chain angles per amino acid and save to file. 
Use these values for residues with missing Ca/Cb atoms.

In [None]:
side_chain_angle_analyser.get_mean_median(
    from_file=path_to_results / 'stats_missing_ca_cb_and_sca.p'
)