# Non-standard amino acids in KLIFS molecules

Aims of this notebook:
1. For all KLIFS molecules, save all non-standard amino acids.
2. Check amount of non-standard amino acids. Some `kinsim_structure` features are only defined for standard amino acids, thus we need to check how much information we loose in our dataset.

## Imports and functions

In [1]:
from pathlib import Path
import pickle
import sys

import pandas as pd

sys.path.extend(['./..'])
from kinsim_structure.analysis import get_non_standard_amino_acids_in_klifs

## IO paths

In [2]:
path_to_data = Path('/') / 'home' / 'dominique' / 'Documents' / 'data' / 'kinsim' / '20190724_full'
path_to_kinsim = Path('/') / 'home' / 'dominique' / 'Documents' / 'projects' / 'kinsim_structure'

metadata_path = path_to_data / 'preprocessed' / 'klifs_metadata_preprocessed.csv'
output_path =  path_to_kinsim / 'results' / '20190724_full' / 'non_standard_aminoacids.p'    

## Load KLIFS metadata

In [3]:
klifs_metadata = pd.read_csv(metadata_path)

In [4]:
klifs_metadata.shape

(3920, 23)

## Data generation

In [5]:
# Get non-standard amino acids in KLIFS dataset
non_standard_aminoacids = get_non_standard_amino_acids_in_klifs(klifs_metadata)

In [6]:
with open(output_path, 'wb') as f:
    pickle.dump(non_standard_aminoacids, f)

## Data analysis

In [7]:
with open(output_path, 'rb') as f:
    non_standard_aminoacids = pickle.load(f)

In [8]:
print(f'Number of non-standard amino acids in KLIFS dataset: {len(non_standard_aminoacids)}')

Number of non-standard amino acids in KLIFS dataset: 17


In [9]:
pd.Series(non_standard_aminoacids)

HUMAN/ADCK3_5i35_chainA            [MSE]
HUMAN/AurA_4j8m_chainA             [CAF]
HUMAN/CDK2_1oir_chainA             [KCX]
HUMAN/CDK2_2cjm_chainC             [PTR]
HUMAN/CHK1_2ydj_chainA             [CSS]
HUMAN/DNAPK_5luq_chainB            [MSE]
HUMAN/EGFR_2j5e_chainA             [CY0]
HUMAN/IRAK4_2o8y_chainB            [MSE]
HUMAN/JNK3_2r9s_chainB             [OCY]
HUMAN/MAPKAPK2_1nxk_chainA         [MSE]
HUMAN/PIM1_1yhs_chainA             [CME]
HUMAN/PIM1_1yi3_chainA             [CME]
HUMAN/PIM1_1yi4_chainA             [CME]
HUMAN/PIM1_5o12_chainA             [CME]
HUMAN/RET_5fm3_chainA              [PTR]
HUMAN/RIOK1_4otp_chainA       [PHD, MSE]
HUMAN/SRC_1yi6_chainA              [PTR]
dtype: object

In [10]:
# Get set of non-standard amino acids
non_standards = []

for k, v in non_standard_aminoacids.items():
    non_standards = non_standards + v
    
non_standards_set = set(non_standards)

In [11]:
non_standards_set

{'CAF', 'CME', 'CSS', 'CY0', 'KCX', 'MSE', 'OCY', 'PHD', 'PTR'}

In [12]:
smiles = {
    'CAF': 'C[As](C)(=O)SC[C@H](N)C(O)=O',
    'CME': 'N[C@@H](CSSCCO)C(O)=O',
    'CSO': 'N[C@@H](CSO)C(O)=O',
    'CSS': 'N[C@@H](CSS)C(O)=O',
    'CY0': 'N[C@@H](CSCCC(=O)Nc1ccc2ncnc(Nc3ccccc3)c2c1)C(O)=O',
    'KCX': 'N[C@@H](CCCCNC(O)=O)C(O)=O',
    'MSE': 'C[Se]CC[C@H](N)C(O)=O',
    'OCY': 'N[C@@H](CSCCO)C(O)=O',
    'PHD': 'N[C@@H](CC(=O)OP(O)(O)=O)C(O)=O',
    'PTR': 'N[C@@H](Cc1ccc(OP(O)(O)=O)cc1)C(O)=O',
}

Check these non-standard residues in the PDB:
* CAF, CME, CSO, CSS, OCY: L-peptide linking (parent CYS)
* KCX: L-peptide linking (parent LYS)
* MSE: L-peptide linking (parent MET) - SELENOMETHIONINE
* PHD: L-peptide linking (parent ASP) - ASPARTYL PHOSPHATE
* PTR: L-peptide linking (parent TYR) - O-PHOSPHOTYROSINE

Additional phosphorylated residues:
* SEP: L-peptide linking (parent SER) - PHOSPHOSERINE

Include phosphorylated amino acids?