# Kinase data preprocessing

This notebook performs the following preprocessing tasks:

1. Set path to KLIFS download files
2. Load, merge and filter KLIFS metadata
3. Remove KLIFS metadata entries with missing mol2 files
4. Download PDB files for KLIFS metadata
5. Remove KLIFS metadata entries with missing PDB files 
6. Remove KLIFS metadata entries with unparsable PDB files
7. Remove KLIFS metadata entries with underscored residue IDs in mol2 file
8. Remove structures with KLIFS residue X
9. Filter by resolution and quality score
10. Save final KLIFS dataset (metadata)

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
from pathlib import Path
import sys

from Bio.PDB import PDBList
import numpy as np
import pandas as pd

sys.path.append('../..')
from kinsim_structure.auxiliary import split_klifs_code, get_klifs_regions
from kinsim_structure.preprocessing import get_klifs_metadata_from_files, download_from_pdb
from kinsim_structure.preprocessing import get_species, get_dfg, get_unique_pdbid_per_kinase
from kinsim_structure.preprocessing import drop_missing_mol2s, drop_missing_pdbs, drop_unparsable_pdbs
from kinsim_structure.preprocessing import drop_underscored_residue_ids, drop_residue_x

## Globals

### 1. Set path to KLIFS download files

In [4]:
# Path to data directory
dataset_name = '20190724_full'

path_to_kinsim = Path('.') / '..' / '..'
path_to_data = Path('/') / 'home' / 'dominique' / 'Documents' / 'data' / 'kinsim' / dataset_name

path_to_results = path_to_kinsim / 'results'

### 2. Load, merge and filter KLIFS metadata

#### Load and merge KLIFS download metadata files

In [5]:
klifs_overview_file = path_to_data / 'raw' / 'KLIFS_download' /'overview.csv'
klifs_export_file = path_to_data / 'raw'/ 'KLIFS_export.csv'

In [6]:
klifs_metadata = get_klifs_metadata_from_files(klifs_overview_file, klifs_export_file)

In [7]:
klifs_metadata.shape

(10136, 21)

In [8]:
klifs_metadata.groups.unique()

array(['CMGC', 'TK', 'CAMK', 'AGC', 'Other', 'STE', 'TKL', 'CK1',
       'Atypical'], dtype=object)

In [9]:
klifs_metadata.groupby(by='groups').size()

groups
AGC          677
Atypical     361
CAMK        1079
CK1          378
CMGC        2753
Other        787
STE          596
TK          2705
TKL          800
dtype: int64

#### Filter metadata by species

Keep only human entries

In [10]:
klifs_metadata.groupby('species').size()

species
Human    9661
Mouse     475
dtype: int64

In [11]:
klifs_metadata_filtered = get_species(klifs_metadata, species='Human')

In [12]:
klifs_metadata_filtered.shape

(9661, 21)

#### Filter metadata by DFG loop position

Keep only structures with DFG-in loops.

In [13]:
klifs_metadata_filtered.groupby('dfg').size()

dfg
in          8449
na           215
out          698
out-like     299
dtype: int64

In [14]:
klifs_metadata_filtered = get_dfg(klifs_metadata_filtered, dfg='in')

In [15]:
klifs_metadata_filtered.shape

(8449, 21)

#### Filter metadata by unique kinase-PDB ID combinations
Keep only the KLIFS entry per kinase-PDB ID combination with the best quality score.

In [16]:
klifs_metadata_filtered = get_unique_pdbid_per_kinase(klifs_metadata_filtered)

In [17]:
klifs_metadata_filtered.shape

(3935, 22)

### 3. Remove KLIFS metadata entries with missing mol2 files

In [18]:
klifs_metadata_filtered = drop_missing_mol2s(klifs_metadata_filtered, path_to_data)

In [19]:
klifs_metadata_filtered.shape

(3922, 22)

In [20]:
print(f'Number of unique PDB IDs in dataset: {klifs_metadata_filtered.pdb_id.unique().size}')

Number of unique PDB IDs in dataset: 3916


In [21]:
# Check if there are PDB IDs occurring multiple times for one kinase
grouped = klifs_metadata_filtered.groupby('pdb_id')['kinase'].size()
multiple_pdb_ids = list(grouped[grouped > 1].index)
klifs_metadata_filtered.loc[klifs_metadata_filtered.pdb_id.isin(multiple_pdb_ids)].sort_values('pdb_id')

Unnamed: 0,index,kinase,family,groups,pdb_id,chain,alternate_model,species,ligand_orthosteric_name,ligand_orthosteric_pdb_id,...,dfg,ac_helix,rmsd1,rmsd2,qualityscore,pocket,resolution,missing_residues,missing_atoms,full_ifp
2573,1013,MAPKAPK2,MAPKAPK,CAMK,2onl,C,-,Human,-,-,...,in,in,0.798,2.852,8.0,NAIIDDYKVKVLQFALKMLKARREVELHWRASPHIVRIVDVLIVME...,4.0,0,8,
3809,3825,p38a,MAPK,CMGC,2onl,B,-,Human,-,-,...,in,out-like,0.822,2.133,9.0,SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTH...,4.0,0,10,
314,2363,BRAF,RAF,TKL,4mne,B,B,Human,-,-,...,in,in,0.796,1.986,6.4,QRIG____GTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQ...,2.85,4,0,
2530,2555,MAP2K1,STE7,STE,4mne,H,B,Human,PHOSPHOMETHYLPHOSPHONIC ACID ADENYLATE ESTER,ACP,...,in,out,0.833,2.218,8.0,SELGAGNGGVVFKMARKLIQIIRELQVLHECNPYIVGFYGASICME...,2.85,0,0,0000000000000010000001000000100000010000001001...
1734,1950,Erk2,MAPK,CMGC,4nif,B,B,Human,PHOSPHOAMINOPHOSPHONIC ACID-ADENYLATE ESTER,ANP,...,in,in,0.784,2.095,8.0,SYIGEGAYGMVCSVAIKKIRTLREIKILLRFRENIIGINDIYIVQD...,2.15,0,0,0000000000000010000001000000100000010000000001...
3278,4272,RSK1-b,RSKb,CAMK,4nif,D,A,Human,-,-,...,in,in,0.785,2.114,9.6,ETIGVGSYSECKRYAVKVIDPSEEIEILLRYGPNIITLKDVYLVTE...,2.15,0,4,
1529,678,EGFR,EGFR,TK,4riw,D,-,Human,ADENOSINE-5'-DIPHOSPHATE,ADP,...,in,in,0.787,2.091,8.0,KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQ...,3.1,0,0,0000000000000010000000000000000000000000000000...
1691,1491,ErbB3,EGFR,TK,4riw,C,-,Human,PHOSPHOAMINOPHOSPHONIC ACID-ADENYLATE ESTER,ANP,...,in,out,0.84,2.205,8.0,KVLGSGVFGTVHKVCIKVIAVTDHMLAIGSLDAHIVRLLGLQLVTQ...,3.1,0,0,0000000000000010000001000000100000000000000001...
1530,614,EGFR,EGFR,TK,4rix,B,-,Human,ADENOSINE-5'-DIPHOSPHATE,ADP,...,in,in,0.792,2.087,8.0,KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQ...,3.1,0,0,0000000000000010000000000000000000000000000000...
1692,1473,ErbB3,EGFR,TK,4rix,A,-,Human,PHOSPHOAMINOPHOSPHONIC ACID-ADENYLATE ESTER,ANP,...,in,out,0.845,2.213,8.0,KVLGSGVFGTVHKVCIKVIAVTDHMLAIGSLDAHIVRLLGLQLVTQ...,3.1,0,0,0000000000000010000001000000000000000000000000...


### 4. Download PDB files for KLIFS metadata

In [22]:
# Download cif files if file does not exist
download_from_pdb(klifs_metadata_filtered, path_to_data)

### 5. Remove KLIFS metadata entries with missing PDB files 

Let's check if we have PDB files for all entries in the KLIFS metadata. For PDB IDs without a corresponding cif file, remove corresponding entries in KLIFS metadata

In [23]:
# Get PDBs in KLIFS metadata
pdb_ids_metadata = klifs_metadata_filtered.pdb_id.unique()

# Get PDBs for downloaded cif files
pdb_ids_ciffiles = [i.stem for i in (path_to_data / 'raw' / 'PDB_download').glob('*')]

In [24]:
# Missing CIF files that are in the KLIFS dataset (deprecated PDB entries)
missing_cifs = set(pdb_ids_metadata) - set(pdb_ids_ciffiles)
print(f'Number of KLIFS metadata PDB IDs with missing CIF file: {len(missing_cifs)}')

Number of KLIFS metadata PDB IDs with missing CIF file: 0


In [25]:
# In case of missing cif files, try to download them again
pdbfile = PDBList()
for i in missing_cifs:
    pdbfile.retrieve_pdb_file(i, pdir=path_to_data / 'raw' / 'PDB_download')

In [26]:
# In case of missing cif files, delete corresponding PDB ID entries in KLIFS metadata
klifs_metadata_filtered = drop_missing_pdbs(klifs_metadata_filtered, path_to_data)

In [27]:
klifs_metadata_filtered.shape

(3922, 22)

### 6. Remove KLIFS metadata entries with unparsable PDB files

Remove PDB IDs for which parsing does not work (using `Bio.PDB.MMCIFParser`).

In [29]:
klifs_metadata_filtered.shape

(3920, 22)

### 7. Remove KLIFS metadata entries with underscored residue IDs in mol2 file

In [31]:
klifs_metadata_filtered.shape

(3918, 22)

### 8. Remove structures with KLIFS residue X

Some structures contain mutations or modifications in their KLIFS binding site. KLIFS denotes these with an X in the pocket sequence.

We remove all structures containing such a residue in important regions in the binding site.

In [32]:
klifs_metadata_filtered = drop_residue_x(klifs_metadata_filtered)

3218
Drop PDB ID: 4otp


In [33]:
klifs_metadata_filtered.shape

(3917, 22)

In [34]:
klifs_metadata_filtered[klifs_metadata_filtered.pdb_id=='4otp']

Unnamed: 0,index,kinase,family,groups,pdb_id,chain,alternate_model,species,ligand_orthosteric_name,ligand_orthosteric_pdb_id,...,dfg,ac_helix,rmsd1,rmsd2,qualityscore,pocket,resolution,missing_residues,missing_atoms,full_ifp


### 9. Filter by resolution and quality score

In [35]:
klifs_metadata_filtered = klifs_metadata_filtered[
    (klifs_metadata_filtered.resolution <= 4) &
    (klifs_metadata_filtered.qualityscore >= 4)
].copy()

In [36]:
klifs_metadata_filtered.shape

(3880, 22)

### 10. Save final KLIFS dataset (metadata)

In [37]:
klifs_metadata_filtered.shape

(3880, 22)

In [38]:
klifs_metadata_filtered.rename(
    columns={'index': 'metadata_index'}, inplace=True
)

In [39]:
codes = []

for index, row in klifs_metadata_filtered.iterrows():
    
    species = row.species.upper()
    kinase = row.kinase
    pdb_id = row.pdb_id
    chain = ''
    alternate_model = ''
    
    if row.chain != '-':
        chain = f'_chain{row.chain}'
    if row.alternate_model != '-':
        alternate_model = f'_alt{row.alternate_model}'
        
    codes.append(f'{species}/{kinase}/{pdb_id}{chain}{alternate_model}')

codes[:10]

['HUMAN/AAK1/4wsq_chainB_altA',
 'HUMAN/AAK1/5l4q_chainA_altA',
 'HUMAN/AAK1/5te0_chainA',
 'HUMAN/ABL1/2f4j_chainA',
 'HUMAN/ABL1/2g1t_chainA',
 'HUMAN/ABL1/2g2i_chainA',
 'HUMAN/ABL1/2gqg_chainA_altB',
 'HUMAN/ABL1/2hz4_chainB',
 'HUMAN/ABL1/2v7a_chainB',
 'HUMAN/ABL1/4twp_chainB']

In [40]:
klifs_metadata_filtered['code'] = codes

In [41]:
klifs_metadata_filtered.shape

(3880, 23)