In [None]:
#| hide
from IPython.display import display

<img alt="Katlas logo" width="500" caption="Katlas logo" src="dataset/images/logo.png" id="logo"/>

# Katlas

> Predict kinase given a substrate sequence

## Install

Install the latest version through git

In [None]:
!pip install git+https://github.com/sky1ove/katlas.git -Uqq

Install from pip

In [None]:
# !pip install katlas -Uqq

## Import

In [None]:
from katlas.core import *

# Quick start

***For a single input sequence***

In [None]:
ref = Data.get_ks_upper()

In [None]:
predict_kinase('AAAAAAAsGGAGSDN',ref)

100%|██████████| 289/289 [00:00<00:00, 8758.02it/s]

calculated string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5S', '6D', '7N']





kinase
PAK6     2.031
ATR      1.991
ULK3     1.960
PRKD1    1.958
TSSK2    1.934
         ...  
FLT3     0.900
KIT      0.890
CSF1R    0.870
FGFR3    0.869
DDR2     0.864
Length: 289, dtype: float64

***For many input sequences***

In [None]:
# load a df that contains many phosphorylation sites
df = Data.get_ochoa_site()

In [None]:
df.iloc[:,-2:].head()

Unnamed: 0,site_seq,gene_site
0,VDDEKGDSNDDYDSA,A0A075B6Q4_S24
1,YDSAGLLSDEDCMSV,A0A075B6Q4_S35
2,IADHLFWSEETKSRF,A0A075B6Q4_S57
3,KSRFTEYSMTSSVMR,A0A075B6Q4_S68
4,FTEYSMTSSVMRRNE,A0A075B6Q4_S71


In [None]:
ref = Data.get_ks_upper()

In [None]:
results = predict_kinase_df(df.head(),ref,'site_seq')

according to the ref 
will calculate position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]


100%|██████████| 289/289 [00:00<00:00, 6934.44it/s]


In [None]:
results

kinase,SRC,EPHA3,FES,NTRK3,ALK,EPHA8,ABL1,FLT3,EPHB2,FYN,...,MEK5,PKN2,MAP2K7,MRCKB,HIPK3,CDK8,BUB1,MEKK3,MAP2K3,GRK1
0,0.975307,1.088103,1.024297,1.055557,0.991858,1.09596,0.963895,0.974716,1.031825,1.053151,...,1.209006,1.845246,1.553544,1.691919,1.358586,1.79327,1.332794,1.384679,1.456727,1.974188
1,0.895416,0.953961,0.948499,0.96179,0.868367,0.928279,0.852676,0.823955,0.921067,0.911028,...,1.03703,1.427855,1.335121,1.153418,1.641751,1.689144,1.220036,1.197893,1.119174,1.811465
2,0.849546,0.898449,0.844726,0.879204,0.874974,0.903264,0.845611,0.843124,0.855262,0.874833,...,1.414065,2.134812,1.445507,1.003682,1.340236,1.435649,1.315813,1.267839,1.231376,1.886599
3,0.80639,0.841915,0.81097,0.914746,0.874387,0.753362,0.838206,0.803782,0.826301,0.790217,...,1.091456,1.846551,1.634287,1.490388,1.561279,1.464307,1.457233,1.084019,1.498325,1.74542
4,0.828211,0.791854,0.783151,0.859955,0.82716,0.74306,0.795135,0.786305,0.799112,0.85468,...,1.0561,1.279951,1.452646,1.170946,1.503872,1.38884,1.334052,1.063377,1.062349,1.823619


# Dataset

## Reference for scoring

### All capital (recommend)
> for phosphorylation site sequence that are all capital

***Option1: reference derived from kinase-substrate dataset***

In [None]:
ref = Data.get_ks_upper()

***Option2: reference from positional scanning peptide array (PSPA)***

In [None]:
ref = Data.get_pspa_upper()

***Option3: combined***

In [None]:
ref = Data.get_combine_upper()

### With lower case (not recommend if the surrouding positions are all capital)
> for phosphorylation site sequence with lower case indicating phosphorylation status

Again, we have three options

In [None]:
ref = Data.get_ks()

In [None]:
ref = Data.get_pspa()

In [None]:
ref = Data.get_combine()

Or from original normalized PSPA, which needs a different algorithm to calculate

In [None]:
ref_original = Data.get_pspa_original()

## Phosphorylation sites

***CPTAC pan-cancer phosphoproteomics***

In [None]:
df = Data.get_cptac_ensembl_site()

***Ochoa et al. dataset from [paper]((https://www.nature.com/articles/s41587-019-0344-3))***

In [None]:
df = Data.get_ochoa_site()

***PhosphoSitePlus***

In [None]:
df = Data.get_pplus_human_site()

***Your customized csv file***
> with 'site_seq' as the column name of the site and 'gene_site' as the site id

In [None]:
# df = pd.read_csv('your_file.csv')

```
Since the phosphorylation sites in the first two datasets are all capital, strongly recommend using all capital reference for them
```

### Get unique site

Since there are duplicated phosphorylation site sequences in the above datasets, we can get unique phosphorylation site through this function

In [None]:
unique = get_unique_site(df)

In [None]:
unique.sort_values('num_site',ascending=False).head()

Unnamed: 0,site_seq,gene_site,num_site,acceptor
185137,TLQHVPDyRQNVyIP,PCDHGA1_Y890|PCDHGA10_Y895|PCDHGA11_Y894|PCDHG...,22,y
185467,TMGLSARyGPQFTLQ,PCDHGA1_Y878|PCDHGA10_Y883|PCDHGA11_Y882|PCDHG...,22,y
126642,PDyRQNVyIPGSNAT,PCDHGA1_Y895|PCDHGA10_Y900|PCDHGA11_Y899|PCDHG...,22,y
67151,GsKGGCGsCGGsKGG,KRTAP5-1_S101|KRTAP5-1_S111|KRTAP5-10_S86|KRTA...,19,s
61524,GPEVLQDsLDRCYST,NBPF14_S295|NBPF14_S539|NBPF19_S364|NBPF19_S60...,19,s


As shown above, a site sequence can correspond to multiple sites

## Predict kinase based on site sequence

***From site sequence***

To replicate phosphoplus kinase library prediction

In [None]:
ref_original = Data.get_pspa_original()

In [None]:
predict_kinase('AAAAAAAsPGAGSDN',ref_original,multiply)

100%|██████████| 303/303 [00:00<00:00, 9484.28it/s]

calculated string: ['-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1P', '2G', '3A', '4G']





kinase
P38D       6.735
JNK2       6.712
JNK1       6.144
JNK3       6.021
HIPK2      5.940
           ...  
MRCKA     -6.985
MST1      -7.108
ALPHAK3   -7.623
RIPK2     -7.699
TTK       -8.061
Length: 303, dtype: float64

In our case

In [None]:
ref = Data.get_ks_upper()

In [None]:
predict_kinase('AAAAAAAsPGAGSDN',ref)

100%|██████████| 289/289 [00:00<00:00, 8829.53it/s]


calculated string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1P', '2G', '3A', '4G', '5S', '6D', '7N']


kinase
ERK2      2.450
ERK1      2.357
CDK1      2.353
DYRK1A    2.338
CDK9      2.332
          ...  
FRK       0.844
DDR2      0.841
BLK       0.833
CSF1R     0.832
LCK       0.824
Length: 289, dtype: float64

***Predict site sequences from a dataframe***

All capital

In [None]:
df = Data.get_ochoa_site()

unique = get_unique_site(df)

In [None]:
ref = Data.get_ks_upper()

In [None]:
results = predict_kinase_df(unique.head(),ref,seq_col='site_seq',seq_id='gene_site')

according to the ref 
will calculate position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]


100%|██████████| 289/289 [00:00<00:00, 7125.54it/s]


In [None]:
results

kinase,SRC,EPHA3,FES,NTRK3,ALK,EPHA8,ABL1,FLT3,EPHB2,FYN,...,MEK5,PKN2,MAP2K7,MRCKB,HIPK3,CDK8,BUB1,MEKK3,MAP2K3,GRK1
gene_site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PBX1_S136,0.967636,0.932705,0.978893,1.088084,0.977932,0.95358,0.984913,0.899555,0.98605,0.967174,...,1.735,1.803454,1.820988,1.58665,1.676599,1.726701,1.428212,1.651039,1.549183,1.743809
PBX2_S146,0.971956,0.928551,1.004222,1.091457,0.996244,0.91109,1.000078,0.909698,0.983315,0.988757,...,1.701254,1.743046,1.875124,1.631292,1.750673,1.971984,1.380593,1.641144,1.528793,1.855968
CLASR_S349,0.96469,0.9429,0.982597,1.074349,1.01971,0.903971,0.996951,0.915299,0.978944,0.96435,...,1.598322,1.733735,1.866852,1.52415,1.851515,1.594087,1.428212,1.700833,1.578802,2.030731
TBL1R_S119,0.884278,0.900332,0.925153,1.037466,0.933632,0.861777,0.965054,0.907505,0.922761,0.878,...,1.555438,1.820845,1.856861,1.550935,1.861953,1.83946,1.304942,1.700161,1.429961,1.917382
SOX3_S249,0.915938,0.889072,0.938906,1.038611,0.994051,0.870013,0.937411,0.923446,0.928658,0.90072,...,1.666336,2.098952,1.811488,1.542007,1.8133,1.699208,1.35274,1.70006,1.352281,1.619684


With lower case

In [None]:
df = Data.get_pplus_human_site()

Convert characters to uppercase except for s, t and y

In [None]:
def convert_to_uppercase(sequence):
    # Convert all characters to uppercase except for 's', 't', 'y'
    return ''.join([char.upper() if char not in ['s', 't', 'y'] else char for char in sequence])

In [None]:
df['site_seq2'] = df['site_seq'].apply(convert_to_uppercase)

In [None]:
ref = Data.get_ks()

In [None]:
results = predict_kinase_df(df.head(),ref,seq_col='site_seq2',seq_id='gene_site')

according to the ref 
will calculate position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]


100%|██████████| 289/289 [00:00<00:00, 6674.78it/s]


In [None]:
results

kinase,SRC,EPHA3,FES,NTRK3,ALK,EPHA8,ABL1,FLT3,EPHB2,FYN,...,MEK5,PKN2,MAP2K7,MRCKB,HIPK3,CDK8,BUB1,MEKK3,MAP2K3,GRK1
gene_site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
YWHAB_T2,0.448111,0.512185,0.460677,0.495763,0.437926,0.507986,0.4458,0.463396,0.483484,0.450507,...,0.909909,0.366295,0.781871,0.830357,0.725926,0.838861,0.852291,0.950844,0.650647,0.589602
YWHAB_S6,0.760687,0.775659,0.777152,0.817897,0.812954,0.752541,0.770987,0.744518,0.772502,0.792198,...,1.023589,1.410543,1.174738,1.331494,1.169024,1.36072,1.098113,1.116309,1.088421,1.76119
YWHAB_Y21,1.823324,1.860114,1.900536,1.764631,1.787405,1.892896,1.839061,1.858879,1.878042,1.838722,...,0.96346,0.567132,0.953375,0.987378,0.778283,0.762847,0.928032,0.793853,1.158406,0.935518
YWHAB_T32,0.911269,0.904384,0.908692,0.921927,0.918522,0.935179,0.905616,0.829514,0.91135,0.932878,...,1.368271,1.020999,1.186577,1.470578,1.032997,1.176981,1.200719,1.362783,1.123469,1.351465
YWHAB_S39,0.974874,1.036194,0.975992,1.018471,0.968744,1.033577,0.988792,0.962221,0.933072,1.045137,...,1.097011,1.514735,1.254766,1.322982,1.506397,2.076575,1.54142,1.22026,1.18076,2.005598


## Site format

### Examples

***All capital - 15 length (-7 to +7), ok for position0 is lowercase***

- QSEEEKLSPSPTTED
- TLQHVPDYRQNVYIP
- TMGLSARyGPQFTLQ

***All capital - 10 length (-5 to +4)***

- SRDPHYQDPH
- LDNPDyQQDF
- AAAAAsGGAG

***With lowercase - 15 length, only allows s,t,y to be lowercase***

- QsEEEKLsPsPTTED
- TLQHVPDyRQNVYIP
- TMGLsARyGPQFTLQ

***With lowercase - 10 length***

- sRDPHyQDPH
- LDNPDyQQDF
- AAAAAsGGAG

### Length

Either 15 (-7 to +7) or 10 (-5 to +4)

### Acceptor (position 0 )

S, T or Y

### Amino acid

Accept amino acid that belongs to "PGACSTVILMFYWHKRQNDEsty" but not rare aa like "U"(selenocysteine) or "O"(pyrrolysine)