# RCS CICAG workshop &mdash; KLIFS: making kinase structures work

## Aim of this notebook

[KLIFS](https://klifs.net/) is a database for kinase-ligand interaction fingerprints and structures. In this notebook, we will use the programmatic access to this database ([KLIFS OpenAPI](https://klifs.net/swagger_v2/)) and the [OpenCADD-KLIFS](https://github.com/volkamerlab/opencadd) package to interact with its rich content. 

We will assess the similarity between a set of kinases (EGFR, ErbB2, SLK, BRAF) based on interaction fingerprints (KLIFS IFP) and subpocket-based structural fingerprints (KiSSim fingerprint).

- Kinase set: EGFR, ErbB2, SLK, BRAF?
  - Show how to use KLIFS Swagger API directly to fetch — Albert is doing this
  - Show how to use opencadd to fetch IFPs for kinase set
- Kinase KLIFS IDs by kinase names 
- Structures by kinase KLIFS IDs – select max. 2 liganded structures per kinase
- IFPs by structure KLIFS IDs
  - Show how to use kissim to encode pockets
- KiSSim fingerprints by structure KLIFS IDs
  - Calculate IFP- and KiSSim-based similarity
  - Show heatmap

## References

The notebook is a mix of the following [TeachOpenCADD](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac267/6582172) notebooks:
- [T012 · Data acquisition from KLIFS](https://projects.volkamerlab.org/teachopencadd/talktorials/T012_query_klifs.html)
- [T025 · Kinase similarity: Kinase pocket (KiSSim fingerprint)](https://projects.volkamerlab.org/teachopencadd/talktorials/T025_kinase_similarity_kissim.html)
- [T026 · Kinase similarity: Interaction fingerprints](https://projects.volkamerlab.org/teachopencadd/talktorials/T026_kinase_similarity_ifp.html)
- [T028 · Kinase similarity: Compare different perspectives](https://projects.volkamerlab.org/teachopencadd/talktorials/T028_kinase_similarity_compare_perspectives.html)

We are using the following open-source resources:
- KLIFS database &mdash; a structural kinase database: [Website](https://klifs.net) and [paper](https://doi.org/10.1093/nar/gkaa895)
- OpenCADD-KLIFS &mdash; a Python module to fetch KLIFS data: [Code](github.com/volkamerlab/opencadd) and [paper](https://joss.theoj.org/papers/10.21105/joss.03951)
- KiSSim &mdash; a KLIFS-based kinase structural similarity fingerprint: [Code](github.com/volkamerlab/kissim) and [paper](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.2c00050)

## Installation (Google Colab)

In [1]:
# If the notebook is run on Google Colab
# install condacolab and kissim
try:
    import google.colab
    !pip install condacolab
    import condacolab
    condacolab.install()
    !mamba install -yq kissim
except ModuleNotFoundError:
    pass

## Imports

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

## Define kinase set

In [3]:
kinase_names = ["EGFR", "ErbB2", "SLK", "BRAF"]

## Generate a KLIFS Python client

In [4]:
from bravado.client import SwaggerClient

In [5]:
KLIFS_API_DEFINITIONS = "https://klifs.net/swagger/swagger.json"
KLIFS_CLIENT = SwaggerClient.from_url(KLIFS_API_DEFINITIONS, config={"validate_responses": False})

In [6]:
KLIFS_CLIENT.Information.get_kinase_ID(kinase_name="EGFR", species="Human").response().result

[KinaseInformation(HGNC='EGFR', family='EGFR', full_name='epidermal growth factor receptor', group='TK', iuphar=1797, kinase_ID=406, kinase_class='', name='EGFR', pocket='KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA', species='Human', uniprot='P00533')]

In [7]:
KLIFS_CLIENT.Structures.get_structures_pdb_list(pdb_codes=["3w32", "3poz"]).response().result

[structureDetails(DFG='in', Grich_angle=53.429, Grich_distance=16.7505, Grich_rotation=31.0905, aC_helix='out', allosteric_ligand=0, alt='', back=True, bp_III=False, bp_II_A_in=True, bp_II_B=False, bp_II_B_in=False, bp_II_in=True, bp_II_out=False, bp_IV=False, bp_I_A=True, bp_I_B=True, bp_V=False, chain='A', fp_I=False, fp_II=False, front=True, gate=True, kinase='EGFR', kinase_ID=406, ligand='03P', missing_atoms=0, missing_residues=0, pdb='3poz', pocket='KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA', quality_score=8.0, resolution=1.5, rmsd1=0.815, rmsd2=2.155, species='Human', structure_ID=7308),
 structureDetails(DFG='in', Grich_angle=44.4624, Grich_distance=13.7114, Grich_rotation=41.949, aC_helix='out', allosteric_ligand=0, alt='', back=True, bp_III=False, bp_II_A_in=True, bp_II_B=False, bp_II_B_in=False, bp_II_in=True, bp_II_out=False, bp_IV=False, bp_I_A=True, bp_I_B=True, bp_V=False, chain='A', fp_I=False, fp_II=False, front=True, gate=Tru

## Set up a remote KLIFS session with OpenCADD-KLIFS

![OpenCADD-KLIFS](https://raw.githubusercontent.com/volkamerlab/opencadd/master/paper/opencadd_klifs_toc.png)

In [8]:
from opencadd.databases.klifs import setup_remote

session = setup_remote()
pd.set_option("display.max_columns", None)

## Get kinase KLIFS IDs from kinase names

In [9]:
kinases = session.kinases.by_kinase_name(kinase_names=kinase_names, species="Human")
kinases

Unnamed: 0,kinase.klifs_id,kinase.klifs_name,kinase.full_name,kinase.gene_name,kinase.family,kinase.group,kinase.subfamily,species.klifs,kinase.uniprot,kinase.iuphar,kinase.pocket
0,406,EGFR,epidermal growth factor receptor,EGFR,EGFR,TK,,Human,P00533,1797,KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQ...
1,407,ErbB2,erb-b2 receptor tyrosine kinase 2,ERBB2,EGFR,TK,,Human,P04626,2019,KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQ...
2,509,BRAF,"B-Raf proto-oncogene, serine/threonine kinase",BRAF,RAF,TKL,RAF,Human,P15056,1943,QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQ...
3,374,SLK,STE20 like kinase,SLK,STE20,STE,SLK,Human,Q9H2G2,2200,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...


In [10]:
kinase_klifs_ids = kinases["kinase.klifs_id"].to_list()
kinase_klifs_ids

[406, 407, 509, 374]

## Define structure set

### Fetch structures for kinase set

In [11]:
structures_df = session.structures.by_kinase_name(kinase_names=kinase_names)
structures_df = structures_df.drop("interaction.fingerprint", axis=1)
print(f"Number of structures: {len(structures_df)}")
print("Kinases:", *structures_df["kinase.klifs_name"].unique())

Number of structures: 731
Kinases: SLK EGFR ErbB2 BRAF


Let’s have a look at what is stored in the structures’ DataFrame:

In [12]:
structures_df.columns

Index(['structure.klifs_id', 'structure.pdb_id', 'structure.alternate_model',
       'structure.chain', 'species.klifs', 'kinase.klifs_id',
       'kinase.klifs_name', 'kinase.names', 'kinase.family', 'kinase.group',
       'structure.pocket', 'ligand.expo_id', 'ligand_allosteric.expo_id',
       'ligand.klifs_id', 'ligand_allosteric.klifs_id', 'ligand.name',
       'ligand_allosteric.name', 'structure.dfg', 'structure.ac_helix',
       'structure.resolution', 'structure.qualityscore',
       'structure.missing_residues', 'structure.missing_atoms',
       'structure.rmsd1', 'structure.rmsd2', 'structure.front',
       'structure.gate', 'structure.back', 'structure.fp_i', 'structure.fp_ii',
       'structure.bp_i_a', 'structure.bp_i_b', 'structure.bp_ii_in',
       'structure.bp_ii_a_in', 'structure.bp_ii_b_in', 'structure.bp_ii_out',
       'structure.bp_ii_b', 'structure.bp_iii', 'structure.bp_iv',
       'structure.bp_v', 'structure.grich_distance', 'structure.grich_angle',
       's

In [13]:
structures_df.head()

Unnamed: 0,structure.klifs_id,structure.pdb_id,structure.alternate_model,structure.chain,species.klifs,kinase.klifs_id,kinase.klifs_name,kinase.names,kinase.family,kinase.group,structure.pocket,ligand.expo_id,ligand_allosteric.expo_id,ligand.klifs_id,ligand_allosteric.klifs_id,ligand.name,ligand_allosteric.name,structure.dfg,structure.ac_helix,structure.resolution,structure.qualityscore,structure.missing_residues,structure.missing_atoms,structure.rmsd1,structure.rmsd2,structure.front,structure.gate,structure.back,structure.fp_i,structure.fp_ii,structure.bp_i_a,structure.bp_i_b,structure.bp_ii_in,structure.bp_ii_a_in,structure.bp_ii_b_in,structure.bp_ii_out,structure.bp_ii_b,structure.bp_iii,structure.bp_iv,structure.bp_v,structure.grich_distance,structure.grich_angle,structure.grich_rotation,structure.filepath,structure.curation_flag
0,1837,2uv2,B,A,Human,374,SLK,,,,GELGD__FGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,GVD,-,485,0,,,in,in,2.3,7.8,2,14,0.782,2.011,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,0.0,0.0,0.0,,False
1,10624,6hvd,B,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,GUQ,-,3251,0,,,in,in,1.63,8.7,0,13,0.791,2.105,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,16.4506,55.752399,70.360397,,False
2,10625,6hvd,A,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,GUQ,-,3251,0,,,in,in,1.63,8.7,0,13,0.791,2.105,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,16.4506,55.752399,70.360397,,False
3,1833,2j51,-,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,DKI,-,50,0,,,in,in,2.1,8.6,0,14,0.78,2.093,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,17.619301,57.9916,66.117302,,False
4,1832,4usf,B,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,6UI,-,490,0,,,out-like,in,1.75,7.2,0,0,1.005,2.367,True,True,True,True,False,False,True,False,False,False,False,False,False,False,False,17.5707,57.901699,14.1689,,False


### Filter structures

We filter the structures by different criteria:

- Species: human
- Conformation: DFG-in (the active kinase conformation)
- Resolution: $\le 3$ Angström
- Quality score*: $\ge 6$
- Ligand-bound (ligand KLIFS ID cannot be $0$)

\* The KLIFS quality score takes into account the quality of the alignment, as well as the number of missing residues and atoms. A higher score indicates a better structure quality.

In [14]:
structures_df = structures_df[
    (structures_df["species.klifs"] == "Human")
    & (structures_df["structure.dfg"] == "in")
    & (structures_df["structure.resolution"] <= 3)
    & (structures_df["structure.qualityscore"] >= 6)
    & (structures_df["ligand.klifs_id"] != 0)
]
print(f"Number of structures: {len(structures_df)}")
print("Kinases:", *structures_df["kinase.klifs_name"].unique())

Number of structures: 431
Kinases: SLK EGFR ErbB2 BRAF


In [15]:
structures_df.groupby("kinase.klifs_name").size().sort_values(ascending=False)

kinase.klifs_name
EGFR     363
BRAF      58
SLK        6
ErbB2      4
dtype: int64

In [16]:
structures_df.head()

Unnamed: 0,structure.klifs_id,structure.pdb_id,structure.alternate_model,structure.chain,species.klifs,kinase.klifs_id,kinase.klifs_name,kinase.names,kinase.family,kinase.group,structure.pocket,ligand.expo_id,ligand_allosteric.expo_id,ligand.klifs_id,ligand_allosteric.klifs_id,ligand.name,ligand_allosteric.name,structure.dfg,structure.ac_helix,structure.resolution,structure.qualityscore,structure.missing_residues,structure.missing_atoms,structure.rmsd1,structure.rmsd2,structure.front,structure.gate,structure.back,structure.fp_i,structure.fp_ii,structure.bp_i_a,structure.bp_i_b,structure.bp_ii_in,structure.bp_ii_a_in,structure.bp_ii_b_in,structure.bp_ii_out,structure.bp_ii_b,structure.bp_iii,structure.bp_iv,structure.bp_v,structure.grich_distance,structure.grich_angle,structure.grich_rotation,structure.filepath,structure.curation_flag
0,1837,2uv2,B,A,Human,374,SLK,,,,GELGD__FGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,GVD,-,485,0,,,in,in,2.3,7.8,2,14,0.782,2.011,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,0.0,0.0,0.0,,False
1,10624,6hvd,B,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,GUQ,-,3251,0,,,in,in,1.63,8.7,0,13,0.791,2.105,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,16.4506,55.752399,70.360397,,False
2,10625,6hvd,A,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,GUQ,-,3251,0,,,in,in,1.63,8.7,0,13,0.791,2.105,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,16.4506,55.752399,70.360397,,False
3,1833,2j51,-,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,DKI,-,50,0,,,in,in,2.1,8.6,0,14,0.78,2.093,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,17.619301,57.9916,66.117302,,False
5,1834,2uv2,A,A,Human,374,SLK,,,,GELGD__FGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,GVD,-,485,0,,,in,in,2.3,7.8,2,14,0.782,2.011,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,0.0,0.0,0.0,,False


In [17]:
structures_df = structures_df.sort_values(
    by=["kinase.klifs_name", "structure.resolution", "structure.qualityscore"],
    ascending=[True, True, False]
)
structures_df = structures_df.groupby("kinase.klifs_name").head(4)
structures_df.groupby("kinase.klifs_name").size()

kinase.klifs_name
BRAF     4
EGFR     4
ErbB2    4
SLK      4
dtype: int64

In [18]:
structure_klifs_ids = structures_df["structure.klifs_id"].to_list()
print("Structure KLIFS IDs:", *structure_klifs_ids)

Structure KLIFS IDs: 6940 7063 7059 7060 12845 12827 12832 12828 4816 4819 4815 4820 10624 10625 1833 1838


## Encode structures: KLIFS IFPs

### Get KLIFS IFPs

![KLIFS IFP](https://raw.githubusercontent.com/volkamerlab/teachopencadd/master/teachopencadd/talktorials/T026_kinase_similarity_ifp/images/T026_KLIFS_IFP.png)

In [19]:
ifps_df = session.interactions.by_structure_klifs_id(structure_klifs_ids=structure_klifs_ids)
print(f"Number of IFPs: {len(ifps_df)}")
ifps_df.head()

Number of IFPs: 16


Unnamed: 0,structure.klifs_id,interaction.fingerprint
0,1833,0000000000000010001000000000000000000000000000...
1,1838,0000000000000010001000000000000000000000000000...
2,4815,0000000000000010000000000000000000010000000000...
3,4816,0000000000000010000000000000000000000000000000...
4,4819,0000000000000010000000000000000000000000000000...


In [20]:
structures_with_ifps_df = ifps_df.merge(structures_df, on="structure.klifs_id", how="inner")
print(f"Number of structures with IFPs: {len(structures_with_ifps_df)}")
structures_with_ifps_df.head()

Number of structures with IFPs: 16


Unnamed: 0,structure.klifs_id,interaction.fingerprint,structure.pdb_id,structure.alternate_model,structure.chain,species.klifs,kinase.klifs_id,kinase.klifs_name,kinase.names,kinase.family,kinase.group,structure.pocket,ligand.expo_id,ligand_allosteric.expo_id,ligand.klifs_id,ligand_allosteric.klifs_id,ligand.name,ligand_allosteric.name,structure.dfg,structure.ac_helix,structure.resolution,structure.qualityscore,structure.missing_residues,structure.missing_atoms,structure.rmsd1,structure.rmsd2,structure.front,structure.gate,structure.back,structure.fp_i,structure.fp_ii,structure.bp_i_a,structure.bp_i_b,structure.bp_ii_in,structure.bp_ii_a_in,structure.bp_ii_b_in,structure.bp_ii_out,structure.bp_ii_b,structure.bp_iii,structure.bp_iv,structure.bp_v,structure.grich_distance,structure.grich_angle,structure.grich_rotation,structure.filepath,structure.curation_flag
0,1833,0000000000000010001000000000000000000000000000...,2j51,-,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,DKI,-,50,0,,,in,in,2.1,8.6,0,14,0.78,2.093,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,17.619301,57.9916,66.117302,,False
1,1838,0000000000000010001000000000000000000000000000...,2jfl,-,A,Human,374,SLK,,,,GELGDGAFGKVYKAAAKVIDYMVEIDILASCDPNIVKLLDAWILIE...,DKI,-,50,0,,,in,in,2.2,9.0,0,10,0.78,2.093,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,17.540001,57.789501,67.502098,,False
2,4815,0000000000000010000000000000000000010000000000...,3pp0,A,A,Human,407,ErbB2,,,,KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQ...,03Q,-,1506,0,,,in,in,2.25,8.0,0,0,0.789,2.181,True,True,True,False,False,True,True,True,True,False,False,False,False,False,False,13.6917,45.711201,65.5812,,False
3,4816,0000000000000010000000000000000000000000000000...,3pp0,A,B,Human,407,ErbB2,,,,KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQ...,03Q,-,1506,0,,,in,in,2.25,8.0,0,0,0.788,2.194,True,True,True,True,False,True,True,True,True,False,False,False,False,False,False,15.7651,50.561501,45.475201,,False
4,4819,0000000000000010000000000000000000000000000000...,3pp0,B,B,Human,407,ErbB2,,,,KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQ...,03Q,-,1506,0,,,in,in,2.25,8.0,0,0,0.788,2.194,True,True,True,True,False,True,True,True,True,False,False,False,False,False,False,15.7651,50.561501,45.475201,,False


## Encode structures: KiSSim fingerprints

### Get KiSSim fingerprints

![KiSSim fingerprint](https://raw.githubusercontent.com/volkamerlab/kissim/main/docs/_static/kissim_toc.png)

In [21]:
from kissim.api import encode



In [22]:
%%time
kissim_fingerprints = encode(structure_klifs_ids, n_cores=1)
kissim_fingerprints

CPU times: user 24.7 ms, sys: 28.8 ms, total: 53.5 ms
Wall time: 36.1 s


<kissim.encoding.fingerprint_generator.FingerprintGenerator at 0x7fda007c6730>

## Compare structures: KLIFS IFPs

We will make a pairwise comparison of the structures' IFP using the Tanimoto/Jaccard distance as implemented in `sklearn.metrics.pairwise_distances`, which uses under the hood the method `scipy.spatial.distance`.

### Prepare IFPs as `numpy` array

KLIFS deposits the IFP as a string of 0's and 1's. We have to convert the IFPs to an array of boolean vectors (required by `scipy.spatial.distance` to be able to use the Jaccard distance). Each row in this array refers to one IFP, each columns to one of the IFP's features.

In [23]:
# This is the KLIFS format of the IFP (structure KLIFS ID and kinase name set as index)
ifp_series = structures_with_ifps_df.set_index(["structure.klifs_id", "kinase.klifs_name"])[
    "interaction.fingerprint"
]
ifp_series.head()

structure.klifs_id  kinase.klifs_name
1833                SLK                  0000000000000010001000000000000000000000000000...
1838                SLK                  0000000000000010001000000000000000000000000000...
4815                ErbB2                0000000000000010000000000000000000010000000000...
4816                ErbB2                0000000000000010000000000000000000000000000000...
4819                ErbB2                0000000000000010000000000000000000000000000000...
Name: interaction.fingerprint, dtype: string

In [24]:
# Cast "0" and "1" to boolean False and True
ifp_series = ifp_series.apply(lambda x: [True if i == "1" else False for i in x])
ifp_series.head()

structure.klifs_id  kinase.klifs_name
1833                SLK                  [False, False, False, False, False, False, Fal...
1838                SLK                  [False, False, False, False, False, False, Fal...
4815                ErbB2                [False, False, False, False, False, False, Fal...
4816                ErbB2                [False, False, False, False, False, False, Fal...
4819                ErbB2                [False, False, False, False, False, False, Fal...
Name: interaction.fingerprint, dtype: object

In [25]:
# Convert to numpy array
ifps_array = np.array(ifp_series.to_list())
ifps_array

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

#### Calculate pairwise Jaccard distances

The Jaccard distance, defined below, is often used in case of binary fingerprints: 

$$
d_J(A,B) = 1 - J(A,B) = \frac{\mid A \cup B \mid - \mid A \cap B \mid}{\mid A \cup B \mid}.
$$

In [26]:
from sklearn.metrics import pairwise_distances

structure_distance_matrix_array = pairwise_distances(ifps_array, metric="jaccard")

In [27]:
# Create DataFrame with structure KLIFS IDs as index/columns
structure_klifs_ids = ifp_series.index.get_level_values(0)
structure_distance_matrix_df = pd.DataFrame(
    structure_distance_matrix_array, index=structure_klifs_ids, columns=structure_klifs_ids
)
print(f"Structure distance matrix size: {structure_distance_matrix_df.shape}")
print("Show matrix subset:")
structure_distance_matrix_df.iloc[:5, :5]

Structure distance matrix size: (16, 16)
Show matrix subset:


structure.klifs_id,1833,1838,4815,4816,4819
structure.klifs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1833,0.0,0.055556,0.647059,0.636364,0.636364
1838,0.055556,0.0,0.617647,0.606061,0.606061
4815,0.647059,0.617647,0.0,0.16129,0.16129
4816,0.636364,0.606061,0.16129,0.0,0.0
4819,0.636364,0.606061,0.16129,0.0,0.0


### Map structure to kinase distance matrix

Note: So far we compared individual structures, but we want to compare kinases (which can be represented by several structures, as plotted above).

First, as an intermediate step, we will create a structure distance matrix but &mdash; instead of labeling the data with structure KLIFS IDs &mdash; we use the corresponding kinase name.

In [28]:
# Copy distance matrix to kinase matrix
kinase_distance_matrix_df = structure_distance_matrix_df.copy()
# Replace structure KLIFS IDs with the structures' kinase names
kinase_names = ifp_series.index.get_level_values(1)
kinase_distance_matrix_df.index = kinase_names
kinase_distance_matrix_df.columns = kinase_names
print("Show matrix subset:")
kinase_distance_matrix_df.iloc[:5, :5]

Show matrix subset:


kinase.klifs_name,SLK,SLK,ErbB2,ErbB2,ErbB2
kinase.klifs_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SLK,0.0,0.055556,0.647059,0.636364,0.636364
SLK,0.055556,0.0,0.617647,0.606061,0.606061
ErbB2,0.647059,0.617647,0.0,0.16129,0.16129
ErbB2,0.636364,0.606061,0.16129,0.0,0.0
ErbB2,0.636364,0.606061,0.16129,0.0,0.0


In this talktorial, we will consider per kinase pair the two structures that show the most similar binding mode for their co-crystallized ligands. Hence, we select the structure pair with the minimum IFP distance as representative for a kinase pair.

In [29]:
# We unstack the matrix (each pairwise comparison in a single row)
# We group by kinase names (level=[0, 1] ensures that the order of the kinases is ignored
# We take the minimum value in each kinase pair group
# We unstack the remaining data points
kinase_distance_matrix_df = (
    kinase_distance_matrix_df.unstack().groupby(level=[0, 1]).min().unstack(level=1)
)
kinase_distance_matrix_df.index.name = None
kinase_distance_matrix_df.columns.name = None

In [30]:
print(
    f"Structure matrix of shape {structure_distance_matrix_df.shape} "
    f"reduced to kinase matrix of shape {kinase_distance_matrix_df.shape}."
)

Structure matrix of shape (16, 16) reduced to kinase matrix of shape (4, 4).


In [31]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_distance_matrix_df.style.background_gradient(cmap=cm).format("{:.3f}")

Unnamed: 0,BRAF,EGFR,ErbB2,SLK
BRAF,0.0,0.462,0.447,0.576
EGFR,0.462,0.0,0.314,0.594
ErbB2,0.447,0.314,0.0,0.606
SLK,0.576,0.594,0.606,0.0


Note: Since this is a distance matrix, lighter colors indicate similarity, darker colors dissimilarity.

## Compare structures: KiSSim fingerprint

In [32]:
from kissim.api import compare

In [33]:
%%time
_, fingerprint_distance_generator = compare(kissim_fingerprints)
fingerprint_distance_generator

Calculate pairwise fingerprint distance:   0%|          | 0/120 [00:00<?, ?it/s]

Calculate pairwise fingerprint coverage:   0%|          | 0/120 [00:00<?, ?it/s]

CPU times: user 99.8 ms, sys: 25.4 ms, total: 125 ms
Wall time: 2.12 s


<kissim.comparison.fingerprint_distance_generator.FingerprintDistanceGenerator at 0x7fda00a5f820>

In [34]:
kinase_distance_matrix_df = fingerprint_distance_generator.kinase_distance_matrix()
kinase_distance_matrix_df.index.name = None
kinase_distance_matrix_df.columns.name = None
kinase_distance_matrix_df

Unnamed: 0,BRAF,EGFR,ErbB2,SLK
BRAF,0.0,0.208025,0.233484,0.22022
EGFR,0.208025,0.0,0.133167,0.193394
ErbB2,0.233484,0.133167,0.0,0.193065
SLK,0.22022,0.193394,0.193065,0.0


In [35]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_distance_matrix_df.style.background_gradient(cmap=cm).format("{:.3f}")

Unnamed: 0,BRAF,EGFR,ErbB2,SLK
BRAF,0.0,0.208,0.233,0.22
EGFR,0.208,0.0,0.133,0.193
ErbB2,0.233,0.133,0.0,0.193
SLK,0.22,0.193,0.193,0.0
