# T023 · What is a kinase?

Authors:

- Talia B. Kimber, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Dominique Sydow, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Andrea Volkamer, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)

## Aim of this talktorial

In this talktorial, we will talk about kinases: why are they important in life and drug design, what do they look like, and what data resources are available?

### Contents in *Theory*

- Kinases in a nutshell
    - The human kinome
    - Kinase structures and important motifs
- Kinase resources
    - Kinase structures and related information
    - Bioactivity data
- Kinase-similarity: off-target, promiscuous binding
- Kinase dataset compilation

### Contents in *Practical*

- Define the kinases of interest

### References

- Kinases as drug targets: [<i>Nat. Rev. Drug Discov.</i> (2021), <b>20(7)</b>, 551-569](https://doi.org/10.1038/s41573-021-00195-4)
- Sequence-based kinase clustering: Manning et al. [<i>Science</i> (2002), <b>298(5600)</b>, 1912-1934](https://doi.org/10.1126/science.1075762)
- KLIFS
  - KLIFS URL: https://klifs.net/
  - KLIFS database: [<i>Nucleic Acid Res.</i> (2020), <b>49(D1)</b>, D562-D569](https://doi.org/10.1093/nar/gkaa895)
  - KLIFS binding site definition: [<i>J. Med. Chem.</i> (2014), <b>57(2)</b>, 249-277](https://doi.org/10.1021/jm400378w)
- Bioactivity data
  - Karaman et al. dataset: [<i>Nature Biotechnology</i> (2008), <b>26</b>, 127-132](https://doi.org/10.1038/nbt1358)
  - Davis et al. dataset: [<i>Nature Biotechnology</i> (2011), <b>29</b>, 1046-1051](https://doi.org/10.1038/nbt.1990)
  - KIBA dataset: [<i>J. Chem. Inf. Model.</i> (2014), <b>54(3)</b>, 753-743](https://doi.org/10.1021/ci400709d)
  - PKIS dataset: [<i>PLOS ONE</i> (2017), <b>12</b>, 1-20](https://doi.org/10.1371/journal.pone.0181585)
- Kinase dataset: [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629) 

## Theory

### Kinases in a nutshell

Kinases are established drug targets to combat cancer and inflammatory diseases ([<i>Nat. Rev. Drug Discov.</i> (2021), <b>20(7)</b>, 551-569](https://doi.org/10.1038/s41573-021-00195-4)). They are involved in most aspects of cell life by phosphorylating and thereby activating themselves or other proteins and are most frequently mutated proteins in tumors.

As of Sep 2021, 5782 X-ray structures of human kinases have been resolved (see [KLIFS](https://klifs.net/) database) and 67 FDA approved small molecule protein kinase inhibitors are on the market (see [this list](http://www.brimr.org/PKI/PKIs.htm)). Most of the approved drugs bind in the ATP-binding pocket and intermediate surroundings.

Despite of decades of kinase research, there are still many open challenges:

- A large fraction of the kinome is un-/underexplored
- Many kinase inhibitors are promiscous binders causing off-target effects or enabling polypharmacology
- Occurence of drug resistances due to mutations

#### The human kinome 

Human kinome consists of over 500 protein kinases (number varies depending on the data resource, see [overview on kinodata](https://github.com/openkinome/kinodata/blob/master/human-kinases/human_kinases.ipynb)).

Manning et al. clustered the human protein kinases based on their sequence into [<i>Science</i> (2002), <b>298(5600)</b>, 1912-1934](https://doi.org/10.1126/science.1075762) into eight major groups (AGC, CAMK, CK1, CMGC, RGC, STE, TK, TKL) and one "Other" group for unassigned kinases. The kinase clustering is visualized as Manning kinome tree. The kinase resource KinMap enables mapping of kinase data onto that tree, e.g. the number of of X-ray structures per kinase as shown in Figure 1.

![Manning tree with number of structures per kinase (KinMap)](images/kinmap_n_structures_per_kinase.png)

*Figure 1:* 
Number of PDB structures per kinase mapped onto the Manning kinome tree using KinMap.
<!---
We are using KLIFS kinase names; some are not recognized by KinMap and were simply dropped!
--->

#### Kinase structures and important motifs

Kinases sequences and structures are highly conserved. Important regions in the kinase pocket include (Figure 2):

- Hinge region: Forms key hydrogen bonds to ligands 
- DFG motif: Flips between phenylalanine (F) and aspartate (D), driving the active and inactive state
- αC-helix: Forms in the αC-in conformation a salt bridge between highly conserved lysine and glutamine
- Glycine-rich (G-rich) loop: Stabilizes ATP binding

![Kinase structure with key motifs](images/T023_kinase_structure.png)

*Figure 2:* 
Kinase structure with important key motifs: Hinge region, DFG motif, αC-helix, and G-rich loop (example: CDK2, PDB ID: 1FIN)

<!---
from opencadd.structure.pocket import PocketKlifs, PocketViewer
pocket = PocketKlifs.from_structure_klifs_id(4367)
viewer = PocketViewer()
viewer.add_pocket(
    pocket,
    ligand_expo_id="ATP",
    show_pocket_center=False
)
viewer.viewer.add_ball_and_stick(selection="ATP")
-->

### Kinase resources

Nevertheless the focus on this protein family has led to a plethora of freely available data on compounds, bioactivity, and structures that are being used for computational drug development. 
[TODO cite Kooistra, Volkamer, ARMC V.50, Elsevier, 2017, 153-192]

#### Kinase structures and related information: KLIFS

The KLIFS database ([<i>Nucleic Acid Res.</i> (2020), <b>49(D1)</b>, D562-D569](https://doi.org/10.1093/nar/gkaa895), [<i>J. Med. Chem.</i> (2014), <b>57(2)</b>, 249-277](https://doi.org/10.1021/jm400378w)) fetches all kinase structures deposited in the structural database PDB ([<i>Acta Cryst.</i> (2002), <b>D58</b>, 899-907](https://doi.org/10.1107/S0907444902003451), [<i>Structure</i> (2012), <b>20(3)</b>, 391-396](https://doi.org/10.1016/j.str.2012.01.010)) and processes them as follows: All multi-chain structures in the PDB are split into monomers and aligned to each other with a special focus on a pre-defined binding site of 85 residues (Figure 1). For example, this means that the conserved gatekeeper (GK) residue at KLIFS position 45 can be easily and quickly looked up in any of the over 10,000 monomeric kinase structures in KLIFS. 

![KLIFS binding site](https://klifs.net/images/faq/xcolors.png.pagespeed.ic.dprMuoZGzn.webp)

*Figure 1:* 
Kinase binding site residues as defined by KLIFS.
Figure and description taken from: [<i>J. Med. Chem.</i> (2014), <b>57(2)</b>, 249-277](https://doi.org/10.1021/jm400378w).

Each structure, kinase, and ligand in KLIFS is associated with an identifier:

- Structure KLIFS ID
- Kinase KLIFS ID
- Ligand KLIFS ID

#### Bioactivity data

TODO - short !
* maybe an overview of data points per kinase on Chembl (from kinodata)
* other profiling data (as available in karaman, xxx)

![Manning tree with number of ChEMBL activities per kinase (KinMap)](images/kinmap_n_activities_per_kinase.png)

*Figure 3:* 
Number of ChEMBL 29 bioactivities per kinase mapped onto the Manning kinome tree using KinMap.
<!---
We are using KLIFS kinase names; some are not recognized by KinMap and were simply dropped!
--->

- Karaman et al. dataset: TODO
  - Paper: [<i>Nature Biotechnology</i> (2008), <b>26</b>, 127-132](https://doi.org/10.1038/nbt1358)
  - Data: [KinMap data (JSON)](http://kinhub.org/js/Davis_profiling.js)
- Davis et al. dataset: TODO
  - Paper: [<i>Nature Biotechnology</i> (2011), <b>29</b>, 1046-1051](https://doi.org/10.1038/nbt.1990)
  - Data: [KinMap data (JSON)](http://kinhub.org/js/Karaman_profiling.js)
- KIBA dataset: TODO
  - Paper: [<i>J. Chem. Inf. Model.</i> (2014), <b>54(3)</b>, 753-743](https://doi.org/10.1021/ci400709d)
  - Data: [SI data (XLSX)](https://ndownloader.figstatic.com/files/3950161)
- PKIS dataset: 
  - Paper: [<i>PLOS ONE</i> (2017), <b>12</b>, 1-20](https://doi.org/10.1371/journal.pone.0181585)
  - Data: [SI data (XLSX)](https://doi.org/10.1371/journal.pone.0181585.s004)

### Kinase-similarity: off-target, promiscuous binding

TODO Problem statement. Introduce
* problem of promiscuous binding (e.g. by profiling results for known inhibitors, (kinhub figure), example from molecules paper?)
* different perspectives: sequence, structure and ligand-profiling data
* that's why we investigate in similarity from the different perspectives


### Kinase dataset compilation

In the course of the kinase similarity talktorials (**Talktorials T024-T028**), we will use nine kinases from [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629), which were selected for the following reasons:

> - Profile 1 combined __EGFR__ and __ErbB2__ as targets and __BRAF__ as a (general) anti-target. 
> - Out of similar considerations, Profile 2 consisted of EGFR and __PI3K__ as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. 
> - Profile 3, comprised of EGFR and __VEGFR2__ as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).
> - To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases __CDK2__, __LCK__, __MET__ and __p38α__ were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets.

## Practical

In [1]:
from pathlib import Path

import pandas as pd

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Define the kinases of interest

We have collected information about these nine kinases in the CSV file `kinase_selection.csv`:

- `kinase`: Kinase name as used in [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629)
- `kinase_klifs`: Kinase name as used in the KLIFS database
- `uniprot_id`: Kinase UniProt ID
- `group`: Kinase group as defined by Manning et al. [<i>Science</i> (2002), <b>298(5600)</b>, 1912-1934](https://doi.org/10.1126/science.1075762)
- `full_kinase_name`: Full kinase name as used in [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629)

Note: You can run the kinase similarity **Talktorials T024-T028** with your own set of kinases. Please update the CSV file with your kinases; the only mandatory columns are `kinase_klifs` and `uniprot_id`.

In [3]:
kinase_selection_df = pd.read_csv(DATA/"kinase_selection.csv")
kinase_selection_df

Unnamed: 0,kinase,kinase_klifs,uniprot_id,group,full_kinase_name
0,EGFR,EGFR,P00533,TK,Epidermal growth factor receptor
1,ErbB2,ErbB2,P04626,TK,Erythroblastic leukemia viral oncogene homolog 2
2,PI3K,p110a,P42336,Atypical,Phosphatidylinositol-3-kinase
3,VEGFR2,KDR,P35968,TK,Vascular endothelial growth factor receptor 2
4,BRAF,BRAF,P15056,TKL,Rapidly accelerated fibrosarcoma isoform B
5,CDK2,CDK2,P24941,CMGC,Cyclic-dependent kinase 2
6,LCK,LCK,P06239,TK,Lymphocyte-specific protein tyrosine kinase
7,MET,MET,P08581,TK,Mesenchymal-epithelial transition factor
8,p38a,p38a,Q16539,CMGC,p38 mitogen activated protein kinase alpha


We will load this dataset in all downstream talktorials to assess kinase similarity from different perspectives:

- **Talktorial T024**: Kinase similarity based on KLIFS pocket sequence
- **Talktorial T025**: Kinase similarity based on KiSSim pocket structure
- **Talktorial T026**: Kinase similarity based on KLIFS interaction fingerprint
- **Talktorial T027**: Kinase similarity based on ligand promiscuity (ChEMBL bioactivity data)
- **Talktorial T028**: Compare kinase similarity measures from **Talktorials T024-T027**

## Appendix

### KinMap data

There are some KinMap trees shown in this notebook. Please find in the appendix the code to generate the KinMap CSV files to be uploaded to KinMap:
http://www.kinhub.org/kinmap

Note: PNG downloads do not seem to work anymore, thus download as SVG and convert to PNG in your terminal (Linux) via `convert -density 25 xxx.svg xxx.png` (SVG cannot be included in Jupyter notebooks out-of-the-box).

In [4]:
def format_for_kinmap(kinase_names, kinase_values, size_min=10, size_max=50):
    """
    Take kinase names and some associated values and generate a KinMap data file
    that will display values as circles of size [`size_min`, `size_max`].
    
    Parameters
    ----------
    kinase_names : list of str
        Kinase names.
    kinase_values : list of float
        Some associated values.
    size_min : int
        Minimum circle size on KinMap tree (minimum input value will be scaled to `size_min`).
    size_max : int
        Maximum circle size on KinMap tree (maximum input value will be scaled to `size_min`).
    
    Returns
    -------
    pandas.DataFrame
        KinMap data with columns `xName` (kinase name), `size` (circle size for KinMap tree).
    """
    
    data = pd.DataFrame({"xName": kinase_names, "values": kinase_values})
    min_ = data["values"].min()
    max_ = data["values"].max()
    data["size"] = data["values"].apply(
        lambda x: ((x - min_)/(max_ - min_) * size_max) + size_min
    )
    return data[["xName", "size"]]

#### Number of PDB structures per kinase

Generate data for number of structures per kinase in the KinMap format to be mapped onto the kinome tree.

In [5]:
from opencadd.databases.klifs import setup_remote

klifs = setup_remote()
structures_df = klifs.structures.all_structures()

# Get number of structures per kinase
n_structures_per_kinase = structures_df.groupby(
    ["structure.pdb_id", "kinase.klifs_name"]
).first().reset_index().groupby("kinase.klifs_name").size()

# Save in KinMap format
kinmap_n_structures_per_kinase = format_for_kinmap(n_structures_per_kinase.index, n_structures_per_kinase.values)
kinmap_n_structures_per_kinase.to_csv(DATA / "kinmap_n_structures_per_kinase.csv", index=None)
# Some kinases will not be resolved in KinMap and will be simply dropped

#### Number of ChEMBL bioactivities per kinase

Generate data for number of ChEMBL bioactivities per kinase in the KinMap format to be mapped onto the kinome tree.

In [6]:
from opencadd.databases.klifs import setup_remote

# Get bioactivity data
path = "https://github.com/openkinome/kinodata/releases/download/v0.3/activities-chembl29_v0.3.zip"
data = pd.read_csv(path, index_col=None)
data = data[data["activities.standard_type"] == "pIC50"]
data = data.dropna()

# Get kinase data
klifs = setup_remote()
kinases_df = klifs.kinases.all_kinases()
kinases_df = kinases_df[kinases_df["kinase.uniprot"] != "0"]
# Some UniProt IDs have several names in KLIFS, keep only first
kinases_df = kinases_df.groupby("kinase.uniprot").first()

# Map UniProt ID > kinase KLIFS name
data = pd.merge(data, kinases_df, left_on="UniprotID", right_on="kinase.uniprot", how="left")

# Get number of activities per kinase
n_activities_per_kinase = data.groupby("kinase.klifs_name").size()

# Save in KinMap format
kinmap_n_activities_per_kinase = format_for_kinmap(n_activities_per_kinase.index, n_activities_per_kinase.values)
kinmap_n_activities_per_kinase.to_csv(DATA / "kinmap_n_activities_per_kinase.csv", index=None)
# Some kinases will not be resolved in KinMap and will be simply dropped