# T027 · Kinase similarity: ligand-profile

Authors:

- Talia B. Kimber, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Dominique Sydow, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Andrea Volkamer, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)

## Aim of this talktorial

The aim of this talktorial is to investigate kinase similarity through ligand profiling data. In the context of drug design, the following assumption can be made: if a compound was tested as active on two different kinases, it is suspected that these two kinases may have some degree of similarity. The concept of kinase promiscuity is also covered.

### Contents in *Theory*

* Kinase dataset
* Bioactivity data
* Kinase similarity descriptor: ligand-profile
    * Kinase similarity
    * Kinase promiscuity

### Contents in *Practical*

* Retrieve the data
* Preprocess the data
    * Kinases of interest
    * Hit or non-hit
* Kinase promiscuity
* Kinase similarity
    * Visualize similarity as kinase matrix
    * Save kinase similarity matrix
* Kinase distance matrix
    * Save kinase distance matrix

### References

* Kinase dataset: [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629) 
* [ChEMBL](https://www.ebi.ac.uk/chembl/) database

## Theory

### Kinase dataset

We will use nine kinases as investigated in [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629). In the study diverse kinase similarity measures were analyzed for different combinations of kinase on- and off-targets to explore the limits of multi-kinase screenings.

 

> We aggregated the investigated kinases in “profiles”. Profile 1 combined **EGFR** and **ErbB2** as targets and **BRAF** as a (general) anti-target. Out of similar considerations, Profile 2 consisted of EGFR and **PI3K** as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. Profile 3, comprised of EGFR and **VEGFR2** as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).
> To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases **CDK2**, **LCK**, **MET** and **p38α** were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets.

 

*Table 1:* 
Kinases used in this notebook, taken from [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629), with their synonyms, UniProt IDs, kinase groups, and full unabbreviated names.

 

| Kinase                     | Synonyms               | UniProt ID | Group    | Full kinase name                                 |
|----------------------------|------------------------|------------|----------|--------------------------------------------------|
| EGFR                       | ErbB1                  | P00533     | TK       | Epidermal growth factor receptor                 |
| ErbB2                      | Her2                   | P04626     | TK       | Erythroblastic leukemia viral oncogene homolog 2 |
| PI3K                       | PI3KCA, p110a          | P42336     | Atypical | Phosphatidylinositol-3-kinase                    |
| VEGFR2                     | KDR                    | P35968     | TK       | Vascular endothelial growth factor receptor 2    |
| BRAF                       | -                      | P15056     | TKL      | Rapidly accelerated fibrosarcoma isoform B       |
| CDK2                       | -                      | P24941     | CMGC     | Cyclic-dependent kinase 2                        |
| LCK                        | -                      | P06239     | TK       | Lymphocyte-specific protein tyrosine kinase      |
| MET                        | -                      | P08581     | TK       | Mesenchymal-epithelial transition factor         |
| p38a                       | MAPK14                 | Q16539     | CMGC     | p38 mitogen activated protein kinase α           |

### Bioactivity data

In order to measure kinase similarity through ligand profiling data, bioactivity data is retrieved from the well-known [ChEMBL](https://www.ebi.ac.uk/chembl/) database and the query focuses on human-kinases. Luckily, a curated version of ChEMBL28 is already freely available in the openkinome organization, see https://github.com/openkinome/kinodata.
For more details on querying the ChEMBL database, please refer to [Talktorial T001](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T001_query_chembl/talktorial.ipynb).

In drug design, it is common to binarize the activity of a compound against a target of interest as a "hit" or "non-hit". Practically speaking, this is done using a cutoff value for measured activity. If the activity is greater than the cutoff, the compound is labeled as active (hit), and inactive (non-hit) otherwise.

### Kinase similarity descriptor: ligand-profile

As a measure of similarity, we use ligand profiling data in this talktorial.

#### Kinase similarity

We use the following metric as similarity between kinases $K_i$ and $K_j$:

$$
\text{similarity}(K_i, K_j) = \frac{\#\text{ of compounds that were tested as actives on } K_i \text{ and } K_j}
{\#\text{ of compounds that were tested on both } K_i \text{ and } K_j}.
$$

Assuming that only one compound was tested on two kinases, and that the compound was tested as active for one and inactive for the other, then the similarity between these two kinases would be zero.

#### Kinase promiscuity
Computing the similarity between a kinase and itself may be interpreted as kinase promiscuity, where the similarity described above would therefore represent the fraction of active compounds over all tested compounds.

## Practical

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
from collections import Counter

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Retrieve the data

We retrieve a pre-curated version of a kinase subset of ChEMBL v.28 freely available at openkinome, see https://github.com/openkinome/kinodata/releases/tag/v0.2.

In [3]:
path = "https://github.com/openkinome/kinodata/releases/download/\
v0.2/activities-chembl28_v0.2.zip"
data = pd.read_csv(path, index_col=None)
print(f"Current shape of data: {data.shape}")
data.head()

Current shape of data: (186972, 17)


Unnamed: 0.1,Unnamed: 0,activities.activity_id,assays.chembl_id,target_dictionary.chembl_id,molecule_dictionary.chembl_id,molecule_dictionary.max_phase,activities.standard_type,activities.standard_value,activities.standard_units,compound_structures.canonical_smiles,compound_structures.standard_inchi,component_sequences.sequence,assays.confidence_score,docs.chembl_id,docs.year,docs.authors,UniprotID
0,96251,16291323,CHEMBL3705523,CHEMBL2973,CHEMBL3666724,0,pIC50,14.09691,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C31H33N7O3/c1-2-4-29(40)33-22-6-3-5-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
1,97672,16306943,CHEMBL3705523,CHEMBL2973,CHEMBL1968705,0,pIC50,14.0,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C31H33N7O2/c1-2-6-29(39)33-23-8-5-7-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
2,93879,16264754,CHEMBL3705523,CHEMBL2973,CHEMBL3666728,0,pIC50,14.0,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C34H40N8O3/c1-5-7-32(43)36-24-9-6-8-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
3,100714,16340050,CHEMBL3705523,CHEMBL2973,CHEMBL1997433,0,pIC50,13.958607,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C28H28N6O3/c1-3-5-26(35)30-20-7-4-6-1...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
4,95877,16287186,CHEMBL3705523,CHEMBL2973,CHEMBL3666721,0,pIC50,13.920819,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C32H35N7O2/c1-2-7-30(40)34-24-9-6-8-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116


### Preprocess the data

We look at the type of activity and the associated units.

In [4]:
print(f"Activities: {set(data['activities.standard_type'])}\n"
      f"Units: {set(data['activities.standard_units'])}")

Activities: {'pIC50', 'pKd', 'pKi'}
Units: {'nM'}


Let's keep the entries which have pIC50 values only.

In [5]:
data = data[data["activities.standard_type"] == "pIC50"]

In [6]:
data.columns

Index(['Unnamed: 0', 'activities.activity_id', 'assays.chembl_id',
       'target_dictionary.chembl_id', 'molecule_dictionary.chembl_id',
       'molecule_dictionary.max_phase', 'activities.standard_type',
       'activities.standard_value', 'activities.standard_units',
       'compound_structures.canonical_smiles',
       'compound_structures.standard_inchi', 'component_sequences.sequence',
       'assays.confidence_score', 'docs.chembl_id', 'docs.year',
       'docs.authors', 'UniprotID'],
      dtype='object')

The dataframe contains many columns that won't be necessary for the rest of the notebook which are therefore removed. 
Only relevant information is kept, namely the canonical SMILES of the compound, the measured activity and the UniProt ID of the kinase. These columns are renamed for readability.

In [7]:
data = data[["compound_structures.canonical_smiles",
             "activities.standard_value",
             "UniprotID"]]
data = data.rename(columns={"compound_structures.canonical_smiles": "smiles",
                            "activities.standard_value": "activity_value"})

In [8]:
print(f"Current shape of data: {data.shape}")
data.head()

Current shape of data: (159978, 3)


Unnamed: 0,smiles,activity_value,UniprotID
0,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,14.09691,O75116
1,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,14.0,O75116
2,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,14.0,O75116
3,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,13.958607,O75116
4,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,13.920819,O75116


We also drop NA values.

In [9]:
data = data.dropna()
print(f"Current shape of data: {data.shape}")

Current shape of data: (159823, 3)


#### Kinases of interest

We focus on the kinases of interest and map them to their [UniProt](https://www.uniprot.org/) IDs, using Table 1 from the theory.

In [10]:
name_to_uniprot = {'EGFR': 'P00533',
                   'ErbB2': 'P04626',
                   'BRAF': 'P15056',
                   'CDK2': 'P24941',
                   'LCK': 'P06239',
                   'MET': 'P08581',
                   'p38a': 'Q16539',
                   'KDR': 'P35968',
                   'p110a': 'P42336'}

We keep data for these kinases only:

In [11]:
data = data[data["UniprotID"].isin(name_to_uniprot.values())]
print(f"Current shape of data: {data.shape}")

Current shape of data: (33169, 3)


In [12]:
data.head()

Unnamed: 0,smiles,activity_value,UniprotID
57,Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1,11.522879,P00533
98,CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC,11.221849,P00533
101,CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1,11.221849,P00533
139,Brc1cccc(Nc2ncnc3cc4[nH]cnc4cc23)c1,11.09691,P00533
140,CNc1cc2c(Nc3cccc(Br)c3)ncnc2cn1,11.09691,P00533


Let's look at EGFR data:

In [13]:
EGFR_data = data[data["UniprotID"] == "P00533"]

Some compounds have been tested several times against EGFR, as shown below.

In [14]:
measured_compounds = Counter(EGFR_data["smiles"])
measured_compounds.most_common()[0:5]

[('COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1', 40),
 ('C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1', 24),
 ('C=CC(=O)Nc1cc(Nc2nccc(-c3cn(C)c4ccccc34)n2)c(OC)cc1N(C)CCN(C)C', 12),
 ('C=CC(=O)Nc1cccc(Oc2nc(Nc3ccc(N4CCN(C)CC4)cc3OC)ncc2Cl)c1', 10),
 ('CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n1c3ccccc3c3c4c(c5c6ccccc6n2c5c31)C(=O)NC4',
  7)]

As a simple workaround, we keep the value for which the compound has the best activity value, i.e., highest pIC50 value here.

In [15]:
data = data.groupby(["UniprotID",
                     "smiles"])['activity_value'].max().reset_index()
data.head()

Unnamed: 0,UniprotID,smiles,activity_value
0,P00533,Br.CC(Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(C(...,5.336488
1,P00533,Br.CC(Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1cccc2c...,5.996539
2,P00533,Br.CC[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1...,8.39794
3,P00533,Br.C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1c...,7.207608
4,P00533,Br.C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1c...,8.420216


#### Hit or non-hit

Finally, we binarize the pIC50 values to obtain hit or non-hit using a cut-off. We use a pIC50 cut-off of $6.3$, similarly to the [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629).

In [16]:
cutoff = 6.3

In [17]:
def binarize_pic50(pic50_value, threshold):
    """
    Binarizes a scalar value given a threshold.

    Parameters
    ----------
    pic50_value : float
        The measurement pIC50 value of a kinase-ligand pair.
    threshold : float
        The cutoff to determine activity.

    Returns
    -------
    1 if the pIC50 value is above the threshold, which indicates activity.
    0 otherwise.
    """
    if pic50_value >= threshold:
        return 1
    else:
        return 0

In [18]:
data["activity_binary"] = data["activity_value"].apply(binarize_pic50,
                                                       args=(cutoff, ))

In [19]:
print(f"Current shape of data: {data.shape}")
data.head()

Current shape of data: (32688, 4)


Unnamed: 0,UniprotID,smiles,activity_value,activity_binary
0,P00533,Br.CC(Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(C(...,5.336488,0
1,P00533,Br.CC(Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1cccc2c...,5.996539,0
2,P00533,Br.CC[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1...,8.39794,1
3,P00533,Br.C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1c...,7.207608,1
4,P00533,Br.C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1c...,8.420216,1


### Kinase promiscuity

We now look at the kinase promiscuity.

For a given kinase, three values are computed:

1. the total number of measured compounds against the given kinase,
2. the number of active compounds against the kinase, and
3. the fraction of active compounds, i.e. the ratio of active compounds over the total number of measured compounds.

In [20]:
def kinase_to_activity_numbers(uniprot_id, activity_dataframe):
    """
    Retrieve the three values for a given kinase.

    Parameters
    ----------
    uniprot_id : str
        The uniprot id of the kinase of interest, e.g. "P00533" for "EGFR".

    activity_dataframe :  pd.DataFrame
        The dataframe with activity values for kinases.

    Returns
    -------
    tuple : (int, int, float)
        The three metrics:
        1. The total number of measured compounds against the kinase
        2. The number of active compounds against the kinase.
        3. The fraction of active compounds against the kinase.
    """
    kinase_data = activity_dataframe[activity_dataframe["UniprotID"]
                                     == uniprot_id]
    total_measured_compounds = len(kinase_data)
    active_compounds = len(kinase_data[kinase_data["activity_binary"] == 1])
    if total_measured_compounds > 0:
        fraction = active_compounds/total_measured_compounds
    else:
        print("No compounds were measured for this kinase.")
        fraction = np.nan
    return (total_measured_compounds, active_compounds, fraction)

Let's see what information we get for EGFR.

In [21]:
uniprot_id = name_to_uniprot["EGFR"]
EGFR_metrics = kinase_to_activity_numbers(uniprot_id, data)
print(f"Total number of measured compounds: \t"
      f"{EGFR_metrics[0]} \n"
      f"Number of active compounds: \t\t"
      f"{EGFR_metrics[1]} \n"
      f"Fraction of active compounds: \t\t"
      f"{EGFR_metrics[2]:.2f} \n")

Total number of measured compounds: 	5869 
Number of active compounds: 		3574 
Fraction of active compounds: 		0.61 



Let's create a table from these values for all kinases:

In [22]:
def from_numbers_to_table(activity_dataframe):
    """
    Create a table with all three values for all kinases.

    Parameters
    ----------
    activity_dataframe :  pd.DataFrame
        The dataframe with activity values for kinases.

    Returns
    -------
    promiscuity_table : pd.DataFrame
        A dataframe with the kinases as rows and values as columns.
    """
    promiscuity_table = pd.DataFrame(index=name_to_uniprot.keys(),
                                     columns=["total", "actives", "fraction"])
    for name, uniprot_id in name_to_uniprot.items():
        values = kinase_to_activity_numbers(uniprot_id, activity_dataframe)
        promiscuity_table.loc[name] = values
    return promiscuity_table

In [23]:
kinase_promiscuity_table = from_numbers_to_table(data)
kinase_promiscuity_table

Unnamed: 0,total,actives,fraction
EGFR,5869,3574,0.608962
ErbB2,1694,1025,0.605077
BRAF,3682,2988,0.811515
CDK2,1505,819,0.544186
LCK,1537,930,0.605075
MET,2821,2235,0.792272
p38a,3609,2760,0.764755
KDR,7624,5312,0.696747
p110a,4347,2780,0.639522


Let's beautify the table:

In [24]:
kinase_promiscuity_table.style.\
    format("{:.3f}", subset=["fraction"]).\
    background_gradient(cmap='Purples', subset=["fraction"]).\
    highlight_min(color="yellow", axis=None).\
    highlight_max(color="red", subset=["fraction"])

ValueError: zero-size array to reduction operation fmin which has no identity

<pandas.io.formats.style.Styler at 0x7fd9e5ed4be0>

From the table, we notice that CDK2 and BRAF are the least and most promiscuous kinases, respectively.

## Kinase similarity

We now investigate how we can use the similarity measure discussed in the theory to compare kinases. 

In [25]:
def similarity_ligand_profile(uniprot_id1, uniprot_id2, activity_dataframe):
    """
    Compute the similarity between two kinases using ligand profiling.

    Parameters
    ----------
    uniprot_id1 : str
        Uniprot id of first kinase of interest.
    uniprot_id2 : str
        Uniprot id of second kinase of interest.

    activity_dataframe :  pd.DataFrame
        The dataframe with activity values for kinases.

    Returns
    -------
    fraction : float
        The metric for kinase similariy,
        i.e. number of active compounds on both kinases
        over number of measured compounds on both kinases.
    """
    if uniprot_id1 == uniprot_id2:
        (total_compounds,
         active_compounds,
         fraction) = kinase_to_activity_numbers(uniprot_id1,
                                                activity_dataframe)
        return fraction
    else:
        # Data for the two kinases only
        reduced_data = activity_dataframe[activity_dataframe
                                          ["UniprotID"].isin([uniprot_id1,
                                                              uniprot_id2])]

        # Look at active compounds only
        active_entries = reduced_data[reduced_data["activity_binary"] == 1]
        # Group by compounds
        compounds = active_entries.groupby("smiles").size()
        # Look at the number of active compounds measured on both kinases
        active_compounds_on_both = compounds[compounds == 2].shape

        # Look at all tested compounds
        compounds = reduced_data.groupby("smiles").size()
        # Look at the number of compounds measured on both kinases
        measured_compounds_on_both = compounds[compounds == 2].shape

        if measured_compounds_on_both[0] > 0:
            fraction = (active_compounds_on_both[0] /
                        measured_compounds_on_both[0])
        else:
            print(f"No compounds were measured on both kinases, "
                  f"namely {uniprot_id1} and {uniprot_id2}.")
            fraction = np.nan
        return fraction

Let's look at the similarity between EGFR and MET:

In [26]:
uniprot_id1=name_to_uniprot["EGFR"]
uniprot_id2=name_to_uniprot["MET"]
similarity_EGFR_MET = similarity_ligand_profile(uniprot_id1, uniprot_id2, data)
print(f"Ligand profile similarity between EGFR and MET: "
      f"{similarity_EGFR_MET:.2f}.")

Ligand profile similarity between EGFR and MET: 0.23.


### Visualize similarity as kinase matrix

In [27]:
kinase_similarity_matrix = np.zeros((len(name_to_uniprot),
                                     len(name_to_uniprot)))
for i, kinase_name1 in enumerate(name_to_uniprot):
    for j, kinase_name2 in enumerate(name_to_uniprot):
        uniprot_id1=name_to_uniprot[kinase_name1]
        uniprot_id2=name_to_uniprot[kinase_name2]
        kinase_similarity_matrix[i, j] = similarity_ligand_profile(
            uniprot_id1,
            uniprot_id2,
            data)

No compounds were measured on both kinases, namely P04626 and P42336.
No compounds were measured on both kinases, namely P42336 and P04626.


In [28]:
kinase_similarity_matrix = pd.DataFrame(data=kinase_similarity_matrix,
                                        index=name_to_uniprot.keys(),
                                        columns=name_to_uniprot.keys())
kinase_similarity_matrix

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,0.608962,0.556918,0.419355,0.116279,0.238462,0.225806,0.315789,0.348214,0.071429
ErbB2,0.556918,0.605077,0.25,0.153846,0.176471,0.035714,0.125,0.39779,
BRAF,0.419355,0.25,0.811515,0.066667,0.511628,0.111111,0.644444,0.74812,0.166667
CDK2,0.116279,0.153846,0.066667,0.544186,0.166667,0.086957,0.083333,0.601695,0.266667
LCK,0.238462,0.176471,0.511628,0.166667,0.605075,0.287879,0.489362,0.432692,0.0
MET,0.225806,0.035714,0.111111,0.086957,0.287879,0.792272,0.047619,0.539359,0.0
p38a,0.315789,0.125,0.644444,0.083333,0.489362,0.047619,0.764755,0.5,0.0
KDR,0.348214,0.39779,0.74812,0.601695,0.432692,0.539359,0.5,0.696747,0.180791
p110a,0.071429,,0.166667,0.266667,0.0,0.0,0.0,0.180791,0.639522


In [29]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix.style.\
    background_gradient(cmap=cm).\
    format("{:.3f}")

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,0.609,0.557,0.419,0.116,0.238,0.226,0.316,0.348,0.071
ErbB2,0.557,0.605,0.25,0.154,0.176,0.036,0.125,0.398,
BRAF,0.419,0.25,0.812,0.067,0.512,0.111,0.644,0.748,0.167
CDK2,0.116,0.154,0.067,0.544,0.167,0.087,0.083,0.602,0.267
LCK,0.238,0.176,0.512,0.167,0.605,0.288,0.489,0.433,0.0
MET,0.226,0.036,0.111,0.087,0.288,0.792,0.048,0.539,0.0
p38a,0.316,0.125,0.644,0.083,0.489,0.048,0.765,0.5,0.0
KDR,0.348,0.398,0.748,0.602,0.433,0.539,0.5,0.697,0.181
p110a,0.071,,0.167,0.267,0.0,0.0,0.0,0.181,0.64


Note that the diagonal contains the previously discussed promiscuity values.

As mentioned above, no compounds were measured on both ErbB2 and p110a and therefore create a `np.nan` entry which can be problematic for algorithmic reason.

As a simple workaround, we will fill the NA values with zero.

In [30]:
kinase_similarity_matrix = kinase_similarity_matrix.fillna(0)

kinase_similarity_matrix.style.\
    background_gradient(cmap=cm).\
    format("{:.3f}")

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,0.609,0.557,0.419,0.116,0.238,0.226,0.316,0.348,0.071
ErbB2,0.557,0.605,0.25,0.154,0.176,0.036,0.125,0.398,0.0
BRAF,0.419,0.25,0.812,0.067,0.512,0.111,0.644,0.748,0.167
CDK2,0.116,0.154,0.067,0.544,0.167,0.087,0.083,0.602,0.267
LCK,0.238,0.176,0.512,0.167,0.605,0.288,0.489,0.433,0.0
MET,0.226,0.036,0.111,0.087,0.288,0.792,0.048,0.539,0.0
p38a,0.316,0.125,0.644,0.083,0.489,0.048,0.765,0.5,0.0
KDR,0.348,0.398,0.748,0.602,0.433,0.539,0.5,0.697,0.181
p110a,0.071,0.0,0.167,0.267,0.0,0.0,0.0,0.181,0.64


### Save kinase similarity matrix

In [31]:
kinase_similarity_matrix.to_csv(DATA /
                                "kinase_similarity_matrix_ligand_profile.csv")

## Kinase distance matrix

In order to apply some clustering algorithm to assess the similarity between kinases, it is necessary to start with a distance matrix. In the case of the similarity matrix above, this matrix is not a distance matrix. For example, the diagonal elements are not zero.

Since all entries are between $0$ and $1$, the similarity matrix $SM$ can be converted to a distance matrix $DM$ using $$ DM = 1-SM.$$

In [32]:
print(f"The values of the similarity matrix lie between: "
      f"{kinase_similarity_matrix.min().min():.2f}"
      f" and {kinase_similarity_matrix.max().max():.2f}")

The values of the similarity matrix lie between: 0.00 and 0.81


In [33]:
kinase_distance_matrix = 1 - kinase_similarity_matrix

Finally, we set the diagonal values to $0$ and we obtain the kinase distance matrix:

In [34]:
np.fill_diagonal(kinase_distance_matrix.values, 0)

In [35]:
kinase_distance_matrix.style.\
    background_gradient(cmap=cm).\
    format("{:.3f}")

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,0.0,0.443,0.581,0.884,0.762,0.774,0.684,0.652,0.929
ErbB2,0.443,0.0,0.75,0.846,0.824,0.964,0.875,0.602,1.0
BRAF,0.581,0.75,0.0,0.933,0.488,0.889,0.356,0.252,0.833
CDK2,0.884,0.846,0.933,0.0,0.833,0.913,0.917,0.398,0.733
LCK,0.762,0.824,0.488,0.833,0.0,0.712,0.511,0.567,1.0
MET,0.774,0.964,0.889,0.913,0.712,0.0,0.952,0.461,1.0
p38a,0.684,0.875,0.356,0.917,0.511,0.952,0.0,0.5,1.0
KDR,0.652,0.602,0.252,0.398,0.567,0.461,0.5,0.0,0.819
p110a,0.929,1.0,0.833,0.733,1.0,1.0,1.0,0.819,0.0


### Save kinase distance matrix

In [36]:
kinase_distance_matrix.to_csv(DATA /
                              "kinase_distance_matrix_ligand_profile.csv")

## Discussion

In this talktorial, we investigate how activity data can be used as a measure of similarity between kinases. The fraction of compounds tested as actives over the total number of measured compounds is a way of accessing the similarity. Moreover, using the same rationale, the promiscuity of a kinase can quantified using the ratio of active compounds over measured compounds.

The kinase distance matrix above will be reloaded in Talktorial T028, where we compare kinase similarities from different perspectives, including the ligand profile perspective we have talked about in this talktorial.

## Quiz

1. Is there be an optimal way to deal with multiple kinase-ligand measurements?
2. Can promiscuity be fairly compared between two kinases if one has been tested against many compounds whereas the other only against very few? 
3. Using the similarity described in this talktorial, what does it mean that two kinases have a similarity of $0$, as is the case for p110a and LCK?