# T024 · Kinase similarity: sequence

Authors:

- Talia B. Kimber, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Dominique Sydow, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Andrea Volkamer, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)

## Aim of this talktorial

In this talktorial, we investigate sequence similarity for kinases of interest. KLIFS' API is used to retrieve the 85 residues pocket sequence for each kinase. 

Two similarity measures are implemented:
   1. Sequence identity, i.e., the similarity which is based on character-wise discrepancy
   2. Sequence similarity, i.e., the similarity which is based on a substitution matrix, thus, reflecting similarities between amino acids.

### Contents in *Theory*

* Kinase dataset
* Kinase similarity descriptor: sequence
    * Identity score
    * Substitution score

### Contents in *Practical*

* Define the kinases of interest
* Retrieve sequences from KLIFS
* Sequence similarity
    * Identity score
    * Substitution score
* Kinase comparison
* Visualize kinase similarity matrix
* Save kinase distance matrix

### References

* Kinase dataset: [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629) 
* KLIFS
  * KLIFS URL: https://klifs.net/
  * KLIFS database: [<i>Nucleic Acid Res.</i> (2020), <b>49(D1)</b>, D562-D569](https://doi.org/10.1093/nar/gkaa895)
* Substitution matrix: [<i>PNAS</i> (1992), <b>89(22)<b>, 10915-10919](https://doi.org/10.1073/pnas.89.22.10915)

## Theory

### Kinase dataset

We will use nine kinases as investigated in [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629). In the study diverse kinase similarity measures were analyzed for different combinations of kinase on- and off-targets to explore the limits of multi-kinase screenings.
 

> We aggregated the investigated kinases in “profiles”. Profile 1 combined **EGFR** and **ErbB2** as targets and **BRAF** as a (general) anti-target. Out of similar considerations, Profile 2 consisted of EGFR and **PI3K** as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. Profile 3, comprised of EGFR and **VEGFR2** as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).
> To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases **CDK2**, **LCK**, **MET** and **p38α** were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets.

 

*Table 1:*
Kinases used in this notebook, taken from [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629), with their synonyms, UniProt IDs, kinase groups, and full unabbreviated names.

 

| Kinase                     | Synonyms               | UniProt ID | Group    | Full kinase name                                 |
|----------------------------|------------------------|------------|----------|--------------------------------------------------|
| EGFR                       | ErbB1                  | P00533     | TK       | Epidermal growth factor receptor                 |
| ErbB2                      | Her2                   | P04626     | TK       | Erythroblastic leukemia viral oncogene homolog 2 |
| PI3K                       | PI3KCA, p110a          | P42336     | Atypical | Phosphatidylinositol-3-kinase                    |
| VEGFR2                     | KDR                    | P35968     | TK       | Vascular endothelial growth factor receptor 2    |
| BRAF                       | -                      | P15056     | TKL      | Rapidly accelerated fibrosarcoma isoform B       |
| CDK2                       | -                      | P24941     | CMGC     | Cyclic-dependent kinase 2                        |
| LCK                        | -                      | P06239     | TK       | Lymphocyte-specific protein tyrosine kinase      |
| MET                        | -                      | P08581     | TK       | Mesenchymal-epithelial transition factor         |
| p38a                       | MAPK14                 | Q16539     | CMGC     | p38 mitogen activated protein kinase α           |

### Kinase similarity descriptor: sequence

In this talktorial, the KLIFS' pocket sequence is used for two main reasons:
1. The sequence is of fixed length (it contains 85 residues), which makes computation for pairwise similarity between two sequences easy.
2. The binding pocket is where the action takes place. Why consider the full kinase sequence when an 85 residues sequence contains most relevant information?

We now describe two ways to compare pocket sequences.

#### Identity score
A simple way of assessing the similarity between two sequences is to use the so-called identity score.
First, a match vector is created: it checks whether for each position the characters from the two sequences are identical. If there are, the entry is set to $1$, and $0$ otherwise.

The identity score is computed by summing the elements in the match vector and normalizing the entry by the length of the sequence, which, in the case of KLIFS pocket sequence is $85$.

Let's consider the identity matrix $I$ below:

|       | A     | C     | D     | E     | ... |
|-------|-------|-----  |-----  |-----  |-----|
| **A** | **1** | 0     | 0     | 0     | ... |
| **C** | 0     | **1** | 0     | 0     | ... |
| **D** | 0     | 0     | **1** | 0     | ... |
| **E** | 0     | 0     | 0     | **1** | ... |
| ...   | ...   | ...   | ...   | ...   | ... |


and let $M = len(K_i) = len(K_j)$.

We use the following as similarity between kinases $K_i$ and $K_j$:

$$
\text{similarity}(K_i, K_j) = \frac{1}{M} \sum_{n}^{M} I (K_i[n], K_j[n]),
$$

where $K_i[n]$ represents the amino acid at position $n$ of kinase $i$.

#### Substitution score
Although the identity score is an easy measure of similarity, it does not take into account the rate at which an amino acid may change into another and treats all residues uniformly.

The substitution score takes the changes of the amino acids over evolutionary time into account. It makes use of a substitution matrix, where each entry gives a score between two amino acids.
In this talktorial, we use the BLOSUM substitution matrix [<i>PNAS</i> (1992), <b>89(22)<b>, 10915-10919](https://doi.org/10.1073/pnas.89.22.10915), implemented in `biotite`.

The BLOSUM substitution matrix $SM$ is defined as below (the full matrix will be displayed in the practical part):

|       | A     | C     | D     | E     | ... |
|-------|-------|-----  |-----  |-----  |-----|
| **A** | 4     | 0     | -2    | -1    | ... |
| **C** | 0     | 9     | -3    | -4    | ... |
| **D** | -2    | -3    | 6     | 2     | ... |
| **E** | -1    | -4    | 2     | 5     | ... |
| ...   | ...   | ...   | ...   | ...   | ... |

We use the following as similarity between kinases $K_i$ and $K_j$:

$$
\text{similarity}(K_i, K_j) = \frac{1}{\sum_i SM[i,i]} \sum_{n}^{M} SM\big((K_i[n], K_j[n])\big),
$$

where $K_i[n]$ represents the amino acid at position $n$ of kinase $i$ and $M = len(K_i) = len(K_j)$.

For computational reasons, we will use the normalized version of this matrix, i.e. all entries are between $0$ and $1$.

## Practical

In [1]:
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
import requests
import matplotlib.pyplot as plt
import biotite.sequence.align as align
from sklearn import preprocessing

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Define the kinases of interest

We start by listing the kinases of interest.

In [3]:
query_kinases = ['EGFR',
                 'ErbB2',
                 'BRAF',
                 'CDK2',
                 'LCK',
                 'MET',
                 'p38a',
                 'KDR',
                 'p110a']

### Retrieve sequences from KLIFS

We use KLIFS' API to retrieve the $85$-long pocket sequence for each kinase.

In [4]:
def klifs_pocket_sequence(kinase_name):
    """
    Retrieves the pocket sequence from KLIFS using the API.

    Parameters
    ----------
    kinase_name : str
        The name of the kinase of interest.

    Returns
    -------
    str :
        The 85 residues pocket sequence from KLIFS,
        if the kinase name is valid, None otherwise.
    """
    response = requests.get(f"https://klifs.net/api/"
                            f"kinase_ID?kinase_name={kinase_name}"
                            f"&species=HUMAN")

    if response.status_code == 200:
        return response.json()[0]['pocket']
    else:
        print(f'KLIFS failed for kinase {kinase_name}')
        return None

Let's look at these pocket sequences.

In [5]:
kinase_sequences = {}
for kinase in query_kinases:
    kinase_sequences[kinase] = klifs_pocket_sequence(kinase)
kinase_sequences

{'EGFR': 'KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA',
 'ErbB2': 'KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA',
 'BRAF': 'QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA',
 'CDK2': 'EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA',
 'LCK': 'ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA',
 'MET': 'EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA',
 'p38a': 'SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA',
 'KDR': 'KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA',
 'p110a': 'CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF'}

### Sequence similarity

Given two kinases, we create functions which account for identity or substitution similarity, as described in the theory.

#### Identity score
We first define a function which compares element-wise characters in two sequences.

In [6]:
def identity_score(sequence1, sequence2):
    """
    Computes the element-wise binary similarity between two sequences.

    Parameters
    ----------
    sequence1 : np.array
        An array of character describing the first sequence.
    sequence2 :
        An array of character describing the second sequence.

    Returns
    -------
    np.array :
        The bool array for each character.
        1 if the elements are identical,
        0 otherwise.
    """
    # True is the character is the same, False otherwise
    return np.compare_chararrays(sequence1,
                                 sequence2,
                                 cmp="==",
                                 rstrip=True)

#### Substitution score
We now define the function which is more specific to amino acids grouping and use the `biotite` library to retrieve the BLOSUM substitution matrix.

The substitution matrix can be retrieve from `biotite` using the following command:

In [7]:
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
print(substitution_matrix)

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   

Check for symmetry:

In [8]:
substitution_matrix.is_symmetric()

True

In [9]:
def substitution_score(sequence1,
                       sequence2,
                       substitution_matrix=align.
                       SubstitutionMatrix.std_protein_matrix()):
    """
    Retrieve the match score given the substitution matrix

    Parameters
    ----------
    sequence1 : np.array
        An array of character describing the first sequence.
    sequence2 :
        An array of character describing the second sequence.
    substitution_matrix:
        A substituition matrix specific to amino acids.
        The default is align.SubstitutionMatrix.std_protein_matrix()
        from biotite, which represents BLOSUM62.

    Returns
    -------
    np.array :
        The vector of match score
        using the normalized substitution matrix.
    """
    # Retrieve np.array from substitution matrix
    score_matrix = substitution_matrix.score_matrix()
    
    # Normalize values to be in (0,1)
    normalized_score_matrix = preprocessing.minmax_scale(score_matrix,
                                              feature_range=(0, 1),
                                              axis=1)

    # Retrieve the letter (amino acid)
    letter_alphabet = substitution_matrix.get_alphabet1()

    # Map letter to index
    dict_letters = {}
    for i, letter in enumerate(letter_alphabet.get_symbols()):
        dict_letters[letter] = i

    match_score  = np.zeros(len(sequence1))
    for i, (character_seq1, character_seq2) in enumerate(zip(sequence1,
                                                             sequence2)):
        ind1 = dict_letters[character_seq1]
        ind2 = dict_letters[character_seq2]
        match_score[i] = normalized_score_matrix[ind1, ind2]
    return match_score

## Kinase similarity

Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures, the identity or the substitution.

In [10]:
def sequence_similarity(sequence_1, sequence_2, type_="identity"):
    """
    Compares two sequences using a given metric.

    kinase_name1, kinase_name2 : str
        The two names of the kinases for comparison.

    type_ : str
        The type of metric to compute the similarity.
        The default is `identity`.

    Returns
    -------
    float :
        The similarity between the pocket sequences of the two kinases.
    """
    #sequence_1 = klifs_pocket_sequence(kinase_name1)
    #sequence_2 = klifs_pocket_sequence(kinase_name2)

    # Replace possible unavailable residue
    # noted in KLIFS with "-"
    # by the symbol "*" for biotite
    sequence_1 = sequence_1.replace("-", "*")
    sequence_2 = sequence_2.replace("-", "*")

    if len(sequence_1) != len(sequence_1):
        print("Mismatch in sequence lengths.")
        return None
    else:
        seq_array1 = np.array(list(sequence_1))
        seq_array2 = np.array(list(sequence_2))

        if type_ == "identity":
            is_match = identity_score(seq_array1, seq_array2)
            similarity_normed = sum(is_match)/len(sequence_1)
        elif type_ == "substitution":
            match_score = substitution_score(seq_array1, seq_array2)
            similarity_normed = sum(match_score)/len(sequence_1)
        else:
            print(f"Type {type_} not defined.")
            return None

        return similarity_normed

Let's look at the sequence similarity between EGFR and MET:

In [11]:
EGFR_MET_seq_similarity = sequence_similarity(kinase_sequences["EGFR"],
                                              kinase_sequences["MET"],
                                              "identity")
for key in ('EGFR','MET'):
    print(f"{key:5s}:{kinase_sequences[key]}")
print(f"Pocket sequence similarity between EGFR and MET kinases: "
      f"{EGFR_MET_seq_similarity:.2f} using identity.")

EGFR :KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
MET  :EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA
Pocket sequence similarity between EGFR and MET kinases: 0.46 using identity.


In [12]:
EGFR_MET_seq_similarity = sequence_similarity(kinase_sequences["EGFR"],
                                              kinase_sequences["MET"],
                                              "substitution")
print(f"Pocket sequence similarity between EGFR and MET kinases: "
      f"{EGFR_MET_seq_similarity:.2f} using substitution.")

Pocket sequence similarity between EGFR and MET kinases: 0.72 using substitution.


We can also look at self-similarity:

In [13]:
EGFR_seq_similarity = sequence_similarity(kinase_sequences["EGFR"], 
                                          kinase_sequences["EGFR"])
print(f"Pocket sequence similarity between EGFR itself: "
      f"{EGFR_seq_similarity:.2f} using identity.")

Pocket sequence similarity between EGFR itself: 1.00 using identity.


In [14]:
EGFR_seq_similarity = sequence_similarity(kinase_sequences["EGFR"], 
                                          kinase_sequences["EGFR"], 
                                          type_="substitution")
print(f"Pocket sequence similarity between EGFR itself: "
      f"{EGFR_seq_similarity:.2f} using substitution.")

Pocket sequence similarity between EGFR itself: 1.00 using substitution.


As expected, the similarity between a kinase and itself leads to the highest possible score:

### Visualize similarity as kinase matrix

We visualize the similarity matrix using substitution:

In [15]:
similarity_measure="substitution" 

In [16]:
kinase_similarity_matrix = np.zeros((len(query_kinases), len(query_kinases)))
for i, kinase_name1 in enumerate(query_kinases):
    for j, kinase_name2 in enumerate(query_kinases):
        kinase_similarity_matrix[i, j] = sequence_similarity(kinase_sequences[kinase_name1],
                                                             kinase_sequences[kinase_name2],
                                                             type_=similarity_measure)

In [17]:
kinase_similarity_matrix_df = pd.DataFrame(data=kinase_similarity_matrix,
                                        index=query_kinases,
                                        columns=query_kinases)
kinase_similarity_matrix_df

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,1.0,0.940995,0.658852,0.650192,0.710442,0.715203,0.644984,0.715307,0.429135
ErbB2,0.941257,1.0,0.659334,0.632521,0.685368,0.702937,0.63682,0.701761,0.416966
BRAF,0.653021,0.651452,1.0,0.646301,0.670291,0.639468,0.637391,0.667424,0.437786
CDK2,0.648024,0.629135,0.649093,1.0,0.679931,0.658547,0.724206,0.652563,0.425303
LCK,0.713619,0.685906,0.677195,0.684067,1.0,0.696126,0.665214,0.689147,0.454323
MET,0.709202,0.694562,0.638929,0.65551,0.687189,1.0,0.628056,0.711115,0.392443
p38a,0.645767,0.634558,0.640101,0.723386,0.661388,0.63262,1.0,0.65212,0.432959
KDR,0.718258,0.703629,0.676719,0.656019,0.687444,0.718457,0.656647,1.0,0.42493
p110a,0.427023,0.412055,0.437192,0.424418,0.450303,0.397434,0.43112,0.422578,1.0


In [18]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix_df.style.\
    background_gradient(cmap=cm).\
    format("{:.3f}")

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,1.0,0.941,0.659,0.65,0.71,0.715,0.645,0.715,0.429
ErbB2,0.941,1.0,0.659,0.633,0.685,0.703,0.637,0.702,0.417
BRAF,0.653,0.651,1.0,0.646,0.67,0.639,0.637,0.667,0.438
CDK2,0.648,0.629,0.649,1.0,0.68,0.659,0.724,0.653,0.425
LCK,0.714,0.686,0.677,0.684,1.0,0.696,0.665,0.689,0.454
MET,0.709,0.695,0.639,0.656,0.687,1.0,0.628,0.711,0.392
p38a,0.646,0.635,0.64,0.723,0.661,0.633,1.0,0.652,0.433
KDR,0.718,0.704,0.677,0.656,0.687,0.718,0.657,1.0,0.425
p110a,0.427,0.412,0.437,0.424,0.45,0.397,0.431,0.423,1.0


### Save kinase similarity matrix

In [19]:
f_name = f"kinase_similarity_{similarity_measure}_matrix_sequence.csv"
kinase_similarity_matrix_df.to_csv(DATA / f_name)

## Kinase distance matrix

In order to apply some clustering algorithm to assess the similarity between kinases, it is necessary to start with a distance matrix. In the case of the similarity matrix above, this matrix is not a distance matrix. For example, the diagonal elements are not zero.

Since all entries are between $0$ and $1$, the similarity matrix $SM$ can be converted to a distance matrix $DM$ using $$ DM = 1-SM.$$

In [20]:
print(f"The values of the similarity matrix lie between: "
      f"{kinase_similarity_matrix_df.min().min():.2f}"
      f" and {kinase_similarity_matrix_df.max().max():.2f}")

The values of the similarity matrix lie between: 0.39 and 1.00


In [21]:
kinase_distance_matrix_df = 1 - kinase_similarity_matrix_df

In [22]:
kinase_distance_matrix_df.style.\
    background_gradient(cmap=cm).\
    format("{:.3f}")

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,0.0,0.059,0.341,0.35,0.29,0.285,0.355,0.285,0.571
ErbB2,0.059,0.0,0.341,0.367,0.315,0.297,0.363,0.298,0.583
BRAF,0.347,0.349,0.0,0.354,0.33,0.361,0.363,0.333,0.562
CDK2,0.352,0.371,0.351,0.0,0.32,0.341,0.276,0.347,0.575
LCK,0.286,0.314,0.323,0.316,0.0,0.304,0.335,0.311,0.546
MET,0.291,0.305,0.361,0.344,0.313,0.0,0.372,0.289,0.608
p38a,0.354,0.365,0.36,0.277,0.339,0.367,0.0,0.348,0.567
KDR,0.282,0.296,0.323,0.344,0.313,0.282,0.343,0.0,0.575
p110a,0.573,0.588,0.563,0.576,0.55,0.603,0.569,0.577,0.0


### Save kinase distance matrix

In [23]:
f_name = f"kinase_distance_{similarity_measure}_matrix_sequence.csv"
kinase_distance_matrix_df.to_csv(DATA / f_name)

## Discussion

In this talktorial, we investigate how sequences can be used to measure similarity between kinases. The focus is set on the pocket sequence, which is retrieved from KLIFS. Sequence similarity can be assessed using two scores: 1. the identity, which treats all amino acids uniformly, and 2. the substitution, which takes into account the rate of change of residues over evolutionary time.

The kinase similarity matrix above will be reloaded in **Talktorial T028**, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial.

## Quiz

1. Should the full kinase sequence be used instead of the pocket sequence?
2. How does the similarity using identity behave with respect to mutations?
3. How does similarity using identity compare to similarity using substitution?