# T024 · Kinase similarity: sequence

Authors:

- Talia B. Kimber, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Dominique Sydow, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Andrea Volkamer, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)

## Aim of this talktorial

In this talktorial, we investigate sequence similarity for kinases of interest. KLIFS' API is used to retrieve the 85 residues pocket sequence for each kinase. Two similarity measures are implemented: 1. the identity similarity which is based on character-wise discrepancy and 2. the substitution similarity which is amino acid specific.

### Contents in *Theory*

* Kinase dataset
* Kinase similarity descriptor: sequence
    * Identity score
    * Substitution score

### Contents in *Practical*

* Retrieve sequences from KLIFS
* Sequence similarity
    * Identity score
    * Substitution score
* Kinase comparison
* Visualize similarity as kinase matrix
* Save kinase distance matrix

### References

* Kinase dataset: [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629) 
* KLIFS
  * KLIFS URL: https://klifs.net/
  * KLIFS database: [<i>Nucleic Acid Res.</i> (2020), <b>49(D1)</b>, D562-D569](https://doi.org/10.1093/nar/gkaa895)
* Substitution matrix: [<i>PNAS</i> (1992), <b>89(22)<b>, 10915-10919](https://doi.org/10.1073/pnas.89.22.10915)

## Theory

### Kinase dataset

We will use nine kinases from [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629), which aimed to understand kinase similarities within different combinations of kinase on- and off-targets (also called anti-targets):

 

> We aggregated the investigated kinases in “profiles”. Profile 1 combined **EGFR** and **ErbB2** as targets and **BRAF** as a (general) anti-target. Out of similar considerations, Profile 2 consisted of EGFR and **PI3K** as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. Profile 3, comprised of EGFR and **VEGFR2** as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).
> To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases **CDK2**, **LCK**, **MET** and **p38α** were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets.

 

*Table 1:*
Kinases used in this notebook, taken from [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629), with their synonyms, UniProt IDs, kinase groups, and full unabbreviated names.

 

| Kinase                     | Synonyms               | UniProt ID | Group    | Full kinase name                                 |
|----------------------------|------------------------|------------|----------|--------------------------------------------------|
| EGFR                       | ErbB1                  | P00533     | TK       | Epidermal growth factor receptor                 |
| ErbB2                      | Her2                   | P04626     | TK       | Erythroblastic leukemia viral oncogene homolog 2 |
| PI3K                       | PI3KCA, p110a          | P42336     | Atypical | Phosphatidylinositol-3-kinase                    |
| VEGFR2                     | KDR                    | P35968     | TK       | Vascular endothelial growth factor receptor 2    |
| BRAF                       | -                      | P15056     | TKL      | Rapidly accelerated fibrosarcoma isoform B       |
| CDK2                       | -                      | P24941     | CMGC     | Cyclic-dependent kinase 2                        |
| LCK                        | -                      | P06239     | TK       | Lymphocyte-specific protein tyrosine kinase      |
| MET                        | -                      | P08581     | TK       | Mesenchymal-epithelial transition factor         |
| p38a                       | MAPK14                 | Q16539     | CMGC     | p38 mitogen activated protein kinase α           |

### Kinase similarity descriptor: sequence

In this talktorial, the KLIFS' pocket sequence is used for two main reasons:
1. The sequence is of fixed length (it contains 85 residues), which makes computation for pairwise similarity between two sequences easy.
2. The binding pocket is where the action takes place. Why consider the full kinase sequence when an 85 residues sequence contains most relevant information?

We now describe two ways to compare pocket sequences.

#### Identity score
A simple way of assessing the similarity between two sequences is to use the so-called identity score.
First, a match vector is created: it checks whether for each position the characters from the two sequences are identical. If there are, the entry is set to $1$, and $0$ otherwise.

The identity score is computed by sum the elements in the match vector and normalizing the entry by the length of the sequence, which, in the case of KLIFS pocket sequence is $85$.

Let's consider the identity matrix $I$ below:

|       | A     | C     | D     | E     | ... |
|-------|-------|-----  |-----  |-----  |-----|
| **A** | **1** | 0     | 0     | 0     | ... |
| **C** | 0     | **1** | 0     | 0     | ... |
| **D** | 0     | 0     | **1** | 0     | ... |
| **E** | 0     | 0     | 0     | **1** | ... |
| ...   | ...   | ...   | ...   | ...   | ... |


and let $M = len(K_i) = len(K_j)$.

We use the following metric as similarity between kinases $K_i$ and $K_j$:

$$
\text{similarity}(K_i, K_j) = \frac{1}{M} \sum_{n}^{M} I (K_i[n], K_j[n]),
$$

where $K_i[n]$ represents the amino acid at position $n$ of kinase $i$.

#### Substitution score
Although the identity score is an easy measure of similarity, it does not take into account the rate at which an amino acid may change into another and treats all residues uniformly.

The substitution score takes the changes of the amino acids over evolutionary time into account. It makes use of a substitution matrix, where each entry gives a score between two amino acids.
In this talktorial, we use the BLOSUM substitution matrix [<i>PNAS</i> (1992), <b>89(22)<b>, 10915-10919](https://doi.org/10.1073/pnas.89.22.10915), implemented in biotite.

The BLOSUM substitution matrix $SM$ is defined as below (the full matrix will be displayed in the practical part):

|       | A     | C     | D     | E     | ... |
|-------|-------|-----  |-----  |-----  |-----|
| **A** | 4     | 0     | -2    | -1    | ... |
| **C** | 0     | 9     | -3    | -4    | ... |
| **D** | -2    | -3    | 6     | 2     | ... |
| **E** | -1    | -4    | 2     | 5     | ... |
| ...   | ...   | ...   | ...   | ...   | ... |

We use the following metric as similarity between kinases $K_i$ and $K_j$:

$$
\text{similarity}(K_i, K_j) = \frac{1}{\sum_i SM[i,i]} \sum_{n}^{M} SM\big((K_i[n], K_j[n])\big),
$$

where $K_i[n]$ represents the amino acid at position $n$ of kinase $i$ and $M = len(K_i) = len(K_j)$.

## Practical

In [1]:
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
import requests
import biotite.sequence.align as align
from sklearn import preprocessing

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Retrieve sequences from KLIFS

We start by listing the kinases of interest.

In [3]:
query_kinases = ['EGFR',
                 'ErbB2',
                 'BRAF',
                 'CDK2',
                 'LCK',
                 'MET',
                 'p38a',
                 'KDR',
                 'p110a']

We use KLIFS' API to retrieve the $85$-long pocket sequence for each kinase.

In [4]:
def klifs_pocket_sequence(kinase_name):
    """
    Retrieves the pocket sequence from KLIFS using the API.

    Parameters
    ----------
    kinase_name : str
        The name of the kinase of interest.

    Returns
    -------
    str :
        The 85 residues pocket sequence from KLIFS,
        if the kinase name is valid, None otherwise.
    """
    response = requests.get(f"https://klifs.net/api/"
                            f"kinase_ID?kinase_name={kinase_name}"
                            f"&species=HUMAN")

    if response.status_code == 200:
        return response.json()[0]['pocket']
    else:
        print(f'KLIFS failed for kinase {kinase_name}')
        return None

Let's look at these pocket sequences.

In [5]:
kinase_sequences = {}
for kinase in query_kinases:
    kinase_sequences[kinase] = klifs_pocket_sequence(kinase)
kinase_sequences

{'EGFR': 'KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA',
 'ErbB2': 'KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA',
 'BRAF': 'QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA',
 'CDK2': 'EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA',
 'LCK': 'ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA',
 'MET': 'EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA',
 'p38a': 'SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA',
 'KDR': 'KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA',
 'p110a': 'CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF'}

### Sequence similarity

Given two kinases, we create functions which account for identity or substitution similarity, as described in the theory.

#### Identity score
We first define a function which compares element-wise characters in two sequences.

In [6]:
def identity_score(sequence1, sequence2):
    """
    Computes the element-wise binary similarity between two sequences.

    Parameters
    ----------
    sequence1 : np.array
        An array of character describing the first sequence.
    sequence2 :
        An array of character describing the second sequence.

    Returns
    -------
    np.array :
        The bool array for each character.
        1 if the elements are identical,
        0 otherwise.
    """
    # True is the character is the same, False otherwise
    return np.compare_chararrays(sequence1,
                                 sequence2,
                                 cmp="==",
                                 rstrip=True)

#### Substitution score
We now define the function which is more specific to amino acids grouping and use the `biotite` library for retrieving the BLOSUM substitution matrix.

The substituion matrix can be retrieve from `biotite` using the following command:

In [7]:
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
print(substitution_matrix)

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   

In [8]:
def substitution_score(sequence1,
                       sequence2,
                       substitution_matrix=align.
                       SubstitutionMatrix.std_protein_matrix()):
    """
    Retrieve the match score given the substitution matrix

    Parameters
    ----------
    sequence1 : np.array
        An array of character describing the first sequence.
    sequence2 :
        An array of character describing the second sequence.
    substitution_matrix:
        A substituition matrix specific to amino acids.
        The default is align.SubstitutionMatrix.std_protein_matrix()
        from biotite, which represents BLOSUM62.

    Returns
    -------
    tuple (np.array, np.float) :
        The vector of match score given the substitution matrix,
        and the normalizing value.
    """
    # Retrieve np.array from substitution matrix
    score_matrix = substitution_matrix.score_matrix()
    
    # Normalize values to be in (0,1)
    score_matrix = preprocessing.minmax_scale(score_matrix,
                                              feature_range=(0, 1))
    
    # Sum diagonal values for normalization, called the trace
    normalizing_value = np.trace(score_matrix)
    
    # Retireve the letter
    letter_alphabet = substitution_matrix.get_alphabet1()

    # Map letter to index
    dict_letters = {}
    for i, letter in enumerate(letter_alphabet.get_symbols()):
        dict_letters[letter] = i

    match_score = match_score = np.zeros(len(sequence1))
    for i, (character_seq1, character_seq2) in enumerate(zip(sequence1,
                                                             sequence2)):
        ind1 = dict_letters[character_seq1]
        ind2 = dict_letters[character_seq2]
        match_score[i] = score_matrix[ind1, ind2]
    return match_score, normalizing_value

## Kinase similarity

Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures, the identity or the substitution.

In [9]:
def sequence_similarity(kinase_name1, kinase_name2, type_="identity"):
    """
    Compares two sequences using a given metric.

    kinase_name1, kinase_name2 : str
        The two names of the kinases for comparison.

    type_ : str
        The type of metric to compute the similarity.
        The default is `identity`.

    Returns
    -------
    float :
        The similarity between the pocket sequences of the two kinases.
    """
    sequence_1 = klifs_pocket_sequence(kinase_name1)
    sequence_2 = klifs_pocket_sequence(kinase_name2)

    # Replace possible unavailable residue
    # noted in KLIFS with "-"
    # by the symbol "*" for biotite
    sequence_1 = sequence_1.replace("-", "*")
    sequence_2 = sequence_2.replace("-", "*")

    if len(sequence_1) != len(sequence_1):
        print("Mismatch in sequence lengths.")
        return None
    else:
        seq_array1 = np.array(list(sequence_1))
        seq_array2 = np.array(list(sequence_2))

        if type_ == "identity":
            is_match = identity_score(seq_array1, seq_array2)
            similarity_normed = sum(is_match)/len(sequence_1)
            return similarity_normed

        elif type_ == "substitution":
            match_score, normalizing_value = substitution_score(seq_array1, seq_array2)
            similarity_normed = sum(match_score)/normalizing_value
            return similarity_normed

        else:
            print("Type not defined.")
            return None

Let's look at the sequence similarity between EGFR and MET:

In [10]:
EGFR_MET_seq_similarity = sequence_similarity("EGFR",
                                              "MET",
                                              "substitution")
print(f"Pocket sequence similarity between EGFR and MET kinases: "
      f"{EGFR_MET_seq_similarity:.2f} using substitution.")

Pocket sequence similarity between EGFR and MET kinases: 2.54 using substitution.


In [11]:
EGFR_MET_seq_similarity = sequence_similarity("EGFR",
                                              "MET",
                                              "identity")
print(f"Pocket sequence similarity between EGFR and MET kinases: "
      f"{EGFR_MET_seq_similarity:.2f} using identity.")

Pocket sequence similarity between EGFR and MET kinases: 0.46 using identity.


As expected, the similarity between a kinase and itself leads to the highest possible score:

In [12]:
EGFR_seq_similarity = sequence_similarity("EGFR", "EGFR")
print(f"Pocket sequence similarity between EGFR itself: "
      f"{EGFR_seq_similarity:.2f} using identity.")

Pocket sequence similarity between EGFR itself: 1.00 using identity.


In [13]:
EGFR_seq_similarity = sequence_similarity("EGFR", "EGFR", type_="substitution")
print(f"Pocket sequence similarity between EGFR itself: "
      f"{EGFR_seq_similarity:.2f} using substitution.")
# TODO: normalize

Pocket sequence similarity between EGFR itself: 3.58 using substitution.


### Visualize similarity as kinase matrix

In [14]:
kinase_similarity_matrix = np.zeros((len(query_kinases), len(query_kinases)))
for i, kinase_name1 in enumerate(query_kinases):
    for j, kinase_name2 in enumerate(query_kinases):
        kinase_similarity_matrix[i, j] = sequence_similarity(kinase_name1,
                                                             kinase_name2)

In [15]:
kinase_similarity_matrix = pd.DataFrame(data=kinase_similarity_matrix,
                                        index=query_kinases,
                                        columns=query_kinases)
kinase_similarity_matrix

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,1.0,0.894118,0.376471,0.317647,0.447059,0.458824,0.388235,0.470588,0.117647
ErbB2,0.894118,1.0,0.4,0.329412,0.423529,0.470588,0.4,0.435294,0.117647
BRAF,0.376471,0.4,1.0,0.329412,0.388235,0.376471,0.376471,0.4,0.152941
CDK2,0.317647,0.329412,0.329412,1.0,0.376471,0.364706,0.470588,0.341176,0.105882
LCK,0.447059,0.423529,0.388235,0.376471,1.0,0.4,0.388235,0.435294,0.141176
MET,0.458824,0.470588,0.376471,0.364706,0.4,1.0,0.364706,0.470588,0.105882
p38a,0.388235,0.4,0.376471,0.470588,0.388235,0.364706,1.0,0.388235,0.141176
KDR,0.470588,0.435294,0.4,0.341176,0.435294,0.470588,0.388235,1.0,0.152941
p110a,0.117647,0.117647,0.152941,0.105882,0.141176,0.105882,0.141176,0.152941,1.0


In [16]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix.style.\
    background_gradient(cmap=cm).\
    format("{:.3f}")

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,1.0,0.894,0.376,0.318,0.447,0.459,0.388,0.471,0.118
ErbB2,0.894,1.0,0.4,0.329,0.424,0.471,0.4,0.435,0.118
BRAF,0.376,0.4,1.0,0.329,0.388,0.376,0.376,0.4,0.153
CDK2,0.318,0.329,0.329,1.0,0.376,0.365,0.471,0.341,0.106
LCK,0.447,0.424,0.388,0.376,1.0,0.4,0.388,0.435,0.141
MET,0.459,0.471,0.376,0.365,0.4,1.0,0.365,0.471,0.106
p38a,0.388,0.4,0.376,0.471,0.388,0.365,1.0,0.388,0.141
KDR,0.471,0.435,0.4,0.341,0.435,0.471,0.388,1.0,0.153
p110a,0.118,0.118,0.153,0.106,0.141,0.106,0.141,0.153,1.0


### Save kinase similarity matrix

In [17]:
kinase_similarity_matrix.to_csv(DATA / "kinase_similarity_matrix_sequence.csv")

## Kinase distance matrix

In order to apply some clustering algorithm to assess the similarity between kinases, it is necessary to start with a distance matrix. In the case of the similarity matrix above, this matrix is not a distance matrix. For example, the diagonal elements are not zero.

Since all entries are between $0$ and $1$, the similarity matrix $SM$ can be converted to a distance matrix $DM$ using $$ DM = 1-SM.$$

In [18]:
print(f"The values of the similarity matrix lie between: "
      f"{kinase_similarity_matrix.min().min():.2f}"
      f" and {kinase_similarity_matrix.max().max():.2f}")

The values of the similarity matrix lie between: 0.11 and 1.00


In [19]:
kinase_distance_matrix = 1 - kinase_similarity_matrix

In [20]:
kinase_distance_matrix.style.\
    background_gradient(cmap=cm).\
    format("{:.3f}")

Unnamed: 0,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a
EGFR,0.0,0.106,0.624,0.682,0.553,0.541,0.612,0.529,0.882
ErbB2,0.106,0.0,0.6,0.671,0.576,0.529,0.6,0.565,0.882
BRAF,0.624,0.6,0.0,0.671,0.612,0.624,0.624,0.6,0.847
CDK2,0.682,0.671,0.671,0.0,0.624,0.635,0.529,0.659,0.894
LCK,0.553,0.576,0.612,0.624,0.0,0.6,0.612,0.565,0.859
MET,0.541,0.529,0.624,0.635,0.6,0.0,0.635,0.529,0.894
p38a,0.612,0.6,0.624,0.529,0.612,0.635,0.0,0.612,0.859
KDR,0.529,0.565,0.6,0.659,0.565,0.529,0.612,0.0,0.847
p110a,0.882,0.882,0.847,0.894,0.859,0.894,0.859,0.847,0.0


### Save kinase distance matrix

In [21]:
kinase_distance_matrix.to_csv(DATA /
                              "kinase_distance_matrix_sequence.csv")

## Discussion

In this talktorial, we investigate how sequences can be used to measure similarity between kinases. The focus is made of the pocket sequence, which is retrieve from KLIFS. Sequence similarity can be assessed using two scores: 1. the identity, which treats all amino acids uniformly, and 2. the substitution, which takes into account the rate of change of residues over evolutionary time.

The kinase similarity matrix above will be reloaded in **Talktorial T028**, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial.

## Quiz

1. Should the full kinase sequence be used instead of the pocket sequence?
2. How does the similarity using identity behave with respect to mutations?
3. How does similarity using identity compare to similarity using substitution?