# T024 · Kinase similarity: sequence

Authors:

- Talia B. Kimber, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Dominique Sydow, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)
- Andrea Volkamer, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)

## Aim of this talktorial

Add a short summary of this talktorial's content.

### Contents in *Theory*

* Kinase dataset
* Kinase similarity descriptor: XXX

### Contents in *Practical*

* Retrieve and preprocess data
* Show kinase coverage
* Compare kinases
* Visualize similarity as kinase matrix
* Visualize similarity as phylogenetic tree

### References

* Kinase dataset: [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629) 
* Kinase similarity descriptor: XXX

## Theory

### Kinase dataset

We will use nine kinases from [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629), which aimed to understand kinase similarities within different combinations of kinase on- and off-targets (also called anti-targets):

 

> We aggregated the investigated kinases in “profiles”. Profile 1 combined **EGFR** and **ErbB2** as targets and **BRAF** as a (general) anti-target. Out of similar considerations, Profile 2 consisted of EGFR and **PI3K** as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. Profile 3, comprised of EGFR and **VEGFR2** as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).
> To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases **CDK2**, **LCK**, **MET** and **p38α** were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets.

 

*Table 1:* 
Kinases used in this notebook, taken from [<i>Molecules</i> (2021), <b>26(3)</b>, 629](https://www.mdpi.com/1420-3049/26/3/629), with their synonyms, UniProt IDs, and kinase groups.

 

| Kinase                     | Synonyms               | UniProt ID | Group    | Full kinase name                                 |
|----------------------------|------------------------|------------|----------|--------------------------------------------------|
| EGFR                       | ErbB1                  | P00533     | TK       | Epidermal growth factor receptor                 |
| ErbB2                      | Her2                   | P04626     | TK       | Erythroblastic leukemia viral oncogene homolog 2 |
| PI3K                       | PI3KCA, p110a          | P42336     | Atypical | Phosphatidylinositol-3-kinase                    |
| VEGFR2                     | KDR                    | P35968     | TK       | Vascular endothelial growth factor receptor 2    |
| BRAF                       | -                      | P15056     | TKL      | Rapidly accelerated fibrosarcoma isoform B       |
| CDK2                       | -                      | P24941     | CMGC     | Cyclic-dependent kinase 2                        |
| LCK                        | -                      | P06239     | TK       | Lymphocyte-specific protein tyrosine kinase      |
| MET                        | -                      | P08581     | TK       | Mesenchymal-epithelial transition factor         |
| p38a                       | MAPK14                 | Q16539     | CMGC     | p38 mitogen activated protein kinase α           |

### Kinase similarity descriptor: sequence

Describe the dataset describing kinase similarity and how we use it.

- XXX = KLIFS pocket sequence

## Practical

In [1]:
# !pip install flake8 pycodestyle_magic
%load_ext pycodestyle_magic
%pycodestyle_on

In [2]:
from pathlib import Path

import pandas as pd
import numpy as np
import requests
import biotite.sequence.align as align

In [3]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Retrieve sequences from KLIFS

We start by listing the kinases of interest.

In [4]:
query_kinases = ['EGFR',
                 'ErbB2',
                 'BRAF',
                 'CDK2',
                 'LCK',
                 'MET',
                 'p38a',
                 'KDR',
                 'p110a']

We use KLIFS' API to retrieve the 85-long pocket sequence for each kinase.

In [5]:
def klifs_pocket_sequence(kinase_name):
    """
    Retrieves the pocket sequence from KLIFS using the API.

    Parameters
    ----------
    kinase_name : str
        The name of the kinase of interest.

    Returns
    -------
    str :
        The 85 residues pocket sequence from KLIFS,
        if the kinase name is valid, None otherwise.
    """
    response = requests.get(f"https://klifs.net/api/"
                            f"kinase_ID?kinase_name={kinase_name}"
                            f"&species=HUMAN")

    if response.status_code == 200:
        return response.json()[0]['pocket']
    else:
        print(f'KLIFS failed for kinase {kinase_name}')
        return None

Let's see how these pocket sequence look like.

In [6]:
kinase_sequences = {}
for kinase in query_kinases:
    kinase_sequences[kinase] = klifs_pocket_sequence(kinase)
kinase_sequences

{'EGFR': 'KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA',
 'ErbB2': 'KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA',
 'BRAF': 'QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA',
 'CDK2': 'EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA',
 'LCK': 'ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA',
 'MET': 'EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA',
 'p38a': 'SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA',
 'KDR': 'KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA',
 'p110a': 'CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF'}

### Sequence similarity

Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures discussed in the theory, namely the identity or the substitution.

#### Identity score
We first define a function which compares element wise elements as described in the theory.

In [7]:
def identity_score(sequence1, sequence2):
    """
    sequence1 :
    sequence2 :
    """
    # True is the character is the same, False otherwise
    return np.compare_chararrays(sequence1,
                                 sequence2,
                                 cmp="==",
                                 rstrip=True)

#### Substitution score
We now define the function which is more specific to amino acids grouping and use the `biotite` library and retrieve the substitution matrix.

In [8]:
def substitution_score(sequence1,
                       sequence2,
                       substitution_matrix=align.SubstitutionMatrix.std_protein_matrix()):
    """
    ADD
    
    Parameters
    ----------
    sequence1 :
    sequence2 :
    substitution_matrix: 
        Default align.SubstitutionMatrix.std_protein_matrix() from biotite
        Obtain BLOSUM62
    Returns
    -------
    """
    # Retrieve np.array from substitution matrix
    score_matrix = substitution_matrix.score_matrix()

    # Retireve the letter
    letter_alphabet = substitution_matrix.get_alphabet1()

    # Map letter to index
    dict_letters = {}
    for i, letter in enumerate(letter_alphabet.get_symbols()):
        dict_letters[letter] = i

    match_score = match_score = np.zeros(len(sequence1))
    for i, (character_seq1, character_seq2) in enumerate(zip(sequence1,
                                                             sequence2)):
        ind1 = dict_letters[character_seq1]
        ind2 = dict_letters[character_seq2]
        match_score[i] = score_matrix[ind1, ind2]
    return match_score

3:80: E501 line too long (90 > 79 characters)
6:1: W293 blank line contains whitespace
11:25: W291 trailing whitespace


#### Kinase comparison

Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures discussed in the theory, namely the identity or the substitution.

In [9]:
def sequence_similarity(kinase_name1, kinase_name2, type_="identity"):
    """
    Compares two sequences using a given metric.

    kinase_name1, kinase_name2 : str
        The two names of the kinases for comparison.

    type_ : str ? default = identity

    Returns
    -------
    float :
        The similarity between the pocket sequences of the two kinases.
    """
    sequence_1 = klifs_pocket_sequence(kinase_name1)
    sequence_2 = klifs_pocket_sequence(kinase_name2)

    if len(sequence_1) != len(sequence_1):
        print("Mismatch in sequence lengths.")
        return None
    else:
        seq_array1 = np.array(list(sequence_1))
        seq_array2 = np.array(list(sequence_2))
        if type_ == "identity":
            is_match = identity_score(seq_array1, seq_array2)
            similarity_normed = sum(is_match)/len(sequence_1)
            return similarity_normed
        elif type_ == "substitution":
            match_score = substitution_score(seq_array1,
                                             seq_array2)
            return match_score
        else:
            print("Type not defined.")
            return None

Let's look at the sequence similarity between EGFR and MET:

In [10]:
a = sequence_similarity("EGFR", "MET", "substitution")
a

array([ 1.,  4.,  2.,  6., -1.,  6., -2.,  6.,  6., -1.,  4.,  7., -1.,
       -1.,  4.,  3.,  5.,  0.,  4.,  2.,  0.,  4., -1.,  5.,  0., -1.,
        3.,  5., -1.,  0., -1.,  0.,  7.,  1.,  4., -1., -1.,  4.,  4.,
        6.,  4., -2.,  1.,  3., -1., -1., -1.,  5., -1., -1.,  6., -3.,
        4., -2.,  1.,  3.,  3.,  5.,  0.,  7.,  4., -1.,  0.,  2.,  2.,
        0.,  4.,  8.,  5.,  6.,  4.,  4.,  4.,  5.,  6., -1.,  2.,  1.,
        3.,  0.,  6.,  6.,  6.,  4.,  4.])

In [11]:
a = sequence_similarity("EGFR", "MET", "identity")
a

0.4588235294117647

As expected, the similarity between a kinase and itself leads the highest possible score:

In [12]:
sequence_similarity("EGFR", "EGFR")

1.0

In [13]:
sequence_similarity("EGFR", "EGFR", type_="substitution")

array([5., 4., 4., 6., 4., 6., 4., 6., 6., 5., 4., 7., 5., 4., 4., 4., 5.,
       5., 4., 5., 4., 4., 6., 5., 4., 7., 4., 5., 4., 4., 4., 6., 7., 8.,
       4., 9., 5., 4., 4., 6., 4., 5., 4., 4., 5., 5., 4., 5., 7., 6., 6.,
       9., 4., 4., 6., 7., 4., 5., 5., 7., 4., 5., 6., 5., 5., 4., 4., 8.,
       5., 6., 4., 4., 4., 5., 6., 4., 4., 4., 4., 5., 6., 6., 6., 4., 4.])

### Visualize similarity as kinase matrix

In [14]:
table = np.zeros((len(query_kinases), len(query_kinases)))
for i, kinase_name1 in enumerate(query_kinases):
    for j, kinase_name2 in enumerate(query_kinases):
        table[i, j] = sequence_similarity(kinase_name1,
                                          kinase_name2)
table

array([[1.        , 0.89411765, 0.37647059, 0.31764706, 0.44705882,
        0.45882353, 0.38823529, 0.47058824, 0.11764706],
       [0.89411765, 1.        , 0.4       , 0.32941176, 0.42352941,
        0.47058824, 0.4       , 0.43529412, 0.11764706],
       [0.37647059, 0.4       , 1.        , 0.32941176, 0.38823529,
        0.37647059, 0.37647059, 0.4       , 0.15294118],
       [0.31764706, 0.32941176, 0.32941176, 1.        , 0.37647059,
        0.36470588, 0.47058824, 0.34117647, 0.10588235],
       [0.44705882, 0.42352941, 0.38823529, 0.37647059, 1.        ,
        0.4       , 0.38823529, 0.43529412, 0.14117647],
       [0.45882353, 0.47058824, 0.37647059, 0.36470588, 0.4       ,
        1.        , 0.36470588, 0.47058824, 0.10588235],
       [0.38823529, 0.4       , 0.37647059, 0.47058824, 0.38823529,
        0.36470588, 1.        , 0.38823529, 0.14117647],
       [0.47058824, 0.43529412, 0.4       , 0.34117647, 0.43529412,
        0.47058824, 0.38823529, 1.        , 0.15294118],


### Substitution

In [15]:
a = np.array(list(kinase_sequences["EGFR"]))
a

array(['K', 'V', 'L', 'G', 'S', 'G', 'A', 'F', 'G', 'T', 'V', 'Y', 'K',
       'V', 'A', 'I', 'K', 'E', 'L', 'E', 'I', 'L', 'D', 'E', 'A', 'Y',
       'V', 'M', 'A', 'S', 'V', 'D', 'P', 'H', 'V', 'C', 'R', 'L', 'L',
       'G', 'I', 'Q', 'L', 'I', 'T', 'Q', 'L', 'M', 'P', 'F', 'G', 'C',
       'L', 'L', 'D', 'Y', 'V', 'R', 'E', 'Y', 'L', 'E', 'D', 'R', 'R',
       'L', 'V', 'H', 'R', 'D', 'L', 'A', 'A', 'R', 'N', 'V', 'L', 'V',
       'I', 'T', 'D', 'F', 'G', 'L', 'A'], dtype='<U1')

In [16]:
b = np.array(list(kinase_sequences["MET"]))
b

array(['E', 'V', 'I', 'G', 'R', 'G', 'H', 'F', 'G', 'C', 'V', 'Y', 'H',
       'C', 'A', 'V', 'K', 'S', 'L', 'Q', 'F', 'L', 'T', 'E', 'G', 'I',
       'I', 'M', 'K', 'D', 'F', 'S', 'P', 'N', 'V', 'L', 'S', 'L', 'L',
       'G', 'I', 'L', 'V', 'V', 'L', 'P', 'Y', 'M', 'K', 'H', 'G', 'D',
       'L', 'R', 'N', 'F', 'I', 'R', 'N', 'Y', 'L', 'A', 'S', 'K', 'K',
       'F', 'V', 'H', 'R', 'D', 'L', 'A', 'A', 'R', 'N', 'C', 'M', 'L',
       'V', 'A', 'D', 'F', 'G', 'L', 'A'], dtype='<U1')

In [17]:
# Obtain BLOSUM62
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
print(substitution_matrix)

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   

In [18]:
score_matrix = substitution_matrix.score_matrix()
type(score_matrix), score_matrix.shape

(numpy.ndarray, (24, 24))

In [19]:
letter_alphabet = substitution_matrix.get_alphabet1()

In [20]:
dict_letters = {}
for i, letter in enumerate(letter_alphabet.get_symbols()):
    dict_letters[letter] = i
dict_letters

{'A': 0,
 'C': 1,
 'D': 2,
 'E': 3,
 'F': 4,
 'G': 5,
 'H': 6,
 'I': 7,
 'K': 8,
 'L': 9,
 'M': 10,
 'N': 11,
 'P': 12,
 'Q': 13,
 'R': 14,
 'S': 15,
 'T': 16,
 'V': 17,
 'W': 18,
 'Y': 19,
 'B': 20,
 'Z': 21,
 'X': 22,
 '*': 23}

In [21]:
match_score = np.zeros(len(a))
for i, (character_seq1, character_seq2) in enumerate(zip(a, b)):
    # print(character_seq1, character_seq2)#, type(character_seq1))
    ind1 = dict_letters[character_seq1]
    ind2 = dict_letters[character_seq2]
    match_score[i] = score_matrix[ind1, ind2]

In [22]:
match_score

array([ 1.,  4.,  2.,  6., -1.,  6., -2.,  6.,  6., -1.,  4.,  7., -1.,
       -1.,  4.,  3.,  5.,  0.,  4.,  2.,  0.,  4., -1.,  5.,  0., -1.,
        3.,  5., -1.,  0., -1.,  0.,  7.,  1.,  4., -1., -1.,  4.,  4.,
        6.,  4., -2.,  1.,  3., -1., -1., -1.,  5., -1., -1.,  6., -3.,
        4., -2.,  1.,  3.,  3.,  5.,  0.,  7.,  4., -1.,  0.,  2.,  2.,
        0.,  4.,  8.,  5.,  6.,  4.,  4.,  4.,  5.,  6., -1.,  2.,  1.,
        3.,  0.,  6.,  6.,  6.,  4.,  4.])

In [23]:
substitution_matrix.shape()

(24, 24)

In [24]:
substitution_matrix.is_symmetric()

True

### Show kinase coverage

### Visualize similarity as phylogenetic tree

## Discussion

Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.

## Quiz

Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.

1. Question
2. Question
3. Question