<div class="alert alert-block alert-info">

<b>Thank you for contributing to TeachOpenCADD!</b>

</div>

<div class="alert alert-block alert-info">

<b>Set up your PR</b>: Please check out our <a href="https://github.com/volkamerlab/teachopencadd/issues/41">issue</a> on how to set up a PR for new talktorials, including standard checks and TODOs.

</div>

# T029 · Compound activity: Proteochemometrics

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

- Marina Gorostiola González, 2022, Computational Drug Discovery, Drug Discovery & Safety Leiden University (The Netherlands)
- Olivier J.M. Béquignon, 2022, Computational Drug Discovery, Drug Discovery & Safety Leiden University (The Netherlands)
- Willem Jespers, 2022, Computational Drug Discovery, Drug Discovery & Safety Leiden University (The Netherlands)

*The examples used in this talktorial template are taken from [__Talktorial T001__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T001_query_chembl/talktorial.ipynb) and [__Talktorial T002__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T002_compound_adme/talktorial.ipynb).*

<div class="alert alert-block alert-info">

<b>Cross-referencing talktorials</b>: If you want to cross-reference to existing talktorials in your notebook, please use the following formatting: <b>Talktorial T000</b>.

</div>

## Aim of this talktorial

While activity data is very abundant for some protein targets, there are still a number of underexplored proteins where the use of machine learning (ML) for activity prediction is very difficult due to the lack of data. This issue can be solved leveraging similarities and differences between proteins. In this talktorial, we use Proteochemometrics modelling (PCM) to enrich our activity models with protein data to predict the activity of novel compounds against the four adenosine receptor isoforms (A1, A2A, A2B, A3).

### Contents in *Theory*

* Data preparation
    * Papyrus dataset
    * Molecule encoding: molecular descriptors
    * Protein encoding: protein descriptors

* Proteochemometrics (PCM)
    * Machine learning (ML): regression model
    * Applications in drug discovery

### Contents in *Practical*

* Downlaod Papyrus dataset
* Data preparation
    * Filter activity data for targets of interest
    * Align target sequences
    * Calculate protein descriptors
    * Calculate compound descriptors
* Proteochemometrics
    * Helper functions
    * XGBoost regressor

### References

* Papyrus scripts [github](https://github.com/OlivierBeq/Papyrus-scripts)
* Papyrus dataset preprint: [<i>ChemRvix</i> (2021)](https://chemrxiv.org/engage/chemrxiv/article-details/617aa2467a002162403d71f0)
* Molecular descriptors (Modred): [<i>J. Cheminf.</i>, 10, (2018)](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y)
* Proteochemometrics review: [<i>Drug Discov.</i> (2019), <b>32</b>, 89-98](https://www.sciencedirect.com/science/article/pii/S1740674920300111?via%3Dihub)

* Tutorial links
* Other useful resources


## Theory

To successfully apply PCM modelling, we need a large dataset of molecule-protein pairs with known bioactivity values, a way of describing molecules and proteins, and a ML algorithm to train a model. Then, we can make predictions for new molecule-protein pairs.

<b>NOTE:</b> PCM modelling is an extension of ligand-based modelling with ML described in <b>Talktorial T007</b>. Explore that talktorial to know more about the basic principle of activity prediction using ML.

<img src='images/PCM_model_text-01.png' width="1000">

*Figure 1:*
Proteochemometrics modelling construction from protein and molecular descriptors for which protein-compound pair bioactivity data is known.
Figure made by Marina Gorostiola González.

### Data preparation

#### Papyrus dataset

The Papyrus dataset is a highly curated compilation of bioactivity data intended for modelling in drug discovery. Apart from the bioactivity data contained in the ChEMBL database (see also  <b>Talktorial T001</b>), the Papyrus dataset contains binary data for classification tasks from the ExCAPE-DB, and bioactivity data from a number of kinase-specific papers (Figure 1).

The bioactivity data aggregated is standardized, repaired, and normalised to form the Papyrus dataset, which is updated with every new version of ChEMBL released. The Papyrus dataset contains "high quality" data associated to pChEMBL values for regression tasks and "low quality" data associated to an active/inactive label for classification tasks (read more about ML applications in <b>Talktorial T007</b>).

<img src='images/papyrus_workflow.png' width="1000">

*Figure 2:*
Papyrus dataset generation scheme.
Figure taken from: [<i>ChemRvix</i> (2021)](https://chemrxiv.org/engage/chemrxiv/article-details/617aa2467a002162403d71f0).

#### Molecule encoding: molecular descriptors

For the ML models used in PCM, molecules need to be converted into a list of features. In <b>Talktorial T007</b>, molecular fingerprints were introduced. In this talktorial, we will use a different type of representation that is often used on its own or in combination with fingerprints: molecular descriptors.

<b>Molecular descriptors</b> are the "final result of a logical and mathematical procedure, which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment" ([<i>J. Cheminf.</i>, 10, (2018)](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y)). These descriptors can be, for example, molecular weight, ring count, Eccentric Connectivity Index (calculated from the 2D structure), or Geometrical Index (calculated from the 3D structure).

In this talktorial, we use Modred as a software engine to calculate molecular descriptors. Modred calculates more than 1,800 molecular descriptors, including the ones implemented in RDKit, including an automatic preprocessing step that is common for all descriptors calculated. For simplicity, here we calculate only 4 types of descriptors from the vast list of possibilities from Modred, excluding their 3D representation. These include:

* <b>ABC Index</b>: 2 descriptors that represent the atom-bond connectivity index or the Graovac-Ghorbani atom-bond connectivity index (see Modred <code>ABCIndex</code> [docs](https://mordred-descriptor.github.io/documentation/master/api/mordred.ABCIndex.html))
* <b>Acid-Base</b>: 2 descriptors that count acidic and basic groups, respectively (see Modred <code>AcidBase</code> [docs](https://mordred-descriptor.github.io/documentation/master/api/mordred.AcidBase.html?highlight=acidbase))
* <b>Atom count</b>: 16 descriptors that represent a count of different types of atoms (see Modred <code>AtomCount</code> [docs](https://mordred-descriptor.github.io/documentation/master/api/mordred.AtomCount.html?highlight=atomcount))
* <b>Balaban J index</b>: 1 descriptor (included in RDkit), which represents a topological index (see Modred <code>BalabanJ</code> [docs](https://mordred-descriptor.github.io/documentation/master/api/mordred.BalabanJ.html?highlight=balaban#module-mordred.BalabanJ))

#### Protein encoding: protein descriptors

As done for molecules, the proteins of interest need to be converted to a list of features or protein descriptors. Protein descriptors used in PCM applications are commonly based on the protein sequence and represent physicochemical characteristics of the amino acids that make up the sequence (e.g. Z-scales). Other protein descriptors represent topological (e.g. St-scales) or electrostatic properties (e.g. MS-WHIM) of the protein sequence. Moreover, if structural information is available, protein descriptors can be derived from the 3D structure of the protein (e.g. sPairs) or the ligand-protein interaction in 3D (e.g. interaction fingerprints). Finally, with the widespread use of deep learning, protein embeddings can be obtained after parsing the protein sequence through the network (e.g. UniRep, AlphaFold embeddings).

For protein descriptors based on the protein sequence, an aspect to take into account is that for ML the length of the protein descriptor needs to be the same. However, most proteins do not have the same sequence length. To solve this issue, there are two main approaches:
* <b>Multiple sequence alignment</b>: when the whole protein wants to be incorporated to the model, a multiple sequence alignment can be performed. The final descriptor will have as many features as the number of features per amino acid multiplied by the number of aligned positions. To take into account, gaps in the alignment will receive zeroes in the descriptor.
* <b>Binding pocket selection</b>: To avoid unnecessary features, a binding pocket of the same length can be selected for each protein. Normally, the binding pocket selection is preceded by a multiple sequence alignment and driven by known structural or mutagenesis data.

Other options are available when proteins are not of the same family or do not share a binding pocket (see [<i>Drug Discov.</i> (2019), <b>32</b>, 89-98](https://www.sciencedirect.com/science/article/pii/S1740674920300111?via%3Dihub))

In this talktorial, we will focus on physicochemical protein descriptors, mainly <b>Z-scales</b> ([<i>J. Med. Chem</i>, 30 (1987)](https://pubs.acs.org/doi/10.1021/jm00390a003)). The Z-scales descriptor assigns three pre-determined values (Z<sub>1</sub>, Z<sub>2</sub>, Z<sub>3</sub>) to each amino acid in the sequence. The Z<sub>1</sub>, Z<sub>2</sub>, and Z<sub>3</sub> values are the first principal components of a principal component analysis (PCA) including 29 different physicochemical variables to characterize the amino acids.
Since we are calculating activity for four proteins with very high sequence similarity (Adenosine receptors A1, A2A, A2B, and A3), we will use <b>multiple sequence alignment</b> prior to calculation of the Z-scales.

### Proteochemometrics (PCM)

The ML principles for proteochemometric modelling are equivalent to those explained in <b>Talktorial T007</b>. However, in this talktorial we will explore the other type of supervised ML application: <b>regression</b>. For regression tasks, a continuous target variable is needed, for example pChEMBL values.



#### Applications in drug discovery

* Multi-target activity prediction
* Selectivity


## Practical

Add short summary of what will be done in this practical section.

<div class="alert alert-block alert-info">

<b>Sync section titles with TOC</b>: Please make sure that all section titles in the <i>Practical</i> section are synced with the bullet point list provided in the <i>Aim of this talktorial</i> > <i>Contents in Practical</i> section.

</div>

<div class="alert alert-block alert-info">
    
<b>Beware of section levels</b>: Please check if you are using the correct subsection levels. The section <i>Practical</i> is written in Markdown as <code>## Practical</code>, so every subsection within <i>Practical</i> is <code>###</code> or lower.

</div>

In [1]:
from pathlib import Path
import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import matplotlib.patches as mpatches
from rdkit import Chem
from rdkit.Chem import Descriptors, Draw, PandasTools



<div class="alert alert-block alert-info">

<b>Imports</b>: Please add all your imports on top of this section, ordered by standard library / 3rd party packages / our own (<code>teachopencadd.*</code>). 
Read more on imports and import order in the <a href="https://www.python.org/dev/peps/pep-0008/#imports">"PEP 8 -- Style Guide for Python Code"</a>.
    
</div>

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

<div class="alert alert-block alert-info">

<b>Relative paths</b>: Please define all paths relative to this talktorial's path by using the global variable <code>HERE</code>.
If your talktorial has input/output data, please define the global <code>DATA</code>, which points to this talktorial's data folder (check out the default folder structure of each talktorial).
    
</div>

### Connect to ChEMBL database

_Explain what you will do and why here in the Markdown cell. This includes everything that has to do with the talktorial's storytelling._

In [3]:
# Add comments in the code cell if you want to comment on coding decisions

<div class="alert alert-block alert-info">

<b>Functions</b>: 

<ul>
<li>Please add <a href="https://numpydoc.readthedocs.io/en/latest/format.html">numpy docstrings</a> to your functions.</li>
<li>Please expose all variables used within a function in the function's signature (i.e. they must be function parameters), unless they are created within the scope of the function.</li>
<li>Please add comments to the steps performed in the function.</li>
<li>Please use meaningful function and parameter names. This applies also to variable names.</li>
</ul>
    
</div>

In [4]:
def calculate_ro5_properties(smiles):
    """
    Test if input molecule (SMILES) fulfills Lipinski's rule of five.

    Parameters
    ----------
    smiles : str
        SMILES for a molecule.

    Returns
    -------
    pandas.Series
        Molecular weight, number of hydrogen bond acceptors/donor and logP value
        and Lipinski's rule of five compliance for input molecule.
    """
    # RDKit molecule from SMILES
    molecule = Chem.MolFromSmiles(smiles)
    # Calculate Ro5-relevant chemical properties
    molecular_weight = Descriptors.ExactMolWt(molecule)
    n_hba = Descriptors.NumHAcceptors(molecule)
    n_hbd = Descriptors.NumHDonors(molecule)
    logp = Descriptors.MolLogP(molecule)
    # Ro5 conditions fulfilled
    conditions = [molecular_weight <= 500, n_hba <= 10, n_hbd <= 5, logp <= 5]
    ro5_fulfilled = sum(conditions) >= 3
    # Return True if no more than one out of four conditions is violated
    return pd.Series(
        [molecular_weight, n_hba, n_hbd, logp, ro5_fulfilled],
        index=["molecular_weight", "n_hba", "n_hbd", "logp", "ro5_fulfilled"],
    )

### Load and draw molecules

_Explain what you will do and why here in the Markdown cell. This includes everything that has to do with the talktorial's storytelling._

In [5]:
# Add comments in the code cell if you want to comment on coding decisions

## Discussion

Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.

## Quiz

Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.

1. Question
2. Question
3. Question

<div class="alert alert-block alert-info">

<b>Useful checks at the end</b>: 
    
<ul>
<li>Clear output and rerun your complete notebook. Does it finish without errors?</li>
<li>Check if your talktorial's runtime is as excepted. If not, try to find out which step(s) take unexpectedly long.</li>
<li>Flag code cells with <code># NBVAL_CHECK_OUTPUT</code> that have deterministic output and should be tested within our Continuous Integration (CI) framework.</li>
</ul>

</div>