# Talktorial 1

# Compound data acquisition (ChEMBL)

#### Developed in the CADD seminars 2017 and 2018, AG Volkamer, Charité/FU Berlin 

Paula Junge, Svetlana Leng, Dominique Sydow

## Aim of this talktorial

We learn how to extract data from ChEMBL:

* Find ligands which were tested on a certain target
* Filter available bioactivity data
* Calculate pIC50 values
* Merge `DataFrame`s and draw extracted molecules

## Learning goals


### Theory

* ChEMBL database
    * ChEMBL web services
    * ChEMBL webresource client
* Compound activity measures
    * IC50
    * pIC50

### Practical
    
Goal: Get list of compounds with bioactivity data for a given target

* Connect to ChEMBL database
* Get target data (EGFR kinase)
* Get bioactivity data
    * Download and filter bioactivities
* Get compound data
    * Download and filter compounds
* Merge bioactivity and compound data
    * Draw molecules with highest pIC50
    * Write output file

## References

* ChEMBL bioactivity database (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210557/)
* ChEMBL web services: <i>Nucleic Acids Res.</i> (2015), <b>43</b>, 612-620 (https://academic.oup.com/nar/article/43/W1/W612/2467881) 
* ChEMBL webrescource client GitHub (https://github.com/chembl/chembl_webresource_client)
* myChEMBL webservices version 2.x (https://github.com/chembl/mychembl/blob/master/ipython_notebooks/09_myChEMBL_web_services.ipynb)
* ChEMBL web-interface (https://www.ebi.ac.uk/chembl/)
* EBI-RDF platform (https://www.ncbi.nlm.nih.gov/pubmed/24413672)
* IC50 and pIC50 (https://en.wikipedia.org/wiki/IC50)
* UniProt website (https://www.uniprot.org/)

## Theory

### ChEMBL database

* Open large-scale bioactivity database
* **Current data content (as of 10.2018):**
    * \>1.8 million distinct compound structures
    * \>15 million activity values from 1 million assays
    * Assays are mapped to ∼12 000 targets
* **Data sources** include scientific literature, PubChem bioassays, Drugs for Neglected Diseases Initiative (DNDi), BindingDB database, ...
* ChEMBL data can be accessed via a [web-interface](https://www.ebi.ac.uk/chembl/), the [EBI-RDF platform](https://www.ncbi.nlm.nih.gov/pubmed/24413672) and the [ChEMBL web services](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4489243/#B5)

#### ChEMBL web services

* RESTful web service
* ChEMBL web service version 2.x resource schema: 

[![ChEMBL web service schema](images/chembl_webservices_schema_diagram.jpg)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4489243/figure/F2/)

*Figure 1:* 
"ChEMBL web service schema diagram. The oval shapes represent ChEMBL web service resources and the line between two resources indicates that they share a common attribute. The arrow direction shows where the primary information about a resource type can be found. A dashed line indicates the relationship between two resources behaves differently. For example, the `Image` resource provides a graphical based representation of a `Molecule`."
Figure and description taken from: [<i>Nucleic Acids Res.</i> (2015), <b>43</b>, 612-620](https://academic.oup.com/nar/article/43/W1/W612/2467881).

#### ChEMBL webresource client

* Python client library for accessing ChEMBL data
* Handles interaction with the HTTPS protocol
* Lazy evaluation of results -> reduced number of network requests

### Compound activity measures

#### IC50 

* [Half maximal inhibitory concentration](https://en.wikipedia.org/wiki/IC50)
* Indicates how much of a particular drug or other substance is needed to inhibit a given biological process by half

[<img src="https://upload.wikimedia.org/wikipedia/commons/8/81/Example_IC50_curve_demonstrating_visually_how_IC50_is_derived.png" width="450" align="center" >](https://commons.wikimedia.org/wiki/File:Example_IC50_curve_demonstrating_visually_how_IC50_is_derived.png)

*Figure 2:* Visual demonstration of how to derive an IC50 value: 
(i) Arrange data with inhibition on vertical axis and log(concentration) on horizontal axis. (ii) Identify maximum and minimum inhibition. (iii) The IC50 is the concentration at which the curve passes through the 50% inhibition level. Figure ["Example IC50 curve demonstrating visually how IC50 is derived"](https://en.wikipedia.org/wiki/IC50#/media/File:Example_IC50_curve_demonstrating_visually_how_IC50_is_derived.png) by JesseAlanGordon is licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

#### pIC50

* To facilitate the comparison of IC50 values, we define pIC50 values on a logarithmic scale, such that <br />
    $ pIC_{50} = -log_{10}(IC_{50}) $ where $ IC_{50}$ is specified in units of M.
* Higher pIC50 values indicate exponentially greater potency of the drug
* pIC50 is given in terms of molar concentration (mol/L or M) <br />
    * IC50 should be specified in M to convert to pIC50  
    * For nM: $pIC_{50} = -log_{10}(IC_{50}*10^{-9})= 9-log_{10}(IC_{50}) $

#### Other activity measures

Besides, IC50 and pIC50, other bioactivity measures are used, such as the equilibrium constant [KI](https://en.wikipedia.org/wiki/Equilibrium_constant) and the half maximal effective concentration  [EC50](https://en.wikipedia.org/wiki/EC50).

## Practical

In the following, we want to download all molecules that have been tested against our target of interest, the EGFR kinase.

### Connect to ChEMBL database

First, the ChEMBL webresource client as well as other python libraries are imported.

In [None]:
import math

from chembl_webresource_client.new_client import new_client
import pandas as pd
from rdkit.Chem import PandasTools

Create resource objects for API access.

In [None]:
targets_api = new_client.target
compounds_api = new_client.molecule
bioactivities_api = new_client.activity

In [None]:
type(targets_api)

### Target data

* Get UniProt ID (http://www.uniprot.org/uniprot/P00533) of the target of interest (EGFR kinase) from UniProt website (https://www.uniprot.org/)
* Use UniProt ID to get target information
* Select a different UniProt ID if you are interested in another target

In [None]:
uniprot_id = 'P00533'

#### Fetch target data from ChEMBL

In [None]:
# Get target information from ChEMBL but restrict to specified values only
targets = targets_api.get(
    target_components__accession=uniprot_id
).only(
    'target_chembl_id', 
    'organism', 
    'pref_name', 
    'target_type'
)
print(type(targets))

#### Download target data from ChEMBL

The results of the query are stored in `targets`, a `QuerySet`, i.e. the results are not fetched from ChEMBL unitl we ask for it (here using `pandas.DataFrame.from_records`).

More information about the `QuerySet` datatype:

> QuerySets are lazy – the act of creating a QuerySet doesn’t involve any database activity. You can stack filters together all day long, and Django won’t actually run the query until the QuerySet is evaluated. 
From: https://docs.djangoproject.com/en/3.0/topics/db/queries/#querysets-are-lazy

In [None]:
targets = pd.DataFrame.from_records(targets)
targets

#### Select target (target ChEMBL ID)

After checking the entries, we select the first entry as our target of interest:

`CHEMBL203`: It is a single protein and represents the human Epidermal growth factor receptor (EGFR, also named erbB1) 

In [None]:
target = targets.iloc[0]
target

Save selected ChEMBL ID.

In [None]:
chembl_id = target.target_chembl_id
chembl_id

### Bioactivity data

Now, we want to query bioactivity data for the target of interest.

#### Fetch bioactivity data for the target from ChEMBL

In this step, we download and filter the bioactivity data and only consider

* human proteins, 
* bioactivity type IC50, 
* exact measurements (relation `'='`), and
* binding data (assay type `'B'`).

In [None]:
bioactivities = bioactivities_api.filter(
    target_chembl_id=chembl_id, type='IC50', relation='=', assay_type='B'
).only(
    'activity_id',
    'assay_chembl_id', 
    'assay_description', 
    'assay_type', 
    'molecule_chembl_id', 
    'type', 
    'standard_units',
    'relation', 
    'standard_value',
    'target_chembl_id', 
    'target_organism'
 )

print(f'Length and type of bioactivities object: {len(bioactivities)}, {type(bioactivities)}')
print(f'Length and type of first element: {len(bioactivities[0])}, {type(bioactivities[0])}')

In [None]:
bioactivities[0]

You are having difficulties to query bioactivities from ChEMBL? 

<details>
    
<summary>Click here.</summary>
    
If you experience difficulties to query the ChEMBL database, we provide here a file containing the results for the query in the previous cell (11 April 2019). We do this using the Python package pickle which serializes Python objects so they can be saved to a file, and loaded in a program again later on.
Learn more about object serialization on [DataCamp](https://www.datacamp.com/community/tutorials/pickle-python-tutorial).

You can load the "pickled" compounds by running the following code:
  
<code>import pickle</code> 

<code>bioactivities = pickle.load(open("../data/T1/EGFR_compounds_from_chembl_query_20190411.p", "rb"))</code> 

</details>

#### Download bioactivity data from ChEMBL

Again, we download the `QuerySet` in the form of a `pandas` `DataFrame`. **This may take some time.**

In [None]:
bioactivities_df = pd.DataFrame.from_records(bioactivities)
print(f'DataFrame shape: {bioactivities_df.shape}')
bioactivities_df.head()

Note, that we have columns for `standard_units`/`units` and `standard_values`/`values` - in the following we will use the standardized columns (standardization by ChEMBL). Thus, we drop the other two columns.

In [None]:
bioactivities_df.drop(['units', 'value'], axis=1, inplace=True)
bioactivities_df.head()

#### Preprocess and filter bioactivity data

1. Convert `standard_value`'s datatype from `object` to `float`.
2. Delete entries with missing values.
3. Keep only entries with `standard_unit == nM`.
4. Delete duplicate molecules.
5. Reset `DataFrame` index.
6. Rename columns.

**1. Convert `standard_value`'s datatype from `object` to `float`.**

The field `standard_value` holds standardized (here IC50) values. In order to make these values useable in calculations lateron, convert values to floats.

In [None]:
bioactivities_df.dtypes

In [None]:
bioactivities_df = bioactivities_df.astype({'standard_value': 'float64'})
bioactivities_df.dtypes

**2. Delete entries with missing values.**

Use the parameter `inplace=True` to drop values in the current `DataFrame` directly.

In [None]:
bioactivities_df.dropna(
    axis=0, 
    how='any', 
    inplace=True
)
print(f'DataFrame shape: {bioactivities_df.shape}')

**3. Keep only entries with `standard_unit == nM`.**

We only want to keep bioactivity entries in `nM`, thus we remove all entries with other units.

In [None]:
print(f'Units in downloaded data: {bioactivities_df.standard_units.unique()}')
print(f'Number of non-nM entries: {bioactivities_df[bioactivities_df.standard_units != "nM"].shape[0]}')

In [None]:
bioactivities_df = bioactivities_df[bioactivities_df.standard_units == 'nM']
print(f'Units after filtering: {bioactivities_df.standard_units.unique()}')

In [None]:
print(f'DataFrame shape: {bioactivities_df.shape}')

**4. Delete duplicate molecules.**

Sometimes the same molecule (`molecule_chembl_id`) has been tested more than once, in this case, we only keep the first one.

In [None]:
bioactivities_df.drop_duplicates(
    'molecule_chembl_id', 
    keep='first', 
    inplace=True
)
print(f'DataFrame shape: {bioactivities_df.shape}')

**5. Reset `DataFrame` index.**

Since we deleted some rows, but we want to iterate over the index later, we reset the index to be continuous.

In [None]:
bioactivities_df.reset_index(drop=True, inplace=True) 
bioactivities_df.head()

**6. Rename columns.**

In [None]:
bioactivities_df.rename(
    columns={
        'standard_value': 'IC50', 
        'standard_units': 'units'
    },
    inplace=True
)
bioactivities_df.head()

### Compound data

We have a `DataFrame` containing all molecules tested against EGFR (with the respective measured bioactivity). 

Now, we want to get the molecules that are linked to respective bioactivity ChEMBL IDs. 

#### Fetch compound data from ChEMBL

Let's have a look at the compounds from ChEMBL which we have defined bioactivity data for: We fetch compound ChEMBL IDs and structures for the compounds linked to our filtered bioactivity data.

In [None]:
compounds = compounds_api.filter(
    molecule_chembl_id__in = list(bioactivities_df['molecule_chembl_id'])
).only(
    'molecule_chembl_id',
    'molecule_structures'
)

#### Download compound data from ChEMBL

Again, we download the `QuerySet` in the form of a `pandas` `DataFrame`. **This may take some time.**

In [None]:
compounds_df = pd.DataFrame.from_records(compounds)
print(f'DataFrame shape: {compounds_df.shape}')

In [None]:
compounds_df.head()

#### Preprocess and filter compound data

1. Remove entries with missing molecule structure.
2. Delete duplicate molecules.
3. Get molecules with canonical SMILES.

**1. Remove entries with missing molecule structure.**

In [None]:
compounds_df.dropna(
    axis=0, 
    how='any', 
    inplace=True
)
print(f'DataFrame shape: {compounds_df.shape}')

**2. Delete duplicate molecules.**

In [None]:
compounds_df.drop_duplicates(
    'molecule_chembl_id', 
    keep='first', 
    inplace=True
)
print(f'DataFrame shape: {compounds_df.shape}')

**3. Get molecules with canonical SMILES.**

So far, we have multiple different molecular structure representations. We only want to keep the canonical SMILES.

In [None]:
compounds_df.iloc[0].molecule_structures.keys()

In [None]:
canonical_smiles = []

for i, compounds in compounds_df.iterrows():
    try:
        canonical_smiles.append(compounds['molecule_structures']['canonical_smiles'])
    except KeyError:
        canonical_smiles.append(None)
        
compounds_df['smiles'] = canonical_smiles
compounds_df.drop('molecule_structures', axis=1, inplace=True)

Remove all molecules without a canonical SMILES string.

In [None]:
compounds_df.dropna(
    axis=0, 
    how='any', 
    inplace=True
)

### Output (bioactivity-compound) data

#### Summary of compound and bioactivity data

In [None]:
print(f'Bioactivities filtered: {bioactivities_df.shape[0]}')
bioactivities_df.columns

In [None]:
print(f'Compounds filtered: {compounds_df.shape[0]}')
compounds_df.columns

#### Merge both datasets

Merge values of interest from `bioactivities_df` and `compounds_df` in an `output_df` based on the compounds ChEMBL IDs, keeping the following columns:
* ChEMBL IDs: `molecule_chembl_id`
* SMILES: `smiles`
* units: `units`
* IC50: `IC50`

In [None]:
# Merge DataFrames
output_df = pd.merge(
    bioactivities_df[['molecule_chembl_id', 'IC50', 'units']], 
    compounds_df, 
    on='molecule_chembl_id'
)

# Reset row indices
output_df.reset_index(drop=True, inplace=True)

print(f'Dataset with {output_df.shape[0]} entries.')

In [None]:
output_df.dtypes

In [None]:
output_df.head(10)

#### Add pIC50 values

As you can see the low IC50 values are difficult to read (values are distributed over multiple scales), which is why we convert the IC50 values to pIC50.

In [None]:
def convert_ic50_to_pic50(IC50_value):
    pIC50_value = 9 - math.log10(IC50_value)
    return pIC50_value

In [None]:
# Apply conversion to each row of the compounds DataFrame
output_df['pIC50'] = output_df.apply(lambda x: convert_ic50_to_pic50(x.IC50), axis=1)

In [None]:
output_df.head()

#### Draw compound data

Let's have a look at our collected data set.

In the next steps, we add a column for RDKit molecule objects to our `DataFrame` and look at the structures of the molecules with the highest pIC50 values. 

In [None]:
# Add molecule column
PandasTools.AddMoleculeColumnToFrame(output_df, smilesCol='smiles')

In [None]:
# Sort molecules by pIC50
output_df.sort_values(
    by="pIC50", 
    ascending=False, 
    inplace=True
)

# Reset index
output_df.reset_index(
    drop=True, 
    inplace=True
)

Show the most active molecules, i.e. molecules with the highest pIC50 values.

In [None]:
output_df.drop("smiles", axis=1).head()

#### Write output data to file

We want to use this bioactivity-compound dataset in the following talktorials, thus we save the data as `csv` file. 
Note that it is advisable to drop the molecule column (which only contains an image of the molecules) when saving the data.

In [None]:
output_df.drop("ROMol", axis=1).to_csv("../data/T1/EGFR_compounds.csv")

In [None]:
print(f'DataFrame shape: {output_df.shape}')

## Discussion

In this tutorial, we collected bioactivity data for our target of interest from the ChEMBL database. 
We filtered the data set in order to only contain molecules with measured IC50 or pIC50 bioactivity values. 

Be aware that ChEMBL data originates from various sources. Compound data has been generated in different labs by different people all over the world. Therefore, we have to be cautious with the predictions we make using this data set. It is always important to consider the source of the data and consistency of data production assays when interpreting the results and determining how much confidence we have in our predictions.

In the next tutorials, we will filter our acquired data by Lipinski's rule of five and by unwanted substructures. Another important step would be to clean the data and remove duplicates. As this is not shown in any of our talktorials (yet), we would like to refer to the  [`standardiser` library](https://github.com/flatkinson/standardiser) or [MolVS](https://molvs.readthedocs.io/en/latest/) as useful tools for this task.

## Quiz

* We have downloaded in this talktorial molecules and bioactivity data from ChEMBL. What else is the ChEMBL database useful for?
* What is the difference between IC50 and EC50?
* What can we use the data extracted from ChEMBL for?