<a href="https://colab.research.google.com/github/sofia-sunny/Introductory_Tutorials/blob/main/Extracting__ChEMBL_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Extracting Data**
In this tutorial, we will extract data from the [ChEMBL](https://www.ebi.ac.uk/chembl/) database. ChEMBL is a large, open-access bioactivity database that contains curated information on the bioactivity of drug-like small molecules, their chemical structures, biological targets (such as proteins and enzymes), and the results of pharmacological assays. ChEMBL is widely used in cheminformatics and drug discovery research for QSAR modeling, virtual screening, target prediction, and more.

### **Accessing ChEMBL via the Web Resource Client**
To programmatically access data from the ChEMBL database, we use the **chembl_webresource_client**—a Python interface for interacting with the ChEMBL RESTful web services. This client abstracts the underlying API calls, allowing users to retrieve structured data on chemical compounds, biological targets, assays, and bioactivities directly.


In [None]:
!pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.1.1-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.1.1-py3-none-any.whl (69 kB)
[2K   [90m━━━━━━━━━━━━━━

 The **new_client** object is the main entry point for accessing various types of data from the ChEMBL database including:

**new_client.molecule** – for retrieving information about chemical structures (e.g., SMILES, InChI, molecular properties).

**new_client.target** – for accessing biological targets such as proteins, enzymes, and receptors.

**new_client.activity** – for querying experimental bioactivity data, such as IC₅₀, EC₅₀, and Ki values.

**new_client.assay** – for retrieving assay descriptions and metadata.

In [None]:
from chembl_webresource_client.new_client import new_client
import pandas as pd
import numpy as np

To begin retrieving bioactivity data, we  need to identify the s**pecific biological target** of interest within the ChEMBL database. In this case, we are interested in **EGFR** (Epidermal Growth Factor Receptor), a protein commonly studied in cancer research.

The **new_client.target.search("EGFR")**  performs a keyword-based search in ChEMBL’s target database, returning a list of all entries that match the term “**EGFR**”. Each entry contains relevant metadata, including the ChEMBL Target ID, organism, target type, and gene name.

In [None]:
targets = new_client.target.search("EGFR")
targets_df = pd.DataFrame(targets)
targets_df.head()

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Epidermal growth factor receptor erbB1,16.0,False,CHEMBL3608,"[{'accession': 'Q01279', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Homo sapiens,EGFR/PPP1CA,16.0,False,CHEMBL4523747,"[{'accession': 'P00533', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
2,[],Homo sapiens,VHL/EGFR,16.0,False,CHEMBL4523998,"[{'accession': 'P00533', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,CCN2-EGFR,16.0,False,CHEMBL5465557,"[{'accession': 'P00533', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
4,[],Homo sapiens,Epidermal growth factor receptor erbB1,12.0,False,CHEMBL203,"[{'accession': 'P00533', 'component_descriptio...",SINGLE PROTEIN,9606


Searching for “EGFR” in the ChEMBL database returns multiple targets across various species. From these, we select 'CHEMBL203' as an example, which corresponds to EGFR in Homo sapiens.

The **new_client.activity** accesses the Activity endpoint of the ChEMBL web client. This resource contains curated experimental results describing how compounds interact with biological targets, including key assay measurements such as IC50, Ki, and EC50.

The activity data is filtered to retain only **IC50 values** (the kind of activity measurement we're interested in) reported in nanomolar (nM) units, along with three key fields:

* **molecule_chembl_id**

* **canonical_smiles**

* **standard_value:** The numeric result for IC50 measurement


In [None]:
target_id = 'CHEMBL203'
activities = new_client.activity.filter(
    target_chembl_id=target_id,
    standard_type="IC50",
    standard_units="nM"
).only([
    'molecule_chembl_id', 'canonical_smiles', 'standard_value'
])

**Next step:**

Each record in the **activities** is evaluated to ensure the presence of three required fields: standard_value, canonical_smiles, and molecule_chembl_id.

The **standard_value**, initially returned as a string, is converted to a floating-point number to enable the calculation of pIC50.

## Let's see what we have in activities:

The following code handles up to 1000 activity records, filtering for entries that contain the required fields (standard_value, canonical_smiles, and molecule_chembl_id). For each valid record, it converts the standard value (in nM) to pIC50, provided the value is positive (If the value is zero or negative, the logarithm becomes undefined or invalid).

In [None]:
batch = []
max_records = 1000  # To speed up processing

for index, entry in enumerate(activities):
    if index >= max_records:
        break
    # Check that all required fields are present
    if all(key in entry for key in ['standard_value', 'canonical_smiles', 'molecule_chembl_id']):

        try:
            val = float(entry['standard_value'])
            if val > 0:
                batch.append({
                    'ChEMBL_ID': entry['molecule_chembl_id'],
                    'SMILES': entry['canonical_smiles'],
                    'pIC50': -np.log10(val * 1e-9)
                })
        except:
            continue

# Create DataFrame
df = pd.DataFrame(batch).drop_duplicates().reset_index(drop=True)
df


Unnamed: 0,ChEMBL_ID,SMILES,pIC50
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,7.387216
1,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,6.522879
2,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,5.106793
3,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,6.769551
4,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,7.397940
...,...,...,...
956,CHEMBL168921,CC(C)(C)OC(=O)N1CCC(n2cc(-c3cccc(O)c3)c3c(N)nc...,5.657577
957,CHEMBL168921,CC(C)(C)OC(=O)N1CCC(n2cc(-c3cccc(O)c3)c3c(N)nc...,5.000000
958,CHEMBL169065,COC(=O)CN1CCC(n2cc(-c3cccc(OC)c3)c3c(N)ncnc32)CC1,5.832683
959,CHEMBL10,C[S+]([O-])c1ccc(-c2nc(-c3ccc(F)cc3)c(-c3ccncc...,4.017729


Save data to a csv file:

In [None]:
df.to_csv('ChemBL_data.csv', index=False)

ChEMBL provides a variety of bioactivity measurement types that capture how compounds interact with biological targets under different experimental conditions. While **IC50** is commonly used to describe how well a compound inhibits a target, other types such as **Ki** (inhibition constant), **Kd** (dissociation constant), **EC50** (half-maximal effective concentration), and **MIC** (minimum inhibitory concentration) are also frequently reported.

In this tutorial, we have focused on IC50 values, but students are encouraged to explore other measurement types by modifying the **standard_type** parameter in the activity query. This allows for flexibility in analyzing different types of biological responses depending on the target and assay type of interest.