# **Computational Drug Discovery [Part 1]: Download + Preprocess Bioactivity Data**

Based on tutorial by Chanin Nantasenamat, [*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a machine learning model using the ChEMBL bioactivity data.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

Example: go to the ChEMBL database and search for 'coronavirus' in 'Target'; this lists potential target proteins or organisms that the drug will act on --> drug discovery that induces modulatory activity towards coronavirus protein/organism; can activate or inhibit.
--> we see 'SARS coronavirus 3C-like proteinase' as a SINGLE PROTEIN as well as 'Replicase polyprotein 1ab' as another SINGLE PROTEIN. these will be our targets.

## **Installing libraries**


Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [10]:
! pip install chembl_webresource_client




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## **Importing libraries**

In [11]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client 
        # creates a new instance of contact w ChEMBL database
        # can query new_client via 'target', 'molecule', 'activity', other...
        # more details at 
                # https://deepwiki.com/chembl/chembl_webresource_client/2.1-new_client-interface 
                # or 
                # https://hub.2i2c.mybinder.org/user/chembl-chembl_webresource_client-njuf4mhz/notebooks/demo_wrc.ipynb


## **Search for Target protein**

### **Target search for coronavirus**

In [12]:
# Target search 
target = new_client.target
target_query = target.search('acetylcholinesterase') 
targets = pd.DataFrame.from_dict(target_query) # query results --> dictionary format --> into df  
targets # display 

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Drosophila melanogaster,Acetylcholinesterase,18.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
1,[],Homo sapiens,Acetylcholinesterase,16.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Torpedo californica,Acetylcholinesterase,16.0,False,CHEMBL4780,"[{'accession': 'P04058', 'component_descriptio...",SINGLE PROTEIN,7787
3,[],Mus musculus,Acetylcholinesterase,16.0,False,CHEMBL3198,"[{'accession': 'P21836', 'component_descriptio...",SINGLE PROTEIN,10090
4,[],Rattus norvegicus,Acetylcholinesterase,16.0,False,CHEMBL3199,"[{'accession': 'P37136', 'component_descriptio...",SINGLE PROTEIN,10116
5,[],Electrophorus electricus,Acetylcholinesterase,16.0,False,CHEMBL4078,"[{'accession': 'O42275', 'component_descriptio...",SINGLE PROTEIN,8005
6,[],Bos taurus,Acetylcholinesterase,16.0,False,CHEMBL4768,"[{'accession': 'P23795', 'component_descriptio...",SINGLE PROTEIN,9913
7,[],Bemisia tabaci,AChE2,16.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038
8,[],Leptinotarsa decemlineata,Acetylcholinesterase,16.0,False,CHEMBL2366490,"[{'accession': 'Q27677', 'component_descriptio...",SINGLE PROTEIN,7539
9,[],Nephotettix cincticeps,Ace-orthologous acetylcholinesterase,16.0,False,CHEMBL2366514,"[{'accession': 'Q9NJH6', 'component_descriptio...",SINGLE PROTEIN,94400


### **Select and retrieve bioactivity data for Desired Target Protein'**

We will assign the 2nd entry target protein *Homo sapiens Acetylcholinesterase [CHEMBL220]* as the selected_target variable 

In [13]:
selected_target = targets.target_chembl_id[1]
selected_target

'CHEMBL220'

Here, we will retrieve bioactivity data only for *selected_target* that have reported IC_50 values (inhibitory concentration that reduces bioactivity by 50%) in nM (nanomolar) unit.

standard_type = measure of activity: IC50, EC50, percent activity
standard_value = potency of drug compound; lower number = smaller dose required for same pharmacological effect

In [14]:
activity = new_client.activity # new query based on activity (as opposed to 'target')
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [15]:
df = pd.DataFrame.from_dict(res)

In [16]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9410,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724873,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,46.0
9411,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724874,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,38.31
9412,"{'action_type': 'INHIBITOR', 'description': 'N...",,25733694,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,1.71
9413,,,25733695,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,10.0


In [17]:
df.standard_type.unique() # ours only has IC50 due tio filtering, but could also be EC50 or % activity

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [18]:
df.to_csv('bioactivity_data.csv', index=False) # dont save index numbers into csv

Let's see the CSV files that we have so far.

In [19]:
ls

 Volume in drive C is Local Disk
 Volume Serial Number is 9674-1826

 Directory of c:\Users\liv_u\Desktop\GitHub\DrugDiscovery\Acetylcholinesterase_tutorial

2025-08-29  18:32    <DIR>          .
2025-08-29  18:16    <DIR>          ..
2025-08-29  18:32         5,502,573 bioactivity_data.csv
2025-08-29  18:28           507,983 CDD_ML_Part_1_Bioactivity_Preprocessing.ipynb
2025-08-29  17:44           394,972 CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
2025-08-29  17:44           213,374 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
               4 File(s)      6,618,902 bytes
               2 Dir(s)   2,447,994,880 bytes free


Taking a glimpse of the **bioactivity_data.csv** file that we've just created.

In [20]:
pd.read_csv('bioactivity_data.csv').head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8


## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [21]:
df2 = df[df.standard_value.notna()] # drop if standard_value==na 
# when we drop rows, index does not reset --> causing errors downstream
df2 = df2.reset_index()
df2


Unnamed: 0,index,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8125,9410,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724873,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,46.0
8126,9411,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724874,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,38.31
8127,9412,"{'action_type': 'INHIBITOR', 'description': 'N...",,25733694,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,1.71
8128,9413,,,25733695,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,10.0


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [22]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

In [23]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0
...,...,...,...
8125,CHEMBL5398421,COc1cc(O)c2c(c1)C(=O)c1cc(O)c(O)cc1CCN2,46000.0
8126,CHEMBL11298,N[C@@H](CO)C(=O)O,38310.0
8127,CHEMBL5395312,CN1CCN(c2ccc(C(=O)Nc3cc(-c4nc5ccccc5[nH]4)n[nH...,1710.0
8128,CHEMBL5399112,O=C(Nc1cc(-c2nc3ccccc3[nH]2)n[nH]1)c1ccc(N2CCN...,10000.0


In [24]:
df3 = pd.concat([df3,pd.Series(bioactivity_class)], axis=1) # since bioactivity_class is list[], must make Series or df before concatenating 
list(df3)
df3 = df3.rename(columns={0:"bioactivity_class"})
list(df3)

['molecule_chembl_id',
 'canonical_smiles',
 'standard_value',
 'bioactivity_class']

Saves dataframe to CSV file

In [25]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0,active
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0,active
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0,inactive
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0,active
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0,active
...,...,...,...,...
8125,CHEMBL5398421,COc1cc(O)c2c(c1)C(=O)c1cc(O)c(O)cc1CCN2,46000.0,inactive
8126,CHEMBL11298,N[C@@H](CO)C(=O)O,38310.0,inactive
8127,CHEMBL5395312,CN1CCN(c2ccc(C(=O)Nc3cc(-c4nc5ccccc5[nH]4)n[nH...,1710.0,intermediate
8128,CHEMBL5399112,O=C(Nc1cc(-c2nc3ccccc3[nH]2)n[nH]1)c1ccc(N2CCN...,10000.0,inactive


In [26]:
ls 

 Volume in drive C is Local Disk
 Volume Serial Number is 9674-1826

 Directory of c:\Users\liv_u\Desktop\GitHub\DrugDiscovery\Acetylcholinesterase_tutorial

2025-08-29  18:32    <DIR>          .
2025-08-29  18:16    <DIR>          ..
2025-08-29  18:32         5,502,573 bioactivity_data.csv
2025-08-29  18:32           660,389 bioactivity_preprocessed_data.csv
2025-08-29  18:28           507,983 CDD_ML_Part_1_Bioactivity_Preprocessing.ipynb
2025-08-29  17:44           394,972 CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
2025-08-29  17:44           213,374 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
               5 File(s)      7,279,291 bytes
               2 Dir(s)   2,447,327,232 bytes free


---