<a href="https://colab.research.google.com/github/smruti1571/Python/blob/main/CDD_ML_Part_1_Bioactivity_Data_Concised_snri_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data (Concised version)**

Smruti Rekha Behera


In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Importing libraries**

In [23]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for SNRI**

In [57]:
# Target search for depression
target = new_client.target
target_query = target.search('serotonin')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'Q29495', 'xref_name': None, 'xre...",Ovis aries,Serotonin N-acetyltransferase,13.0,False,CHEMBL5452,"[{'accession': 'Q29495', 'component_descriptio...",SINGLE PROTEIN,9940
1,"[{'xref_id': 'Q64666', 'xref_name': None, 'xre...",Rattus norvegicus,Serotonin N-acetyltransferase,13.0,False,CHEMBL1075242,"[{'accession': 'Q64666', 'component_descriptio...",SINGLE PROTEIN,10116
2,"[{'xref_id': 'P31645', 'xref_name': None, 'xre...",Homo sapiens,Serotonin transporter,12.0,False,CHEMBL228,"[{'accession': 'P31645', 'component_descriptio...",SINGLE PROTEIN,9606
3,"[{'xref_id': 'P31652', 'xref_name': None, 'xre...",Rattus norvegicus,Serotonin transporter,12.0,False,CHEMBL313,"[{'accession': 'P31652', 'component_descriptio...",SINGLE PROTEIN,10116
4,"[{'xref_id': 'NBK23655', 'xref_name': 'Seroton...",Mus musculus,Serotonin transporter,12.0,False,CHEMBL4642,"[{'accession': 'Q60857', 'component_descriptio...",SINGLE PROTEIN,10090
...,...,...,...,...,...,...,...,...,...
90,[],Mus musculus,Serotonin 2c (5-HT2c) receptor,7.0,False,CHEMBL3006,"[{'accession': 'P34968', 'component_descriptio...",SINGLE PROTEIN,10090
91,"[{'xref_id': 'P08909', 'xref_name': None, 'xre...",Rattus norvegicus,Serotonin 2c (5-HT2c) receptor,7.0,False,CHEMBL324,"[{'accession': 'P08909', 'component_descriptio...",SINGLE PROTEIN,10116
92,[],Homo sapiens,Dopamine D2 receptor and serotonin 1a receptor,7.0,False,CHEMBL2111460,"[{'accession': 'P14416', 'component_descriptio...",SELECTIVITY GROUP,9606
93,[],Homo sapiens,Monoamine transporter,7.0,False,CHEMBL2363064,"[{'accession': 'P31645', 'component_descriptio...",PROTEIN FAMILY,9606


### **Select and retrieve bioactivity data for *Human Acetylcholinesterase* (first entry)**

We will assign the fifth entry (which corresponds to the target protein, *Human SNRI*) to the ***selected_target*** variable 

In [58]:
selected_target = targets.target_chembl_id[2]
selected_target

'CHEMBL228'

Here, we will retrieve only bioactivity data for *Human SNRI* (CHEMBL220) that are reported as pChEMBL values.

In [66]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [67]:
df = pd.DataFrame.from_dict(res)

In [69]:
df.head(10)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,105351,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,5160.0
1,,105826,[],CHEMBL808141,Binding affinity towards serotonin transporter...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,1.15
2,,106988,[],CHEMBL873260,Binding affinity towards serotonin transporter...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,4.2
3,,108742,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,33.4
4,,111537,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,27.6
5,,112956,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,6.5
6,,118016,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,180.0
7,,122882,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,1960.0
8,,122884,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,2.5
9,,124744,[],CHEMBL808141,Binding affinity towards serotonin transporter...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,0.32


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [71]:
df.to_csv('serotonin_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [72]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,105351,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,5160.0
1,,105826,[],CHEMBL808141,Binding affinity towards serotonin transporter...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,1.15
2,,106988,[],CHEMBL873260,Binding affinity towards serotonin transporter...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,4.2
3,,108742,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,33.4
4,,111537,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,27.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4462,,23305408,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4842938,Inhibition of human wild type SERT expressed i...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,43.0
4463,,23305409,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4842938,Inhibition of human wild type SERT expressed i...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,135.0
4464,,23305410,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4842938,Inhibition of human wild type SERT expressed i...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,881.0
4465,Active,23349824,[],CHEMBL4882629,Serotonin-uptake assay,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,13.0


In [73]:
len(df2.canonical_smiles.unique())

3001

In [74]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,105351,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,5160.0
1,,105826,[],CHEMBL808141,Binding affinity towards serotonin transporter...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,1.15
2,,106988,[],CHEMBL873260,Binding affinity towards serotonin transporter...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,4.2
3,,108742,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,33.4
4,,111537,[],CHEMBL806861,Inhibition of [3H]citalopram binding to seroto...,B,,,BAO_0000190,BAO_0000221,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,27.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4461,,23305407,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4842938,Inhibition of human wild type SERT expressed i...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,15.0
4462,,23305408,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4842938,Inhibition of human wild type SERT expressed i...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,43.0
4463,,23305409,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4842938,Inhibition of human wild type SERT expressed i...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,135.0
4464,,23305410,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4842938,Inhibition of human wild type SERT expressed i...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serotonin transporter,9606,,,IC50,nM,UO_0000065,,881.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [75]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL322923,COC(=O)C1=C(c2ccc(Cl)c(Cl)c2)CC2CCC1C2,5160.0
1,CHEMBL435341,C/C=C\c1ccc([C@H]2CC3CCC([C@H]2C(=O)OC)N3C)cc1,1.15
2,CHEMBL311347,COC(=O)[C@@H]1C2CC[C@H](C[C@@H]1c1ccc(I)cc1)N2C,4.2
3,CHEMBL100941,COC(=O)C1C2CCC(C2)CC1c1ccc(Cl)c(Cl)c1,33.4
4,CHEMBL87031,COC(=O)C1C(c2ccc(Cl)c(Cl)c2)CC2CCC1N2C,27.6
...,...,...,...
4461,CHEMBL4857489,COC(=O)C1CCC2CC(O)C1N2c1ccccc1,15.0
4462,CHEMBL4847426,Clc1ccc(OC2CC3CCC(C2)N3)cc1,43.0
4463,CHEMBL4851558,Clc1cccc(OC2CC3CCC(C2)N3)c1,135.0
4464,CHEMBL4873377,Clc1ccccc1OC1CC2CCC(C1)N2,881.0


Saves dataframe to CSV file

In [77]:
df3.to_csv('serotonin_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [78]:
df4 = pd.read_csv('serotonin_02_bioactivity_data_preprocessed.csv')

In [80]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [81]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL322923,COC(=O)C1=C(c2ccc(Cl)c(Cl)c2)CC2CCC1C2,5160.00,intermediate
1,CHEMBL435341,C/C=C\c1ccc([C@H]2CC3CCC([C@H]2C(=O)OC)N3C)cc1,1.15,active
2,CHEMBL311347,COC(=O)[C@@H]1C2CC[C@H](C[C@@H]1c1ccc(I)cc1)N2C,4.20,active
3,CHEMBL100941,COC(=O)C1C2CCC(C2)CC1c1ccc(Cl)c(Cl)c1,33.40,active
4,CHEMBL87031,COC(=O)C1C(c2ccc(Cl)c(Cl)c2)CC2CCC1N2C,27.60,active
...,...,...,...,...
2996,CHEMBL4857489,COC(=O)C1CCC2CC(O)C1N2c1ccccc1,15.00,active
2997,CHEMBL4847426,Clc1ccc(OC2CC3CCC(C2)N3)cc1,43.00,active
2998,CHEMBL4851558,Clc1cccc(OC2CC3CCC(C2)N3)c1,135.00,active
2999,CHEMBL4873377,Clc1ccccc1OC1CC2CCC(C1)N2,881.00,active


Saves dataframe to CSV file

In [82]:
df5.to_csv('serotonin_03_bioactivity_data_curated.csv', index=False)

In [86]:
! zip serotonin.zip *.csv

updating: serotonin_01_bioactivity_data_raw.csv (deflated 91%)
updating: serotonin_02_bioactivity_data_preprocessed.csv (deflated 81%)
updating: serotonin_03_bioactivity_data_curated.csv (deflated 82%)


In [87]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [88]:
! ls -l

total 3152
drwx------ 5 root root    4096 Jan  7 13:46 drive
drwxr-xr-x 1 root root    4096 Jan  5 14:34 sample_data
-rw-r--r-- 1 root root 2299786 Jan  7 11:12 serotonin_01_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  195575 Jan  7 11:44 serotonin_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  220676 Jan  7 11:45 serotonin_03_bioactivity_data_curated.csv
-rw-r--r-- 1 root root  499124 Jan  7 13:46 serotonin.zip


---