# **Computational Drug Discovery**
## Download Bioactivity Data from ChEMBL Database

In this Jupyter notebook, I am building a machine learning model using the ChEMBL bioactivity data.

*Tsaniyah Nur Kholilah*

inspired from : *Data Professor*


In [None]:
#Install the ChEMBL web service package
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 21.3 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 27.8 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 12.9 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 9.6 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 9.2 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 1.9 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 28.3 MB/s 
[?25hCollecting itsdangerous>=2.0.1
  Downloading itsdangerous-2

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search Target Protein
Protein target = NAMPT,
Organism = Human,
Bioactivity = pIC50,
Assay type = Single protein


In [None]:
# Target search for NAMPT
target = new_client.target
target_query = target.search('NAMPT')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Nicotinamide phosphoribosyltransferase,19.0,False,CHEMBL3259474,"[{'accession': 'Q99KQ4', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Rattus norvegicus,Nicotinamide phosphoribosyltransferase,19.0,False,CHEMBL3259475,"[{'accession': 'Q80Z29', 'component_descriptio...",SINGLE PROTEIN,10116
2,[{'xref_id': 'Nicotinamide_phosphoribosyltrans...,Homo sapiens,Nicotinamide phosphoribosyltransferase,18.0,False,CHEMBL1744525,"[{'accession': 'P43490', 'component_descriptio...",SINGLE PROTEIN,9606


In [None]:
# Select and retrieve bioactivity data
selected_target = targets.target_chembl_id[2]
selected_target

'CHEMBL1744525'

Here, we will retrieve only bioactivity data for CHEMBL1744525 that are reported as IC 50  values in nM (nanomolar) unit

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
# Visualization top row data
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,6152799,[],CHEMBL1763209,Inhibition of human NAmPRTase by spectrophotom...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,14.8
1,,6152800,[],CHEMBL1763209,Inhibition of human NAmPRTase by spectrophotom...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,20.3
2,,6274579,[],CHEMBL1805766,Inhibition of nicotinamide phosphoribosyltrans...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,0.0019


In [None]:
df.standard_type.unique()

array(['IC50'], dtype=object)

In [None]:
# Save data to csv
df.to_csv('bioactivity_data.csv', index=False)

## Mounting files to Google Drive

In [None]:
## Copying data to Google drive
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

MessageError: ignored

**Next**, we create a **data** folder in our **Colab Notebooks** folder on Google Drive.

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data2"

In [None]:
! cp bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls

In [None]:
! head bioactivity_data.csv

## Handling missing data
If any compounds has missing value for the standard_value column then drop it

In [None]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,6152799,[],CHEMBL1763209,Inhibition of human NAmPRTase by spectrophotom...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,14.8
1,,6152800,[],CHEMBL1763209,Inhibition of human NAmPRTase by spectrophotom...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,20.3
2,,6274579,[],CHEMBL1805766,Inhibition of nicotinamide phosphoribosyltrans...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,0.0019
3,,6274580,[],CHEMBL1805766,Inhibition of nicotinamide phosphoribosyltrans...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,0.0013
4,,6274581,[],CHEMBL1805766,Inhibition of nicotinamide phosphoribosyltrans...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,0.0042
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2682,,19403641,[],CHEMBL4425876,Inhibition of recombinant NAMPT (unknown origi...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,nM,UO_0000065,,195.0
2683,,19403642,[],CHEMBL4425876,Inhibition of recombinant NAMPT (unknown origi...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,nM,UO_0000065,,25.3
2684,,19403758,[],CHEMBL4425882,Inhibition of NAMPT in human MCF7 cells assess...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,nM,UO_0000065,,0.155
2685,,20646204,[],CHEMBL4614220,Inhibition of NAMPT (unknown origin) using NAM...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,7.31


This dataset there is missing data, but I don't think its nead to processed further, at the beginning.
But, after I did further analysis there is constant error.
And, after checking, I found out that there was no SMILES for the index 2019. So I remove this row, and redo the analysis.
Or, we can use 


```
df2 = df[df.canonical_smiles.notna()]
df2
```



In [None]:
import pandas as pd

In [None]:
df2 = pd.read_csv('/content/bioactivity_data.csv')
df2.tail()

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
2681,,19403641,[],CHEMBL4425876,Inhibition of recombinant NAMPT (unknown origi...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,nM,UO_0000065,,195.0
2682,,19403642,[],CHEMBL4425876,Inhibition of recombinant NAMPT (unknown origi...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,nM,UO_0000065,,25.3
2683,,19403758,[],CHEMBL4425882,Inhibition of NAMPT in human MCF7 cells assess...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,nM,UO_0000065,,0.155
2684,,20646204,[],CHEMBL4614220,Inhibition of NAMPT (unknown origin) using NAM...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,7.31
2685,,20646205,[],CHEMBL4614220,Inhibition of NAMPT (unknown origin) using NAM...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Nicotinamide phosphoribosyltransferase,9606,,,IC50,uM,UO_0000065,,2.15


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Iterate the *molecule_chembl_id* to a list**

In [None]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

### **Iterate *canonical_smiles* to a list**

In [None]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

### **Iterate *standard_value* to a list**

In [None]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

### **Combine the 4 lists into a dataframe**

In [None]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL566757,O=C(/C=C/c1cccnc1)NCCCCC1CCN(C(=O)c2ccccc2)CC1,inactive,14800.000
1,CHEMBL1762233,O=C(/C=C/c1ccc[nH]1)NCCCCC1CCN(C(=O)c2ccccc2)CC1,inactive,20300.000
2,CHEMBL1801562,CC(C)=CCN(Cc1ccccc1Cl)c1ccc(C(=O)NCc2cccnc2)cc1,active,1.900
3,CHEMBL1801561,O=C(NCc1cccnc1)c1ccc(N(Cc2ccccc2Cl)CC2CC2)cc1,active,1.300
4,CHEMBL1801560,C#CCN(Cc1ccccc1Cl)c1ccc(C(=O)NCc2cccnc2)cc1,active,4.200
...,...,...,...,...
2681,CHEMBL4544855,CC(C)N(CCc1ccsc1)Cc1ccc(CCCCCNC(=O)/C=C/c2cccn...,active,195.000
2682,CHEMBL4544378,CC(C)N(CCc1c[nH]c2ccccc12)Cc1ccc(NCCCCCNC(=O)/...,active,25.300
2683,CHEMBL4544378,CC(C)N(CCc1c[nH]c2ccccc12)Cc1ccc(NCCCCCNC(=O)/...,active,0.155
2684,CHEMBL4635425,CCOC(=O)c1ccc(NC(=O)NCc2ccnc(-n3ccnc3C)c2)cc1,intermediate,7310.000


### Alternative dataframe

In [None]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL566757,O=C(/C=C/c1cccnc1)NCCCCC1CCN(C(=O)c2ccccc2)CC1,14800.0
1,CHEMBL1762233,O=C(/C=C/c1ccc[nH]1)NCCCCC1CCN(C(=O)c2ccccc2)CC1,20300.0
2,CHEMBL1801562,CC(C)=CCN(Cc1ccccc1Cl)c1ccc(C(=O)NCc2cccnc2)cc1,1.9
3,CHEMBL1801561,O=C(NCc1cccnc1)c1ccc(N(Cc2ccccc2Cl)CC2CC2)cc1,1.3
4,CHEMBL1801560,C#CCN(Cc1ccccc1Cl)c1ccc(C(=O)NCc2cccnc2)cc1,4.2
...,...,...,...
2682,CHEMBL4544855,CC(C)N(CCc1ccsc1)Cc1ccc(CCCCCNC(=O)/C=C/c2cccn...,195.0
2683,CHEMBL4544378,CC(C)N(CCc1c[nH]c2ccccc12)Cc1ccc(NCCCCCNC(=O)/...,25.3
2684,CHEMBL4544378,CC(C)N(CCc1c[nH]c2ccccc12)Cc1ccc(NCCCCCNC(=O)/...,0.155
2685,CHEMBL4635425,CCOC(=O)c1ccc(NC(=O)NCc2ccnc(-n3ccnc3C)c2)cc1,7310.0


In [None]:
pd.concat([df3,pd.Series(bioactivity_class)], axis=1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,0
0,CHEMBL566757,O=C(/C=C/c1cccnc1)NCCCCC1CCN(C(=O)c2ccccc2)CC1,14800.0,inactive
1,CHEMBL1762233,O=C(/C=C/c1ccc[nH]1)NCCCCC1CCN(C(=O)c2ccccc2)CC1,20300.0,inactive
2,CHEMBL1801562,CC(C)=CCN(Cc1ccccc1Cl)c1ccc(C(=O)NCc2cccnc2)cc1,1.9,active
3,CHEMBL1801561,O=C(NCc1cccnc1)c1ccc(N(Cc2ccccc2Cl)CC2CC2)cc1,1.3,active
4,CHEMBL1801560,C#CCN(Cc1ccccc1Cl)c1ccc(C(=O)NCc2cccnc2)cc1,4.2,active
...,...,...,...,...
2682,CHEMBL4544855,CC(C)N(CCc1ccsc1)Cc1ccc(CCCCCNC(=O)/C=C/c2cccn...,195.0,intermediate
2683,CHEMBL4544378,CC(C)N(CCc1c[nH]c2ccccc12)Cc1ccc(NCCCCCNC(=O)/...,25.3,
2684,CHEMBL4544378,CC(C)N(CCc1c[nH]c2ccccc12)Cc1ccc(NCCCCCNC(=O)/...,0.155,
2685,CHEMBL4635425,CCOC(=O)c1ccc(NC(=O)NCc2ccnc(-n3ccnc3C)c2)cc1,7310.0,


### Save dataframe to CSV

In [None]:
df3.to_csv('NAMPT_bioactivity_preprocessed_data.csv', index=False)

Download to local

In [None]:
! ls -l

total 2996
-rw-r--r-- 1 root root 2866966 Apr 19 11:43 bioactivity_data.csv
-rw-r--r-- 1 root root  189098 Apr 19 12:01 bioactivity_preprocessed_data.csv
drwx------ 5 root root    4096 Apr 19 11:47 gdrive
drwxr-xr-x 1 root root    4096 Apr  8 13:32 sample_data


Copy to Google Drive

In [None]:
! cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

'/content/gdrive/My Drive/Colab Notebooks/data'
