#**QSAR modeling for topoisomerase II inhibitors using machine learning**
[Part 1]

Creator : Mansi Patel


In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 2.3 MB/s 
[?25hCollecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.1.2-py3-none-any.whl (15 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting attrs<22.0,>=21.2
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 7.3 MB/s 
Installing collected packages: url-normalize, itsdangerous, attrs, requests-cache, chembl-webresource-client
  Attempting uninstall: itsdangerous
    Found existing installation: itsdangerous 1.1.0
    Uninstalling itsdangerous-1.1.0:
      Successfully uninstalled itsdangerous-1.1.0
  Attempting unin

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

In [None]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('topoisomerase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Topoisomerase I/II,15.0,False,CHEMBL4106174,"[{'accession': 'P11387', 'component_descriptio...",PROTEIN FAMILY,9606
1,"[{'xref_id': 'P11387', 'xref_name': None, 'xre...",Homo sapiens,DNA topoisomerase I,14.0,False,CHEMBL1781,"[{'accession': 'P11387', 'component_descriptio...",SINGLE PROTEIN,9606
2,"[{'xref_id': 'P0AFI2', 'xref_name': None, 'xre...",Escherichia coli K-12,Topoisomerase IV subunit A,14.0,False,CHEMBL1895,"[{'accession': 'P0AFI2', 'component_descriptio...",SINGLE PROTEIN,83333
3,"[{'xref_id': 'P96583', 'xref_name': None, 'xre...",Bacillus subtilis (strain 168),DNA topoisomerase III,14.0,False,CHEMBL4320,"[{'accession': 'P96583', 'component_descriptio...",SINGLE PROTEIN,224308
4,"[{'xref_id': 'A7E2X7', 'xref_name': None, 'xre...",Homo sapiens,Topoisomerase (DNA) II binding protein 1,14.0,False,CHEMBL3175,"[{'accession': 'Q92547', 'component_descriptio...",SINGLE PROTEIN,9606
5,"[{'xref_id': 'Q04750', 'xref_name': None, 'xre...",Mus musculus,DNA topoisomerase I,14.0,False,CHEMBL2814,"[{'accession': 'Q04750', 'component_descriptio...",SINGLE PROTEIN,10090
6,"[{'xref_id': 'Q02880', 'xref_name': None, 'xre...",Homo sapiens,DNA topoisomerase II beta,14.0,False,CHEMBL3396,"[{'accession': 'Q02880', 'component_descriptio...",SINGLE PROTEIN,9606
7,"[{'xref_id': 'P15348', 'xref_name': None, 'xre...",Drosophila melanogaster,DNA topoisomerase II,14.0,False,CHEMBL2671,"[{'accession': 'P15348', 'component_descriptio...",SINGLE PROTEIN,7227
8,"[{'xref_id': 'Q64511', 'xref_name': None, 'xre...",Mus musculus,DNA topoisomerase II beta,14.0,False,CHEMBL5564,"[{'accession': 'Q64511', 'component_descriptio...",SINGLE PROTEIN,10090
9,"[{'xref_id': 'Q931S2', 'xref_name': None, 'xre...",Staphylococcus aureus subsp. aureus Mu50,Topoisomerase IV subunit A,14.0,False,CHEMBL4836,"[{'accession': 'Q931S2', 'component_descriptio...",SINGLE PROTEIN,158878


We will assign the 15th entry (which corresponds to the target protein, *topoisomerase II*) to the **selected_target** variable 

In [None]:
selected_target = targets.target_chembl_id[15]
selected_target

'CHEMBL2094255'

Here, we will retrieve only bioactivity data for *topoisomerase II* (CHEMBL2094255) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,33049,[],CHEMBL663875,Inhibitory activity against topo II-mediated d...,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,15.1
1,,36756,[],CHEMBL663874,Inhibitory activity against DNA topoisomerase II,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,40.5
2,,45103,[],CHEMBL663875,Inhibitory activity against topo II-mediated d...,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,17.3


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [None]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,33049,[],CHEMBL663875,Inhibitory activity against topo II-mediated d...,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,15.1
1,,36756,[],CHEMBL663874,Inhibitory activity against DNA topoisomerase II,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,40.5
2,,45103,[],CHEMBL663875,Inhibitory activity against topo II-mediated d...,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,17.3
3,,78097,[],CHEMBL662164,Antibacterial activity against human DNA Topoi...,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,ug ml-1,UO_0000274,,60.0
4,,87669,[],CHEMBL662164,Antibacterial activity against human DNA Topoi...,B,,,BAO_0000190,BAO_0000224,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,ug ml-1,UO_0000274,,60.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256,,19240592,[],CHEMBL4399552,Inhibition of DNA topoisomerase 2 in human HeL...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,4.95
257,,19240593,[],CHEMBL4399552,Inhibition of DNA topoisomerase 2 in human HeL...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,0.73
258,,19240601,[],CHEMBL4399556,Inhibition of DNA topoisomerase 2 in human MCF...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,0.23
259,,19240602,[],CHEMBL4399556,Inhibition of DNA topoisomerase 2 in human MCF...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,DNA topoisomerase II,9606,,,IC50,uM,UO_0000065,,0.44


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL115665,O=C1C(Nc2ccc(Br)cc2)=C(Cl)C(=O)c2ncncc21,15100.0
1,CHEMBL115302,Cc1ccc(/N=C2/C(=O)c3cncnc3C(O)=C2Cl)c(Br)c1,40500.0
2,CHEMBL325088,O=C1C(Nc2ccccc2Br)=C(Cl)C(=O)c2ncncc21,17300.0
3,CHEMBL157769,COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2cc...,60.0
4,CHEMBL157831,COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2cc...,60.0
...,...,...,...
256,CHEMBL4456370,OC(c1cccs1)c1cc2ccccc2nc1Cl,4950.0
257,CHEMBL84,CC[C@@]1(O)C(=O)OCc2c1cc1n(c2=O)Cc2cc3c(CN(C)C...,730.0
258,CHEMBL4469453,Fc1ccc(COC(c2cccs2)c2cc3ccccc3nc2Cl)cc1,230.0
259,CHEMBL4470431,Clc1ccc(COC(c2cccs2)c2cc3ccccc3nc2Cl)cc1,440.0


In [None]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL115665,O=C1C(Nc2ccc(Br)cc2)=C(Cl)C(=O)c2ncncc21,15100.0,inactive
1,CHEMBL115302,Cc1ccc(/N=C2/C(=O)c3cncnc3C(O)=C2Cl)c(Br)c1,40500.0,inactive
2,CHEMBL325088,O=C1C(Nc2ccccc2Br)=C(Cl)C(=O)c2ncncc21,17300.0,inactive
3,CHEMBL157769,COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2cc...,60.0,active
4,CHEMBL157831,COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2cc...,60.0,active
...,...,...,...,...
256,CHEMBL4456370,OC(c1cccs1)c1cc2ccccc2nc1Cl,4950.0,
257,CHEMBL84,CC[C@@]1(O)C(=O)OCc2c1cc1n(c2=O)Cc2cc3c(CN(C)C...,730.0,
258,CHEMBL4469453,Fc1ccc(COC(c2cccs2)c2cc3ccccc3nc2Cl)cc1,230.0,
259,CHEMBL4470431,Clc1ccc(COC(c2cccs2)c2cc3ccccc3nc2Cl)cc1,440.0,


Saves dataframe to CSV file

In [None]:
df4.to_csv('bioactivity_data_preprocessed.csv', index=False)

In [None]:
! ls -l

total 164
-rw-r--r-- 1 root root  22520 Sep  1 16:33 bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 135499 Sep  1 16:28 bioactivity_data_raw.csv
drwxr-xr-x 1 root root   4096 Aug 15 13:44 sample_data


---