#**QSAR modeling for topoisomerase II inhibitors using machine learning**
[Part 3]

Creator : Mansi Patel


In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-09-03 14:58:52--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-09-03 14:58:53--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-09-03 14:58:54 (246 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-09-03 14:58:54--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (

In [None]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **pIC50 data.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
import pandas as pd

In [None]:
df3 = pd.read_csv('pIC50 data.csv')

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL115665,O=C1C(Nc2ccc(Br)cc2)=C(Cl)C(=O)c2ncncc21,inactive,364.586,3.18060,1.0,5.0,4.821023
1,CHEMBL115302,Cc1ccc(/N=C2/C(=O)c3cncnc3C(O)=C2Cl)c(Br)c1,inactive,378.613,3.98192,1.0,5.0,4.392545
2,CHEMBL325088,O=C1C(Nc2ccccc2Br)=C(Cl)C(=O)c2ncncc21,inactive,364.586,3.18060,1.0,5.0,4.761954
3,CHEMBL157769,COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2cc...,active,639.506,5.05580,2.0,13.0,7.221849
4,CHEMBL157831,COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2cc...,active,636.506,5.75962,2.0,13.0,7.221849
...,...,...,...,...,...,...,...,...
221,CHEMBL4593714,O=C1c2ccccc2C(=O)c2c(O)c(C(c3ccc(OC(F)(F)F)cc3...,inactive,554.348,5.64080,2.0,6.0,4.458421
222,CHEMBL17594,O=C1c2ccccc2C(=O)c2c(O)ccc(O)c21,active,240.214,1.87320,2.0,4.0,3.698970
223,CHEMBL9470,CC(C)=CC[C@@H](O)C1=CC(=O)c2c(O)ccc(O)c2C1=O,active,288.299,2.12040,3.0,5.0,5.107905
224,CHEMBL53463,COc1cccc2c1C(=O)c1c(O)c3c(c(O)c1C2=O)C[C@@](O)...,active,543.525,0.00130,6.0,12.0,5.420216


In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

O=C1C(Nc2ccc(Br)cc2)=C(Cl)C(=O)c2ncncc21	CHEMBL115665
Cc1ccc(/N=C2/C(=O)c3cncnc3C(O)=C2Cl)c(Br)c1	CHEMBL115302
O=C1C(Nc2ccccc2Br)=C(Cl)C(=O)c2ncncc21	CHEMBL325088
COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2ccc([N+](=O)[O-])cc2)cs1)C1=NOCCO1	CHEMBL157769
COC(=O)c1c(Br)c(OC)cc(O)c1CSC[C@H](Nc1nc(-c2ccc([N+](=O)[O-])cc2)cs1)c1nc(C)no1	CHEMBL157831


In [None]:
! cat molecule.smi | wc -l

226


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
! bash padel.sh

Processing CHEMBL115302 in molecule.smi (2/226). 
Processing CHEMBL115665 in molecule.smi (1/226). 
Processing CHEMBL157769 in molecule.smi (4/226). Average speed: 1.27 s/mol.
Processing CHEMBL325088 in molecule.smi (3/226). Average speed: 2.42 s/mol.
Processing CHEMBL157831 in molecule.smi (5/226). Average speed: 1.08 s/mol.
Processing CHEMBL156813 in molecule.smi (6/226). Average speed: 1.24 s/mol.
Processing CHEMBL95777 in molecule.smi (7/226). Average speed: 1.18 s/mol.
Processing CHEMBL36506 in molecule.smi (8/226). Average speed: 1.15 s/mol.
Processing CHEMBL442194 in molecule.smi (9/226). Average speed: 1.25 s/mol.
Processing CHEMBL330372 in molecule.smi (10/226). Average speed: 1.07 s/mol.
Processing CHEMBL95741 in molecule.smi (11/226). Average speed: 1.03 s/mol.
Processing CHEMBL97620 in molecule.smi (12/226). Average speed: 0.95 s/mol.
Processing CHEMBL95778 in molecule.smi (13/226). Average speed: 1.01 s/mol.
Processing CHEMBL335387 in molecule.smi (14/226). Average speed: 

In [None]:
! ls -l

total 25636
-rw-r--r-- 1 root root   412985 Sep  3 15:10  descriptors_output.csv
drwxr-xr-x 3 root root     4096 Sep  3 14:59  __MACOSX
-rw-r--r-- 1 root root    16167 Sep  3 15:08  molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020  PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Sep  3 14:58  padel.sh
-rw-r--r-- 1 root root 25768637 Sep  3 14:58  padel.zip
-rw-r--r-- 1 root root    32007 Sep  3 15:08 'pIC50 data.csv'
drwxr-xr-x 1 root root     4096 Aug 31 13:47  sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL115665,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL115302,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL325088,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL157769,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL157831,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
221,CHEMBL17594,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
222,CHEMBL4593714,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
223,CHEMBL9470,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
224,CHEMBL53463,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
221,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
222,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
223,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
224,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

0      4.821023
1      4.392545
2      4.761954
3      7.221849
4      7.221849
         ...   
221    4.458421
222    3.698970
223    5.107905
224    5.420216
225    3.933674
Name: pIC50, Length: 226, dtype: float64

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.821023
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.392545
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.761954
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.221849
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.221849
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
221,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.458421
222,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.698970
223,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.107905
224,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.420216


In [None]:
dataset3.to_csv('topoisomerase_bioactivity_data_pIC50_pubchem_fp.csv', index=False)

# **Download the CSV file to your local computer for the Part 3B (Model Building).**