# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

In [None]:
! unzip padel.zip

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

In [1]:
import pandas as pd

In [2]:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

In [3]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL1678,COc1cc2c(cc1OC)C(=O)C(CC1CCN(Cc3ccccc3)CC1)C2.Cl,active,379.500,4.3611,0.0,4.0,8.244125
1,1,CHEMBL552871,COc1cc2c(c(OC)c1OC)C(=O)C(CC1CCN(Cc3ccccc3)CC1...,active,409.526,4.3697,0.0,5.0,7.886057
2,2,CHEMBL145446,COc1cc2c(cc1OC)C(=O)C(CCCC1CCN(Cc3ccccc3)CC1)C2,active,407.554,5.1413,0.0,4.0,8.823909
3,3,CHEMBL538050,Cl.O=C1c2ccccc2CCC1CC1CCN(Cc2ccccc2)CC1,intermediate,333.475,4.7340,0.0,2.0,5.677781
4,4,CHEMBL543923,COc1cc2c(cc1OC)CC(CC1CCN(Cc3ccccc3)CC1)=C2.Cl,intermediate,363.501,4.9456,0.0,3.0,5.356547
...,...,...,...,...,...,...,...,...,...
442,442,CHEMBL4543223,COc1ccc(-c2coc3cc(OCCOCCN4CCCCC4)ccc3c2=O)cc1,inactive,423.509,4.3499,0.0,6.0,4.301030
443,443,CHEMBL4516496,CCN(C)CCCOc1ccc2c(=O)c(-c3ccc(OC)cc3)coc2c1,inactive,367.445,4.1892,0.0,5.0,4.645892
444,444,CHEMBL4586884,O=c1c(-c2ccc(OCCN3CCCCC3)cc2)coc2cc(OCCN3CCCC3...,active,462.590,4.7993,0.0,6.0,6.148742
445,445,CHEMBL4554976,CCN(C)CCCCOc1ccc2c(=O)c(-c3ccc(OC)cc3)coc2c1,intermediate,381.472,4.5793,0.0,5.0,5.023650


In [4]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [5]:
! cat molecule.smi | head -5

COc1cc2c(cc1OC)C(=O)C(CC1CCN(Cc3ccccc3)CC1)C2.Cl	CHEMBL1678
COc1cc2c(c(OC)c1OC)C(=O)C(CC1CCN(Cc3ccccc3)CC1)C2.Cl	CHEMBL552871
COc1cc2c(cc1OC)C(=O)C(CCCC1CCN(Cc3ccccc3)CC1)C2	CHEMBL145446
Cl.O=C1c2ccccc2CCC1CC1CCN(Cc2ccccc2)CC1	CHEMBL538050
COc1cc2c(cc1OC)CC(CC1CCN(Cc3ccccc3)CC1)=C2.Cl	CHEMBL543923


In [6]:
! cat molecule.smi | wc -l

     447


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [7]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [8]:
! bash padel.sh

Processing CHEMBL1678 in molecule.smi (1/447). 
Processing CHEMBL552871 in molecule.smi (2/447). 
Processing CHEMBL145446 in molecule.smi (3/447). 
Processing CHEMBL538050 in molecule.smi (4/447). 
Processing CHEMBL544629 in molecule.smi (6/447). Average speed: 2.78 s/mol.
Processing CHEMBL543923 in molecule.smi (5/447). Average speed: 2.74 s/mol.
Processing CHEMBL545317 in molecule.smi (7/447). Average speed: 1.41 s/mol.
Processing CHEMBL554887 in molecule.smi (8/447). Average speed: 0.95 s/mol.
Processing CHEMBL542253 in molecule.smi (9/447). Average speed: 0.82 s/mol.
Processing CHEMBL555107 in molecule.smi (10/447). Average speed: 0.67 s/mol.
Processing CHEMBL542744 in molecule.smi (11/447). Average speed: 0.59 s/mol.
Processing CHEMBL545560 in molecule.smi (12/447). Average speed: 0.51 s/mol.
Processing CHEMBL144477 in molecule.smi (13/447). Average speed: 0.50 s/mol.
Processing CHEMBL555358 in molecule.smi (14/447). Average speed: 0.45 s/mol.
Processing CHEMBL356333 in molecule.s

Processing CHEMBL138442 in molecule.smi (110/447). Average speed: 0.15 s/mol.
Processing CHEMBL141515 in molecule.smi (111/447). Average speed: 0.14 s/mol.
Processing CHEMBL105874 in molecule.smi (112/447). Average speed: 0.14 s/mol.
Processing CHEMBL141352 in molecule.smi (113/447). Average speed: 0.14 s/mol.
Processing CHEMBL140770 in molecule.smi (114/447). Average speed: 0.14 s/mol.
Processing CHEMBL342413 in molecule.smi (115/447). Average speed: 0.14 s/mol.
Processing CHEMBL141276 in molecule.smi (116/447). Average speed: 0.14 s/mol.
Processing CHEMBL140936 in molecule.smi (117/447). Average speed: 0.14 s/mol.
Processing CHEMBL141042 in molecule.smi (118/447). Average speed: 0.14 s/mol.
Processing CHEMBL140328 in molecule.smi (119/447). Average speed: 0.14 s/mol.
Processing CHEMBL139353 in molecule.smi (120/447). Average speed: 0.14 s/mol.
Processing CHEMBL138552 in molecule.smi (121/447). Average speed: 0.14 s/mol.
Processing CHEMBL140990 in molecule.smi (122/447). Average speed

Processing CHEMBL2159421 in molecule.smi (215/447). Average speed: 0.14 s/mol.
Processing CHEMBL2159420 in molecule.smi (216/447). Average speed: 0.14 s/mol.
Processing CHEMBL2159419 in molecule.smi (217/447). Average speed: 0.14 s/mol.
Processing CHEMBL2159418 in molecule.smi (218/447). Average speed: 0.14 s/mol.
Processing CHEMBL2159417 in molecule.smi (219/447). Average speed: 0.14 s/mol.
Processing CHEMBL2159428 in molecule.smi (220/447). Average speed: 0.13 s/mol.
Processing CHEMBL2159427 in molecule.smi (221/447). Average speed: 0.14 s/mol.
Processing CHEMBL448799 in molecule.smi (224/447). Average speed: 0.14 s/mol.
Processing CHEMBL2159426 in molecule.smi (222/447). Average speed: 0.14 s/mol.
Processing CHEMBL2159425 in molecule.smi (223/447). Average speed: 0.13 s/mol.
Processing CHEMBL345124 in molecule.smi (225/447). Average speed: 0.13 s/mol.
Processing CHEMBL3087804 in molecule.smi (226/447). Average speed: 0.13 s/mol.
Processing CHEMBL3087803 in molecule.smi (227/447). Av

Processing CHEMBL3891772 in molecule.smi (319/447). Average speed: 0.11 s/mol.
Processing CHEMBL3960783 in molecule.smi (320/447). Average speed: 0.11 s/mol.
Processing CHEMBL3918754 in molecule.smi (321/447). Average speed: 0.11 s/mol.
Processing CHEMBL3957522 in molecule.smi (323/447). Average speed: 0.11 s/mol.
Processing CHEMBL502 in molecule.smi (322/447). Average speed: 0.11 s/mol.
Processing CHEMBL3929969 in molecule.smi (324/447). Average speed: 0.11 s/mol.
Processing CHEMBL3986296 in molecule.smi (325/447). Average speed: 0.11 s/mol.
Processing CHEMBL3954226 in molecule.smi (326/447). Average speed: 0.11 s/mol.
Processing CHEMBL3893350 in molecule.smi (327/447). Average speed: 0.11 s/mol.
Processing CHEMBL3897047 in molecule.smi (328/447). Average speed: 0.11 s/mol.
Processing CHEMBL3985617 in molecule.smi (329/447). Average speed: 0.11 s/mol.
Processing CHEMBL3946681 in molecule.smi (330/447). Average speed: 0.11 s/mol.
Processing CHEMBL3914759 in molecule.smi (331/447). Aver

Processing CHEMBL4535870 in molecule.smi (423/447). Average speed: 0.11 s/mol.
Processing CHEMBL4592519 in molecule.smi (424/447). Average speed: 0.11 s/mol.
Processing CHEMBL4526265 in molecule.smi (425/447). Average speed: 0.11 s/mol.
Processing CHEMBL4458241 in molecule.smi (426/447). Average speed: 0.11 s/mol.
Processing CHEMBL4547392 in molecule.smi (427/447). Average speed: 0.11 s/mol.
Processing CHEMBL4458007 in molecule.smi (431/447). Average speed: 0.11 s/mol.
Processing CHEMBL4450711 in molecule.smi (430/447). Average speed: 0.11 s/mol.
Processing CHEMBL4550487 in molecule.smi (428/447). Average speed: 0.11 s/mol.
Processing CHEMBL4461381 in molecule.smi (429/447). Average speed: 0.11 s/mol.
Processing CHEMBL4568293 in molecule.smi (435/447). Average speed: 0.11 s/mol.
Processing CHEMBL4568961 in molecule.smi (432/447). Average speed: 0.11 s/mol.
Processing CHEMBL4559958 in molecule.smi (433/447). Average speed: 0.11 s/mol.
Processing CHEMBL4554072 in molecule.smi (434/447). 

In [None]:
! ls -l

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

In [None]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**