# Computational Drug Discovery - Part 3 - Descriptor calculation & Dataset Preparation

In this part, we are going to compute pubchem fingerprints for all the biological compounds associated with our target protein. This kind of data is useful in medicinal chemistry

## Download Padel Descriptor

Since we already have it installed plus its .sh file, all we have to do is load the data so that we can run the file from terminal

## Load the Bioactivity data 

We will simply load the data from part 2. Of importance we want the canoncal-smiles data values since this acts as the data for calculating the fingerprints

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('pfHT1_biological_data_pIC50.csv')

In [3]:
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL2028051,O=C(NCC(c1ccsc1)N1CCCCCC1)c1ccc(C(F)(F)F)cc1,inactive,396.478,5.11400,1.0,3.0,4.920819
1,CHEMBL1459149,CCN1CCCC1CNc1[nH]cnc2c3cc(Cl)ccc3nc1-2,inactive,329.835,3.61230,2.0,4.0,4.920819
2,CHEMBL2028052,COc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)n...,inactive,492.910,5.80790,1.0,6.0,4.920819
3,CHEMBL2028053,Cc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)nc...,inactive,476.911,6.10772,1.0,5.0,4.945578
4,CHEMBL2028054,Cc1cc(NC(=O)c2cccc(C(F)(F)F)c2)n(-c2nc(-c3ccc4...,inactive,472.448,5.30402,1.0,7.0,4.920819
...,...,...,...,...,...,...,...,...
787,CHEMBL2028046,Cc1sc(NC(=O)c2ccccc2)c(C(c2cccs2)N2CCN(c3ccccc...,inactive,487.694,6.59034,1.0,5.0,4.920819
788,CHEMBL2028047,Cc1sc(NC(=O)c2ccco2)c(C(c2cccnc2)N2CCC(Cc3cccc...,inactive,485.653,6.64934,1.0,5.0,4.920819
789,CHEMBL2028048,CCOC(=O)c1c(C)n(-c2ccccc2)c2ccc(OC(=O)c3cc(OC)...,inactive,489.524,5.36062,0.0,8.0,4.920819
790,CHEMBL2028049,COc1cc(C(=O)Nc2nc3c(cc4c5c(cccc53)CC4)s2)cc(OC...,inactive,420.490,4.82620,1.0,6.0,4.920819


In [4]:
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [5]:
cat molecule.smi | head -3

O=C(NCC(c1ccsc1)N1CCCCCC1)c1ccc(C(F)(F)F)cc1	CHEMBL2028051
CCN1CCCC1CNc1[nH]cnc2c3cc(Cl)ccc3nc1-2	CHEMBL1459149
COc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)nccn3n2)cc1	CHEMBL2028052


In [6]:
cat molecule.smi | wc -l

792


## Calculate Fingerprint Descriptors

In [17]:
cat padel.sh

#!/bin/bash
java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [25]:
#%%bash 
#padel.sh
#I'll just run the code from terminal for faster execution

Based on the number of molecules in my file,
Descriptor calculation completed in 4 mins 15.244 secs . Average speed: 0.32 s/mol.


The resulting file is saved as descriptors_output.csv as per the padel.sh instruction

In [26]:
ls

 bioactivity_data.csv
 boxplot_LogP.pdf
 boxplot_MW.pdf
 boxplot_Num_H_Acceptors.pdf
 boxplot_Num_H_Donors.pdf
 boxplt_pIC50.pdf
 CDD_Malaria_Bioactivity_Data.ipynb
 CDD_Malaria_Exploratory_Data_Analysis.ipynb
 CDD_Malaria_PaDEL_Descriptor.ipynb
 descriptors_output.csv
 [0m[01;34mExploratory_analysis[0m/
'Frequency plot.bioactivity class.pdf'
 mannwhitneyuLogP.csv
 mannwhitneyuMW.csv
 mannwhitneyuNumHAcceptors.csv
 mannwhitneyuNumHDonors.csv
 mannwhitneyupIC50.csv
 molecule.smi
 [01;34mPaDEL-Descriptor[0m/
 [01;32mpadel.sh[0m*
 pfHT1_biological_data_pIC50.csv
 pfHT1_Preprocessed_biological_data.csv
 plot_MW_vs_LogP.pdf
 results


## Prepare the X and Y data matrices

For the X,let's just filter the pubchem fingerprints

In [27]:
df1 = pd.read_csv('descriptors_output.csv')
df1

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1459149,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL2028051,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL2028053,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL2028052,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL1622128,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
787,CHEMBL2028046,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
788,CHEMBL2028047,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
789,CHEMBL2028048,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
790,CHEMBL2028049,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
df_X = df1.drop(columns=['Name'])
df_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
787,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
788,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
789,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
790,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [31]:
df_Y = df.pIC50
df_Y

0      4.920819
1      4.920819
2      4.920819
3      4.945578
4      4.920819
         ...   
787    4.920819
788    4.920819
789    4.920819
790    4.920819
791    5.311580
Name: pIC50, Length: 792, dtype: float64

## Combine the 2 dataframes

In [32]:
dataset = pd.concat([df_X,df_Y], axis=1)
dataset

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.945578
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
787,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819
788,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819
789,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819
790,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.920819


### Save the new dataset into a .csv file

In [34]:
dataset.to_csv('pfHT1_bioactivity_data_pIC50_pubchem_fingerprints.csv', index=False)