# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Based on tutorial by Chanin Nantasenamat, [*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset, and preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
%%bash
wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip --directory-prefix ./PaDEL/

--2025-09-04 02:52:53--  https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip [following]
--2025-09-04 02:52:54--  https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘./PaDEL/fingerprints_xml.zip’

     0K ..........                                            100%  617K=0.02s

2025-09-04 02:52:55 (617 KB/s) - ‘./PaDEL/fingerprints_xml.zip’ saved [10871/10871]



In [1]:
! pip install padelpy
! powershell Expand-Archive -Path ./PaDEL/fingerprints_xml.zip -DestinationPath ./PaDEL/fingerprints_xml



ExpandArchiveHelper : Failed to create file 'C:\Users\liv_u\Desktop\GitHub\DrugDiscovery\Acetylcholinesterase_tutorial\
PaDEL\fingerprints_xml\AtomPairs2DFingerprintCount.xml' while expanding the archive file 
'C:\Users\liv_u\Desktop\GitHub\DrugDiscovery\Acetylcholinesterase_tutorial\PaDEL\fingerprints_xml.zip' contents as the 
file 'C:\Users\liv_u\Desktop\GitHub\DrugDiscovery\Acetylcholinesterase_tutorial\PaDEL\fingerprints_xml\AtomPairs2DFinge
rprintCount.xml' already exists. Use the -Force parameter if you want to overwrite the existing directory 'C:\Users\liv
_u\Desktop\GitHub\DrugDiscovery\Acetylcholinesterase_tutorial\PaDEL\fingerprints_xml\AtomPairs2DFingerprintCount.xml' 
contents when expanding the archive file.
At 
C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules\Microsoft.PowerShell.Archive\Microsoft.PowerShell.Archive.psm1:397 
char:17
+ ...             ExpandArchiveHelper $resolvedSourcePaths $resolvedDestina ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [6]:
import pandas as pd

In [7]:
df3 = pd.read_csv('./data/acetylcholinesterase_bioactivity_3class_data.csv')

In [4]:
df3

Unnamed: 0,molecule_chembl_id,bioactivity_class,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL1834807,intermediate,CCCCC/C=C\C/C=C\CCCCCCCC(=O)OCCCc1ccc2oc(-c3cc...,558.759,10.11780,0.0,5.0,5.568636
1,CHEMBL5188500,active,CC(=O)N1N=C(c2ccc(-c3ccccc3)cc2)CC1c1ccc2c(c1)...,384.435,4.77990,0.0,4.0,6.005243
2,CHEMBL491358,active,CCOC(=O)C1=C(C)Nc2nc3c(c(N)c2C1c1ccc(OC)c(OC)c...,423.513,3.95430,2.0,7.0,7.346787
3,CHEMBL5199361,active,CN1CCN(c2c3c(nc4ccc([N+](=O)[O-])cc24)CCCC3)CC1,326.400,2.77360,0.0,5.0,6.610834
4,CHEMBL2158994,intermediate,CN(C)CCOc1ccc(C(=O)/C=C/c2ccccc2)cc1,295.382,3.52310,0.0,3.0,5.329754
...,...,...,...,...,...,...,...,...
6608,CHEMBL310918,active,O=C(CCC1CCN(Cc2cccc([N+](=O)[O-])c2)CC1)c1ccc2...,393.487,4.43790,1.0,5.0,7.193820
6609,CHEMBL539571,inactive,C#CCNC1CCc2ccc(OC(=O)N(CC)CCCC)cc21,314.429,3.51750,1.0,3.0,4.140261
6610,CHEMBL130738,inactive,Cc1[nH]c(C)c(/C=C2\CN(Cc3ccccc3)CCC2=O)c1C=O,322.408,3.30244,1.0,3.0,4.522879
6611,CHEMBL4453051,intermediate,c1ccc(CNC2CCN(Cc3ccccc3)CC2)cc1,280.415,3.44080,1.0,2.0,5.769551


In [5]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('./PaDEL/molecule.smi', sep='\t', index=False, header=False)

In [6]:
%%bash

cat ./PaDEL/molecule.smi | head -5

CCCCC/C=C\C/C=C\CCCCCCCC(=O)OCCCc1ccc2oc(-c3ccc4c(c3)OCO4)cc2c1	CHEMBL1834807
CC(=O)N1N=C(c2ccc(-c3ccccc3)cc2)CC1c1ccc2c(c1)OCO2	CHEMBL5188500
CCOC(=O)C1=C(C)Nc2nc3c(c(N)c2C1c1ccc(OC)c(OC)c1)CCCC3	CHEMBL491358
CN1CCN(c2c3c(nc4ccc([N+](=O)[O-])cc24)CCCC3)CC1	CHEMBL5199361
CN(C)CCOc1ccc(C(=O)/C=C/c2ccccc2)cc1	CHEMBL2158994


In [7]:
%%bash

cat ./PaDEL/molecule.smi | wc -l

6613


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**


padeldescriptor(mol_dir='molecule.smi', d_file='fp_descriptor_output.csv', descriptortypes=fp_descriptortype, 
        standardizenitro=True, threads=2, removesalt=True, fingerprints=True, detectaromaticity=True, standardizetautomers=True)

input: \
        molecules.smi file that we created prior, should contain canonical_smiles column + ChEMBL_id column. 

d_file: \
output file 

removesalt: \
remove salts/small organic molecules from chemical structure, which was already done with df_clean_smiles

standardizenitro: \
to standardize nitro groups to N(:O):O 

threads: \
limit the maximum number of threads to use. default to use as many threads as cpu cores available.

fingerprints: \
that we want to compute molecular fingerprints

descriptortypes: \
the type of fingerprint we want; here, PubchemFingerprinter.xml file designates 'pubchem' as True and all other fingerprinting options as False 

detectaromaticity: \
to remove existing aromaticity information and automatically detect aromaticity in the molecule before calculation of descriptors

standardizetautomers: \
to standardize tautomers; will remove any 3D information from the molecules 



In [18]:
# import os

# os.environ["JAVA_HOME"] = r"C:\Program Files\Eclipse Adoptium\jdk-17"
# os.environ["PATH"] += os.pathsep + os.path.join(os.environ["JAVA_HOME"], "bin")

! java -version


openjdk version "17.0.16" 2025-07-15
OpenJDK Runtime Environment Temurin-17.0.16+8 (build 17.0.16+8)
OpenJDK 64-Bit Server VM Temurin-17.0.16+8 (build 17.0.16+8, mixed mode, sharing)


In [None]:
from padelpy import padeldescriptor
fp_descriptortype = './PaDEL/fingerprints_xml/PubchemFingerprinter.xml'

padeldescriptor(mol_dir='./PaDEL/molecule.smi', d_file='./PaDEL/fp_descriptor_output.csv', descriptortypes=fp_descriptortype, 
        standardizenitro=True, threads=2, removesalt=True, fingerprints=True, detectaromaticity=True, standardizetautomers=True)

In [1]:
%%bash

ls -l

total 6160
-rwxrwxrwx 1 txx99 txx99  720731 Sep  4 02:34 CDD_ML_Part_1_Bioactivity_Preprocessing.ipynb
-rwxrwxrwx 1 txx99 txx99  562304 Sep  4 15:10 CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
-rwxrwxrwx 1 txx99 txx99 3556317 Sep  2 01:27 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
-rwxrwxrwx 1 txx99 txx99 1411819 Sep  4 03:42 CDD_ML_Part_4_Acetylcholinesterase_ML_Models.ipynb
drwxrwxrwx 1 txx99 txx99    4096 Sep  4 02:54 PaDEL
-rwxrwxrwx 1 txx99 txx99    1152 Sep  2 01:39 README.md
drwxrwxrwx 1 txx99 txx99    4096 Sep  2 01:21 data
-rwxrwxrwx 1 txx99 txx99   42092 Sep  3 14:31 regression_model_scatter_plot.pdf


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [2]:
import pandas as pd 
df3_X = pd.read_csv('./PaDEL/fp_descriptor_output.csv')

In [3]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1834807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CHEMBL5188500,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CHEMBL491358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CHEMBL5199361,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CHEMBL2158994,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6608,CHEMBL310918,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6609,CHEMBL539571,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6610,CHEMBL130738,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6611,CHEMBL4453051,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
df3_X = df3_X.drop(columns=['Name']) # fp table no names
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6609,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6610,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6611,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **Y variable**

In [8]:
# here, pIC50 as Y variable
df3_Y = df3['pIC50']
df3_Y

0       5.568636
1       6.005243
2       7.346787
3       6.610834
4       5.329754
          ...   
6608    7.193820
6609    4.140261
6610    4.522879
6611    5.769551
6612    5.173925
Name: pIC50, Length: 6613, dtype: float64

## **Concatenate X and Y variables to One Df**

In [9]:
# concat pIC50 column to fingerprint df
dataset3 = pd.concat([df3_X,df3_Y], axis=1) 
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.568636
1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.005243
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.346787
3,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.610834
4,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.329754
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.193820
6609,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.140261
6610,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.522879
6611,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.769551


In [10]:
dataset3.to_csv('./data/acetylcholinesterase_bioactivity_3class_pubchem_fp.csv', index=False)