# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)


In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset, and preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [None]:
%%bash
wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip --directory-prefix ./PaDEL/

-bash: line 1: pip: command not found
--2025-09-02 01:26:14--  https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip [following]
--2025-09-02 01:26:14--  https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘./PaDEL/fingerprints_xml.zip.1’

     0K ..........                                            100% 1.63M=0.006s

2025-09-02 01:26:14 (1.63 MB/s) - ‘./PaDEL/fingerprints

In [23]:
! pip install padelpy
! powershell Expand-Archive -Path ./PaDEL/fingerprints_xml.zip -DestinationPath ./PaDEL/fingerprints_xml




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

In [4]:
df3 = pd.read_csv('./data/acetylcholinesterase_bioactivity_3class_data.csv')

In [5]:
df3

Unnamed: 0,molecule_chembl_id,bioactivity_class,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL133897,active,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,312.325,2.8032,0.0,6.0,6.124939
1,CHEMBL336398,active,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,376.913,4.5546,0.0,5.0,7.000000
2,CHEMBL131588,inactive,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,426.851,5.3574,0.0,5.0,4.301030
3,CHEMBL130628,active,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,404.845,4.7069,0.0,5.0,6.522879
4,CHEMBL130478,active,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,346.334,3.0953,0.0,6.0,6.096910
...,...,...,...,...,...,...,...,...
8122,CHEMBL5398421,inactive,COc1cc(O)c2c(c1)C(=O)c1cc(O)c(O)cc1CCN2,301.298,2.0110,4.0,6.0,4.337242
8123,CHEMBL11298,inactive,N[C@@H](CO)C(=O)O,105.093,-1.6094,3.0,3.0,4.416688
8124,CHEMBL5395312,intermediate,CN1CCN(c2ccc(C(=O)Nc3cc(-c4nc5ccccc5[nH]4)n[nH...,401.474,2.9571,3.0,5.0,5.767004
8125,CHEMBL5399112,inactive,O=C(Nc1cc(-c2nc3ccccc3[nH]2)n[nH]1)c1ccc(N2CCN...,387.447,2.6149,4.0,5.0,5.000000


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('./PaDEL/molecule.smi', sep='\t', index=False, header=False)

In [7]:
%%bash

cat molecule.smi | head -5

cat: molecule.smi: No such file or directory


In [8]:
%%bash

cat molecule.smi | wc -l

cat: molecule.smi: No such file or directory


0


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**


padeldescriptor(mol_dir='molecule.smi', d_file='fp_descriptor_output.csv', descriptortypes=fp_descriptortype, 
        standardizenitro=True, threads=2, removesalt=True, fingerprints=True, detectaromaticity=True, standardizetautomers=True)

input: \
        molecules.smi file that we created prior, should contain canonical_smiles column + ChEMBL_id column. \

d_file: \
output file 

removesalt: \
remove salts/small organic molecules from chemical structure, which we already did with  df_clean_smiles 

standardizenitro: \
to standardize nitro groups to N(:O):O 

threads: \
limit the maximum number of threads to use. default to use as many threads as cpu cores available.

fingerprints: \
that we want to compute molecular fingerprints

descriptortypes: \
the type of fingerprint we want; here, PubchemFingerprinter.xml file designates 'pubchem' as True and all other fingerprinting options as False 

detectaromaticity: \
to remove existing aromaticity information and automatically detect aromaticity in the molecule before calculation of descriptors

standardizetautomers: \
to standardize tautomers; will remove any 3D information from the molecules 



In [9]:
# import os

# os.environ["JAVA_HOME"] = r"C:\Program Files\Eclipse Adoptium\jdk-17"
# os.environ["PATH"] += os.pathsep + os.path.join(os.environ["JAVA_HOME"], "bin")

! java -version


openjdk version "17.0.16" 2025-07-15
OpenJDK Runtime Environment Temurin-17.0.16+8 (build 17.0.16+8)
OpenJDK 64-Bit Server VM Temurin-17.0.16+8 (build 17.0.16+8, mixed mode, sharing)


In [None]:
from padelpy import padeldescriptor
fp_descriptortype = './PaDEL/fingerprints_xml/PubchemFingerprinter.xml'

padeldescriptor(mol_dir='./PaDEL/molecule.smi', d_file='./PaDEL/fp_descriptor_output.csv', descriptortypes=fp_descriptortype, 
        standardizenitro=True, threads=2, removesalt=True, fingerprints=True, detectaromaticity=True, standardizetautomers=True)

In [11]:
%%bash

ls -l

total 5712
-rwxrwxrwx 1 txx99 txx99  465491 Sep  2 01:14 CDD_ML_Part_1_Bioactivity_Preprocessing.ipynb
-rwxrwxrwx 1 txx99 txx99  566429 Sep  2 01:22 CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
-rwxrwxrwx 1 txx99 txx99 3556801 Sep  2 01:00 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
-rwxrwxrwx 1 txx99 txx99 1247105 Sep  2 01:00 CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb
drwxrwxrwx 1 txx99 txx99    4096 Sep  2 01:23 PaDEL
-rwxrwxrwx 1 txx99 txx99    1124 Sep  2 00:56 README.md
drwxrwxrwx 1 txx99 txx99    4096 Sep  2 01:21 data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [12]:
import pandas as pd 
df3_X = pd.read_csv('./PaDEL/fp_descriptor_output.csv')

In [13]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL133897,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL336398,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL131588,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL130628,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL130478,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8122,CHEMBL5398421,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8123,CHEMBL11298,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8124,CHEMBL5395312,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8125,CHEMBL5399112,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df3_X = df3_X.drop(columns=['Name']) # fp table no names
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8122,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8123,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8124,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8125,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

In [15]:
# here, pIC50 as Y variable
df3_Y = df3['pIC50']
df3_Y

0       6.124939
1       7.000000
2       4.301030
3       6.522879
4       6.096910
          ...   
8122    4.337242
8123    4.416688
8124    5.767004
8125    5.000000
8126    5.000000
Name: pIC50, Length: 8127, dtype: float64

## **Combining X and Y variable**

In [16]:
# concat pIC50 column to fingerprint df
dataset3 = pd.concat([df3_X,df3_Y], axis=1) 
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.124939
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.522879
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.096910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8122,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.337242
8123,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.416688
8124,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.767004
8125,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000


In [17]:
dataset3.to_csv('./data/acetylcholinesterase_bioactivity_3class_pubchem_fp.csv', index=False)