# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [15]:
pip install padelpy

Note: you may need to restart the kernel to use updated packages.


In [18]:
from padelpy import from_smiles
import os

# Set JAVA_HOME if needed
os.environ["JAVA_HOME"] = "/Library/Java/JavaVirtualMachines/openjdk-17.jdk/Contents/Home"

# Example
descriptors = from_smiles("CCC", output_csv="output.csv")

In [1]:
! wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
! unzip fingerprints_xml.zip

--2025-10-29 13:30:01--  https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip [following]
--2025-10-29 13:30:01--  https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘fingerprints_xml.zip’


2025-10-29 13:30:01 (16.0 MB/s) - ‘fingerprints_xml.zip’ saved [10871/10871]

Archive:  fingerprints_xml.zip
  inflating: AtomPairs2DFingerprintCount.xml  
  inflating: AtomPairs2DFin

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [4]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2025-10-29 13:30:54--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.5’


2025-10-29 13:30:54 (8.14 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.5’ saved [655414/655414]



In [5]:
import pandas as pd

In [6]:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

In [7]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,active,312.325,2.80320,0.0,6.0,6.124939
1,1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,active,376.913,4.55460,0.0,5.0,7.000000
2,2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,inactive,426.851,5.35740,0.0,5.0,4.301030
3,3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,active,404.845,4.70690,0.0,5.0,6.522879
4,4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,active,346.334,3.09530,0.0,6.0,6.096910
...,...,...,...,...,...,...,...,...,...
4690,4690,CHEMBL4293155,CC(C)(C)c1cc(/C=C/C(=O)NCCC2CCN(Cc3ccccc3Cl)CC...,intermediate,511.150,7.07230,2.0,3.0,5.612610
4691,4691,CHEMBL4282558,CC(C)(C)c1cc(/C=C/C(=O)NCCC2CCN(Cc3cccc(Cl)c3)...,intermediate,511.150,7.07230,2.0,3.0,5.595166
4692,4692,CHEMBL4281727,CC(C)(C)c1cc(/C=C/C(=O)NCCC2CCN(Cc3ccc(Br)cc3)...,intermediate,555.601,7.18140,2.0,3.0,5.419075
4693,4693,CHEMBL4292349,CC(C)(C)c1cc(/C=C/C(=O)NCCC2CCN(Cc3cccc([N+](=...,intermediate,521.702,6.32710,2.0,5.0,5.460924


In [14]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [15]:
! cat molecule.smi | head -5

CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1	CHEMBL133897
O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1	CHEMBL336398
CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1	CHEMBL131588
O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F	CHEMBL130628
CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C	CHEMBL130478
cat: stdout: Broken pipe


In [16]:
! cat molecule.smi | wc -l

    4695


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [17]:
! cat padel.sh

cat: padel.sh: No such file or directory


In [28]:
#! bash padel.sh

bash: padel.sh: No such file or directory


In [29]:
#! ls -l

total 718808
-rw-r--r--  1 valeriaramosprado  staff    124747 Oct  1 13:55 CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 valeriaramosprado  staff    404298 Oct 15 13:26 CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb
-rw-r--r--  1 valeriaramosprado  staff    129593 Oct 22 12:56 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
-rw-r--r--  1 valeriaramosprado  staff    100076 Oct  1 13:55 CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb
-rw-r--r--  1 valeriaramosprado  staff    230778 Oct  1 13:55 CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb
-rwxr-xr-x  1 valeriaramosprado  staff  85055499 Feb 13  2025 [31mMiniconda3-py37_4.8.2-Linux-x86_64.sh[m[m
-rw-r--r--  1 valeriaramosprado  staff  85055499 Feb 13  2025 Miniconda3-py37_4.8.2-Linux-x86_64.sh.1
-rw-r--r--  1 valeriaramosprado  staff  85055499 Feb 13  2025 Miniconda3-py37_4.8.2-Linux-x86_64.sh.2
-rw-r--r--  1 valeriaramosprad

In [15]:
from rdkit import Chem
from mordred import Calculator, descriptors
import pandas as pd

# Example: list of SMILES
smiles_list = ["CCC", "CCO", "CCN"]  # replace with your molecules

# Convert to RDKit molecule objects
mols = [Chem.MolFromSmiles(sm) for sm in smiles_list]

# Set up Mordred calculator
calc = Calculator(descriptors, ignore_3D=True)

# Calculate descriptors
df = calc.pandas(mols)

# Save to CSV (this creates the file)
df.to_csv('descriptors_output.csv', index=False)

# Now you can read it safely
df3_X = pd.read_csv('descriptors_output.csv')
print(df3_X.head())


                                                 ABC  \
0  module 'numpy' has no attribute 'float'.\n`np....   
1  module 'numpy' has no attribute 'float'.\n`np....   
2  module 'numpy' has no attribute 'float'.\n`np....   

                                               ABCGG  nAcid  nBase   SpAbs_A  \
0  module 'numpy' has no attribute 'float'.\n`np....      0      0  2.828427   
1  module 'numpy' has no attribute 'float'.\n`np....      0      0  2.828427   
2  module 'numpy' has no attribute 'float'.\n`np....      0      1  2.828427   

    SpMax_A  SpDiam_A    SpAD_A   SpMAD_A   LogEE_A  ...     SRW10     TSRW10  \
0  1.414214  2.828427  2.828427  0.942809  1.849457  ...  4.174387  17.310771   
1  1.414214  2.828427  2.828427  0.942809  1.849457  ...  4.174387  17.310771   
2  1.414214  2.828427  2.828427  0.942809  1.849457  ...  4.174387  17.310771   

          MW       AMW  WPath  WPol  Zagreb1  Zagreb2  mZagreb1  mZagreb2  
0  44.062600  4.005691      4     0      6.0      4.0

In [16]:
import pandas as pd
df3_X = pd.read_csv('descriptors_output.csv')

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [17]:
df3_X = pd.read_csv('descriptors_output.csv')

In [18]:
df3_X

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,0,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,4.174387,17.310771,44.0626,4.005691,4,0,6.0,4.0,2.25,1.0
1,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,0,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,4.174387,17.310771,46.041865,5.115763,4,0,6.0,4.0,2.25,1.0
2,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,1,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,4.174387,17.310771,45.057849,4.505785,4,0,6.0,4.0,2.25,1.0


In [19]:
# Only do this if 'Name' exists
if 'Name' in df3_X.columns:
    df3_X = df3_X.drop(columns=['Name'])

df3_X


Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,0,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,4.174387,17.310771,44.0626,4.005691,4,0,6.0,4.0,2.25,1.0
1,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,0,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,4.174387,17.310771,46.041865,5.115763,4,0,6.0,4.0,2.25,1.0
2,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,1,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,4.174387,17.310771,45.057849,4.505785,4,0,6.0,4.0,2.25,1.0


## **Y variable**

### **Convert IC50 to pIC50**

In [20]:
df3_Y = df3['pIC50']
df3_Y

0       6.124939
1       7.000000
2       4.301030
3       6.522879
4       6.096910
          ...   
4690    5.612610
4691    5.595166
4692    5.419075
4693    5.460924
4694    5.555955
Name: pIC50, Length: 4695, dtype: float64

## **Combining X and Y variable**

In [21]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2,pIC50
0,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0.0,0.0,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,17.310771,44.062600,4.005691,4.0,0.0,6.0,4.0,2.25,1.0,6.124939
1,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0.0,0.0,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,17.310771,46.041865,5.115763,4.0,0.0,6.0,4.0,2.25,1.0,7.000000
2,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0.0,1.0,2.828427,1.414214,2.828427,2.828427,0.942809,1.849457,...,17.310771,45.057849,4.505785,4.0,0.0,6.0,4.0,2.25,1.0,4.301030
3,,,,,,,,,,,...,,,,,,,,,,6.522879
4,,,,,,,,,,,...,,,,,,,,,,6.096910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,,,,,,,,,,,...,,,,,,,,,,5.612610
4691,,,,,,,,,,,...,,,,,,,,,,5.595166
4692,,,,,,,,,,,...,,,,,,,,,,5.419075
4693,,,,,,,,,,,...,,,,,,,,,,5.460924


In [22]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**