# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [50]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2025-12-17 22:35:48--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
140.82.113.3thub.com (github.com)... 
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-12-17 22:35:49 ERROR 404: Not Found.

--2025-12-17 22:35:49--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (github.com)... 140.82.113.3
connected. to github.com (github.com)|140.82.113.3|:443... 
404 Not Foundsent, awaiting response... 
2025-12-17 22:35:49 ERROR 404: Not Found.



In [51]:
! unzip padel.zip

unzip:  cannot find or open padel.zip, padel.zip.zip or padel.zip.ZIP.


## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [52]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2025-12-17 22:35:50--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8003::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
200 OKequest sent, awaiting response... 
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.7’


2025-12-17 22:35:50 (2.77 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.7’ saved [655414/655414]



In [45]:
import pandas as pd

In [46]:
df3_X = pd.read_csv(
    'acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv'
)

In [47]:
# 2) Split into X and y
df3_X = df3.drop(columns=["pIC50"])
df3_y = df3["pIC50"]

# 3) Combine back if the notebook wants a single dataframe
dataset3 = pd.concat([df3_X, df3_y], axis=1)

# 4) Save (use a NEW filename so you don't overwrite your input)
dataset3.to_csv("acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp_ready.csv", index=False)

In [53]:
! cat molecule.smi | head -5

CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl	CHEMBL463210
CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252723
CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252722
CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252721
CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252851


In [54]:
! cat molecule.smi | wc -l

      18


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [55]:
! cat padel.sh

cat: padel.sh: No such file or directory


In [56]:
! bash padel.sh

bash: padel.sh: No such file or directory


In [57]:
! ls -l

total 637856
-rw-r--r--  1 Sophia  staff     134764 Dec 17 22:19 CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 Sophia  staff     271025 Dec 17 22:25 CDD_ML_Part_2_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 Sophia  staff      48618 Dec 17 22:32 CDD_ML_Part_3_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 Sophia  staff      34309 Dec 17 22:09 CDD_ML_Part_4_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 Sophia  staff      27040 Dec 17 22:09 CDD_ML_Part_5_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rwxr-xr-x  1 Sophia  staff   85055499 Feb 13  2025 [31mMiniconda3-py37_4.8.2-Linux-x86_64.sh[m[m
-rw-r--r--  1 Sophia  staff         29 Oct  1 13:11 README.md
-rw-r--r--  1 Sophia  staff     642100 Dec 17 22:19 acetylcholinesterase.zip
-rw-r--r--  1 Sophia  staff       9828 Dec 17 22:19 acetylcholinesterase_01_bioactivity_data_raw.csv
-rw-r--r--  1 Sophia  staff       1093 Dec 17 22:19

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [58]:
df3_X = pd.read_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv')

In [59]:
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50.1,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.737549,5.737549
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,3.947999,3.947999
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,4.425969,4.425969
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.346787,5.346787
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.735182,5.735182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,
4691,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,
4692,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,
4693,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,


In [60]:
df3_X = df3_X.drop(columns=['pIC50'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50.1
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.737549
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.947999
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.425969
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.346787
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.735182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
4691,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
4692,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
4693,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,


## **Y variable**

### **Convert IC50 to pIC50**

In [61]:
df3_Y = df3['pIC50']
df3_Y

0       5.737549
1       3.947999
2       4.425969
3       5.346787
4       5.735182
          ...   
4690         NaN
4691         NaN
4692         NaN
4693         NaN
4694         NaN
Name: pIC50, Length: 4695, dtype: float64

## **Combining X and Y variable**

In [62]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50.1,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.737549,5.737549
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,3.947999,3.947999
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,4.425969,4.425969
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.346787,5.346787
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.735182,5.735182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,
4691,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,
4692,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,
4693,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,,


In [63]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**