# Phase 3: Descriptor Calculation and Dataset Preparation
Shreya Das

In phase 3, we will be calculating the molecular descriptors which will act like quantitative description of the drugs in the dataset. Then we will prepare this dataset for the next phase.

NOTE: The structure and layout of this phase and the project is inspired by The Data Professor on Youtube. The findings for RET molecules and drugs are original and investigated by the author (Shreya Das).

## 1. Download PaDEL-Descriptor
PaDEL Descriptor is a software that is used to calculate molecular descriptors and fingerprints. **Molecular fingerprints** are calculated values that represent the molecules in chemical space and their associated chemical properties. This is what we will use to calculate the molecular descriptors of each of the RET inhibitors in our previous datasets, in order to provide training data for our ML model in the next phase.

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-09-03 14:28:03--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-09-03 14:28:03--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2024-09-03 14:28:03 (141 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2024-09-03 14:28:03--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [None]:
! unzip padel.zip

Archive:  padel.zip
replace __MACOSX/._PaDEL-Descriptor? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/._PaDEL-Descriptor  
replace PaDEL-Descriptor/MACCSFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
replace __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
replace PaDEL-Descriptor/AtomPairs2DFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
replace __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
replace PaDEL-Descriptor/EStateFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
replace __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml? [y]es, [n

## 1.1 Load bioactivity data
We will load the bioactivity data from the previous phase from GitHub here.

In [None]:
!wget https://github.com/Shreya-Das-uoft/Chembl-Database-Drug-Discovery-for-RET-tyrosine-kinase-receptor-in-neuroblastoma/blob/main/RET_03_bioactivity_data_pIC50.csv

--2024-09-03 14:32:43--  https://github.com/Shreya-Das-uoft/Chembl-Database-Drug-Discovery-for-RET-tyrosine-kinase-receptor-in-neuroblastoma/blob/main/RET_03_bioactivity_data_pIC50.csv
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘RET_03_bioactivity_data_pIC50.csv.3’

          RET_03_bi     [<=>                 ]       0  --.-KB/s               RET_03_bioactivity_     [ <=>                ] 450.94K  --.-KB/s    in 0.05s   

2024-09-03 14:32:43 (9.24 MB/s) - ‘RET_03_bioactivity_data_pIC50.csv.3’ saved [461765]



We will use pandas to read in the csv file from GitHub. Then we will create a selected dataframe showing the canonical smiles (the graphical expression of the chemical formula) and the ChEMBL id. In this selected dataframe there are 896 molecules present.

In [None]:
import pandas as pd

In [None]:
df3 = pd.read_csv('RET_03_bioactivity_data_pIC50.csv')

In [None]:
selection = ['canonical_smiles', 'molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1	CHEMBL115220
O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23	CHEMBL6246
CO[C@@H](C(=O)N1Cc2[nH]nc(NC(=O)c3ccc(N4CCN(C)CC4)cc3)c2C1)c1ccccc1	CHEMBL402548
CNc1ncnc(-c2cccnc2Oc2ccc(F)c(C(=O)Nc3cc(C(F)(F)F)ccc3N(C)CCCN(C)C)c2)n1	CHEMBL373882
Cc1ccc(F)c(NC(=O)Nc2ccc(-c3cccc4[nH]nc(N)c34)cc2)c1	CHEMBL223360


In [None]:
! cat molecule.smi | wc -l

896


## 1.2 Calculate fingerprint descriptors

### Calculate PaDEL descriptors


In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
! bash padel.sh

Processing CHEMBL115220 in molecule.smi (1/896). 
Processing CHEMBL6246 in molecule.smi (2/896). 
Processing CHEMBL402548 in molecule.smi (3/896). Average speed: 5.31 s/mol.
Processing CHEMBL373882 in molecule.smi (4/896). Average speed: 2.69 s/mol.
Processing CHEMBL223360 in molecule.smi (5/896). Average speed: 2.59 s/mol.
Processing CHEMBL236928 in molecule.smi (6/896). Average speed: 2.09 s/mol.
Processing CHEMBL399208 in molecule.smi (7/896). Average speed: 1.98 s/mol.
Processing CHEMBL237347 in molecule.smi (9/896). Average speed: 1.83 s/mol.
Processing CHEMBL237557 in molecule.smi (8/896). Average speed: 1.74 s/mol.
Processing CHEMBL395359 in molecule.smi (10/896). Average speed: 1.47 s/mol.
Processing CHEMBL237346 in molecule.smi (11/896). Average speed: 1.32 s/mol.
Processing CHEMBL237345 in molecule.smi (12/896). Average speed: 1.27 s/mol.
Processing CHEMBL237344 in molecule.smi (13/896). Average speed: 1.16 s/mol.
Processing CHEMBL395358 in molecule.smi (14/896). Average spee

In [None]:
! ls -l

total 26940
-rw-r--r-- 1 root root  1604233 Sep  3 14:38 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Sep  3 14:28 __MACOSX
-rw-r--r-- 1 root root    61725 Sep  3 14:34 molecule.smi
drwxrwxr-x 4 root root     4096 Sep  3 14:28 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Sep  3 14:28 padel.sh
-rw-r--r-- 1 root root 25768637 Sep  3 14:28 padel.zip
-rw-r--r-- 1 root root   125352 Sep  3 14:33 RET_03_bioactivity_data_pIC50.csv
drwxr-xr-x 1 root root     4096 Aug 29 13:22 sample_data


# 2. Preparing the X and Y Data Matrices

## 2.1. X Data Matrix
The X matrix will only contain the calculated molecular decriptors. We will remove the column that contains the ChEMBL ids.

In [None]:
import pandas as pd

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL6246,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL402548,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL373882,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL223360,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891,CHEMBL5289571,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
892,CHEMBL5268831,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
893,CHEMBL5284144,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
894,CHEMBL4080062,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
892,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
893,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
894,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## 2.2. Y Data Matrix
The Y matrix will only contain the column with pIC50 values. These pIC50 values will be used to train and test the predicted values of an ML regression model.

In [None]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,5.000000
1,4.397940
2,7.508638
3,5.080922
4,5.721246
...,...
891,6.866461
892,7.950782
893,7.920819
894,7.522879


## 2.3. Combining X and Y variable
We will create a new data frame that combines the molecular descriptors and the pIC50 values. This data frame will be exported as a .csv file and used in the next phase.

In [None]:
dataset3 = pd.concat([df3_X, df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.397940
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.508638
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.080922
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.721246
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.866461
892,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.950782
893,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.920819
894,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.522879


In [None]:
dataset3.to_csv('RET_04_bioactivity_data_pIC50_pubchem_fp.csv', index=False)