### Here, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset

#### Here We need to download the PaDEL-Descriptor

PaDEL-Descriptor is an open-source software tool used for calculating molecular descriptors and fingerprints from molecular structures, such as those represented in the Simplified Molecular Input Line Entry System (SMILES) format. It's widely used in cheminformatics, drug discovery, and other areas of computational chemistry to extract numerical representations of molecules that can be used in machine learning models, quantitative structure-activity relationship (QSAR) modeling, and other data-driven analyses.

In [4]:
import pandas as pd

### **Download PaDEL-Descriptor**

'''In Terminal'''
wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
unzip padel.zip


In [6]:
df = pd.read_csv('bioactivity_preprocessed_data_with_descriptors.csv')
df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,inactive,421.19,2.6605,0.0,4.0,4.869666
1,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,inactive,293.347,3.6308,0.0,3.0,4.882397
2,CHEMBL365134,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c(Br)cccc21,active,372.243,4.3933,0.0,3.0,6.008774
3,CHEMBL190743,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccc(I)cc21,active,419.243,4.2354,0.0,3.0,6.022276
4,CHEMBL365469,O=C1C(=O)N(Cc2cc3ccccc3s2)c2cccc(Cl)c21,inactive,327.792,4.2842,0.0,3.0,4.950782


In [8]:
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

df_selection.head()

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,CHEMBL185698
1,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,CHEMBL426082
2,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c(Br)cccc21,CHEMBL365134
3,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccc(I)cc21,CHEMBL190743
4,O=C1C(=O)N(Cc2cc3ccccc3s2)c2cccc(Cl)c21,CHEMBL365469


'''In Terminal'''
cat molecule.smi | head -5
cat molecule.smi | wc -l

### **Calculate fingerprint descriptors**

#### **Calculate PaDEL descriptors**


'''In Terminal'''

cat padel.sh

bash padel.sh

**This will create the output descriptor file which we use for model creation**

### **Preparing the X and Y Data Matrices**

In [9]:
df_X = pd.read_csv('descriptors_output.csv')

In [10]:
df_X.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL379727,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL148483,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL209287,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL210525,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL348660,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
df_X = df_X.drop(columns=['Name'])
df_X.head()

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [12]:
df_X.shape

(118, 881)

**For Y -> Convert IC50 to pIC50**

In [13]:
df_Y = df['pIC50']
df_Y.head()

0    4.869666
1    4.882397
2    6.008774
3    6.022276
4    4.950782
Name: pIC50, dtype: float64

In [14]:
df_Y.shape

(118,)

In [16]:
dataset = pd.concat([df_X,df_Y], axis=1)
dataset.head()

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.869666
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.882397
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.008774
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.022276
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.950782


In [18]:
dataset.to_csv('final_dataset_for_model_building.csv', index=False)