# **Computational Drug Discovery - Descriptor Calculation and Dataset Preparation**

In **Part 3**, I will be calculating molecular descriptors which are quantitative description of the compounds in the dataset. Then, I will be preparing them into a dataset for subsequent model building in **Part 4**.

---

## **Load bioactivity data**

In [1]:
import pandas as pd

In [2]:
df3 = pd.read_csv('../data/SGLT2_04_bioactivity_data_3class_pIC50.csv')
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL183937,Cc1cc(O)c(C(=O)CCc2ccc3occc3c2)c(O[C@@H]2O[C@H...,active,458.463,1.44102,5.0,9.0,7.958607
1,1,CHEMBL371448,Cc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]...,active,454.479,1.58142,5.0,8.0,6.308919
2,2,CHEMBL382302,Cc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]...,active,456.495,0.77012,5.0,8.0,6.339135
3,3,CHEMBL382319,CCc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H...,active,470.522,1.02410,5.0,8.0,6.274088
4,4,CHEMBL200608,Cc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]...,active,470.522,0.78052,4.0,9.0,6.122629
...,...,...,...,...,...,...,...,...,...
1293,1293,CHEMBL5177502,OC[C@H]1O[C@@H](c2ccc(Cl)c(Cc3ccc(OC4CCOC4)cc3...,active,450.915,1.61340,4.0,7.0,8.508638
1294,1294,CHEMBL5202047,CCOc1ccc(Cc2cc([C@H]3O[C@@H](SC)[C@H](O)[C@@H]...,active,424.946,3.17260,3.0,6.0,8.744727
1295,1295,CHEMBL5182632,COc1ccc(Cc2cc([C@H]3O[C@@H](SC)[C@H](O)[C@@H](...,active,410.919,2.78250,3.0,6.0,8.677781
1296,1296,CHEMBL5205876,COc1ccc(Cc2cc3c(c([C@@H]4O[C@H](CO)[C@@H](O)[C...,active,400.471,1.28960,4.0,6.0,8.128427


In [3]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [4]:
! cat molecule.smi | head -5

Cc1cc(O)c(C(=O)CCc2ccc3occc3c2)c(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]2O)c1	CHEMBL183937
Cc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]2O)c2c(CCc3ccc4occc4c3)n[nH]c2c1	CHEMBL371448
Cc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]2O)c2c(CCc3ccc4c(c3)CCO4)n[nH]c2c1	CHEMBL382302
CCc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]2O)c2c(CCc3ccc4c(c3)CCO4)n[nH]c2c1	CHEMBL382319
Cc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]2O)c2c(CCc3ccc4c(c3)CCO4)nn(C)c2c1	CHEMBL200608


Checking that all of the rows are in the `molecule.smi` file:

In [5]:
! cat molecule.smi | wc -l

    1298


## **Preparing the X and Y Data Matrices**

*Note: `descriptors_output.csv` contains molecular fingerprint descriptors.*

### **X data matrix**

In [6]:
df3_X = pd.read_csv('../data/descriptors_output.csv')

In [7]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL336398,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL133897,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL130628,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL131588,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL130478,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,CHEMBL4293155,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4691,CHEMBL4282558,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4692,CHEMBL4281727,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4693,CHEMBL4292349,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4691,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4692,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4693,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [9]:
df3_Y = df3['pIC50']
df3_Y

0       7.958607
1       6.308919
2       6.339135
3       6.274088
4       6.122629
          ...   
1293    8.508638
1294    8.744727
1295    8.677781
1296    8.128427
1297    8.841638
Name: pIC50, Length: 1298, dtype: float64

## **Combining X and Y variable**

In [10]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.958607
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.308919
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.339135
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.274088
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.122629
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
4691,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
4692,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
4693,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,


In [11]:
dataset3.to_csv('SGLT2_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)