<a href="https://colab.research.google.com/github/sara-then/HGF-drugdiscovery-project/blob/main/project_drugdiscovery_pt3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Computational Drug Discovery Project (Part 3)
Part 3: Molecular Descriptor Calculation and Dataset Preparation

Lipinski descriptors provide a simple and quick overview of the drugability of a molecule/compound. These descriptors provide a global/macro view of the molecule's features-- its molecular weight, solubility, number of hydrogen acceptors and donors. Compounds that pass Lipinski's Rule of 5 give insight to the potential of the compounds' success as an oral drug.

For drug discovery and design, we need to take a more microscopic view of the molecule. Molecular descriptors and fingerprints provide insight to "building blocks" of the molecule -- its structures, its bonds/connectivity, functional groups, and unique molecular properties. The essence of drug discovery and design is to rearrange/manipulate these building blocks in a way that the molcule can provide the highest potency toward the target protein while still maintaining safety/lowest toxicity. 

## Downloading PaDEL-Descriptor 
PaDEL-descriptor is an open source software to calculate molecular descriptors and fingerprints. 
More information can be found here: https://pubmed.ncbi.nlm.nih.gov/21425294/


In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-06-19 03:55:19--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-06-19 03:55:19--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-06-19 03:55:20 (36.8 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-06-19 03:55:20--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (git

In [None]:
%%capture
! unzip padel.zip

## Loading preprocessed bioactivity dataframe
Load preprocessed dataframe containing Lipinski descriptors and pIC50 calculations

In [None]:
import pandas as pd

In [None]:
df3 = pd.read_csv('bioactivity_data_3class_pIC50.csv')

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL352308,COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2c...,inactive,501.627,6.03750,3.0,7.0,5.000000
1,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,inactive,291.354,3.62150,2.0,2.0,5.000000
2,CHEMBL101683,O=C(Nc1ccc(Cl)cc1)c1ccccc1NCc1ccncc1,inactive,337.810,4.59940,2.0,3.0,5.000000
3,CHEMBL101253,Clc1ccc(Nc2nnc(Cc3ccncc3)c3ccccc23)cc1,inactive,346.821,5.01260,1.0,4.0,5.000000
4,CHEMBL281957,CCN(CC)C/C=C/c1nc(O)c2c(ccc3nc(Nc4c(Cl)cccc4Cl...,inactive,484.431,6.54092,2.0,6.0,4.000000
...,...,...,...,...,...,...,...,...
4825,CHEMBL4799551,COc1cc2ncnc(Oc3ccc(Nc4nccc5c4c(=O)c(-c4ccc(F)c...,active,567.552,6.37500,1.0,9.0,7.081445
4826,CHEMBL4593677,CC1=C(C#N)C(c2ccc3[nH]nc(C)c3c2)C(C#N)=C(C)N1,active,289.342,3.15328,2.0,4.0,9.000000
4827,CHEMBL4593677,CC1=C(C#N)C(c2ccc3[nH]nc(C)c3c2)C(C#N)=C(C)N1,active,289.342,3.15328,2.0,4.0,8.537602
4828,CHEMBL4522773,CC1=C(C#N)[C@@H](c2ccc3[nH]nc(C)c3c2)C(C#N)=C(...,active,385.858,4.94398,2.0,4.0,7.958607


Preparing subset of dataframe for PaDEL-Descriptor input. Subsetting only 'canonical_smiles' and 'molecule_chembl_id' into separate smi file. 

In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2cc1OCCNCCO	CHEMBL352308
O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1	CHEMBL115220
O=C(Nc1ccc(Cl)cc1)c1ccccc1NCc1ccncc1	CHEMBL101683
Clc1ccc(Nc2nnc(Cc3ccncc3)c3ccccc23)cc1	CHEMBL101253
CCN(CC)C/C=C/c1nc(O)c2c(ccc3nc(Nc4c(Cl)cccc4Cl)n(C)c32)c1C	CHEMBL281957


In [None]:
! cat molecule.smi | wc -l

4830


## Calculating fingerprint descriptors 
Calculate PaDEL descriptors 

In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
# running padel.sh to calculate fingerprint descriptors
! bash padel.sh

Processing CHEMBL352308 in molecule.smi (1/4830). 
Processing CHEMBL115220 in molecule.smi (2/4830). 
Processing CHEMBL101683 in molecule.smi (3/4830). Average speed: 13.19 s/mol.
Processing CHEMBL101253 in molecule.smi (4/4830). Average speed: 7.21 s/mol.
Processing CHEMBL281957 in molecule.smi (5/4830). Average speed: 4.88 s/mol.
Processing CHEMBL2111784 in molecule.smi (6/4830). Average speed: 4.01 s/mol.
Processing CHEMBL419409 in molecule.smi (7/4830). Average speed: 3.25 s/mol.
Processing CHEMBL120185 in molecule.smi (8/4830). Average speed: 2.94 s/mol.
Processing CHEMBL121405 in molecule.smi (9/4830). Average speed: 2.55 s/mol.
Processing CHEMBL47203 in molecule.smi (10/4830). Average speed: 2.33 s/mol.
Processing CHEMBL118258 in molecule.smi (11/4830). Average speed: 2.11 s/mol.
Processing CHEMBL178455 in molecule.smi (12/4830). Average speed: 1.99 s/mol.
Processing CHEMBL412367 in molecule.smi (13/4830). Average speed: 1.81 s/mol.
Processing CHEMBL401930 in molecule.smi (14/48

In [None]:
! ls -l

total 26208
-rw-r--r-- 1 root root   684985 Jun 19 03:55 bioactivity_data_3class_pIC50.csv
drwxr-xr-x 3 root root     4096 Jun 19 03:55 __MACOSX
-rw-r--r-- 1 root root   357280 Jun 19 03:56 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Jun 19 03:55 padel.sh
-rw-r--r-- 1 root root 25768637 Jun 19 03:55 padel.zip
drwxr-xr-x 1 root root     4096 Jun 15 13:42 sample_data


## Preparing the X and Y Data Matrices

X data matrix

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL352308,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL101683,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL101253,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL281957,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4825,CHEMBL4593677,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4826,CHEMBL4593677,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4827,CHEMBL4799551,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4828,CHEMBL4097778,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# dropping column 'Name'
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4825,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4826,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4827,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4828,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Y variable

In [None]:
# selecting pIC50 column and set as Y variable
df3_Y = df3['pIC50']
df3_Y

0       5.000000
1       5.000000
2       5.000000
3       5.000000
4       4.000000
          ...   
4825    7.081445
4826    9.000000
4827    8.537602
4828    7.958607
4829    6.000000
Name: pIC50, Length: 4830, dtype: float64

## Combining X and Y variables into dataset for model building

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4825,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.081445
4826,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.000000
4827,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.537602
4828,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.958607


In [None]:
# writing dataframe to csv file
dataset3.to_csv('bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)