# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation(Modified Version towards unknown smiles predcition purpose)**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

Modified by quantaosun@gmail.com


In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model prediction in notebook 5.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2021-11-26 01:58:43--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2021-11-26 01:58:44--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2021-11-26 01:58:44 (140 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2021-11-26 01:58:44--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load unknown smiles data, called "example.txt"**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

# We will convert a raw txt file with two lines of smiles to a standard dataframe by a csv intermediate

In [4]:
df3 = pd.read_csv('/content/example.txt')

In [5]:
df3

Unnamed: 0,smiles
0,CN(N=C1)C=C1C2=CC=C(OC(F)=C3C4=CN(CC5=CC=CC=C5...
1,CN(N=C1)C=C1C2=CC=C(OC(C)=C3C4=CN(CC5=CC=CC=C5...


Next the txt file will be save as a csv file, then import back again, to beautify the structure.

In [6]:
df3.to_csv('example.csv')

In [7]:
df4=pd.read_csv('example.csv')

In [8]:
df4

Unnamed: 0.1,Unnamed: 0,smiles
0,0,CN(N=C1)C=C1C2=CC=C(OC(F)=C3C4=CN(CC5=CC=CC=C5...
1,1,CN(N=C1)C=C1C2=CC=C(OC(C)=C3C4=CN(CC5=CC=CC=C5...


In [9]:
selection=['smiles']
df4[selection]

Unnamed: 0,smiles
0,CN(N=C1)C=C1C2=CC=C(OC(F)=C3C4=CN(CC5=CC=CC=C5...
1,CN(N=C1)C=C1C2=CC=C(OC(C)=C3C4=CN(CC5=CC=CC=C5...


In [10]:
df4_selection = df4[selection]
df4_selection

Unnamed: 0,smiles
0,CN(N=C1)C=C1C2=CC=C(OC(F)=C3C4=CN(CC5=CC=CC=C5...
1,CN(N=C1)C=C1C2=CC=C(OC(C)=C3C4=CN(CC5=CC=CC=C5...


In [11]:
df4_selection.to_csv('molecule.smi', sep= '\t', index=False, header=False)

In [12]:
! cat molecule.smi | head -5

CN(N=C1)C=C1C2=CC=C(OC(F)=C3C4=CN(CC5=CC=CC=C5)N=C4)C3=N2
CN(N=C1)C=C1C2=CC=C(OC(C)=C3C4=CN(CC5=CC=CC=C5)N=C4)C3=N2


In [13]:
! cat molecule.smi | wc -l

2


## **Calculate fingerprint descriptors, for the unknown smiles**


### **Calculate PaDEL descriptors**

In [14]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [15]:
! bash padel.sh

Processing AUTOGEN_molecule_1 in molecule.smi (1/2). 
Processing AUTOGEN_molecule_2 in molecule.smi (2/2). 
Descriptor calculation completed in 2.570 secs . Average speed: 1.29 s/mol.


In [16]:
! ls -l

total 25212
-rw-r--r-- 1 root root    14914 Nov 26 01:59 descriptors_output.csv
-rw-r--r-- 1 root root      128 Nov 26 01:59 example.csv
-rw-r--r-- 1 root root      123 Nov 26 01:57 example.txt
drwxr-xr-x 3 root root     4096 Nov 26 01:58 __MACOSX
-rw-r--r-- 1 root root      116 Nov 26 01:59 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Nov 26 01:58 padel.sh
-rw-r--r-- 1 root root 25768637 Nov 26 01:58 padel.zip
drwxr-xr-x 1 root root     4096 Nov 18 14:36 sample_data


## **Preparing the X matrices, there is no Y Data Matrices since we need to predict Y later**

### **X data matrix**

In [17]:
df3_X = pd.read_csv('descriptors_output.csv')

In [18]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,PubchemFP10,PubchemFP11,PubchemFP12,PubchemFP13,PubchemFP14,PubchemFP15,PubchemFP16,PubchemFP17,PubchemFP18,PubchemFP19,PubchemFP20,PubchemFP21,PubchemFP22,PubchemFP23,PubchemFP24,PubchemFP25,PubchemFP26,PubchemFP27,PubchemFP28,PubchemFP29,PubchemFP30,PubchemFP31,PubchemFP32,PubchemFP33,PubchemFP34,PubchemFP35,PubchemFP36,PubchemFP37,PubchemFP38,...,PubchemFP841,PubchemFP842,PubchemFP843,PubchemFP844,PubchemFP845,PubchemFP846,PubchemFP847,PubchemFP848,PubchemFP849,PubchemFP850,PubchemFP851,PubchemFP852,PubchemFP853,PubchemFP854,PubchemFP855,PubchemFP856,PubchemFP857,PubchemFP858,PubchemFP859,PubchemFP860,PubchemFP861,PubchemFP862,PubchemFP863,PubchemFP864,PubchemFP865,PubchemFP866,PubchemFP867,PubchemFP868,PubchemFP869,PubchemFP870,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,AUTOGEN_molecule_1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# You may have noticed that we now have a very long bit of 882 descriptor set, but when we built the model earlier, there are much less descriptors. This mismatch will be addressed later.

In [19]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,PubchemFP10,PubchemFP11,PubchemFP12,PubchemFP13,PubchemFP14,PubchemFP15,PubchemFP16,PubchemFP17,PubchemFP18,PubchemFP19,PubchemFP20,PubchemFP21,PubchemFP22,PubchemFP23,PubchemFP24,PubchemFP25,PubchemFP26,PubchemFP27,PubchemFP28,PubchemFP29,PubchemFP30,PubchemFP31,PubchemFP32,PubchemFP33,PubchemFP34,PubchemFP35,PubchemFP36,PubchemFP37,PubchemFP38,PubchemFP39,...,PubchemFP841,PubchemFP842,PubchemFP843,PubchemFP844,PubchemFP845,PubchemFP846,PubchemFP847,PubchemFP848,PubchemFP849,PubchemFP850,PubchemFP851,PubchemFP852,PubchemFP853,PubchemFP854,PubchemFP855,PubchemFP856,PubchemFP857,PubchemFP858,PubchemFP859,PubchemFP860,PubchemFP861,PubchemFP862,PubchemFP863,PubchemFP864,PubchemFP865,PubchemFP866,PubchemFP867,PubchemFP868,PubchemFP869,PubchemFP870,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
df3.to_csv('example_smile_with_descriptors_fp.csv')

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**