# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Download PaDEL-Descriptor**

In [None]:
# ! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
# ! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

In [None]:
# ! unzip padel.zip

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
# ! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

In [2]:
import pandas as pd

In [3]:
df3 = pd.read_csv('/content/drive/MyDrive/Project_Msc/Project/pIC50_data/Leishmania_04_bioactivity_data_3class_pIC50.csv')

In [4]:
df3.drop("Unnamed: 0",axis=1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL164580,C[N+]1(CCOP(=O)([O-])OCCOC2CCCCC2)CCOCC1,inactive,351.380,1.31410,0.0,6.0,4.000000
1,CHEMBL166314,C[N+](C)(C)CCOP(=O)([O-])OCCCCCC12CC3CC(CC(C3)...,intermediate,387.501,3.97100,0.0,4.0,5.395774
2,CHEMBL165097,C[N+](C)(C)CCOP(=O)([O-])OCCCCC=C1CCCCC1,inactive,333.409,3.25500,0.0,4.0,4.000000
3,CHEMBL166092,C[N+](C)(C)CCOP(=O)([O-])OCCCCCCCCCCC=C1C2CC3C...,intermediate,469.647,6.08760,0.0,4.0,5.500313
4,CHEMBL349670,C[N+](C)(C)CCOP(=O)([O-])OCCOc1ccc2ccccc2c1,inactive,353.355,2.42640,0.0,5.0,4.000000
...,...,...,...,...,...,...,...,...
6486,CHEMBL4873840,N=C(N)Nc1ccc2[nH]c(-c3csc4ccccc34)nc2c1,active,307.382,3.74997,4.0,3.0,7.232102
6487,CHEMBL4876357,N=C(N)Nc1ccc2[nH]c(-c3ccc(-c4cccs4)s3)nc2c1,active,339.449,4.32527,4.0,4.0,7.423659
6488,CHEMBL4862955,N=C(N)Nc1ccc2[nH]c(-c3ccc(-c4ccccc4)s3)nc2c1,active,333.420,4.26377,4.0,3.0,7.356547
6489,CHEMBL4849996,N=C(N)Nc1ccc2[nH]c(-c3cc(-c4ccc(Cl)cc4)on3)nc2c1,active,352.785,3.84367,4.0,4.0,8.113509


In [5]:
y = df3.pIC50 #orb df3.class 
y

0       4.000000
1       5.395774
2       4.000000
3       5.500313
4       4.000000
          ...   
6486    7.232102
6487    7.423659
6488    7.356547
6489    8.113509
6490    7.044793
Name: pIC50, Length: 6491, dtype: float64

In [None]:
_Y=df3.pIC50

In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

C[N+]1(CCOP(=O)([O-])OCCOC2CCCCC2)CCOCC1	CHEMBL164580
C[N+](C)(C)CCOP(=O)([O-])OCCCCCC12CC3CC(CC(C3)C1)C2	CHEMBL166314
C[N+](C)(C)CCOP(=O)([O-])OCCCCC=C1CCCCC1	CHEMBL165097
C[N+](C)(C)CCOP(=O)([O-])OCCCCCCCCCCC=C1C2CC3CC(C2)CC1C3	CHEMBL166092
C[N+](C)(C)CCOP(=O)([O-])OCCOc1ccc2ccccc2c1	CHEMBL349670


In [8]:
! cat molecule.smi | wc -l

6491


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
# ! cat padel.sh

In [None]:
# ! bash padel.sh

In [None]:
# !cat /content/drive/MyDrive/Project_Msc/Project/fingerprints_xml.zip.zip

In [None]:
! unzip /content/drive/MyDrive/Project_Msc/Project/fingerprints_xml.zip

Archive:  /content/drive/MyDrive/Project_Msc/Project/fingerprints_xml.zip
  inflating: AtomPairs2DFingerprintCount.xml  
  inflating: AtomPairs2DFingerprinter.xml  
  inflating: EStateFingerprinter.xml  
  inflating: ExtendedFingerprinter.xml  
  inflating: Fingerprinter.xml       
  inflating: GraphOnlyFingerprinter.xml  
  inflating: KlekotaRothFingerprintCount.xml  
  inflating: KlekotaRothFingerprinter.xml  
  inflating: MACCSFingerprinter.xml  
  inflating: PubchemFingerprinter.xml  
  inflating: SubstructureFingerprintCount.xml  
  inflating: SubstructureFingerprinter.xml  


In [None]:
import glob
xml_files = glob.glob("*.xml")
xml_files.sort()
xml_files

['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']

In [11]:
FP_list = ['AtomPairs2DCount',
 'AtomPairs2D',
 'EState',
 'CDKextended',
 'CDK',
 'CDKgraphonly',
 'KlekotaRothCount',
 'KlekotaRoth',
 'MACCS',
 'PubChem',
 'SubstructureCount',
 'Substructure']


In [None]:
fp = dict(zip(FP_list, xml_files))
fp

{'KlekotaRoth': 'AtomPairs2DFingerprintCount.xml',
 'MACCS': 'AtomPairs2DFingerprinter.xml',
 'PubChem': 'EStateFingerprinter.xml',
 'SubstructureCount': 'ExtendedFingerprinter.xml',
 'Substructure': 'Fingerprinter.xml'}

In [None]:
mol=pd.read_csv('molecule.smi')
mol

Unnamed: 0,C[N+]1(CCOP(=O)([O-])OCCOC2CCCCC2)CCOCC1\tCHEMBL164580
0,C[N+](C)(C)CCOP(=O)([O-])OCCCCCC12CC3CC(CC(C3)...
1,C[N+](C)(C)CCOP(=O)([O-])OCCCCC=C1CCCCC1\tCHEM...
2,C[N+](C)(C)CCOP(=O)([O-])OCCCCCCCCCCC=C1C2CC3C...
3,C[N+](C)(C)CCOP(=O)([O-])OCCOc1ccc2ccccc2c1\tC...
4,C[N+]1(CCOP(=O)([O-])Oc2ccc(C34CC5CC(CC(C5)C3)...
...,...
6485,N=C(N)Nc1ccc2[nH]c(-c3csc4ccccc34)nc2c1\tCHEMB...
6486,N=C(N)Nc1ccc2[nH]c(-c3ccc(-c4cccs4)s3)nc2c1\tC...
6487,N=C(N)Nc1ccc2[nH]c(-c3ccc(-c4ccccc4)s3)nc2c1\t...
6488,N=C(N)Nc1ccc2[nH]c(-c3cc(-c4ccc(Cl)cc4)on3)nc2...


In [None]:
!pip install padelpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting padelpy
  Downloading padelpy-0.1.12-py2.py3-none-any.whl (20.9 MB)
[K     |████████████████████████████████| 20.9 MB 1.4 MB/s 
[?25hInstalling collected packages: padelpy
Successfully installed padelpy-0.1.12


In [None]:
from padelpy import padeldescriptor
for i in FP_list:

  fingerprint = i #user input

  fingerprint_output_file = ''.join([fingerprint,'.csv']) #PubChem.csv
  fingerprint_descriptortypes = fp[fingerprint]
  padeldescriptor(mol_dir='molecule.smi', 
                  d_file=fingerprint_output_file, #'Substructure.csv' example
                  #descriptortypes='SubstructureFingerprint.xml', 
                  descriptortypes= fingerprint_descriptortypes,
                  detectaromaticity=True,
                  standardizenitro=True,
                  standardizetautomers=True,
                  threads=10,
                  removesalt=True,
                  log=True,
                  fingerprints=True)
  descriptors = pd.read_csv(fingerprint_output_file)
    # descriptors
    # df_2class
  X = descriptors.drop('Name', axis=1)
     #maybe we will choose bioactivity with 3 class
  model_dataset_PubChem = pd.concat([X,y,_Y],axis=1)
  model_dataset_PubChem.to_csv(f"/content/drive/MyDrive/Project_Msc/Project/pIC50_data/Fingerprints_descriptors/model_dataset_"+fingerprint_output_file ,index=False)

In [None]:
fingerprint = "Substructure" #user input

fingerprint_output_file = ''.join([fingerprint,'.csv']) #PubChem.csv
fingerprint_descriptortypes = fp[fingerprint]
padeldescriptor(mol_dir='molecule.smi', 
                   d_file=fingerprint_output_file, #'Substructure.csv' example
                   #descriptortypes='SubstructureFingerprint.xml', 
                   descriptortypes= fingerprint_descriptortypes,
                   detectaromaticity=True,
                   standardizenitro=True,
                   standardizetautomers=True,
                   threads=10,
                   removesalt=True,
                   log=True,
                   fingerprints=True)
descriptors = pd.read_csv(fingerprint_output_file)
   # descriptors
    # df_2class
X = descriptors.drop('Name', axis=1)
     #maybe we will choose bioactivity with 3 class
model_dataset_PubChem = pd.concat([X,y,_Y],axis=1)
model_dataset_PubChem.to_csv(f"/content/drive/MyDrive/Project_Msc/Project/pIC50_data/Fingerprints_descriptors/model_dataset_"+fingerprint_output_file ,index=False)

In [None]:
X

In [None]:
! ls -l

In [9]:
# df3_Y = df3['pIC50']
df3_Y=df3['class']

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [15]:
for i in FP_list:

  fingerprint = i #user input

  fingerprint_output_file = ''.join([fingerprint,'.csv'])

  df3_X = pd.read_csv(f"/content/drive/MyDrive/Project_Msc/Project/pIC50_data/Fingerprints_descriptors/model_dataset_"+ fingerprint_output_file)
  # df3_X = df3_X.drop(columns=['Name'])
  dataset3 = pd.concat([df3_X,df3_Y], axis=1)
  dataset3.to_csv(f"/content/drive/MyDrive/Project_Msc/Project/pIC50_data/Leishmania_06_bioactivity_data_3class_"+fingerprint+"_fp.csv", index=False)
  

In [16]:
df3_X

Unnamed: 0,FP1,FP2,FP3,FP4,FP5,FP6,FP7,FP8,FP9,FP10,...,FP1017,FP1018,FP1019,FP1020,FP1021,FP1022,FP1023,FP1024,pIC50,pIC50.1
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,4.000000,4.000000
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,5.395774,5.395774
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,4.000000,4.000000
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.500313,5.500313
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4.000000,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6486,0,1,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,7.232102,7.232102
6487,0,1,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,7.423659,7.423659
6488,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,7.356547,7.356547
6489,1,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,8.113509,8.113509


In [18]:
# df3_X = df3_X.drop(columns=['Name'])
df3_X = df3_X.drop(columns=['pIC50.1'])
df3_X

Unnamed: 0,FP1,FP2,FP3,FP4,FP5,FP6,FP7,FP8,FP9,FP10,...,FP1015,FP1016,FP1017,FP1018,FP1019,FP1020,FP1021,FP1022,FP1023,FP1024
0,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6486,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
6487,0,1,1,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
6488,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,0
6489,1,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [19]:
# df3_Y = df3['pIC50']
df3_Y = df3['class']
df3_Y

0           inactive
1       intermediate
2           inactive
3       intermediate
4           inactive
            ...     
6486          active
6487          active
6488          active
6489          active
6490          active
Name: class, Length: 6491, dtype: object

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

In [None]:
dataset3.to_csv('Leishmania_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**