<a href="https://colab.research.google.com/github/stawiskm/QSAR_Modelbuilding_amesTest/blob/main/AMES_Test-Part-3-Descriptor-Dataset-Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Ames test [Part 3] Descriptor Calculation and Dataset Preparation**

Marc Jermann

inspired by [*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-05-18 14:56:54--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-05-18 14:56:55--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-05-18 14:56:56 (204 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-05-18 14:56:56--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/stawiskm/QSAR_Modelbuilding_amesTest/main/data/QSAR_ames_lipinskydata.csv')

In [5]:
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [6]:
! cat molecule.smi | head -5

CC(C)(N)C(=O)N[C@H](COCc1ccccc1)c1nnnn1CCOC(=O)NCCCCO	CHEMBL398372
C=C1C(=O)O[C@@H]2C[C@@]3(C)CCCC(=C)[C@@H]3C[C@H]12	CHEMBL137803
C=C1CCC[C@]2(C)C[C@H]3OC(=O)[C@@H](C)[C@H]3C[C@@H]12	CHEMBL486423
O=c1ccc2ccccc2o1	CHEMBL6466
COc1c2ccoc2cc2oc(=O)ccc12	CHEMBL24171


In [7]:
! cat molecule.smi | wc -l

655


## **Calculate fingerprint descriptors**


In [8]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [9]:
! bash padel.sh

Processing CHEMBL137803 in molecule.smi (2/655). 
Processing CHEMBL398372 in molecule.smi (1/655). 
Processing CHEMBL486423 in molecule.smi (3/655). Average speed: 4.67 s/mol.
Processing CHEMBL6466 in molecule.smi (4/655). Average speed: 2.44 s/mol.
Processing CHEMBL24171 in molecule.smi (5/655). Average speed: 1.88 s/mol.
Processing CHEMBL416 in molecule.smi (6/655). Average speed: 1.52 s/mol.
Processing CHEMBL52229 in molecule.smi (7/655). Average speed: 1.39 s/mol.
Processing CHEMBL453805 in molecule.smi (8/655). Average speed: 1.25 s/mol.
Processing CHEMBL164660 in molecule.smi (9/655). Average speed: 1.17 s/mol.
Processing CHEMBL51628 in molecule.smi (10/655). Average speed: 1.14 s/mol.
Processing CHEMBL447467 in molecule.smi (11/655). Average speed: 1.07 s/mol.
Processing CHEMBL451631 in molecule.smi (13/655). Average speed: 0.90 s/mol.
Processing CHEMBL450641 in molecule.smi (12/655). Average speed: 0.97 s/mol.
Processing CHEMBL485168 in molecule.smi (14/655). Average speed: 0.8

## **Combine features with class**

In [16]:
df_features = pd.read_csv('descriptors_output.csv')
df_features = df_features.rename({"Name":"molecule_chembl_id"},axis=1)

target = ['class','molecule_chembl_id']
df_target = df[target]

df_merge = pd.merge(df_features,df_target,on="molecule_chembl_id")

In [17]:
df_merge.to_csv('QSAR_ames_padeldata.csv', index=False)