<a href="https://colab.research.google.com/github/samservo09/bioinformatics-bipolar-drug-discovery/blob/main/3-descriptor-calculation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bioinformatics: Drug discovery on CaM-kinase kinase beta protein

## Download PaDEL-Descriptor

**PaDEL-Descriptor** is a software used in bioinformatics to calculate molecular descriptors and fingerprints.

These descriptors and fingerprints are essential for building **quantitative structure-activity relationship (QSAR)** models, which are used in drug discovery to predict the biological activity of molecules based on their structure.

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-10-12 08:03:15--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-10-12 08:03:15--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2024-10-12 08:03:16 (361 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2024-10-12 08:03:16--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## Load bioactivity data

In [3]:
! wget https://raw.githubusercontent.com/samservo09/bioinformatics-bipolar-drug-discovery/refs/heads/main/data/CaMKK2_bioactivity_data_3class_pIC50.csv

--2024-10-12 08:21:11--  https://raw.githubusercontent.com/samservo09/bioinformatics-bipolar-drug-discovery/refs/heads/main/data/CaMKK2_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17617 (17K) [text/plain]
Saving to: ‘CaMKK2_bioactivity_data_3class_pIC50.csv’


2024-10-12 08:21:11 (98.1 MB/s) - ‘CaMKK2_bioactivity_data_3class_pIC50.csv’ saved [17617/17617]



In [4]:
import pandas as pd

In [5]:
df3 = pd.read_csv('CaMKK2_bioactivity_data_3class_pIC50.csv')

In [6]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL319620,O=C(O)c1cc(NCc2cc(O)ccc2O)ccc1O,active,275.260,2.11370,5.0,5.0,6.698970
1,CHEMBL265470,CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21,active,374.352,3.38100,2.0,5.0,10.397940
2,CHEMBL265470,CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21,active,374.352,3.38100,2.0,5.0,8.000000
3,CHEMBL1234833,CC(C)c1cnn2c(NCc3ccccc3)cc(N[C@@H](CO)[C@H](O)...,intermediate,385.468,1.59080,5.0,8.0,5.610834
4,CHEMBL2205766,CC(C)(C)NS(=O)(=O)c1cncc(-c2ccn3nc(N)nc3c2)c1,inactive,346.416,1.45030,2.0,7.0,5.000000
...,...,...,...,...,...,...,...,...
128,CHEMBL4787282,O=C(O)c1ccc(-c2coc3ncc(-c4ccccc4)cc23)cc1Cl,inactive,349.773,5.51340,1.0,3.0,5.000000
129,CHEMBL4745471,Cc1cccc(-c2cnc3occ(-c4ccc(C(=O)O)c(C5CCCC5)c4)...,intermediate,397.474,6.82602,1.0,3.0,5.795880
130,CHEMBL4787282,O=C(O)c1ccc(-c2coc3ncc(-c4ccccc4)cc23)cc1Cl,inactive,349.773,5.51340,1.0,3.0,4.568636
131,CHEMBL4745471,Cc1cccc(-c2cnc3occ(-c4ccc(C(=O)O)c(C5CCCC5)c4)...,active,397.474,6.82602,1.0,3.0,7.522879


In [7]:
# select only the canonical smiles column and the molecule chembl id
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
# look at the smi file using the bash
! cat molecule.smi | head -5

O=C(O)c1cc(NCc2cc(O)ccc2O)ccc1O	CHEMBL319620
CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21	CHEMBL265470
CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21	CHEMBL265470
CC(C)c1cnn2c(NCc3ccccc3)cc(N[C@@H](CO)[C@H](O)CO)nc12	CHEMBL1234833
CC(C)(C)NS(=O)(=O)c1cncc(-c2ccn3nc(N)nc3c2)c1	CHEMBL2205766


In [10]:
# see how many lines of molecules is in the smi file
! cat molecule.smi | wc -l

133


## Calculate fingerprint descriptors

## Preparing the X and Y Data Matrices

## Y Variable

## Combining X and Y variables