<a href="https://colab.research.google.com/github/samservo09/bioinformatics-bipolar-drug-discovery/blob/main/2-eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bioinformatics: Drug discovery on CaM-kinase kinase beta protein

## Install necessary packages/libraries

**RDKit** - collection of open-source cheminformatics and machine-learning software written in C++ and Python.

In [4]:
# install conda and rdkit
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

--2024-10-11 15:27:58--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.191.158, 104.16.32.241, 2606:4700::6810:20f1, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.191.158|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85055499 (81M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh.3’


2024-10-11 15:27:58 (183 MB/s) - ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh.3’ saved [85055499/85055499]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | / done
Solving environment: \ | / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==

## Load bioactivity data

In [5]:
import pandas as pd

In [7]:
df = pd.read_csv('/content/CaMKK2_preprocessed_data.csv')

## Calculate Lipinski descriptors

**Lipinski descriptors** - "Rule of Five" <br>
- set of rules to evaluate the druglikeness of a compound <br>
- druglikeness is based on ADME (Absorption, Distribution, Metabolism, and Excretion) aka "pharmacokinetic profile" <br>

**Rule of Five** <br>
*   Molecular weight < 500 Dalton
*   Octanol-water partition coefficient (LogP) < 5
*   Hydrogen bond donors < 5
*   Hydrogen bond acceptors < 10



*Note: compounds that violate more than 1 of these rules are likely to have poor absorpiton making them less suitable for oral administration as drugs.*

Further explanation for each rule: <br>
*   **Molecules with higher molecular weights** tend to have **difficulty passing through** cell membranes, which is crucial for absorption and distribution in the body <br>
*   **LogP is a way to measure if a molecule prefers to dissolve in fat (octanol) or in water.** If it likes fat a lot (high LogP), it can be hard for it to mix with water, making it difficult for our body to absorb it. <br>
*   When molecules have groups like OH or NH, they stick to water really well. This **strong sticking can make it harder for those molecules to get through the protective barriers** (membranes) of cells.<br>
*   Hydrogen bond acceptors, like oxygen and nitrogen, can attract water. This can make it **difficult for certain molecules to pass through the cell's protective barrier** (membrane).

In [9]:
! pip install -q rdkit

[K     |████████████████████████████████| 29.5 MB 55.0 MB/s 
[?25h

In [10]:
# import libraries
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

In [11]:
# calculate descriptors
# Inspired by: https://codeocean.com/explore/capsules?query=tag:data-curation

def lipinski(smiles, verbose=False):

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem)
        moldata.append(mol)

    baseData= np.arange(1,1)
    i=0
    for mol in moldata:

        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)

        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_NumHDonors,
                        desc_NumHAcceptors])

        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1

    columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]
    descriptors = pd.DataFrame(data=baseData,columns=columnNames)

    return descriptors

In [12]:
# turn the canonical smiles column into a dataframe
df_lipinski = lipinski(df.canonical_smiles)

In [13]:
# lipinski dataframe
df_lipinski

Unnamed: 0,MW,LogP,NumHDonors,NumHAcceptors
0,275.260,2.11370,5.0,5.0
1,374.352,3.38100,2.0,5.0
2,374.352,3.38100,2.0,5.0
3,385.468,1.59080,5.0,8.0
4,346.416,1.45030,2.0,7.0
...,...,...,...,...
128,349.773,5.51340,1.0,3.0
129,397.474,6.82602,1.0,3.0
130,349.773,5.51340,1.0,3.0
131,397.474,6.82602,1.0,3.0


In [15]:
# original dataframe
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL319620,O=C(O)c1cc(NCc2cc(O)ccc2O)ccc1O,200.00,active
1,CHEMBL265470,CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21,0.04,active
2,CHEMBL265470,CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21,10.00,active
3,CHEMBL1234833,CC(C)c1cnn2c(NCc3ccccc3)cc(N[C@@H](CO)[C@H](O)...,2450.00,intermediate
4,CHEMBL2205766,CC(C)(C)NS(=O)(=O)c1cncc(-c2ccn3nc(N)nc3c2)c1,10000.00,inactive
...,...,...,...,...
128,CHEMBL4787282,O=C(O)c1ccc(-c2coc3ncc(-c4ccccc4)cc23)cc1Cl,10000.00,inactive
129,CHEMBL4745471,Cc1cccc(-c2cnc3occ(-c4ccc(C(=O)O)c(C5CCCC5)c4)...,1600.00,intermediate
130,CHEMBL4787282,O=C(O)c1ccc(-c2coc3ncc(-c4ccccc4)cc23)cc1Cl,27000.00,inactive
131,CHEMBL4745471,Cc1cccc(-c2cnc3occ(-c4ccc(C(=O)O)c(C5CCCC5)c4)...,30.00,active


In [16]:
# combine the two to have the standard value and bioactivity class
df_combined = pd.concat([df,df_lipinski], axis=1)

In [17]:
df_combined

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL319620,O=C(O)c1cc(NCc2cc(O)ccc2O)ccc1O,200.00,active,275.260,2.11370,5.0,5.0
1,CHEMBL265470,CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21,0.04,active,374.352,3.38100,2.0,5.0
2,CHEMBL265470,CC(=O)O.O=C(O)c1ccc2c3c1cccc3c(=O)n1c3ccccc3nc21,10.00,active,374.352,3.38100,2.0,5.0
3,CHEMBL1234833,CC(C)c1cnn2c(NCc3ccccc3)cc(N[C@@H](CO)[C@H](O)...,2450.00,intermediate,385.468,1.59080,5.0,8.0
4,CHEMBL2205766,CC(C)(C)NS(=O)(=O)c1cncc(-c2ccn3nc(N)nc3c2)c1,10000.00,inactive,346.416,1.45030,2.0,7.0
...,...,...,...,...,...,...,...,...
128,CHEMBL4787282,O=C(O)c1ccc(-c2coc3ncc(-c4ccccc4)cc23)cc1Cl,10000.00,inactive,349.773,5.51340,1.0,3.0
129,CHEMBL4745471,Cc1cccc(-c2cnc3occ(-c4ccc(C(=O)O)c(C5CCCC5)c4)...,1600.00,intermediate,397.474,6.82602,1.0,3.0
130,CHEMBL4787282,O=C(O)c1ccc(-c2coc3ncc(-c4ccccc4)cc23)cc1Cl,27000.00,inactive,349.773,5.51340,1.0,3.0
131,CHEMBL4745471,Cc1cccc(-c2cnc3occ(-c4ccc(C(=O)O)c(C5CCCC5)c4)...,30.00,active,397.474,6.82602,1.0,3.0
