<a href="https://colab.research.google.com/github/sara-then/HGF-drugdiscovery-project/blob/main/project_drugdiscovery_pt1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Computational Drug Discovery Project (Part 1)



## Project Description
The goal for this project is to build a machine learning model capable of predicting if a compound or molecule would be a good drug candidate for a given target protein. Specifically, the objective is to deploy a streamlit webapp where users can input information about a compound/molecule of interest and find out if it would be a good drug candidate to inhibit the target protein, Hepatocyte Growth Factor receptor (HGFR). 

###**Hepatocyte Growth Factor receptor (HGFR)**
The HGFR protein is encoded by the MET gene. HGFR is known to promote proliferation, migration, invasion, survival, and therapeutic resistance of cancer cells.While there has been significant progress in developing therapeutics that target HGFR/MET, a lot of the drug candidates do not succeed in translation into the clinical setting due to the concomitant therapeutic resistance. Novel approaches for suppressing or inhibiting HGFR/MET signaling are still needed. 


---

sources:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5406710/

https://www.frontiersin.org/articles/10.3389/fcell.2020.00152/full

## Downloading Bioactivity Data from ChEMBL Database
ChEMBL is a curated database of bioactive molecules with drug-like properties. Data has been used to develop compound screening libraries to lead indentificaition during drug discovery.


---
source:
https://www.ebi.ac.uk/chembl/



In [None]:
# installing ChEMBL web service package to retrieve bioactivity data 
!pip install chembl_webresource_client

Importing libraries 

In [None]:
# importing necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

### Target search for Hepatocyte Growth Factor receptor (HGFR)


In [None]:
# saerch target protein of interest: HGF receptor
target = new_client.target
target_query = target.search('hepatocyte growth factor receptor')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P14210', 'xref_name': None, 'xre...",Homo sapiens,Hepatocyte growth factor,34.0,False,CHEMBL5479,"[{'accession': 'P14210', 'component_descriptio...",SINGLE PROTEIN,9606.0
1,[],Homo sapiens,Hepatocyte growth factor activator,33.0,False,CHEMBL3351190,"[{'accession': 'Q04756', 'component_descriptio...",SINGLE PROTEIN,9606.0
2,"[{'xref_id': 'P26927', 'xref_name': None, 'xre...",Homo sapiens,Hepatocyte growth factor-like protein,31.0,False,CHEMBL6042,"[{'accession': 'P26927', 'component_descriptio...",SINGLE PROTEIN,9606.0
3,"[{'xref_id': 'P08581', 'xref_name': None, 'xre...",Homo sapiens,Hepatocyte growth factor receptor,29.0,False,CHEMBL3717,"[{'accession': 'P08581', 'component_descriptio...",SINGLE PROTEIN,9606.0
4,"[{'xref_id': 'P16056', 'xref_name': None, 'xre...",Mus musculus,Hepatocyte growth factor receptor,29.0,False,CHEMBL5585,"[{'accession': 'P16056', 'component_descriptio...",SINGLE PROTEIN,10090.0
...,...,...,...,...,...,...,...,...,...
2838,[],Homo sapiens,Interferon-induced helicase C domain-containin...,1.0,False,CHEMBL4739862,"[{'accession': 'Q9BYX4', 'component_descriptio...",SINGLE PROTEIN,9606.0
2839,[],Homo sapiens,Mitochondrial complex I (NADH dehydrogenase),0.0,False,CHEMBL2363065,"[{'accession': 'P03923', 'component_descriptio...",PROTEIN COMPLEX,9606.0
2840,[],Homo sapiens,Cyclin-dependent kinase,0.0,False,CHEMBL3559691,"[{'accession': 'P06493', 'component_descriptio...",PROTEIN FAMILY,9606.0
2841,[],Homo sapiens,Caspase,0.0,False,CHEMBL3831289,"[{'accession': 'P49662', 'component_descriptio...",PROTEIN FAMILY,9606.0


### Select bioactivity data for human HGFR (CHEMBL3717)

In [None]:
# assigning selected target (4th entry on the tagets dataframe)
selected_target = targets.target_chembl_id[3] 

# retrieving bioactivity data for selected target with IC50 values as standard_type
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

df = pd.DataFrame.from_dict(res)
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,653063,[],CHEMBL820578,Inhibition of c-Met autophosphorylation of in ...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,UO_0000065,,10000.0
1,,750649,[],CHEMBL880758,Inhibition of MET kinase activity,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,UO_0000065,,10000.0
2,,1068014,[],CHEMBL711149,Inhibition of Met proto-oncogene tyrosine kinase,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,uM,UO_0000065,,10.0
3,,1070529,[],CHEMBL711149,Inhibition of Met proto-oncogene tyrosine kinase,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,uM,UO_0000065,,10.0
4,,1173958,[],CHEMBL697832,Inhibition of Hepatocyte growth factor receptor,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4988,Not Determined,22987197,[],CHEMBL4774499,Inhibition of MET in human MKN-45 cells assess...,B,,,BAO_0000179,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,,,,
4989,Active,23060645,[],CHEMBL4508613,Homogeneous Time Resolved Fluorescence Assay,B,,,BAO_0000179,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,,,1.0
4990,Active,23060646,[],CHEMBL4508614,Cellular mechanistic assay using NKM-45 cells ...,B,,,BAO_0000179,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,,,2.9
4991,Not Active,23060649,[],CHEMBL4508613,Homogeneous Time Resolved Fluorescence Assay,B,,,BAO_0000179,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,µM,,,11.0


standard_value: IC50 is a measure that quantifies how much of a particular inhibitory substance (drug) is needed to inhibit a given biological process/component (target protein) by 50%. It reflects the potency/concentration of the drug. The higher the concentration, the less effective the drug is (need more of it to inhibit target protein)

In [None]:
# save resulting bioactivity data to csv file bioactivity_data.csv
df.to_csv('bioactivity_data.csv', index=False)

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [None]:
# create a data folder in Colab Notebooks folder on Google Drive
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"
! cp bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data’: File exists
bioactivity_data.csv


In [None]:
# check csv file
! head bioactivity_data.csv

activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,653063,[],CHEMBL820578,Inhibition of c-Met autophosphorylation of in intact cells,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2cc1OCCNCCO,,,CHEMBL1146677,Bioorg. Med. Chem. Lett.,2004,,CHEMBL352308,,CHEMBL352308,,0,http://www.openphacts.org/units/Nanomolar,336059,>,1,1,>,,IC

## Data Cleaning and Pre-processing

### Handling missing data

In [None]:
# removing compounds in bioactivity dataframe with missing values in standard_value column
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,653063,[],CHEMBL820578,Inhibition of c-Met autophosphorylation of in ...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,UO_0000065,,10000.0
1,,750649,[],CHEMBL880758,Inhibition of MET kinase activity,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,UO_0000065,,10000.0
2,,1068014,[],CHEMBL711149,Inhibition of Met proto-oncogene tyrosine kinase,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,uM,UO_0000065,,10.0
3,,1070529,[],CHEMBL711149,Inhibition of Met proto-oncogene tyrosine kinase,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,uM,UO_0000065,,10.0
4,,1173958,[],CHEMBL697832,Inhibition of Hepatocyte growth factor receptor,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4985,,22987194,[],CHEMBL4774499,Inhibition of MET in human MKN-45 cells assess...,B,,,BAO_0000179,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,,,82.9
4989,Active,23060645,[],CHEMBL4508613,Homogeneous Time Resolved Fluorescence Assay,B,,,BAO_0000179,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,,,1.0
4990,Active,23060646,[],CHEMBL4508614,Cellular mechanistic assay using NKM-45 cells ...,B,,,BAO_0000179,BAO_0000219,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,nM,,,2.9
4991,Not Active,23060649,[],CHEMBL4508613,Homogeneous Time Resolved Fluorescence Assay,B,,,BAO_0000179,BAO_0000357,...,Homo sapiens,Hepatocyte growth factor receptor,9606,,,IC50,µM,,,11.0


### Labeling compounds into 3 classes- active, inactive, or intermediate
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. Values in between 1000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000: 
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### Selecting columns of interest (*molecule_chembl_id*, *canonical_smiles*, *standard_value*)


In [None]:
# iterate molecule_chembl_id to a list
mol_id = []
for i in df2.molecule_chembl_id:
  mol_id.append(i)

# iterate canonical_smiles to a list
canon_smiles = []
for i in df2.canonical_smiles:
  canon_smiles.append(i)

# iterate standard_value to a list
st_val = []
for i in df2.standard_value:
  st_val.append(i)
  

### Creating new dataframe of preprocessed data


In [None]:
features = list(zip(mol_id, canon_smiles, bioactivity_class, st_val))
df3 = pd.DataFrame(features, columns=['molecule_chembl_id','canonical_smiles','bioactivity_class','standard_value'])

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL352308,COc1cc2c(Oc3ccc(Nc4ccc(C(C)(C)C)cc4)cc3)ccnc2c...,inactive,10000.0
1,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,inactive,10000.0
2,CHEMBL101683,O=C(Nc1ccc(Cl)cc1)c1ccccc1NCc1ccncc1,inactive,10000.0
3,CHEMBL101253,Clc1ccc(Nc2nnc(Cc3ccncc3)c3ccccc23)cc1,inactive,10000.0
4,CHEMBL281957,CCN(CC)C/C=C/c1nc(O)c2c(ccc3nc(Nc4c(Cl)cccc4Cl...,inactive,100000.0
...,...,...,...,...
4825,CHEMBL4799551,COc1cc2ncnc(Oc3ccc(Nc4nccc5c4c(=O)c(-c4ccc(F)c...,active,82.9
4826,CHEMBL4593677,CC1=C(C#N)C(c2ccc3[nH]nc(C)c3c2)C(C#N)=C(C)N1,active,1.0
4827,CHEMBL4593677,CC1=C(C#N)C(c2ccc3[nH]nc(C)c3c2)C(C#N)=C(C)N1,active,2.9
4828,CHEMBL4522773,CC1=C(C#N)[C@@H](c2ccc3[nH]nc(C)c3c2)C(C#N)=C(...,active,11.0


Saving dataframe to csv file and copy to Google Drive

In [None]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)
! cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

# check if csv file saved
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data.csv  bioactivity_preprocessed_data.csv
