# Phase 1: Data Collection and Pre-Processing
**Shreya Das**

In this phase, we will retrieve biological activity data from ChEMBL database for RET-specific molecules that will help supplement constructing a machine learning model for drug discovery.

NOTE: The structure and layout of this phase and the project is inspired by **The Data Professor** on Youtube. The findings for RET molecules and drugs are original and investigated by the author (Shreya Das).

## Installing Libraries

We need to install the ChEMBL library in order to retrieve bioactivity data.

In [4]:
! pip install chembl_webresource_client

OSError: "/usr/local/bin/bash" shell not found

## Importing Libraries

We will be using pandas package for data manipulation of the data that we obtain from ChEMBL and import new_client from the chembl_webresource_client to aid in a targeted search.

In [5]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target Protein: RET

Here, we use the **target** object to create a targeted search for RET in the imported ChEMBL database. 

Then we use the search() function to pass the name of the target protein as a string parameter. 

Finally, we organize the found targets in the DB into a pandas dataframe with the rows representing different drugs and the columns representing different information about these drugs.

NOTE: that the number of targets seen here should match the number of targets found on ChEMBL website under the "Targets" tab. As of 2024-08-04, the number of targets for RET is 10.

In [6]:
target = new_client.target
target_query = target.search('RET')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Proto-oncogene tyrosine-protein kinase recepto...,19.0,False,CHEMBL2034799,"[{'accession': 'P35546', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Rattus norvegicus,Proto-oncogene tyrosine-protein kinase recepto...,19.0,False,CHEMBL4295641,"[{'accession': 'G3V9H8', 'component_descriptio...",SINGLE PROTEIN,10116
2,[],Mus musculus,Proto-oncogene tyrosine-protein kinase recepto...,19.0,False,CHEMBL5291969,"[{'accession': 'P35546', 'component_descriptio...",CHIMERIC PROTEIN,10090
3,"[{'xref_id': 'P07949', 'xref_name': None, 'xre...",Homo sapiens,Tyrosine-protein kinase receptor RET,18.0,False,CHEMBL2041,"[{'accession': 'P07949', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Homo sapiens,Centrosomal protein 43/RET,18.0,False,CHEMBL4523602,"[{'accession': 'P07949', 'component_descriptio...",CHIMERIC PROTEIN,9606
5,[],Homo sapiens,Proto-oncogene tyrosine-protein kinase recepto...,17.0,False,CHEMBL3430877,"[{'accession': 'P07949', 'component_descriptio...",CHIMERIC PROTEIN,9606
6,[],Homo sapiens,Kinesin-1 heavy chain/ Tyrosine-protein kinase...,17.0,False,CHEMBL3430888,"[{'accession': 'P07949', 'component_descriptio...",CHIMERIC PROTEIN,9606
7,[],Homo sapiens,Coiled-coil domain-containing protein 6/Tyrosi...,17.0,False,CHEMBL3430904,"[{'accession': 'P07949', 'component_descriptio...",CHIMERIC PROTEIN,9606
8,[],Homo sapiens,GDNF family receptor alpha-1,10.0,False,CHEMBL3833481,"[{'accession': 'P56159', 'component_descriptio...",SINGLE PROTEIN,9606
9,[],Homo sapiens,E3 ubiquitin-protein ligase TRIM33,9.0,False,CHEMBL2176772,"[{'accession': 'Q9UPN9', 'component_descriptio...",SINGLE PROTEIN,9606


## Select and retrieve bioactivity data for *Human RET* (fourth entry)

We assign the fourth entry (corresponding to *Human RET*) to the selected_target variable. Make sure this is correct using the ChEMBL website as a reference. Also the entry may change depending on the date. As of 2024-08-04, the ChEMBL ID for CHEMBL2041 is the fourth entry.

NOTE: The fourth entry has id number 3.

In [7]:
selected_target = targets.target_chembl_id[3]
selected_target

'CHEMBL2041'

We will retrieve the bioactivity data for all entries in ChEMBL database and save it under a variable called **activity**. Then we will filter this bioactivity data to only match the ChEMBL id of our target protein (CHEMBL2041) and then filter to get entries with IC50 values under the column "standard_type". Here, we are interested in IC50, which will be reported as -log(IC50), as the measure of bioactivity.

In [8]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type='IC50')

We are then going to take this filtered data and format it into a dataframe using the pandas package.

In [9]:
df = pd.DataFrame.from_dict(res)

In [10]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,750656,[],CHEMBL770568,Inhibition of RET kinase activity,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,10000.0
1,,,1701846,[],CHEMBL861018,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,uM,UO_0000065,,40.0
2,,,1781988,[],CHEMBL911838,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,uM,UO_0000065,,0.031
3,,,1815199,[],CHEMBL908511,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,8300.0
4,,,1826436,[],CHEMBL919913,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,1900.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1213,"{'action_type': 'INHIBITOR', 'description': 'N...",,25061413,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5250521,Inhibition of wild type rearranged during tran...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,12.0
1214,"{'action_type': 'INHIBITOR', 'description': 'N...",,25070302,[],CHEMBL5252761,Inhibition of RET (unknown origin) autophospho...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,30.0
1215,"{'action_type': 'INHIBITOR', 'description': 'N...",,25070303,[],CHEMBL5252761,Inhibition of RET (unknown origin) autophospho...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,30.0
1216,"{'action_type': 'INHIBITOR', 'description': 'N...",,25073450,[],CHEMBL5253679,Inhibition of RET (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,169.0


Next save the filtered data into a .csv file and export the file as **RET_03_bioactivity_data_raw.csv**

In [11]:
df.to_csv("RET_03_bioactivity_data_raw.csv", index = False)

## Handling Missing Data

If we see that any of the compound in the curated dataset from the previous step is missing values for standard_value and canonical_smiles, we will remove these. Canonical smiles is the chemical formula representation of the molecule using a series of one-letter abbreviations of elements and lines.

In [12]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,750656,[],CHEMBL770568,Inhibition of RET kinase activity,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,10000.0
1,,,1701846,[],CHEMBL861018,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,uM,UO_0000065,,40.0
2,,,1781988,[],CHEMBL911838,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,uM,UO_0000065,,0.031
3,,,1815199,[],CHEMBL908511,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,8300.0
4,,,1826436,[],CHEMBL919913,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,1900.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1213,"{'action_type': 'INHIBITOR', 'description': 'N...",,25061413,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5250521,Inhibition of wild type rearranged during tran...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,12.0
1214,"{'action_type': 'INHIBITOR', 'description': 'N...",,25070302,[],CHEMBL5252761,Inhibition of RET (unknown origin) autophospho...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,30.0
1215,"{'action_type': 'INHIBITOR', 'description': 'N...",,25070303,[],CHEMBL5252761,Inhibition of RET (unknown origin) autophospho...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,30.0
1216,"{'action_type': 'INHIBITOR', 'description': 'N...",,25073450,[],CHEMBL5253679,Inhibition of RET (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,169.0


NOTE: The number of rows has reduced here from 1218 to 1208 (2024-08-04).

We can see how many unique values are under the canonical_smiles column. Because this value refers to the chemical formula of the molecule, we only want entries that ech represent a different molecule. In other words, we don't want duplicate entries of the same molecule in this curated dataset.

In [13]:
len(df2.canonical_smiles.unique())

896

In [14]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,750656,[],CHEMBL770568,Inhibition of RET kinase activity,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,10000.0
1,,,1701846,[],CHEMBL861018,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,uM,UO_0000065,,40.0
2,,,1781988,[],CHEMBL911838,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,uM,UO_0000065,,0.031
3,,,1815199,[],CHEMBL908511,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,8300.0
4,,,1826436,[],CHEMBL919913,Inhibition of RET,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,1900.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1211,"{'action_type': 'INHIBITOR', 'description': 'N...",,25061411,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5250521,Inhibition of wild type rearranged during tran...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,136.0
1212,"{'action_type': 'INHIBITOR', 'description': 'N...",,25061412,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5250521,Inhibition of wild type rearranged during tran...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,11.2
1213,"{'action_type': 'INHIBITOR', 'description': 'N...",,25061413,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5250521,Inhibition of wild type rearranged during tran...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,12.0
1214,"{'action_type': 'INHIBITOR', 'description': 'N...",,25070302,[],CHEMBL5252761,Inhibition of RET (unknown origin) autophospho...,B,,,BAO_0000190,...,Homo sapiens,Tyrosine-protein kinase receptor RET,9606,,,IC50,nM,UO_0000065,,30.0


Now we have a curated dataset that includes unique molecules that have unique canonical smiles or chemical formulas.

# Data Pre-Processing of the bioactivity data

## Combining 3 columns (molecule_chembl_id, canonical_smiles, standard_value) and bioactivity_class into a DataFrame 

In [15]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,10000.0
1,CHEMBL6246,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,40000.0
2,CHEMBL402548,CO[C@@H](C(=O)N1Cc2[nH]nc(NC(=O)c3ccc(N4CCN(C)...,31.0
3,CHEMBL373882,CNc1ncnc(-c2cccnc2Oc2ccc(F)c(C(=O)Nc3cc(C(F)(F...,8300.0
4,CHEMBL223360,Cc1ccc(F)c(NC(=O)Nc2ccc(-c3cccc4[nH]nc(N)c34)c...,1900.0
...,...,...,...
1211,CHEMBL5289571,COc1cc2nccc(Oc3ccc(Nc4nn(C)cc4C(=O)NC45CC6CC(C...,136.0
1212,CHEMBL5268831,Cn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(Nc2ccc(Oc3c...,11.2
1213,CHEMBL5284144,Cn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(Nc2ccc(Oc3c...,12.0
1214,CHEMBL4080062,O=C(Nc1cccc(C(F)(F)F)c1)c1cccc2cc(Oc3cc(CO)ncn...,30.0


Save the DataFrame into a csv file.

In [16]:
df3.to_csv("RET_03_bioactivity_data_preprocessed.csv", index=False)

## Labeling compounds as active, inactive or intermediate

Here, we use the IC50 value as the unit for bioactivity of the molecule. 

**Active**: Values less than or equal to 1,000 nM
**Inactive**: Values more than or equal to 10,000 nM
**Intermediate**: Values between 1,000 and 10,000 nM

We are going read the pre-processed csv file from the previous step.

In [17]:
df4 = pd.read_csv("RET_03_bioactivity_data_preprocessed.csv")

Then we are going to create a list object named bioactivity_threshold that will hold the appropriate identifier with each molecule in the row of the preprocessed data. 

In [18]:
bioactivity_threshold = []

for i in df4.standard_value:
    if i <= 1000:
        bioactivity_threshold.append('active')
    elif i >= 10000:
        bioactivity_threshold.append('inactive')
    else:
        bioactivity_threshold.append('intermediate')

We will then take the list object and convert into a series (an object in pandas package that is like a column in a table) and name it **class**. Then concatenate the class column with the pre-processed dataframe (df4) into a new dataframe named 'df5'

In [19]:
bioactivity_class = pd.Series(bioactivity_threshold, name='bioactivity_class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,10000.0,inactive
1,CHEMBL6246,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,40000.0,inactive
2,CHEMBL402548,CO[C@@H](C(=O)N1Cc2[nH]nc(NC(=O)c3ccc(N4CCN(C)...,31.0,active
3,CHEMBL373882,CNc1ncnc(-c2cccnc2Oc2ccc(F)c(C(=O)Nc3cc(C(F)(F...,8300.0,intermediate
4,CHEMBL223360,Cc1ccc(F)c(NC(=O)Nc2ccc(-c3cccc4[nH]nc(N)c34)c...,1900.0,intermediate
...,...,...,...,...
891,CHEMBL5289571,COc1cc2nccc(Oc3ccc(Nc4nn(C)cc4C(=O)NC45CC6CC(C...,136.0,active
892,CHEMBL5268831,Cn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(Nc2ccc(Oc3c...,11.2,active
893,CHEMBL5284144,Cn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(Nc2ccc(Oc3c...,12.0,active
894,CHEMBL4080062,O=C(Nc1cccc(C(F)(F)F)c1)c1cccc2cc(Oc3cc(CO)ncn...,30.0,active


NOTE: you should have 4 columns here.

Save the new dataframe into a new CSV file called 'RET_03_bioactivity_data_curated.csv'

In [20]:
df5.to_csv("RET_03_bioactivity_data_curated.csv", index=False)

Save all the csv files from this notebook into a zip file called "RET"

In [21]:
! zip RET.zip *.csv

OSError: "/usr/local/bin/bash" shell not found

In [22]:
! ls -l

OSError: "/usr/local/bin/bash" shell not found