# **Computational Drug Discovery Project: Data Preprocessing**
By Mathew Kuruvilla

Based on the Drug Discovery Project taught by Chanin Nantasenamat [*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this project, I will be building a machine learning model using bioactivity data from ChEMBL for coronavirus replicase polyprotein 1ab inhibitors.
This Jupyter notebook will focus on data preprocessing steps and compile bioactivity class data.

---

## **Installing libraries**

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Using cached chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Using cached requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting easydict (from chembl_webresource_client)
  Using cached easydict-1.13-py3-none-any.whl.metadata (4.2 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Using cached cattrs-24.1.3-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Using cached chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
Using cached requests_cache-1.2.1-py3-none-any.whl (61 kB)
Using cached easydict-1.13-py3-none-any.whl (6.8 kB)
Using cached cattrs-24.1.3-py3-none-any.whl (66 kB)
Downloading url_normalize-2.2.1-py3-none-any.whl (14 kB)
Installing collected packages: easydict, 

## **Importing libraries**

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

In [3]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],Feline coronavirus,Feline coronavirus,14.0,False,CHEMBL612744,[],ORGANISM,12663
2,[],Murine coronavirus,Murine coronavirus,14.0,False,CHEMBL5209664,[],ORGANISM,694005
3,[],Canine coronavirus,Canine coronavirus,14.0,False,CHEMBL5291668,[],ORGANISM,11153
4,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
5,[],Human coronavirus OC43,Human coronavirus OC43,13.0,False,CHEMBL5209665,[],ORGANISM,31631
6,[],Severe acute respiratory syndrome-related coro...,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,694009
7,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
8,[],Severe acute respiratory syndrome-related coro...,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,694009
9,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


### **Select and retrieve bioactivity data for *replicase polyprotein 1ab* (tenth entry)**

In [4]:
selected_target = targets.target_chembl_id[9]
selected_target

'CHEMBL4523582'

Here, we will retrieve only bioactivity data for *Replicase polyprotein 1ab* (CHEMBL4523582) that are reported as IC$_{50}$. <br>
IC$_{50}$ is a measure of how much compound is needed to inhibit the enzyme by 50% recorded as values in nM (nanomolar) unit. We are filtering specifically for experiments that tested measured how much of a drug is needed to inhibit 50% of the protease’s activity. <br>
The replicase polyprotein 1ab was chosen for this study because it is a huge polyprotein that coronaviruses make inside infected cells. Its main job is to copy (replicate) and control the virus’s RNA genome. Inhibitors targeting the replicase polyprotein 1ab would block the virus’s ability to copy its RNA, stopping it from making new viruses inside the host. Without a working replication complex, the infection can't spread and the virus eventually dies off.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(type="IC50")
res

[{'action_type': None, 'activity_comment': 'Dtt Insensitive', 'activity_id': 19964199, 'activity_properties': [], 'assay_chembl_id': 'CHEMBL4495583', 'assay_description': 'SARS-CoV-2 3CL-Pro protease inhibition IC50 determined by FRET kind of response from peptide substrate', 'assay_type': 'F', 'assay_variant_accession': None, 'assay_variant_mutation': None, 'bao_endpoint': 'BAO_0000190', 'bao_format': 'BAO_0000019', 'bao_label': 'assay format', 'canonical_smiles': 'Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1', 'data_validity_comment': None, 'data_validity_description': None, 'document_chembl_id': 'CHEMBL4495564', 'document_journal': None, 'document_year': 2020, 'ligand_efficiency': None, 'molecule_chembl_id': 'CHEMBL480', 'molecule_pref_name': 'LANSOPRAZOLE', 'parent_molecule_chembl_id': 'CHEMBL480', 'pchembl_value': '6.41', 'potential_duplicate': 0, 'qudt_units': 'http://www.openphacts.org/units/Nanomolar', 'record_id': 3341963, 'relation': '=', 'src_id': 52, 'standard_flag': 1,

In [6]:
df = pd.DataFrame.from_dict(res)
df.shape

(3660, 46)

In [7]:
with pd.option_context('display.max_columns', None): 
    display(df.head())

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,,,CHEMBL4495564,,2020,,CHEMBL480,LANSOPRAZOLE,CHEMBL480,6.41,0,http://www.openphacts.org/units/Nanomolar,3341963,=,52,1,=,,IC50,nM,,390.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(-c2cnccn2)ssc1=S,,,CHEMBL4495564,,2020,,CHEMBL178459,OLTIPRAZ,CHEMBL178459,6.68,0,http://www.openphacts.org/units/Nanomolar,3341991,=,52,1,=,,IC50,nM,,210.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,,,CHEMBL4495564,,2020,,CHEMBL3545157,TIDEGLUSIB,CHEMBL3545157,7.1,0,http://www.openphacts.org/units/Nanomolar,3342067,=,52,1,=,,IC50,nM,,80.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,,,CHEMBL4495564,,2020,,CHEMBL297453,EPIGALOCATECHIN GALLATE,CHEMBL297453,5.8,0,http://www.openphacts.org/units/Nanomolar,3342156,=,52,1,=,,IC50,nM,,1580.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C1C=Cc2cc(Br)ccc2C1=O,,,CHEMBL4495564,,2020,,CHEMBL4303595,,CHEMBL4303595,7.4,0,http://www.openphacts.org/units/Nanomolar,3342307,=,52,1,=,,IC50,nM,,40.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04


The **standard_value** column tells us how much of the drug is needed to reach the IC$_{50}$. The lower the number, the more effective the particular drug is at inhibiting the activity of the protease.

In [8]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [20]:
df.to_csv('bioactivity_data.csv', index=False)

## **Handling missing data**
If any compound has missing values for the **standard_value** column drop it.

In [9]:
df2_1 = df[df.standard_value.notna()].reset_index()
dropped_rows_sv = df[df.standard_value.isna()]
print(len(dropped_rows_sv))
print(len(df2_1))
df2_1

104
3556


Unnamed: 0,index,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3551,3655,,,25739541,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.375
3552,3656,,,25739542,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.067
3553,3657,,,25739543,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,18.97
3554,3658,,,25739544,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.621


In [10]:
df2_2 = df2_1[df2_1.canonical_smiles.notna()].reset_index()
dropped_rows_cs = df2_1[df2_1.canonical_smiles.isna()]
print(len(dropped_rows_cs))
print(len(df2_2))
df2_2

10
3546


Unnamed: 0,level_0,index,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,0,0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,1,1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,2,2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,3,3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,4,4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3541,3551,3655,,,25739541,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.375
3542,3552,3656,,,25739542,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.067
3543,3553,3657,,,25739543,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,18.97
3544,3554,3658,,,25739544,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.621


In [11]:
df2_3 = df2_2.drop_duplicates(subset='canonical_smiles').reset_index(drop=True)
dropped_rows_dupe = len(df2_2) - len(df2_3)
print(dropped_rows_dupe)
print(len(df2_3))
df2_3

1145
2401


Unnamed: 0,level_0,index,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,0,0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,1,1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,2,2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,3,3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,4,4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2396,3529,3633,,Falls Outside The Dose Series,25739519,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,99.5
2397,3530,3634,,,25739520,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.349
2398,3538,3642,,Falls Outside The Dose Series,25739528,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,99.5
2399,3541,3645,,,25739531,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.336


In [12]:
df3 = df2_3[df2_3.standard_type == 'IC50'].reset_index(drop=True)
dropped_rows_ic50 = len(df2_3) - len(df3)
print(dropped_rows_ic50)
print(len(df3))
df3

0
2401


Unnamed: 0,level_0,index,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,0,0,,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,1,1,,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,2,2,,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,3,3,,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,4,4,,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2396,3529,3633,,Falls Outside The Dose Series,25739519,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,99.5
2397,3530,3634,,,25739520,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.349
2398,3538,3642,,Falls Outside The Dose Series,25739528,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,99.5
2399,3541,3645,,,25739531,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5441457,Functional biochemical assay to identify treat...,F,,...,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.336


For this dataset there were 104 rows with missing standard_value data, 10 missing canonical_smiles data, 1145 duplicate canonical_smiles data, and 0 non IC50 measurements.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC$_{50}$ unit. Compounds having values of less than 1,000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [13]:
bioactivity_class = []
for i in df3.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

bc_set = set(bioactivity_class)
print(bc_set)

{'intermediate', 'active', 'inactive'}


### **Compiling bioactivity data into one dataframe**

In [14]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df4 = df3[selection]
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,390.0
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,210.0
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,80.0
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,1580.0
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,40.0
...,...,...,...
2396,CHEMBL5441534,CC(C)[C@H](Nc1ncncc1-c1nc[nH]n1)c1ccc2c(c1)S(=...,99500.0
2397,CHEMBL5441398,COc1ccc([C@H](Cc2nnn[nH]2)NC(=O)c2ncnc3[nH]ccc...,349.0
2398,CHEMBL5442105,CC(C)[C@@H](Oc1ncnc2[nH]ccc12)c1ccc2c(c1)OCCO2,99500.0
2399,CHEMBL5441739,CC(C)[C@@H](Nc1ncnc2[nH]c(Br)cc12)c1ccc2c(c1)S...,336.0


In [15]:
bc_pd = pd.DataFrame(bioactivity_class)
bc_pd.columns= ['bioactivity_class']
bc_pd

Unnamed: 0,bioactivity_class
0,active
1,active
2,active
3,intermediate
4,active
...,...
2396,inactive
2397,active
2398,inactive
2399,active


In [16]:
df5 = pd.concat([df4,bc_pd], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,390.0,active
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,210.0,active
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,80.0,active
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,1580.0,intermediate
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,40.0,active
...,...,...,...,...
2396,CHEMBL5441534,CC(C)[C@H](Nc1ncncc1-c1nc[nH]n1)c1ccc2c(c1)S(=...,99500.0,inactive
2397,CHEMBL5441398,COc1ccc([C@H](Cc2nnn[nH]2)NC(=O)c2ncnc3[nH]ccc...,349.0,active
2398,CHEMBL5442105,CC(C)[C@@H](Oc1ncnc2[nH]ccc12)c1ccc2c(c1)OCCO2,99500.0,inactive
2399,CHEMBL5441739,CC(C)[C@@H](Nc1ncnc2[nH]c(Br)cc12)c1ccc2c(c1)S...,336.0,active


Saves dataframe to CSV file

In [17]:
df5.to_csv('bioactivity_preprocessed_data.csv', index=False)