# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data (Concised version)**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 24.0 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 13.0 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 10.2 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 8.9 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 5.1 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.4 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 11.1 MB/s 
[?25hCollecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

In [3]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('sars-cov-2')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,33.0,False,CHEMBL4303835,[],ORGANISM,2697049.0
1,[],Severe acute respiratory syndrome-related coro...,SARS-CoV,32.0,False,CHEMBL4303836,[],ORGANISM,694009.0
2,[],SARS coronavirus,SARS coronavirus,15.0,False,CHEMBL612575,[],ORGANISM,227859.0
3,[],Homo sapiens,"Serine--tRNA ligase, cytoplasmic",14.0,False,CHEMBL4523232,"[{'accession': 'P49591', 'component_descriptio...",SINGLE PROTEIN,9606.0
4,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,11.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859.0
...,...,...,...,...,...,...,...,...,...
2484,[],Rattus norvegicus,Voltage-gated sodium channel,0.0,False,CHEMBL3988641,"[{'accession': 'O88457', 'component_descriptio...",PROTEIN FAMILY,10116.0
2485,[],Homo sapiens,von Hippel-Lindau disease tumor suppressor/Elo...,0.0,False,CHEMBL4296117,"[{'accession': 'Q16665', 'component_descriptio...",PROTEIN COMPLEX,9606.0
2486,[],Homo sapiens,UDP-glucuronosyltransferases (UGTs),0.0,False,CHEMBL4523985,"[{'accession': 'P22310', 'component_descriptio...",PROTEIN FAMILY,9606.0
2487,[],Mus musculus,I-kappa-B kinase,0.0,False,CHEMBL4524000,"[{'accession': 'Q60680', 'component_descriptio...",PROTEIN COMPLEX,10090.0


### **Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase* (fifth entry)**

We will assign the fifth entry (which corresponds to the target protein, *coronavirus 3C-like proteinase*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL4303835'

Here, we will retrieve only bioactivity data for *coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [10]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [11]:
df = pd.DataFrame.from_dict(res)

In [14]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18827175,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCN1CCN(Cc2ccc(Nc3ncc(F)c(-c4cc(F)c5nc(C)n(C(C...,,,CHEMBL4303084,,2020,,CHEMBL3301610,ABEMACICLIB,CHEMBL3301610,5.18,False,http://www.openphacts.org/units/Nanomolar,3133933,=,52,True,=,,IC50,nM,,6620.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,6.62
1,,18827176,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O,,,CHEMBL4303084,,2020,,CHEMBL682,AMODIAQUINE,CHEMBL682,5.29,False,http://www.openphacts.org/units/Nanomolar,3133934,=,52,True,=,,IC50,nM,,5150.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,5.15
2,,18827177,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCCCCOc1ccc(-c2ccc(-c3ccc(C(=O)N[C@H]4C[C@@H](...,,,CHEMBL4303084,,2020,,CHEMBL264241,ANIDULAFUNGIN,CHEMBL264241,5.33,False,http://www.openphacts.org/units/Nanomolar,3133935,=,52,True,=,,IC50,nM,,4640.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,4.64
3,,18827178,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,Cc1c(-c2ccc(O)cc2)n(Cc2ccc(OCCN3CCCCCC3)cc2)c2...,,,CHEMBL4303084,,2020,,CHEMBL46740,BAZEDOXIFENE,CHEMBL46740,5.46,False,http://www.openphacts.org/units/Nanomolar,3133936,=,52,True,=,,IC50,nM,,3440.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,3.44
4,,18827179,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,COc1cc2c3cc1Oc1c(OC)c(OC)cc4c1[C@@H](Cc1ccc(O)...,,,CHEMBL4303084,,2020,,CHEMBL504323,BERBAMINE,CHEMBL504323,5.10,False,http://www.openphacts.org/units/Nanomolar,3133937,=,52,True,=,,IC50,nM,,7870.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,7.87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
365,,20154347,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCCCn1c(NC(=O)c2cccs2)c(C#N)c2nc3ccccc3nc21,,,CHEMBL4495565,,2020,,CHEMBL3967196,,CHEMBL3967196,4.71,False,http://www.openphacts.org/units/Nanomolar,3362297,=,52,True,=,,IC50,nM,,19650.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,19.65
366,,20154348,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,Cc1c(C(=O)c2cccc3ccccc23)c2cccc3c2n1[C@H](CN1C...,,,CHEMBL4495565,,2020,,CHEMBL188,WIN-552122,CHEMBL188,,False,http://www.openphacts.org/units/Nanomolar,3361526,>,52,True,>,,IC50,nM,,33000.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,33.0
367,,20154349,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,CNC(=O)Nc1ccc(-c2nc(N3CC4CCC(C3)O4)c3cnn(C4CCC...,,,CHEMBL4495565,,2020,,CHEMBL601661,,CHEMBL601661,4.67,False,http://www.openphacts.org/units/Nanomolar,3363375,=,52,True,=,,IC50,nM,,21620.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,21.62
368,,20154350,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,CC(C)N1CCN(c2ccc(C(=O)c3c(-c4ccc(O)cc4)sc4cc(O...,,,CHEMBL4495565,,2020,,CHEMBL178334,,CHEMBL178334,4.53,False,http://www.openphacts.org/units/Nanomolar,3368236,=,52,True,=,,IC50,nM,,29870.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,29.87


In [None]:
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1476106,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cnccc21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '30.22', 'le': '0.56', 'lle': '5.88', ...",CHEMBL189657,,CHEMBL189657,6.96,False,http://www.openphacts.org/units/Nanomolar,382064,=,1,True,=,,IC50,nM,,110.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,110.0
1,,1476649,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cncc(CNC3CCNCC3)c21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '15.51', 'le': '0.29', 'lle': '4.39', ...",CHEMBL188434,,CHEMBL188434,5.31,False,http://www.openphacts.org/units/Nanomolar,382011,=,1,True,=,,IC50,nM,,4900.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,4900.0
2,,1701840,[],CHEMBL861010,Inhibition of human recombinant DYRK1a,B,,,BAO_0000190,BAO_0000357,single protein format,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,,,CHEMBL1146093,J. Med. Chem.,2006,,CHEMBL6246,ELLAGIC ACID,CHEMBL6246,,False,http://www.openphacts.org/units/Nanomolar,448549,>,1,True,>,,IC50,nM,,40000.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,uM,UO_0000065,,40.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [12]:
df.to_csv('sars-cov-2_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [15]:
df2 = df[df.pchembl_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18827175,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCN1CCN(Cc2ccc(Nc3ncc(F)c(-c4cc(F)c5nc(C)n(C(C...,,,CHEMBL4303084,,2020,,CHEMBL3301610,ABEMACICLIB,CHEMBL3301610,5.18,False,http://www.openphacts.org/units/Nanomolar,3133933,=,52,True,=,,IC50,nM,,6620.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,6.62
1,,18827176,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O,,,CHEMBL4303084,,2020,,CHEMBL682,AMODIAQUINE,CHEMBL682,5.29,False,http://www.openphacts.org/units/Nanomolar,3133934,=,52,True,=,,IC50,nM,,5150.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,5.15
2,,18827177,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCCCCOc1ccc(-c2ccc(-c3ccc(C(=O)N[C@H]4C[C@@H](...,,,CHEMBL4303084,,2020,,CHEMBL264241,ANIDULAFUNGIN,CHEMBL264241,5.33,False,http://www.openphacts.org/units/Nanomolar,3133935,=,52,True,=,,IC50,nM,,4640.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,4.64
3,,18827178,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,Cc1c(-c2ccc(O)cc2)n(Cc2ccc(OCCN3CCCCCC3)cc2)c2...,,,CHEMBL4303084,,2020,,CHEMBL46740,BAZEDOXIFENE,CHEMBL46740,5.46,False,http://www.openphacts.org/units/Nanomolar,3133936,=,52,True,=,,IC50,nM,,3440.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,3.44
4,,18827179,[],CHEMBL4303812,Antiviral activity against SARS-CoV-2 (viral t...,F,,,BAO_0000190,BAO_0000218,organism-based format,COc1cc2c3cc1Oc1c(OC)c(OC)cc4c1[C@@H](Cc1ccc(O)...,,,CHEMBL4303084,,2020,,CHEMBL504323,BERBAMINE,CHEMBL504323,5.10,False,http://www.openphacts.org/units/Nanomolar,3133937,=,52,True,=,,IC50,nM,,7870.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,7.87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
364,,20154346,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,c1cncc(CN2CCC(n3ncc4c(N5CCOCC5)nc(-c5ccc6[nH]c...,,,CHEMBL4495565,,2020,,CHEMBL583194,,CHEMBL583194,5.03,False,http://www.openphacts.org/units/Nanomolar,3367857,=,52,True,=,,IC50,nM,,9430.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,9.43
365,,20154347,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,CCCCn1c(NC(=O)c2cccs2)c(C#N)c2nc3ccccc3nc21,,,CHEMBL4495565,,2020,,CHEMBL3967196,,CHEMBL3967196,4.71,False,http://www.openphacts.org/units/Nanomolar,3362297,=,52,True,=,,IC50,nM,,19650.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,19.65
367,,20154349,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,CNC(=O)Nc1ccc(-c2nc(N3CC4CCC(C3)O4)c3cnn(C4CCC...,,,CHEMBL4495565,,2020,,CHEMBL601661,,CHEMBL601661,4.67,False,http://www.openphacts.org/units/Nanomolar,3363375,=,52,True,=,,IC50,nM,,21620.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,21.62
368,,20154350,[],CHEMBL4513083,Determination of IC50 values for inhibition of...,F,,,BAO_0000190,BAO_0000218,organism-based format,CC(C)N1CCN(c2ccc(C(=O)c3c(-c4ccc(O)cc4)sc4cc(O...,,,CHEMBL4495565,,2020,,CHEMBL178334,,CHEMBL178334,4.53,False,http://www.openphacts.org/units/Nanomolar,3368236,=,52,True,=,,IC50,nM,,29870.0,CHEMBL4303835,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,2697049,,,IC50,uM,UO_0000065,,29.87


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [16]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [17]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL3301610,CCN1CCN(Cc2ccc(Nc3ncc(F)c(-c4cc(F)c5nc(C)n(C(C...,6620.0
1,CHEMBL682,CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O,5150.0
2,CHEMBL264241,CCCCCOc1ccc(-c2ccc(-c3ccc(C(=O)N[C@H]4C[C@@H](...,4640.0
3,CHEMBL46740,Cc1c(-c2ccc(O)cc2)n(Cc2ccc(OCCN3CCCCCC3)cc2)c2...,3440.0
4,CHEMBL504323,COc1cc2c3cc1Oc1c(OC)c(OC)cc4c1[C@@H](Cc1ccc(O)...,7870.0
...,...,...,...
364,CHEMBL583194,c1cncc(CN2CCC(n3ncc4c(N5CCOCC5)nc(-c5ccc6[nH]c...,9430.0
365,CHEMBL3967196,CCCCn1c(NC(=O)c2cccs2)c(C#N)c2nc3ccccc3nc21,19650.0
367,CHEMBL601661,CNC(=O)Nc1ccc(-c2nc(N3CC4CCC(C3)O4)c3cnn(C4CCC...,21620.0
368,CHEMBL178334,CC(C)N1CCN(c2ccc(C(=O)c3c(-c4ccc(O)cc4)sc4cc(O...,29870.0


# In case there is a mismatch in numbering, we use a temperary file to get rid of this.

In [18]:
df3.to_csv('temp.csv',index=False)

In [19]:
df_temp = pd.read_csv('temp.csv')

In [20]:
df_temp

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL3301610,CCN1CCN(Cc2ccc(Nc3ncc(F)c(-c4cc(F)c5nc(C)n(C(C...,6620.0
1,CHEMBL682,CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O,5150.0
2,CHEMBL264241,CCCCCOc1ccc(-c2ccc(-c3ccc(C(=O)N[C@H]4C[C@@H](...,4640.0
3,CHEMBL46740,Cc1c(-c2ccc(O)cc2)n(Cc2ccc(OCCN3CCCCCC3)cc2)c2...,3440.0
4,CHEMBL504323,COc1cc2c3cc1Oc1c(OC)c(OC)cc4c1[C@@H](Cc1ccc(O)...,7870.0
...,...,...,...
281,CHEMBL583194,c1cncc(CN2CCC(n3ncc4c(N5CCOCC5)nc(-c5ccc6[nH]c...,9430.0
282,CHEMBL3967196,CCCCn1c(NC(=O)c2cccs2)c(C#N)c2nc3ccccc3nc21,19650.0
283,CHEMBL601661,CNC(=O)Nc1ccc(-c2nc(N3CC4CCC(C3)O4)c3cnn(C4CCC...,21620.0
284,CHEMBL178334,CC(C)N1CCN(c2ccc(C(=O)c3c(-c4ccc(O)cc4)sc4cc(O...,29870.0


In [21]:
df3 = df_temp

In [22]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL3301610,CCN1CCN(Cc2ccc(Nc3ncc(F)c(-c4cc(F)c5nc(C)n(C(C...,6620.0,intermediate
1,CHEMBL682,CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O,5150.0,intermediate
2,CHEMBL264241,CCCCCOc1ccc(-c2ccc(-c3ccc(C(=O)N[C@H]4C[C@@H](...,4640.0,intermediate
3,CHEMBL46740,Cc1c(-c2ccc(O)cc2)n(Cc2ccc(OCCN3CCCCCC3)cc2)c2...,3440.0,intermediate
4,CHEMBL504323,COc1cc2c3cc1Oc1c(OC)c(OC)cc4c1[C@@H](Cc1ccc(O)...,7870.0,intermediate
...,...,...,...,...
281,CHEMBL583194,c1cncc(CN2CCC(n3ncc4c(N5CCOCC5)nc(-c5ccc6[nH]c...,9430.0,intermediate
282,CHEMBL3967196,CCCCn1c(NC(=O)c2cccs2)c(C#N)c2nc3ccccc3nc21,19650.0,inactive
283,CHEMBL601661,CNC(=O)Nc1ccc(-c2nc(N3CC4CCC(C3)O4)c3cnn(C4CCC...,21620.0,inactive
284,CHEMBL178334,CC(C)N1CCN(c2ccc(C(=O)c3c(-c4ccc(O)cc4)sc4cc(O...,29870.0,inactive


Saves dataframe to CSV file

In [23]:
df4.to_csv('sars-cov-2_bioactivity_data_preprocessed.csv', index=False)

In [24]:
! ls -l

total 264
drwxr-xr-x 1 root root   4096 Dec 23 14:32 sample_data
-rw-r--r-- 1 root root  26606 Jan  9 11:01 sars-cov-2_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 209411 Jan  9 10:58 sars-cov-2_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  23530 Jan  9 11:01 temp.csv


---