# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data (Concised version)**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 22.9 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 26.9 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 12.2 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 9.3 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.1 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 10.7 MB/s 
[?25hCollecting itsdangerous>=2.0.1
  Downloading itsdangerous-2

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

In [3]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('sars-cov-2')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'Q61214', 'xref_name': None, 'xre...",Mus musculus,Dual-specificity tyrosine-phosphorylation regu...,13.0,False,CHEMBL4750,"[{'accession': 'Q61214', 'component_descriptio...",SINGLE PROTEIN,10090
1,"[{'xref_id': 'Q63470', 'xref_name': None, 'xre...",Rattus norvegicus,Dual specificity tyrosine-phosphorylation-regu...,13.0,False,CHEMBL5508,"[{'accession': 'Q63470', 'component_descriptio...",SINGLE PROTEIN,10116
2,"[{'xref_id': 'Q13627', 'xref_name': None, 'xre...",Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,12.0,False,CHEMBL2292,"[{'accession': 'Q13627', 'component_descriptio...",SINGLE PROTEIN,9606


### **Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase* (fifth entry)**

We will assign the fifth entry (which corresponds to the target protein, *coronavirus 3C-like proteinase*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[2]
selected_target

'CHEMBL2292'

Here, we will retrieve only bioactivity data for *coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [6]:
df = pd.DataFrame.from_dict(res)

In [12]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1476106,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cnccc21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '30.22', 'le': '0.56', 'lle': '5.88', ...",CHEMBL189657,,CHEMBL189657,6.96,False,http://www.openphacts.org/units/Nanomolar,382064,=,1,True,=,,IC50,nM,,110.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,110.0
1,,1476649,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cncc(CNC3CCNCC3)c21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '15.51', 'le': '0.29', 'lle': '4.39', ...",CHEMBL188434,,CHEMBL188434,5.31,False,http://www.openphacts.org/units/Nanomolar,382011,=,1,True,=,,IC50,nM,,4900.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,4900.0
2,,1701840,[],CHEMBL861010,Inhibition of human recombinant DYRK1a,B,,,BAO_0000190,BAO_0000357,single protein format,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,,,CHEMBL1146093,J. Med. Chem.,2006,,CHEMBL6246,ELLAGIC ACID,CHEMBL6246,,False,http://www.openphacts.org/units/Nanomolar,448549,>,1,True,>,,IC50,nM,,40000.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,uM,UO_0000065,,40.0
3,,1751439,[],CHEMBL869157,Inhibition of DYRK1a,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)c1nnc2ccc(-c3ocnc3-c3ccc(F)cc3Cl)cn12,,,CHEMBL1145312,Bioorg. Med. Chem. Lett.,2006,,CHEMBL215652,,CHEMBL215652,,False,http://www.openphacts.org/units/Nanomolar,528643,>,1,True,>,,IC50,nM,,10000.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,uM,UO_0000065,,10.0
4,,1751440,[],CHEMBL869157,Inhibition of DYRK1a,B,,,BAO_0000190,BAO_0000357,single protein format,CC(C)c1nnc2ccc(-c3ocnc3-c3cc(F)c(F)cc3F)cn12,,,CHEMBL1145312,Bioorg. Med. Chem. Lett.,2006,,CHEMBL213423,,CHEMBL213423,,False,http://www.openphacts.org/units/Nanomolar,528646,>,1,True,>,,IC50,nM,,10000.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1156,Slightly Active,20739000,[],CHEMBL4511049,In vitro kinase assay (DYRK1A),B,,,BAO_0000190,BAO_0000357,single protein format,CN1CCN(C(=O)C(C)(C)c2ccc(C(=O)Nc3cn4cc(-c5ccnc...,,,CHEMBL4507307,,2021,,CHEMBL4531334,T3-CLK,CHEMBL4531334,,False,http://www.openphacts.org/units/Nanomolar,3359743,,54,True,,,IC50,nM,,260.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,260.0
1157,Slightly Active,20739005,[],CHEMBL4511054,NanoBRET (SGC Frankfurt) (DYRK1A),B,,,BAO_0000190,BAO_0000357,single protein format,CN1CCN(C(=O)C(C)(C)c2ccc(C(=O)Nc3cn4cc(-c5ccnc...,,,CHEMBL4507307,,2021,,CHEMBL4531334,T3-CLK,CHEMBL4531334,,False,http://www.openphacts.org/units/Nanomolar,3359743,,54,True,,,IC50,nM,,32.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,32.0
1158,Not Active,20739013,[],CHEMBL4511054,NanoBRET (SGC Frankfurt) (DYRK1A),B,,,BAO_0000190,BAO_0000357,single protein format,CN1CCN(C(=O)C(C)(C)c2ccc(C(=O)Nc3cn4cc(-c5cc(C...,Non standard unit for type,Units for this activity type are unusual and m...,CHEMBL4507307,,2021,,CHEMBL4576555,T3-CLK-N,CHEMBL4576555,,False,,3359744,>,54,False,>,,IC50,µM,,10.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,µM,,,10.0
1159,,20764795,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4512209,DYRK1A(DY1ALGP1) Takeda global kinase panel,B,,,BAO_0000190,BAO_0000357,single protein format,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,,,CHEMBL4507326,,2021,,CHEMBL4549667,TP-030-2,CHEMBL4549667,,False,http://www.openphacts.org/units/Nanomolar,3359780,,54,True,,,IC50,nM,,1000.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,pIC50,,UO_0000065,,6.0


In [7]:
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1476106,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cnccc21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '30.22', 'le': '0.56', 'lle': '5.88', ...",CHEMBL189657,,CHEMBL189657,6.96,False,http://www.openphacts.org/units/Nanomolar,382064,=,1,True,=,,IC50,nM,,110.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,110.0
1,,1476649,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cncc(CNC3CCNCC3)c21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '15.51', 'le': '0.29', 'lle': '4.39', ...",CHEMBL188434,,CHEMBL188434,5.31,False,http://www.openphacts.org/units/Nanomolar,382011,=,1,True,=,,IC50,nM,,4900.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,4900.0
2,,1701840,[],CHEMBL861010,Inhibition of human recombinant DYRK1a,B,,,BAO_0000190,BAO_0000357,single protein format,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,,,CHEMBL1146093,J. Med. Chem.,2006,,CHEMBL6246,ELLAGIC ACID,CHEMBL6246,,False,http://www.openphacts.org/units/Nanomolar,448549,>,1,True,>,,IC50,nM,,40000.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,uM,UO_0000065,,40.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [8]:
df.to_csv('Dyrk1a_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [9]:
df2 = df[df.pchembl_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1476106,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cnccc21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '30.22', 'le': '0.56', 'lle': '5.88', ...",CHEMBL189657,,CHEMBL189657,6.96,False,http://www.openphacts.org/units/Nanomolar,382064,=,1,True,=,,IC50,nM,,110.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,110.0
1,,1476649,[],CHEMBL828004,Inhibitory concentration against selected kina...,B,,,BAO_0000190,BAO_0000357,single protein format,CCn1c(-c2nonc2N)nc2cncc(CNC3CCNCC3)c21,,,CHEMBL1139578,Bioorg. Med. Chem. Lett.,2005,"{'bei': '15.51', 'le': '0.29', 'lle': '4.39', ...",CHEMBL188434,,CHEMBL188434,5.31,False,http://www.openphacts.org/units/Nanomolar,382011,=,1,True,=,,IC50,nM,,4900.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,4900.0
10,,1837933,[],CHEMBL920482,Inhibition of DYRK1A,B,,,BAO_0000190,BAO_0000357,single protein format,CN(C)c1nc2c(Br)c(Br)c(Br)c(Br)c2[nH]1,,,CHEMBL1149250,J. Med. Chem.,2004,"{'bei': '14.52', 'le': '0.59', 'lle': '2.24', ...",CHEMBL376505,,CHEMBL376505,6.92,False,http://www.openphacts.org/units/Nanomolar,629634,=,1,True,=,,IC50,nM,,120.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,uM,UO_0000065,,0.12
11,,2137228,[],CHEMBL936939,Inhibition of DYRK1a,B,,,BAO_0000190,BAO_0000357,single protein format,NS(=O)(=O)c1ccc(Nc2nc(OCC3CCCCC3)c3nc[nH]c3n2)cc1,,,CHEMBL1145498,Proc. Natl. Acad. Sci. U.S.A.,2007,"{'bei': '15.02', 'le': '0.29', 'lle': '3.35', ...",CHEMBL319467,,CHEMBL319467,6.05,True,http://www.openphacts.org/units/Nanomolar,705222,=,1,True,=,,IC50,nM,,900.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,900.0
14,,2211669,[],CHEMBL980537,Inhibition of human recombinant DYRK1A,B,,,BAO_0000190,BAO_0000357,single protein format,CC[C@H](CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,,,CHEMBL1149828,J. Nat. Prod.,2008,"{'bei': '13.39', 'le': '0.25', 'lle': '1.54', ...",CHEMBL14762,SELICICLIB,CHEMBL14762,4.75,False,http://www.openphacts.org/units/Nanomolar,742340,=,1,True,=,,IC50,nM,,18000.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,uM,UO_0000065,,18.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1150,,20680054,[],CHEMBL4622607,Inhibition of recombinant human N-terminal Hex...,B,,,BAO_0000190,BAO_0000019,assay format,COc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,,,CHEMBL4619818,ACS Med Chem Lett,2020,"{'bei': '24.88', 'le': '0.45', 'lle': '4.98', ...",CHEMBL187081,PYRAZOLOPYRIDAZINE 1,CHEMBL187081,7.92,False,http://www.openphacts.org/units/Nanomolar,3482667,=,1,True,=,,IC50,nM,,12.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,12.0
1151,,20680055,[],CHEMBL4622607,Inhibition of recombinant human N-terminal Hex...,B,,,BAO_0000190,BAO_0000019,assay format,N#Cc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,,,CHEMBL4619818,ACS Med Chem Lett,2020,"{'bei': '26.49', 'le': '0.47', 'lle': '5.50', ...",CHEMBL495696,GW778894X,CHEMBL495696,8.30,False,http://www.openphacts.org/units/Nanomolar,3482668,=,1,True,=,,IC50,nM,,5.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,5.0
1152,,20680056,[],CHEMBL4622607,Inhibition of recombinant human N-terminal Hex...,B,,,BAO_0000190,BAO_0000019,assay format,N#Cc1ccc(Nc2nccc(-c3cnn4ncccc34)n2)cc1,,,CHEMBL4619818,ACS Med Chem Lett,2020,"{'bei': '22.30', 'le': '0.40', 'lle': '4.19', ...",CHEMBL359794,,CHEMBL359794,6.99,False,http://www.openphacts.org/units/Nanomolar,3482669,=,1,True,=,,IC50,nM,,103.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,103.0
1153,,20680057,[],CHEMBL4622607,Inhibition of recombinant human N-terminal Hex...,B,,,BAO_0000190,BAO_0000019,assay format,Cc1cccc(N(C)c2nccc(-c3cnn4ncccc34)n2)c1,,,CHEMBL4619818,ACS Med Chem Lett,2020,"{'bei': '22.50', 'le': '0.41', 'lle': '3.86', ...",CHEMBL4637319,,CHEMBL4637319,7.12,False,http://www.openphacts.org/units/Nanomolar,3482670,=,1,True,=,,IC50,nM,,76.0,CHEMBL2292,Homo sapiens,Dual-specificity tyrosine-phosphorylation regu...,9606,,,IC50,nM,UO_0000065,,76.0


Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [10]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [11]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL189657,CCn1c(-c2nonc2N)nc2cnccc21,110.0
1,CHEMBL188434,CCn1c(-c2nonc2N)nc2cncc(CNC3CCNCC3)c21,4900.0
10,CHEMBL376505,CN(C)c1nc2c(Br)c(Br)c(Br)c(Br)c2[nH]1,120.0
11,CHEMBL319467,NS(=O)(=O)c1ccc(Nc2nc(OCC3CCCCC3)c3nc[nH]c3n2)cc1,900.0
14,CHEMBL14762,CC[C@H](CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,18000.0
...,...,...,...
1150,CHEMBL187081,COc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,12.0
1151,CHEMBL495696,N#Cc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,5.0
1152,CHEMBL359794,N#Cc1ccc(Nc2nccc(-c3cnn4ncccc34)n2)cc1,103.0
1153,CHEMBL4637319,Cc1cccc(N(C)c2nccc(-c3cnn4ncccc34)n2)c1,76.0


# In case there is a mismatch in numbering, we use a temperary file to get rid of this.

In [13]:
df3.to_csv('temp.csv',index=False)

In [14]:
df_temp = pd.read_csv('temp.csv')

In [15]:
df_temp

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL189657,CCn1c(-c2nonc2N)nc2cnccc21,110.0
1,CHEMBL188434,CCn1c(-c2nonc2N)nc2cncc(CNC3CCNCC3)c21,4900.0
2,CHEMBL376505,CN(C)c1nc2c(Br)c(Br)c(Br)c(Br)c2[nH]1,120.0
3,CHEMBL319467,NS(=O)(=O)c1ccc(Nc2nc(OCC3CCCCC3)c3nc[nH]c3n2)cc1,900.0
4,CHEMBL14762,CC[C@H](CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,18000.0
...,...,...,...
837,CHEMBL187081,COc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,12.0
838,CHEMBL495696,N#Cc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,5.0
839,CHEMBL359794,N#Cc1ccc(Nc2nccc(-c3cnn4ncccc34)n2)cc1,103.0
840,CHEMBL4637319,Cc1cccc(N(C)c2nccc(-c3cnn4ncccc34)n2)c1,76.0


In [16]:
df3 = df_temp

In [17]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL189657,CCn1c(-c2nonc2N)nc2cnccc21,110.0,active
1,CHEMBL188434,CCn1c(-c2nonc2N)nc2cncc(CNC3CCNCC3)c21,4900.0,intermediate
2,CHEMBL376505,CN(C)c1nc2c(Br)c(Br)c(Br)c(Br)c2[nH]1,120.0,active
3,CHEMBL319467,NS(=O)(=O)c1ccc(Nc2nc(OCC3CCCCC3)c3nc[nH]c3n2)cc1,900.0,active
4,CHEMBL14762,CC[C@H](CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,18000.0,inactive
...,...,...,...,...
837,CHEMBL187081,COc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,12.0,active
838,CHEMBL495696,N#Cc1cccc(Nc2nccc(-c3cnn4ncccc34)n2)c1,5.0,active
839,CHEMBL359794,N#Cc1ccc(Nc2nccc(-c3cnn4ncccc34)n2)cc1,103.0,active
840,CHEMBL4637319,Cc1cccc(N(C)c2nccc(-c3cnn4ncccc34)n2)c1,76.0,active


Saves dataframe to CSV file

In [18]:
df4.to_csv('Dyrk1a_bioactivity_data_preprocessed.csv', index=False)

In [19]:
! ls -l

total 804
-rw-r--r-- 1 root root  60892 Nov 26 05:02 Dyrk1a_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 699557 Nov 26 04:58 Dyrk1a_bioactivity_data_raw.csv
drwxr-xr-x 1 root root   4096 Nov 18 14:36 sample_data
-rw-r--r-- 1 root root  53434 Nov 26 05:00 temp.csv


---