<a href="https://colab.research.google.com/github/tasneem94/Bioactivity-Predicition-Thesis/blob/main/Thesis_01_Plasmodium_falciparum_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2.3 million compounds. It is compiled from more than 85,000 documents, 1.5 million assays and the data spans 15,000 targets and 2,000 cells and 43,000 indications.
[Data as of September 29, 2022; ChEMBL version 31].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Plasmodium falciparum**

In [101]:
# Target search for Plasmodium falciparum
target = new_client.target
target_query = target.search('Plasmodium falciparum')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Plasmodium falciparum,Plasmodium falciparum,28.0,False,CHEMBL364,[],ORGANISM,5833
1,[],Plasmodium falciparum 3D7,Plasmodium falciparum 3D7,24.0,False,CHEMBL2366922,[],ORGANISM,36329
2,[],Plasmodium falciparum D6,Plasmodium falciparum D6,24.0,False,CHEMBL2367107,[],ORGANISM,478860
3,[],Plasmodium falciparum NF54,Plasmodium falciparum NF54,24.0,False,CHEMBL2367131,[],ORGANISM,5843
4,[],Plasmodium falciparum FcB1/Columbia,Plasmodium falciparum (isolate FcB1 / Columbia),19.0,False,CHEMBL612608,[],ORGANISM,186763
...,...,...,...,...,...,...,...,...,...
71,"[{'xref_id': 'Q27744', 'xref_name': None, 'xre...",Plasmodium falciparum,Aldolase,8.0,False,CHEMBL4156,"[{'accession': 'Q27744', 'component_descriptio...",SINGLE PROTEIN,5833
72,[],Plasmodium falciparum,HAP protein (Putative aspartic proteinase),8.0,False,CHEMBL6075,"[{'accession': 'Q9Y006', 'component_descriptio...",SINGLE PROTEIN,5833
73,[],Plasmodium falciparum,Glutathione S-transferase,8.0,False,CHEMBL1697656,"[{'accession': 'Q8ILQ7', 'component_descriptio...",SINGLE PROTEIN,5833
74,[],Plasmodium falciparum,Calcium-dependent protein kinase 1,8.0,False,CHEMBL1908387,"[{'accession': 'P62344', 'component_descriptio...",SINGLE PROTEIN,5833


### **Select and retrieve bioactivity data for *Plasmodium falciparum* (first entry)**

In [105]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL364'

Here, we will retrieve only bioactivity data for *Plasmodium falciparum* (CHEMBL364) that are reported as pChEMBL values.

In [106]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [107]:
df = pd.DataFrame.from_dict(res)

In [108]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,32325,[],CHEMBL764090,Growth inhibition of chloroquine-resistant Pla...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,73.5
1,,32480,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,nM,UO_0000065,,4.25
2,,33480,[],CHEMBL764090,Growth inhibition of chloroquine-resistant Pla...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,22.5
3,,34739,[],CHEMBL764090,Growth inhibition of chloroquine-resistant Pla...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,200.0
4,,34877,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,M,UO_0000065,,0.00000566
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43319,,23681266,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,25.0
43320,,23681267,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,0.416
43321,,23681268,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,0.444
43322,,23681269,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,1.32


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [110]:
df.to_csv('Plasmodium_falciparum_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [109]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
1,,32480,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,nM,UO_0000065,,4.25
4,,34877,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,M,UO_0000065,,0.00000566
5,,34878,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,M,UO_0000065,,0.0000176
11,,37328,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,nM,UO_0000065,,9.55
12,,37329,[],CHEMBL760653,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,nM,UO_0000065,,16.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43319,,23681266,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,25.0
43320,,23681267,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,0.416
43321,,23681268,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,0.444
43322,,23681269,[],CHEMBL4888493,Re-testing in dose-response curve in 3D7 pLDH ...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,1.32


In [112]:
len(df2.canonical_smiles.unique())

20664

In [113]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
1,,32480,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,nM,UO_0000065,,4.25
4,,34877,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,M,UO_0000065,,0.00000566
5,,34878,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,M,UO_0000065,,0.0000176
11,,37328,[],CHEMBL760652,In vitro antimalarial activity against Plasmod...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,nM,UO_0000065,,9.55
13,,38263,[],CHEMBL762998,Antimalarial activity against Plasmodium falci...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,nM,UO_0000065,,18.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41809,,23679273,[],CHEMBL4888488,Hit confirmation in dose-response curve in NF5...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,1.86
41810,,23679274,[],CHEMBL4888488,Hit confirmation in dose-response curve in NF5...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,10.0
41811,,23679275,[],CHEMBL4888488,Hit confirmation in dose-response curve in NF5...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,1.66
41812,,23679276,[],CHEMBL4888488,Hit confirmation in dose-response curve in NF5...,F,,,BAO_0000190,BAO_0000218,...,Plasmodium falciparum,Plasmodium falciparum,5833,,,IC50,uM,UO_0000065,,3.98


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [114]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
1,CHEMBL77052,C[C@@H]1CC[C@H]2[C@@H](C)[C@@H](OCCC3ON3C(=O)c...,4.25
4,CHEMBL307145,Oc1cccc(O)c1O,5660.0
5,CHEMBL16300,O=C(NO)c1ccccc1,17600.0
11,CHEMBL307153,C[C@@H]1CC[C@H]2[C@@H](C)[C@@H](OCCCOc3coc(CO)...,9.55
13,CHEMBL339049,CC(C)(C)NCc1cc(Nc2ccnc3cc(Cl)ccc23)cc(-c2ccc(C...,18.0
...,...,...,...
41809,CHEMBL5024998,Fc1cnc(OC2Cc3ccccc3C2)nc1,1860.0
41810,CHEMBL4953418,CCn1cc(C#N)c(=O)c2cc(F)c(N3CCN(Cc4ccccc4)CC3)cc21,10000.0
41811,CHEMBL4964239,CCC(=O)N(Cc1ccc(-c2ccc(Cl)cc2)o1)[C@H]1CCS(=O)...,1660.0
41812,CHEMBL4939535,CC(C)N1C(=O)NC(=O)C2(Cc3ccccc3N3CCN(Cc4ccccc4)...,3980.0


Saves dataframe to CSV file

In [115]:
df3.to_csv('Plasmodium_falciparum_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [116]:
df4 = pd.read_csv('Plasmodium_falciparum_02_bioactivity_data_preprocessed.csv')

In [122]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) > 1000:
    bioactivity_threshold.append("inactive")
  else:
    bioactivity_threshold.append("active")

In [123]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL77052,C[C@@H]1CC[C@H]2[C@@H](C)[C@@H](OCCC3ON3C(=O)c...,4.25,active
1,CHEMBL307145,Oc1cccc(O)c1O,5660.00,inactive
2,CHEMBL16300,O=C(NO)c1ccccc1,17600.00,inactive
3,CHEMBL307153,C[C@@H]1CC[C@H]2[C@@H](C)[C@@H](OCCCOc3coc(CO)...,9.55,active
4,CHEMBL339049,CC(C)(C)NCc1cc(Nc2ccnc3cc(Cl)ccc23)cc(-c2ccc(C...,18.00,active
...,...,...,...,...
20659,CHEMBL5024998,Fc1cnc(OC2Cc3ccccc3C2)nc1,1860.00,inactive
20660,CHEMBL4953418,CCn1cc(C#N)c(=O)c2cc(F)c(N3CCN(Cc4ccccc4)CC3)cc21,10000.00,inactive
20661,CHEMBL4964239,CCC(=O)N(Cc1ccc(-c2ccc(Cl)cc2)o1)[C@H]1CCS(=O)...,1660.00,inactive
20662,CHEMBL4939535,CC(C)N1C(=O)NC(=O)C2(Cc3ccccc3N3CCN(Cc4ccccc4)...,3980.00,inactive


Saves dataframe to CSV file

In [124]:
df5.to_csv('Plasmodium_falciparum_03_bioactivity_data_curated.csv', index=False)

In [125]:
! zip Plasmodium_falciparum.zip *.csv

  adding: Plasmodium_falciparum_01_bioactivity_data_raw.csv (deflated 93%)
  adding: Plasmodium_falciparum_02_bioactivity_data_preprocessed.csv (deflated 81%)
  adding: Plasmodium_falciparum_03_bioactivity_data_curated.csv (deflated 82%)


In [126]:
! ls -l

total 28224
drwx------ 5 root root     4096 Sep 29 13:23 drive
-rw-r--r-- 1 root root 23290672 Sep 29 13:29 Plasmodium_falciparum_01_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  1629272 Sep 29 13:34 Plasmodium_falciparum_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  1793201 Sep 29 13:39 Plasmodium_falciparum_03_bioactivity_data_curated.csv
-rw-r--r-- 1 root root  2171858 Sep 29 13:39 Plasmodium_falciparum.zip
drwxr-xr-x 1 root root     4096 Sep 26 13:45 sample_data
