# Computational Drug Discovery[CDD] [Part 1] - Download Bioactivity Data

# Target Disease - Malaria

"Malaria is a life-threatening disease caused by parasites that are transmitted to people through the bites of infected female Anopheles mosquitoes. It is preventable and curable.
In 2019, there were an estimated 229 million cases of malaria worldwide.
The estimated number of malaria deaths stood at 409 000 in 2019.
Children aged under 5 years are the most vulnerable group affected by malaria; in 2019, they accounted for 67% (274 000) of all malaria deaths worldwide.
The WHO African Region carries a disproportionately high share of the global malaria burden. In 2019, the region was home to 94% of malaria cases and deaths."

(From a WHO article dated 1st April 2021)

## Install Libraries

In [1]:
pip install chembl_webresource_client

Collecting chembl_webresource_client
  Using cached chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
Collecting requests-cache~=0.7.0
  Using cached requests_cache-0.7.4-py3-none-any.whl (38 kB)
Collecting easydict
  Using cached easydict-1.9.tar.gz (6.4 kB)
Collecting urllib3
  Using cached urllib3-1.26.6-py2.py3-none-any.whl (138 kB)
Collecting requests>=2.18.4
  Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
Collecting idna<4,>=2.5
  Using cached idna-3.2-py3-none-any.whl (59 kB)
Collecting charset-normalizer~=2.0.0
  Using cached charset_normalizer-2.0.4-py3-none-any.whl (36 kB)
Collecting url-normalize<2.0,>=1.4
  Using cached url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 930 kB/s eta 0:00:01
[?25hCollecting itsdangerous>=2.0.1
  Using cached itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Building wheels for collected p

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


## Import Libraries

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Searching for the Target Protein
## Target search 'Plasmodium falciparum'

In [43]:
target = new_client.target
target_query = target.search("Plasmodium falciparum")
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Plasmodium falciparum,Plasmodium falciparum,28.0,False,CHEMBL364,[],ORGANISM,5833
1,[],Plasmodium falciparum 3D7,Plasmodium falciparum 3D7,24.0,False,CHEMBL2366922,[],ORGANISM,36329
2,[],Plasmodium falciparum D6,Plasmodium falciparum D6,24.0,False,CHEMBL2367107,[],ORGANISM,478860
3,[],Plasmodium falciparum NF54,Plasmodium falciparum NF54,24.0,False,CHEMBL2367131,[],ORGANISM,5843
4,[],Plasmodium falciparum FcB1/Columbia,Plasmodium falciparum (isolate FcB1 / Columbia),19.0,False,CHEMBL612608,[],ORGANISM,186763
...,...,...,...,...,...,...,...,...,...
70,[],Plasmodium falciparum,Apical membrane antigen 1,8.0,False,CHEMBL4295920,"[{'accession': 'Q94661', 'component_descriptio...",SINGLE PROTEIN,5833
71,[],Plasmodium falciparum,Heat-shock protein,8.0,False,CHEMBL4296306,"[{'accession': 'Q25869', 'component_descriptio...",SINGLE PROTEIN,5833
72,[],Plasmodium falciparum,Adenosine deaminase,8.0,False,CHEMBL4523370,"[{'accession': 'Q86GS5', 'component_descriptio...",SINGLE PROTEIN,5833
73,[],Plasmodium falciparum,Glutamine amidotransferase,8.0,False,CHEMBL4523484,"[{'accession': 'Q9U775', 'component_descriptio...",SINGLE PROTEIN,5833


In [45]:
#targets.head(3)

### Choosing the target protein/organism of interest

For this bioinformatics project,P. falciparum hexose transporter (PfHT1) will be our target

For malaria, blood forms of parasites rely almost entirely on glycolysis for energy production and, without energy stores, they are dependent on the constant uptake of glucose. Plasmodium falciparum is considered the most dangerous human malarial parasite and its hexose transporter has been identified as being the major glucose transporter.

In [48]:
selected_target = targets.target_chembl_id = 'CHEMBL4697'
selected_target

'CHEMBL4697'

Let's now retrieve only the bioactivity data for the target Hexose transporter 1 that are reported as IC50 values in nM

In [54]:
activity = new_client.activity
result = activity.filter(target_chembl_id=selected_target).filter(standard_type='IC50')

In [55]:
df = pd.DataFrame.from_dict(result)

In [56]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,15331355,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
1,,15331356,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
2,,15331357,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
3,,15331358,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,11335.0
4,,15331359,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
787,,15365839,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
788,,15365840,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
789,,15365841,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
790,,15365842,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0


In [57]:
df.standard_type.unique()

array(['IC50'], dtype=object)

### Let's save the bioactivity data into a csv file

In [58]:
df.to_csv("bioactivity_data.csv",index=False)

In [59]:
ls

bioactivity_data.csv  CDD_Malaria_Bioactivity_Data.ipynb


## Handling records with missing data

### If a compound has missing values for the standard_value column then drop it

In [60]:
#Our compound list for the chembl target is quite long,we can't simply just have a look at the standard_value column
#df.standard_value

In [61]:
df2 = df[df.standard_value.notna()]

In [62]:
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,15331355,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
1,,15331356,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
2,,15331357,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
3,,15331358,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,11335.0
4,,15331359,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
787,,15365839,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
788,,15365840,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
789,,15365841,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0
790,,15365842,[],CHEMBL3436039,ST_JUDE_LEISH: Cytotoxicity against transgenic...,F,,,BAO_0000190,BAO_0000019,...,Plasmodium falciparum,Hexose transporter 1,5833,,,IC50,nM,UO_0000065,,12000.0


## Data preprocessing of the bioactivity data

### Label the drugs as either Active, Inactive or Intermediate

Our bioactivity data is in the IC50 unit. Compounds having IC50 value of less than 1000nM are considered active while compounds with IC50 value of more than 10000nM are considered inactive. Those with values between 1000 and 10,000nM are considered intermediate

In [63]:
bioactivity_class = []
for i in df2.standard_value:
    if float(i) <= 1000:
        bioactivity_class.append('active')
    elif float(i) >= 10000:
        bioactivity_class.append('inactive')
    else:
        bioactivity_class.append('intermediate')

In [64]:
#bioactivity_class

### Let's filter the above dataframe and select specific columns that are useful for further processing

In [65]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL2028051,O=C(NCC(c1ccsc1)N1CCCCCC1)c1ccc(C(F)(F)F)cc1,12000.0
1,CHEMBL1459149,CCN1CCCC1CNc1[nH]cnc2c3cc(Cl)ccc3nc1-2,12000.0
2,CHEMBL2028052,COc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)n...,12000.0
3,CHEMBL2028053,Cc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)nc...,11335.0
4,CHEMBL2028054,Cc1cc(NC(=O)c2cccc(C(F)(F)F)c2)n(-c2nc(-c3ccc4...,12000.0
...,...,...,...
787,CHEMBL2028046,Cc1sc(NC(=O)c2ccccc2)c(C(c2cccs2)N2CCN(c3ccccc...,12000.0
788,CHEMBL2028047,Cc1sc(NC(=O)c2ccco2)c(C(c2cccnc2)N2CCC(Cc3cccc...,12000.0
789,CHEMBL2028048,CCOC(=O)c1c(C)n(-c2ccccc2)c2ccc(OC(=O)c3cc(OC)...,12000.0
790,CHEMBL2028049,COc1cc(C(=O)Nc2nc3c(cc4c5c(cccc53)CC4)s2)cc(OC...,12000.0


Let's now add a column for the bioactivity_class

In [66]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL2028051,O=C(NCC(c1ccsc1)N1CCCCCC1)c1ccc(C(F)(F)F)cc1,12000.0,inactive
1,CHEMBL1459149,CCN1CCCC1CNc1[nH]cnc2c3cc(Cl)ccc3nc1-2,12000.0,inactive
2,CHEMBL2028052,COc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)n...,12000.0,inactive
3,CHEMBL2028053,Cc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)nc...,11335.0,inactive
4,CHEMBL2028054,Cc1cc(NC(=O)c2cccc(C(F)(F)F)c2)n(-c2nc(-c3ccc4...,12000.0,inactive
...,...,...,...,...
787,CHEMBL2028046,Cc1sc(NC(=O)c2ccccc2)c(C(c2cccs2)N2CCN(c3ccccc...,12000.0,inactive
788,CHEMBL2028047,Cc1sc(NC(=O)c2ccco2)c(C(c2cccnc2)N2CCC(Cc3cccc...,12000.0,inactive
789,CHEMBL2028048,CCOC(=O)c1c(C)n(-c2ccccc2)c2ccc(OC(=O)c3cc(OC)...,12000.0,inactive
790,CHEMBL2028049,COc1cc(C(=O)Nc2nc3c(cc4c5c(cccc53)CC4)s2)cc(OC...,12000.0,inactive


In [69]:
#Save the data into a csv_file
df4.to_csv('pfHT1_Preprocessed_biological_data.csv', index=False)

In [68]:
ls

bioactivity_data.csv                pfHT1_Preprocessed_biological_data
CDD_Malaria_Bioactivity_Data.ipynb
