# Python for Data Science Practicce Session 4: Biochemistry

## Processing bioactivity data for ML

This notebook will be focusing on the Epidermal Growth Factor Receptor (**EGRF**).

EGRF is the target of an anticancer targeted therapy drug: Tagrisso (Osimertinib). It was the seventh best-selling drug of 2020 with $3.19 billion in sales, and is used to treat non-small cell lung cancers where there is a specific mutation in the **EGRF** protein.

Sources: -

https://www.fiercepharma.com/special-report/tagrisso-top-10-drugs-by-sales-increase-2020

https://en.wikipedia.org/wiki/Osimertinib

We will first import pandas and the ChEMBL web service package

In [9]:
#import pandas
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [11]:
#Search ChEMBL for the 'egfr' target - A growth factor receptor
#This will collate all possible results - we combine these into a pandas dataframe
target = new_client.target
target_query = target.search('egfr')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'Q01279', 'xref_name': None, 'xre...",Mus musculus,Epidermal growth factor receptor erbB1,16.0,False,CHEMBL3608,"[{'accession': 'Q01279', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Homo sapiens,EGFR/PPP1CA,16.0,False,CHEMBL4523747,"[{'accession': 'P00533', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
2,[],Homo sapiens,VHL/EGFR,16.0,False,CHEMBL4523998,"[{'accession': 'P00533', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,"[{'xref_id': 'P00533', 'xref_name': None, 'xre...",Homo sapiens,Epidermal growth factor receptor erbB1,12.0,False,CHEMBL203,"[{'accession': 'P00533', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Homo sapiens,Protein cereblon/Epidermal growth factor receptor,11.0,False,CHEMBL4523680,"[{'accession': 'P00533', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
5,[],Homo sapiens,MER intracellular domain/EGFR extracellular do...,9.0,False,CHEMBL3137284,"[{'accession': 'P00533', 'component_descriptio...",CHIMERIC PROTEIN,9606
6,[],Homo sapiens,Epidermal growth factor receptor and ErbB2 (HE...,7.0,False,CHEMBL2111431,"[{'accession': 'P04626', 'component_descriptio...",PROTEIN FAMILY,9606
7,[],Homo sapiens,Epidermal growth factor receptor,4.0,False,CHEMBL2363049,"[{'accession': 'P04626', 'component_descriptio...",PROTEIN FAMILY,9606


In [73]:
#In this particular case CHEMBL3608 is the target we will use - index 0
select_target = targets.target_chembl_id[0]
select_target

'CHEMBL3608'

https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL3608/

In [74]:
#When attempting to make conclusions, bioactivity data is crucial
activity = new_client.activity
res = activity.filter(target_chembl_id = select_target).filter(standard_type="IC50")

In [75]:
df = pd.DataFrame.from_dict(res)

In [76]:
df.tail(5)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
66,Not Determined,18923658,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,,,,
67,,18923668,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,nM,UO_0000065,,1100.0
68,,18923669,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,nM,UO_0000065,,1300.0
69,,18923670,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,nM,UO_0000065,,700.0
70,Not Determined,18923671,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,,,,


In [146]:
#Useful reference of all the column names
for col in df.columns:
    print(col)

activity_comment
activity_id
activity_properties
assay_chembl_id
assay_description
assay_type
assay_variant_accession
assay_variant_mutation
bao_endpoint
bao_format
bao_label
canonical_smiles
data_validity_comment
data_validity_description
document_chembl_id
document_journal
document_year
ligand_efficiency
molecule_chembl_id
molecule_pref_name
parent_molecule_chembl_id
pchembl_value
potential_duplicate
qudt_units
record_id
relation
src_id
standard_flag
standard_relation
standard_text_value
standard_type
standard_units
standard_upper_value
standard_value
target_chembl_id
target_organism
target_pref_name
target_tax_id
text_value
toid
type
units
uo_units
upper_value
value


In [94]:
missing_val = df.loc[df['activity_comment'] == 'Not Determined']
missing_val_clean = missing_val[['activity_comment', 'molecule_chembl_id']]
missing_val_clean

Unnamed: 0,activity_comment,molecule_chembl_id
47,Not Determined,CHEMBL184685
66,Not Determined,CHEMBL4591327
70,Not Determined,CHEMBL4444231


Looking at the last few columns in the bottom 5 rows we can see that **66** and **70** have **no IC50** values. In place of a value, **'None'** is displayed.

In [72]:
#Determining what units are used in our current dataset.
df.units.unique()

array(['uM', 'nM', 'ug ml-1', None], dtype=object)

## Missing Values

There are many ways of dealing with missing values, however in the case of the missing IC50 values in `df`, 2 of the compounds have IC50 values in a separate dataset `CHEMBL203`. This is also the 4th result in our initial target search.

In [98]:
#Selecting the CHEMBL203 target - index 3
select_target_2 = targets.target_chembl_id[3]
select_target_2

'CHEMBL203'

CHEMBL4324121



In [134]:
res_2 = activity.filter(target_chembl_id = select_target_2).filter(molecule_chembl_id="CHEMBL4444231",assay_variant_accession='P00533')

In [135]:
df_2 = pd.DataFrame.from_dict(res_2)

In [136]:
df_2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18923698,[],CHEMBL4324121,Inhibition of GST-fusion tagged EGFR L858R/T79...,B,P00533,"L858R,T790M,C797S",BAO_0000190,BAO_0000219,...,Homo sapiens,Epidermal growth factor receptor erbB1,9606,,,IC50,nM,UO_0000065,,2700.0


In [138]:
#When attempting to make conclusions, bioactivity data is crucial
res_3 = activity.filter(target_chembl_id = select_target_2).filter(molecule_chembl_id="CHEMBL4591327", assay_variant_accession ='P00533')

In [139]:
df_3 = pd.DataFrame.from_dict(res_3)

In [140]:
df_3

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18923697,[],CHEMBL4324121,Inhibition of GST-fusion tagged EGFR L858R/T79...,B,P00533,"L858R,T790M,C797S",BAO_0000190,BAO_0000219,...,Homo sapiens,Epidermal growth factor receptor erbB1,9606,,,IC50,nM,UO_0000065,,2100.0


The same cannot be done with `CHEMBL184685` as the activity data is not present in this dataset either.

We will next remove the compunds missing bioactivity data from `df`.

In [155]:
df_1 = df.drop(index=[47,66,70])

We can double-check that this has been done correctly...

In [156]:
missing_val_2 = df_1.loc[df['activity_comment'] == 'Not Determined']
missing_val_clean_2 = missing_val_2[['activity_comment', 'molecule_chembl_id']]
missing_val_clean_2

Unnamed: 0,activity_comment,molecule_chembl_id


## Combining Datasets

We would like to combine `df1`, `df2`, and `df3` to complete our dataset.

In [157]:
print(df.shape, df_2.shape, df_3.shape)

(71, 45) (1, 45) (1, 45)


As all dataframes have the same number of columns (45) so they can all be combined using only **concatenation**. 

In [228]:
frames = [df_1,df_2, df_3]
df_full = pd.concat(frames)
df_full

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,110221,[],CHEMBL675511,Inhibition of epidermal growth factor receptor...,B,,,BAO_0000190,BAO_0000357,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,uM,UO_0000065,,100.0
1,,113118,[],CHEMBL675511,Inhibition of epidermal growth factor receptor...,B,,,BAO_0000190,BAO_0000357,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,uM,UO_0000065,,100.0
2,,119387,[],CHEMBL675511,Inhibition of epidermal growth factor receptor...,B,,,BAO_0000190,BAO_0000357,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,uM,UO_0000065,,100.0
3,,133319,[],CHEMBL675511,Inhibition of epidermal growth factor receptor...,B,,,BAO_0000190,BAO_0000357,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,uM,UO_0000065,,25.0
4,,193384,[],CHEMBL675513,Inhibition of epidermal growth factor receptor...,B,,,BAO_0000190,BAO_0000357,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,uM,UO_0000065,,0.07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,,18923668,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,nM,UO_0000065,,1100.0
68,,18923669,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,nM,UO_0000065,,1300.0
69,,18923670,[],CHEMBL4324124,Inhibition of wild type EGFR in mouse BAF3 cel...,B,,,BAO_0000190,BAO_0000219,...,Mus musculus,Epidermal growth factor receptor erbB1,10090,,,IC50,nM,UO_0000065,,700.0
0,,18923698,[],CHEMBL4324121,Inhibition of GST-fusion tagged EGFR L858R/T79...,B,P00533,"L858R,T790M,C797S",BAO_0000190,BAO_0000219,...,Homo sapiens,Epidermal growth factor receptor erbB1,9606,,,IC50,nM,UO_0000065,,2700.0


In [213]:

selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df_complete = df_full[selection]
df_complete = df_complete.reset_index(drop=True)
df_complete

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL292323,COc1cccc2c(C(=O)Nc3ccccc3)c(SSc3c(C(=O)Nc4cccc...,100000.0
1,CHEMBL304414,Cn1c(SSc2c(C(=O)Nc3ccccc3)c3ccccc3n2C)c(C(=O)N...,100000.0
2,CHEMBL62176,CN1C(=S)C(C(=O)Nc2ccccc2)c2ccccc21,100000.0
3,CHEMBL62701,Cn1c(SSc2c(C(=O)Nc3ccccc3)c3cccnc3n2C)c(C(=O)N...,25000.0
4,CHEMBL137617,C/N=N/Nc1ccc2ncnc(Nc3cccc(Br)c3)c2c1,70.0
...,...,...,...
65,CHEMBL4562138,Cc1cc2cc(n1)-c1cnn(C)c1OCCC[C@@H](C)CN1/C(=N/C...,1100.0
66,CHEMBL4519157,O=C1/N=C2\Nc3ccccc3N2CCCCCOc2ccccc2-c2cc1ccn2,1300.0
67,CHEMBL4532034,COc1ccncc1-c1cc(C(=O)/N=c2\[nH]c3ccccc3n2CC(C)...,700.0
68,CHEMBL4444231,CC(C)(O)Cn1/c(=N/C(=O)c2cccc(-c3cccnc3)c2)[nH]...,2700.0


In [220]:
bioactivity_class = []
for i in df_complete.standard_value:
    if float(i) >= 10000:
        bioactivity_class.append("inactive")
    elif 100 <= float(i) <= 1000:
        bioactivity_class.append("active")
    elif float(i) <= 100:
        bioactivity_class.append("highly active")
    else:
        bioactivity_class.append("intermediate")

In [227]:
df_complete_activity = pd.concat([df_complete,pd.Series(bioactivity_class)], axis=1)
df_complete_activity

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,0
0,CHEMBL292323,COc1cccc2c(C(=O)Nc3ccccc3)c(SSc3c(C(=O)Nc4cccc...,100000.0,inactive
1,CHEMBL304414,Cn1c(SSc2c(C(=O)Nc3ccccc3)c3ccccc3n2C)c(C(=O)N...,100000.0,inactive
2,CHEMBL62176,CN1C(=S)C(C(=O)Nc2ccccc2)c2ccccc21,100000.0,inactive
3,CHEMBL62701,Cn1c(SSc2c(C(=O)Nc3ccccc3)c3cccnc3n2C)c(C(=O)N...,25000.0,inactive
4,CHEMBL137617,C/N=N/Nc1ccc2ncnc(Nc3cccc(Br)c3)c2c1,70.0,highly active
...,...,...,...,...
65,CHEMBL4562138,Cc1cc2cc(n1)-c1cnn(C)c1OCCC[C@@H](C)CN1/C(=N/C...,1100.0,intermediate
66,CHEMBL4519157,O=C1/N=C2\Nc3ccccc3N2CCCCCOc2ccccc2-c2cc1ccn2,1300.0,intermediate
67,CHEMBL4532034,COc1ccncc1-c1cc(C(=O)/N=c2\[nH]c3ccccc3n2CC(C)...,700.0,active
68,CHEMBL4444231,CC(C)(O)Cn1/c(=N/C(=O)c2cccc(-c3cccnc3)c2)[nH]...,2700.0,intermediate


## Reshaping Data

## Encoding Variables