# **Computational Drug Discovery [Part 1]: Download + Preprocess Bioactivity Data**

Based on tutorial by Chanin Nantasenamat, [*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a machine learning model using the ChEMBL bioactivity data.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].


## **Install the ChEMBL web service package**

Allows retrieval of bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## **Importing libraries**

In [31]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client 
        # creates a new instance of contact w ChEMBL database
        # can query new_client via 'target', 'molecule', 'activity', other...
        # more details at 
                # https://deepwiki.com/chembl/chembl_webresource_client/2.1-new_client-interface 
                # or 
                # https://hub.2i2c.mybinder.org/user/chembl-chembl_webresource_client-njuf4mhz/notebooks/demo_wrc.ipynb


## **Search for Target protein**

### **Target search for coronavirus**

In [32]:
# Target search 
target = new_client.target
target_query = target.search('acetylcholinesterase') 
targets = pd.DataFrame.from_dict(target_query) # query results --> dictionary format --> into df  
targets # display 

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Drosophila melanogaster,Acetylcholinesterase,18.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
1,[],Homo sapiens,Acetylcholinesterase,16.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Torpedo californica,Acetylcholinesterase,16.0,False,CHEMBL4780,"[{'accession': 'P04058', 'component_descriptio...",SINGLE PROTEIN,7787
3,[],Mus musculus,Acetylcholinesterase,16.0,False,CHEMBL3198,"[{'accession': 'P21836', 'component_descriptio...",SINGLE PROTEIN,10090
4,[],Rattus norvegicus,Acetylcholinesterase,16.0,False,CHEMBL3199,"[{'accession': 'P37136', 'component_descriptio...",SINGLE PROTEIN,10116
5,[],Electrophorus electricus,Acetylcholinesterase,16.0,False,CHEMBL4078,"[{'accession': 'O42275', 'component_descriptio...",SINGLE PROTEIN,8005
6,[],Bos taurus,Acetylcholinesterase,16.0,False,CHEMBL4768,"[{'accession': 'P23795', 'component_descriptio...",SINGLE PROTEIN,9913
7,[],Bemisia tabaci,AChE2,16.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038
8,[],Leptinotarsa decemlineata,Acetylcholinesterase,16.0,False,CHEMBL2366490,"[{'accession': 'Q27677', 'component_descriptio...",SINGLE PROTEIN,7539
9,[],Nephotettix cincticeps,Ace-orthologous acetylcholinesterase,16.0,False,CHEMBL2366514,"[{'accession': 'Q9NJH6', 'component_descriptio...",SINGLE PROTEIN,94400


### **Select Desired Target Protein**

We will assign the 2nd entry target protein *Homo sapiens Acetylcholinesterase [CHEMBL220]* as the selected_target variable and retrieve its bioactivity data.

In [33]:
selected_target = targets.target_chembl_id[1]
selected_target

'CHEMBL220'

Here, we will retrieve bioactivity data only for *selected_target* that have reported IC_50 values (inhibitory concentration that reduces bioactivity by 50%) in nM (nanomolar) unit.

standard_type = measure of activity: IC50, EC50, percent activity
standard_value = potency of drug compound; lower number = smaller dose required for same pharmacological effect

In [34]:
# new query based on activity (as opposed to 'target')
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

df = pd.DataFrame.from_dict(res)


In [35]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9410,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724873,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,46.0
9411,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724874,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,38.31
9412,"{'action_type': 'INHIBITOR', 'description': 'N...",,25733694,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,1.71
9413,,,25733695,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,10.0


In [36]:
df.standard_type.unique() # ours only has IC50 due to filtering, but could also be EC50 or % activity

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
%%bash
mkdir data

In [39]:
df.to_csv('./data/bioactivity_data.csv', index=False) # dont save index numbers into csv

Verify directory contents:

In [40]:
ls

 Volume in drive C is Local Disk
 Volume Serial Number is 9674-1826

 Directory of c:\Users\liv_u\Desktop\GitHub\DrugDiscovery\Acetylcholinesterase_tutorial

2025-09-03  14:11    <DIR>          .
2025-09-03  14:35    <DIR>          ..
2025-09-04  01:21           386,461 CDD_ML_Part_1_Bioactivity_Preprocessing.ipynb
2025-09-02  01:22           566,429 CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
2025-09-02  01:27         3,556,317 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
2025-09-03  14:34         1,411,810 CDD_ML_Part_4_Acetylcholinesterase_ML_Models.ipynb
2025-09-02  01:21    <DIR>          data
2025-09-02  01:28    <DIR>          PaDEL
2025-09-02  01:39             1,152 README.md
2025-09-03  14:31            42,092 regression_model_scatter_plot.pdf
               6 File(s)      5,964,261 bytes
               4 Dir(s)  16,393,908,224 bytes free


Taking a glimpse of the **bioactivity_data.csv** file that we've just created.

In [129]:
df=pd.read_csv('./data/bioactivity_data.csv')
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8


## **Data pre-processing**


### Remove Duplicate SMILES

Different papers can report the same molecule but have slightly different SMILES entries --> hard to catch. \
We can create InChi Keys, which will be constant for a single molecule across varying smiles entries --> catch duplicated molecules

In [132]:
# remove rows where SMILES == NA
df=df[df['canonical_smiles'].notna()]
df.shape

(9379, 46)

In [133]:
# duplicate molecule but slighlty different SMILES --> use Key 
from rdkit import Chem

def id_duplicates(df):
    def calculate_inchi_key(smiles):
        try:
            mol = Chem.MolFromSmiles(smiles) 
            if mol: # if molecule
                return Chem.MolToInchiKey(mol) # create its inchi key
        except:
            return None

    df_analysis = df.copy()
    # calculate inchi keys 
    df_analysis['inchi_key'] = df_analysis['canonical_smiles'].apply(calculate_inchi_key)
    # group duplicates
    df_analysis['duplicate_group'] = df_analysis.groupby('inchi_key').ngroup()
    # count occurrences/group 
    duplicate_count = df_analysis['inchi_key'].value_counts()
    df_analysis['occurrence_count'] = df_analysis['inchi_key'].map(duplicate_count)

    return df_analysis



def process_duplicates(df, inchi_key_col='inchi_key', pchem_col='pchembl_value'):
    df1 = df.copy()

    # create new stats df:
    # per inchi_key group, calculate pchem stats
    group_stats = df1.groupby(inchi_key_col).agg({
        pchem_col: ['count', 'mean', 'std']
    }).round(4) # round to 4 decimals 
    # rename cols
    group_stats.columns = ['count', 'mean', 'std']
    group_stats = group_stats.reset_index()

    processed_rows = []
    duped_compounds = []
    # process each group
    for inchi_key in group_stats['inchi_key']:
        # df of all group members
        group_data = df1[df1[inchi_key_col]==inchi_key].copy()
        # collect stats row of group 
        group_stats_row = group_stats[group_stats['inchi_key']==inchi_key].iloc[0]

        if len(group_data) > 1: #multiple entries
            # add stats to df of all group members 
            group_data['group_mean'] = group_stats_row['mean']
            group_data['group_std'] = group_stats_row['std']
            duped_compounds.append(group_data)

            if group_stats_row['std'] == 0: # low variance bw entries
                # keep only first entry
                processed_row = group_data.iloc[[0]].copy() #first entry
                processed_row['group_type'] = 'zero_sd'
            else: # variation in group , sd != 0 
                # apply mean pchem of the group to the first entry 
                processed_row = group_data.iloc[[0]].copy() 
                processed_row[pchem_col] = group_stats_row['mean']
                processed_row['group_type'] = 'nonzero_sd'

            processed_rows.append(processed_row) #list of lists

        else: # unique group
            group_data['group_type'] = 'single_entry'
            processed_rows.append(group_data)
            
    # combine all processed rows
    final_df = pd.concat(processed_rows)
    duped_compounds_df = pd.concat(duped_compounds) if duped_compounds else pd.DataFrame()

    # sort final_df by ichi keys
    final_df = final_df.sort_values(by=[inchi_key_col])

    return final_df, duped_compounds_df


In [134]:
df_analysed = id_duplicates(df)

In [135]:
df_analysed

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,text_value,toid,type,units,uo_units,upper_value,value,inchi_key,duplicate_group,occurrence_count
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,0.75,GUZKBNUSIOHJIR-UHFFFAOYSA-N,1838,1
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,0.10,XWIIHZHUZCWNLW-UHFFFAOYSA-N,7127,1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,50.00,LOFGEQLLLVJCDY-UHFFFAOYSA-N,3347,1
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,0.30,GTBUYLWBDILEPY-UHFFFAOYSA-N,1814,1
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,0.80,FBOHEFQKBNWFPO-UHFFFAOYSA-N,1311,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9410,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724873,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,46.00,QXPFWUCSRCSJRI-UHFFFAOYSA-N,5102,1
9411,"{'action_type': 'INHIBITOR', 'description': 'N...",,25724874,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5391657,Inhibition of Acetylcholinesterase (unknown or...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,38.31,MTCFGRXMJLQNBG-REOHCLBHSA-N,3742,1
9412,"{'action_type': 'INHIBITOR', 'description': 'N...",,25733694,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,1.71,OYUAMDGBOAEJOY-UHFFFAOYSA-N,4451,1
9413,,,25733695,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5393547,Inhibition of recombinant human AChE expressed...,B,,,BAO_0000190,...,,,IC50,uM,UO_0000065,,10.00,XIXBDRAYBSRWGU-UHFFFAOYSA-N,6955,1


In [136]:
final_df, duped_compounds_df = process_duplicates(df_analysed)
final_df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,units,uo_units,upper_value,value,inchi_key,duplicate_group,occurrence_count,group_type,group_mean,group_std
3074,,,6387742,[],CHEMBL1840095,Inhibition of human recombinant AChE using ace...,B,,,BAO_0000190,...,uM,UO_0000065,,2.700,AABHSZUTAUSICN-HZJYTTRNSA-N,0,1,single_entry,,
8532,"{'action_type': 'INHIBITOR', 'description': 'N...",,24703075,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5107917,Inhibition of human acetylcholinesterase using...,B,,,BAO_0000190,...,umol/L,UO_0000065,,0.988,AACVKVMUDUPMQI-UHFFFAOYSA-N,1,1,single_entry,,
1984,,,2529957,[],CHEMBL1003711,Inhibition of human serum AChE by Ellman's method,B,,,BAO_0000190,...,nM,UO_0000065,,45.000,AAEYHJGXSUBKQJ-UHFFFAOYSA-N,2,1,single_entry,,
8733,"{'action_type': 'INHIBITOR', 'description': 'N...",,24890340,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5158290,Inhibition of C-terminal 6His-tagged human rec...,B,,,BAO_0000190,...,uM,UO_0000065,,0.245,AAIVPEBVZWQHCI-UHFFFAOYSA-N,3,1,single_entry,,
4996,,,15067962,[],CHEMBL3390418,Inhibition of AChE (unknown origin) using acet...,B,,,BAO_0000190,...,uM,UO_0000065,,4.680,AAJCCATVSUSRCY-MDWZMJQESA-N,4,1,single_entry,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1099,,,1255727,[],CHEMBL640423,In vitro inhibitory concentration against puri...,B,,,BAO_0000190,...,uM,UO_0000065,,72.400,ZZVNBOINQNKSOL-UHFFFAOYSA-N,7797,1,single_entry,,
335,,,415600,[],CHEMBL875241,Inhibitory concentration required against acet...,B,,,BAO_0000190,...,uM,UO_0000065,,30.000,ZZVUSNFNVDNFCQ-LICLKQGHSA-N,7798,1,single_entry,,
3384,,Not Active (inhibition < 50% @ 10 uM and thus ...,7679246,[],CHEMBL1909212,DRUGMATRIX: Acetylcholinesterase enzyme inhibi...,B,,,BAO_0000190,...,,,,,ZZVUWRFHKOJYTH-UHFFFAOYSA-N,7799,1,single_entry,,
7410,,,19449423,[],CHEMBL4433401,Inhibition of human erythrocyte acetylcholines...,B,,,BAO_0000190,...,uM,UO_0000065,,1.700,ZZWIYXVNVZCOSL-UHFFFAOYSA-N,7800,1,single_entry,,


In [137]:
final_df['group_type'].unique()

array(['single_entry', 'nonzero_sd', 'zero_sd'], dtype=object)

In [138]:
duped_compounds_df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,type,units,uo_units,upper_value,value,inchi_key,duplicate_group,occurrence_count,group_mean,group_std
5449,,,15736939,[],CHEMBL3626446,Inhibition of recombinant human AChE after 10 ...,B,,,BAO_0000190,...,IC50,nM,UO_0000065,,6.0500,ACIBUVSZJSBTER-UHFFFAOYSA-N,30,2,8.14,0.1131
5479,,,15736969,[],CHEMBL3626447,Inhibition of recombinant human AChE after 60 ...,B,,,BAO_0000190,...,IC50,nM,UO_0000065,,8.6900,ACIBUVSZJSBTER-UHFFFAOYSA-N,30,2,8.14,0.1131
1382,,,1509044,[],CHEMBL832710,Inhibitory concentration against human acetylc...,B,,,BAO_0000190,...,IC50,nM,UO_0000065,,4.1000,ACKJXXOVSOCBPX-UHFFFAOYSA-N,32,3,8.39,0.0000
1822,,,2077381,[],CHEMBL945890,Inhibition of human AchE,B,,,BAO_0000190,...,IC50,nM,UO_0000065,,4.1000,ACKJXXOVSOCBPX-UHFFFAOYSA-N,32,3,8.39,0.0000
4586,,,13834481,[],CHEMBL3090660,Inhibition of human brain AChE using acetylthi...,B,,,BAO_0000190,...,IC50,nM,UO_0000065,,4.1000,ACKJXXOVSOCBPX-UHFFFAOYSA-N,32,3,8.39,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
911,,,1132704,[],CHEMBL884109,Inhibition of Acetylcholinesterase activity fr...,B,,,BAO_0000190,...,IC50,M,UO_0000065,,0.0005,ZYQHJUNKLPQIOL-UHFFFAOYSA-M,7781,2,4.30,
500,,,759433,[],CHEMBL644101,Inhibitory activity against human erythrocyte ...,B,,,BAO_0000190,...,IC50,uM,UO_0000065,,8.1000,ZZKQNBNYCRCFCW-UHFFFAOYSA-N,7787,4,5.09,0.0000
501,,,759434,[],CHEMBL638445,Experimental pIC50 values for the inhibition o...,B,,,BAO_0000190,...,Log IC50,,UO_0000065,,-5.0900,ZZKQNBNYCRCFCW-UHFFFAOYSA-N,7787,4,5.09,0.0000
1243,,,1463821,[],CHEMBL829154,Inhibitory activity against human erythrocyte ...,B,,,BAO_0000190,...,Log IC50,,UO_0000065,,-5.0900,ZZKQNBNYCRCFCW-UHFFFAOYSA-N,7787,4,5.09,0.0000


### Feature Selection

In [151]:
# feature selection for columns we care about
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df2 = final_df[selection]
df2


Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
3074,CHEMBL1834807,CCCCC/C=C\C/C=C\CCCCCCCC(=O)OCCCc1ccc2oc(-c3cc...,2700.0
8532,CHEMBL5188500,CC(=O)N1N=C(c2ccc(-c3ccccc3)cc2)CC1c1ccc2c(c1)...,988.0
1984,CHEMBL491358,CCOC(=O)C1=C(C)Nc2nc3c(c(N)c2C1c1ccc(OC)c(OC)c...,45.0
8733,CHEMBL5199361,CN1CCN(c2c3c(nc4ccc([N+](=O)[O-])cc24)CCCC3)CC1,245.0
4996,CHEMBL2158994,CN(C)CCOc1ccc(C(=O)/C=C/c2ccccc2)cc1,4680.0
...,...,...,...
1099,CHEMBL539571,C#CCNC1CCc2ccc(OC(=O)N(CC)CCCC)cc21.Cl,72400.0
335,CHEMBL130738,Cc1[nH]c(C)c(/C=C2\CN(Cc3ccccc3)CCC2=O)c1C=O,30000.0
3384,CHEMBL657,CN(C)CCOC(c1ccccc1)c1ccccc1,
7410,CHEMBL4453051,c1ccc(CNC2CCN(Cc3ccccc3)CC2)cc1,1700.0


### Remove Rows with Missing values

In [152]:
# remove rows with any NAs
colNames = list(df2.columns.values)
for col in colNames:
    df2=df2[df2[col].notna()]
df2 = df2.reset_index(drop=True)
df2


Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL1834807,CCCCC/C=C\C/C=C\CCCCCCCC(=O)OCCCc1ccc2oc(-c3cc...,2700.0
1,CHEMBL5188500,CC(=O)N1N=C(c2ccc(-c3ccccc3)cc2)CC1c1ccc2c(c1)...,988.0
2,CHEMBL491358,CCOC(=O)C1=C(C)Nc2nc3c(c(N)c2C1c1ccc(OC)c(OC)c...,45.0
3,CHEMBL5199361,CN1CCN(c2c3c(nc4ccc([N+](=O)[O-])cc24)CCCC3)CC1,245.0
4,CHEMBL2158994,CN(C)CCOc1ccc(C(=O)/C=C/c2ccccc2)cc1,4680.0
...,...,...,...
6609,CHEMBL310918,O=C(CCC1CCN(Cc2cccc([N+](=O)[O-])c2)CC1)c1ccc2...,64.0
6610,CHEMBL539571,C#CCNC1CCc2ccc(OC(=O)N(CC)CCCC)cc21.Cl,72400.0
6611,CHEMBL130738,Cc1[nH]c(C)c(/C=C2\CN(Cc3ccccc3)CCC2=O)c1C=O,30000.0
6612,CHEMBL4453051,c1ccc(CNC2CCN(Cc3ccccc3)CC2)cc1,1700.0


### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [153]:
# list data content for bioactivity_class
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

In [154]:
# concatenate new column into df
df2 = pd.concat([df2,pd.Series(bioactivity_class)], axis=1) # since bioactivity_class is list[], must make Series or df before concatenating 
list(df2)
df2 = df2.rename(columns={0:"bioactivity_class"})
list(df2)

['molecule_chembl_id',
 'canonical_smiles',
 'standard_value',
 'bioactivity_class']

In [155]:
df2.to_csv('./data/bioactivity_preprocessed_data.csv', index=False)
df2

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL1834807,CCCCC/C=C\C/C=C\CCCCCCCC(=O)OCCCc1ccc2oc(-c3cc...,2700.0,intermediate
1,CHEMBL5188500,CC(=O)N1N=C(c2ccc(-c3ccccc3)cc2)CC1c1ccc2c(c1)...,988.0,active
2,CHEMBL491358,CCOC(=O)C1=C(C)Nc2nc3c(c(N)c2C1c1ccc(OC)c(OC)c...,45.0,active
3,CHEMBL5199361,CN1CCN(c2c3c(nc4ccc([N+](=O)[O-])cc24)CCCC3)CC1,245.0,active
4,CHEMBL2158994,CN(C)CCOc1ccc(C(=O)/C=C/c2ccccc2)cc1,4680.0,intermediate
...,...,...,...,...
6609,CHEMBL310918,O=C(CCC1CCN(Cc2cccc([N+](=O)[O-])c2)CC1)c1ccc2...,64.0,active
6610,CHEMBL539571,C#CCNC1CCc2ccc(OC(=O)N(CC)CCCC)cc21.Cl,72400.0,inactive
6611,CHEMBL130738,Cc1[nH]c(C)c(/C=C2\CN(Cc3ccccc3)CCC2=O)c1C=O,30000.0,inactive
6612,CHEMBL4453051,c1ccc(CNC2CCN(Cc3ccccc3)CC2)cc1,1700.0,intermediate


In [156]:
%%bash
ls 

CDD_ML_Part_1_Bioactivity_Preprocessing.ipynb
CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
CDD_ML_Part_4_Acetylcholinesterase_ML_Models.ipynb
PaDEL
README.md
data
regression_model_scatter_plot.pdf


---