# **predictBioactivity - Bioactivity Prediction given Drug Structures**

Yuzhu Duan

This Jupyter notebook shows the code and processes of obtaining the training data used for building the drug bioactivity prediction model. The bioactivity data towards CDKs (cyclin-dependent kinases) is obtained from ChEMBL. The data will be cleaned and preprocessed by using `Pandas` library, the compound fingerprint descriptors are calculated by using `PaDEL` tool.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for CDKs**

In [41]:
target = new_client.target
target_query = target.search('CDK')
targets = pd.DataFrame.from_dict(target_query)
targets = targets[(targets['target_type'] == 'SINGLE PROTEIN') & (targets['organism'] == 'Homo sapiens')]
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
3,"[{'xref_id': 'P38936', 'xref_name': None, 'xre...",Homo sapiens,CDK-interacting protein 1,13.0,False,CHEMBL5021,"[{'accession': 'P38936', 'component_descriptio...",SINGLE PROTEIN,9606
4,"[{'xref_id': 'P50613', 'xref_name': None, 'xre...",Homo sapiens,Cyclin-dependent kinase 7,9.0,False,CHEMBL3055,"[{'accession': 'P50613', 'component_descriptio...",SINGLE PROTEIN,9606
6,[],Homo sapiens,Cyclin-dependent kinase 20,9.0,False,CHEMBL3559690,"[{'accession': 'Q8IZL9', 'component_descriptio...",SINGLE PROTEIN,9606


### **Select and retrieve bioactivity data for *Human Cellular tumor antigen p53* (first entry)**

In [42]:
selected_target = targets.target_chembl_id.values
selected_target

array(['CHEMBL5021', 'CHEMBL3055', 'CHEMBL3559690'], dtype=object)

Here, we will retrieve only bioactivity data for *Human Cellular tumor antigen p53* (CHEMBL4096) that are reported as pChEMBL values.

In [43]:
activity = new_client.activity
df = pd.DataFrame()
for t in selected_target:
    res = activity.filter(target_chembl_id=t).filter(standard_type="IC50")
    tmp = pd.DataFrame.from_dict(res)
    df = pd.concat([df, tmp])
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1418556,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,20.0
1,,1418557,[],CHEMBL832970,Inhibitory concentration against human p21 pro...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,20.0
2,,1418559,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,20.0
3,,1418560,[],CHEMBL832970,Inhibitory concentration against human p21 pro...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,20.0
4,,1418759,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,2.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,,23297777,[],CHEMBL4841258,Inhibition of CDK7 (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,IC50,nM,UO_0000065,,1000.0
308,,23297778,[],CHEMBL4841258,Inhibition of CDK7 (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,IC50,nM,UO_0000065,,1000.0
309,,23372694,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4884659,CDK7(CDK7CGS1) Takeda global kinase panel,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,pIC50,,UO_0000065,,6.0
310,,23373103,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4884950,CDK7(CDK7CGS1) Takeda global kinase panel,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,pIC50,,UO_0000065,,6.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [48]:
df.to_csv('data/cdk_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [55]:
df.columns.values

array(['activity_comment', 'activity_id', 'activity_properties',
       'assay_chembl_id', 'assay_description', 'assay_type',
       'assay_variant_accession', 'assay_variant_mutation',
       'bao_endpoint', 'bao_format', 'bao_label', 'canonical_smiles',
       'data_validity_comment', 'data_validity_description',
       'document_chembl_id', 'document_journal', 'document_year',
       'ligand_efficiency', 'molecule_chembl_id', 'molecule_pref_name',
       'parent_molecule_chembl_id', 'pchembl_value',
       'potential_duplicate', 'qudt_units', 'record_id', 'relation',
       'src_id', 'standard_flag', 'standard_relation',
       'standard_text_value', 'standard_type', 'standard_units',
       'standard_upper_value', 'standard_value', 'target_chembl_id',
       'target_organism', 'target_pref_name', 'target_tax_id',
       'text_value', 'toid', 'type', 'units', 'uo_units', 'upper_value',
       'value'], dtype=object)

In [53]:
df['standard_value'].dropna(inplace=True)
df['canonical_smiles'].dropna(inplace=True)
len(df.canonical_smiles.unique())

311

In [54]:
df2 = df.drop_duplicates(['canonical_smiles'])
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1418556,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,20.0
2,,1418559,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,20.0
4,,1418759,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,2.3
6,,1418798,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,20.0
8,,1418801,[],CHEMBL833769,Inhibitory concentration against p21 deficient...,F,,,BAO_0000190,BAO_0000019,...,Homo sapiens,CDK-interacting protein 1,9606,,,IC50,uM,UO_0000065,,2.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,,23297777,[],CHEMBL4841258,Inhibition of CDK7 (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,IC50,nM,UO_0000065,,1000.0
308,,23297778,[],CHEMBL4841258,Inhibition of CDK7 (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,IC50,nM,UO_0000065,,1000.0
309,,23372694,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4884659,CDK7(CDK7CGS1) Takeda global kinase panel,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,pIC50,,UO_0000065,,6.0
310,,23373103,"[{'comments': None, 'relation': None, 'result_...",CHEMBL4884950,CDK7(CDK7CGS1) Takeda global kinase panel,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cyclin-dependent kinase 7,9606,,,pIC50,,UO_0000065,,6.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [56]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0
2,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,20000.0
4,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,2300.0
6,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0
8,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,2400.0
...,...,...,...
307,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,1000.0
308,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,1000.0
309,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,1000.0
310,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,1000.0


Saves dataframe to CSV file

In [65]:
df3.to_csv('data/cdk_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [66]:
df4 = pd.read_csv('data/cdk_bioactivity_data_preprocessed.csv')

In [59]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [60]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0,inactive
1,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,20000.0,inactive
2,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,2300.0,intermediate
3,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0,inactive
4,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,2400.0,intermediate
...,...,...,...,...
306,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,1000.0,active
307,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,1000.0,active
308,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,1000.0,active
309,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,1000.0,active


In [61]:
df5['class'].value_counts()

active          129
intermediate     95
inactive         87
Name: class, dtype: int64

Saves dataframe to CSV file

In [67]:
df5.to_csv('data/cdk_bioactivity_data_curated.csv', index=False)

# Value Normalization and pIC50 Calculation #
---

In [19]:
df = pd.read_csv('data/cdk_bioactivity_data_curated.csv')
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0,inactive
1,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,20000.0,inactive
2,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,2300.0,intermediate
3,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0,inactive
4,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,2400.0,intermediate
...,...,...,...,...
306,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,1000.0,active
307,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,1000.0,active
308,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,1000.0,active
309,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,1000.0,active


In [20]:
# clean the canonical smiles by selecting the longest one
smiles = []

for i in df.canonical_smiles.tolist():
  cpd = str(i).split('.')
  cpd_longest = max(cpd, key = len)
  smiles.append(cpd_longest)

smiles = pd.Series(smiles, name = 'canonical_smiles')
df['canonical_smiles'] = smiles
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0,inactive
1,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,20000.0,inactive
2,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,2300.0,intermediate
3,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,20000.0,inactive
4,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,2400.0,intermediate
...,...,...,...,...
306,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,1000.0,active
307,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,1000.0,active
308,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,1000.0,active
309,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,1000.0,active


### **Convert IC50 to pIC50**
To allow **IC50** data to be more uniformly distributed, we will convert **IC50** to the negative logarithmic scale which is essentially **-log10(IC50)**.

This custom function pIC50() will accept a DataFrame as input and will:
* Take the IC50 values from the ``standard_value`` column and converts it from nM to M by multiplying the value by 10$^{-9}$
* Take the molar value and apply -log10
* Delete the ``standard_value`` column and create a new ``pIC50`` column

In [21]:
def norm_value(input):
    norm = []

    for i in input['standard_value']:
        if i > 100000000:
            i = 100000000
        norm.append(i)

    input['standard_value_norm'] = norm
    x = input.drop(columns='standard_value')
        
    return x

def pIC50(input):
    pIC50 = []

    for i in input['standard_value_norm']:
        molar = i*(10**-9) # Converts nM to M
        pIC50.append(-np.log10(molar))

    input['pIC50'] = pIC50
    x = input.drop(columns='standard_value_norm')
        
    return x

**Note:** Values greater than 100,000,000 will be fixed at 100,000,000 otherwise the negative logarithmic value will become negative.

In [22]:
df.standard_value.describe()

count       294.000000
mean       6080.422041
std       10888.917447
min           0.130000
25%         313.250000
50%        1696.000000
75%       10000.000000
max      100000.000000
Name: standard_value, dtype: float64

We will first apply the `norm_value()` function so that the values in the standard_value column is normalized.

In [23]:
df_norm = norm_value(df)
df_norm

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,standard_value_norm
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,20000.0
1,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,inactive,20000.0
2,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,intermediate,2300.0
3,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,20000.0
4,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,intermediate,2400.0
...,...,...,...,...
306,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,active,1000.0
307,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,active,1000.0
308,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,active,1000.0
309,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,active,1000.0


In [24]:
df_norm.standard_value_norm.describe()

count       294.000000
mean       6080.422041
std       10888.917447
min           0.130000
25%         313.250000
50%        1696.000000
75%       10000.000000
max      100000.000000
Name: standard_value_norm, dtype: float64

In [25]:
df_final = pIC50(df_norm)
df_final

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,pIC50
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,4.698970
1,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,inactive,4.698970
2,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,intermediate,5.638272
3,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,4.698970
4,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,intermediate,5.619789
...,...,...,...,...
306,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,active,6.000000
307,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,active,6.000000
308,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,active,6.000000
309,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,active,6.000000


In [26]:
df_final.pIC50.describe()

count    294.000000
mean       5.887888
std        0.991392
min        4.000000
25%        5.000000
50%        5.770575
75%        6.504119
max        9.886057
Name: pIC50, dtype: float64

In [27]:
df_final.dropna(inplace=True)
df_final

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,pIC50
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,4.698970
1,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,inactive,4.698970
2,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,intermediate,5.638272
3,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,4.698970
4,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,intermediate,5.619789
...,...,...,...,...
306,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,active,6.000000
307,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,active,6.000000
308,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,active,6.000000
309,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,active,6.000000


Let's write this to CSV file.

In [28]:
df_final.to_csv('data/cdk_bioactivity_data_3class_pIC50.csv', index=False)

# **Descriptor Calculation and Dataset Preparation**

---

In [29]:
df = pd.read_csv('data/cdk_bioactivity_data_3class_pIC50.csv')
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,pIC50
0,CHEMBL368648,COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,4.698970
1,CHEMBL178663,Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3,inactive,4.698970
2,CHEMBL179730,COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC,intermediate,5.638272
3,CHEMBL179692,Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1,inactive,4.698970
4,CHEMBL179110,COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1,intermediate,5.619789
...,...,...,...,...
289,CHEMBL4876497,Cc1n[nH]c2ccc(-c3cc(N[C@@H](CO)c4ccccc4)cnc3Cl...,active,6.000000
290,CHEMBL4786559,COC[C@H](C)N[C@H]1CC[C@H](Nc2cc(-c3cccc(NCC4(C...,active,6.000000
291,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,active,6.000000
292,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,active,6.000000


In [30]:
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [31]:
! cat molecule.smi | head -5

COC(=O)c1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1	CHEMBL368648
Oc1nc(-c2ccc(-c3ccccc3)cc2)nc2sc3c(c12)CCCC3	CHEMBL178663
COc1cc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc(OC)c1OC	CHEMBL179730
Cc1ccc(-c2nc(O)c3c4c(sc3n2)CCCC4)cc1	CHEMBL179692
COc1ccc(-c2nc(O)c3c4c(sc3n2)CCC4)cc1	CHEMBL179110


In [32]:
! cat molecule.smi | wc -l

     294


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [33]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [34]:
! bash padel.sh

Processing CHEMBL368648 in molecule.smi (1/294). 
Processing CHEMBL178663 in molecule.smi (2/294). 
Processing CHEMBL179730 in molecule.smi (3/294). 
Processing CHEMBL179692 in molecule.smi (4/294). 
Processing CHEMBL179110 in molecule.smi (5/294). 
Processing CHEMBL361956 in molecule.smi (6/294). 
Processing CHEMBL179581 in molecule.smi (7/294). 
Processing CHEMBL178310 in molecule.smi (8/294). 
Processing CHEMBL179178 in molecule.smi (9/294). 
Processing CHEMBL368233 in molecule.smi (10/294). 
Processing CHEMBL360687 in molecule.smi (11/294). 
Processing CHEMBL360323 in molecule.smi (12/294). 
Processing CHEMBL425866 in molecule.smi (14/294). Average speed: 0.93 s/mol.
Processing CHEMBL178149 in molecule.smi (13/294). Average speed: 1.68 s/mol.
Processing CHEMBL180572 in molecule.smi (16/294). Average speed: 0.63 s/mol.
Processing CHEMBL359861 in molecule.smi (15/294). Average speed: 0.63 s/mol.
Processing CHEMBL179694 in molecule.smi (18/294). Average speed: 0.38 s/mol.
Processing C

Processing CHEMBL3612494 in molecule.smi (112/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612492 in molecule.smi (111/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612496 in molecule.smi (114/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612495 in molecule.smi (113/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612498 in molecule.smi (116/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612497 in molecule.smi (115/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612503 in molecule.smi (117/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612504 in molecule.smi (118/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612505 in molecule.smi (119/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612506 in molecule.smi (120/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612507 in molecule.smi (121/294). Average speed: 0.05 s/mol.
Processing CHEMBL3612513 in molecule.smi (122/294). Average speed: 0.05 s/mol.
Processing CHEMBL3623375 in molecule.smi (124/294). 

Processing CHEMBL4763202 in molecule.smi (216/294). Average speed: 0.05 s/mol.
Processing CHEMBL4753350 in molecule.smi (217/294). Average speed: 0.05 s/mol.
Processing CHEMBL4794338 in molecule.smi (218/294). Average speed: 0.05 s/mol.
Processing CHEMBL4764719 in molecule.smi (219/294). Average speed: 0.05 s/mol.
Processing CHEMBL4781334 in molecule.smi (220/294). Average speed: 0.05 s/mol.
Processing CHEMBL4756935 in molecule.smi (221/294). Average speed: 0.05 s/mol.
Processing CHEMBL4786324 in molecule.smi (222/294). Average speed: 0.05 s/mol.
Processing CHEMBL4789890 in molecule.smi (223/294). Average speed: 0.05 s/mol.
Processing CHEMBL4789917 in molecule.smi (224/294). Average speed: 0.05 s/mol.
Processing CHEMBL4744627 in molecule.smi (225/294). Average speed: 0.05 s/mol.
Processing CHEMBL4752238 in molecule.smi (226/294). Average speed: 0.05 s/mol.
Processing CHEMBL4643754 in molecule.smi (227/294). Average speed: 0.05 s/mol.
Processing CHEMBL4797288 in molecule.smi (228/294). 

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [35]:
df3_X = pd.read_csv('descriptors_output.csv')

In [36]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL360323,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL179110,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL360687,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL178310,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL179178,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,CHEMBL4876497,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
290,CHEMBL4097778,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
291,CHEMBL4088216,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
292,CHEMBL4786559,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
290,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
291,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
292,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [38]:
df3_Y = df['pIC50']
df3_Y

0      4.698970
1      4.698970
2      5.638272
3      4.698970
4      5.619789
         ...   
289    6.000000
290    6.000000
291    6.000000
292    6.000000
293    6.000000
Name: pIC50, Length: 294, dtype: float64

## **Combining X and Y variable**

In [39]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.698970
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.698970
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.638272
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.698970
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.619789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.000000
290,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.000000
291,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.000000
292,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.000000


In [40]:
dataset3.to_csv('data/cdk_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

---