<a href="https://colab.research.google.com/github/yr2387/E4511-2021-Rong/blob/main/dtc_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Predicting Solubility Using AMPL</h1>

The ATOM Modeling PipeLine (AMPL; https://github.com/ATOMconsortium/AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.

**Warning: This is an experimental notebook**

# Goal: Predict solubility using the ATOM Modeling Pipeline (AMPL) on the public dataset

In this notebook, we describe the following steps using AMPL:

1.   Read a public dataset containing chemical structures and some properties
1.   Curate the dataset 
2.   Fit a simple model
3.   Predict solubility for withheld compounds


## Set up
We first import the AMPL modules for use in this notebook.

The relevant AMPL modules for this example are listed below:

|module|Description|
|-|-|
|`atomsci.ddm.pipeline.model_pipeline`|The model pipeline module is used to fit models and load models for prediction.|
|`atomsci.ddm.pipeline.parameter_parser`|The parameter parser reads through pipeline options for the model pipeline.|
|`atomsci.ddm.utils.curate_data`|The curate data module is used for data loading and pre-processing.|
|`atomsci.ddm.utils.struct_utils`|The structure utilities module is used to process loaded structures.|
|`atomsci.ddm.pipeline.perf_plots`|Perf plots contains a variety of plotting functions.|

## Install AMPL

In [None]:
%tensorflow_version 1.x

# get the Anaconda file 
! wget -c https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
! chmod +x Anaconda3-2019.10-Linux-x86_64.sh
! bash ./Anaconda3-2019.10-Linux-x86_64.sh -b -f -p /usr/local

! time conda install -y -c deepchem -c rdkit -c conda-forge -c omnia deepchem-gpu=2.3.0

import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import deepchem as dc

# install mordred, bravado and molvs
! time conda install -c conda-forge -y mordred bravado molvs

# get the Install AMPL_GPU_test.sh
!wget https://raw.githubusercontent.com/ravichas/AMPL-Tutorial/master/config/install_AMPL_GPU_test.sh

# run the script to install AMPL
! chmod u+x install_AMPL_GPU_test.sh
! ./install_AMPL_GPU_test.sh

TensorFlow 1.x selected.
--2021-04-02 14:25:48--  https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 530308481 (506M) [application/x-sh]
Saving to: ‘Anaconda3-2019.10-Linux-x86_64.sh’


2021-04-02 14:25:50 (195 MB/s) - ‘Anaconda3-2019.10-Linux-x86_64.sh’ saved [530308481/530308481]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ | / - \ | / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _ipyw_jlab_nb_ext_conf==0.1.0=py37_0
    - _libgcc_mutex==0.1=main
    - alabaster==0.7.12=py37_0
    - anaconda-client==1.7.2=py37



The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - failed

InvalidVersionSpec: Invalid version '4.19.112+': empty version component


real	0m8.303s
user	0m7.078s
sys	0m1.485s
--2021-04-02 14:32:38--  https://raw.githubusercontent.com/ravichas/AMPL-Tutorial/master/config/install_AMPL_GPU_test.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|

In [None]:
# Load AMPL in this notebook

site_packages_path = '/content/AMPL/lib/python3.7/site-packages'
if site_packages_path not in sys.path:
  sys.path.insert(1, site_packages_path)
sys.path

['/tensorflow-1.15.2/python3.7',
 '/content/AMPL/lib/python3.7/site-packages',
 '',
 '/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython',
 '/usr/local/lib/python3.7/site-packages/']

In [None]:
# There is a problem with the previously imported cffi, so delete it and 
# load it with AMPL instead
if 'cffi' in sys.modules:
  del sys.modules['cffi']

In [None]:
!pip install molvs

Collecting molvs
[?25l  Downloading https://files.pythonhosted.org/packages/08/dc/d948e83b97f2c420cb6c7e2143ae349560d3b5b061945f1b2a4eefb0231c/MolVS-0.1.1.tar.gz (61kB)
[K     |████████████████████████████████| 71kB 5.1MB/s 
Building wheels for collected packages: molvs
  Building wheel for molvs (setup.py) ... [?25l[?25hdone
  Created wheel for molvs: filename=MolVS-0.1.1-cp37-none-any.whl size=32376 sha256=9b957f7a651d2a85c94793af29cf806b120c941a382b1dda03438c003f190ff3
  Stored in directory: /root/.cache/pip/wheels/30/37/a8/8ac8147605c9de6b45ffd66d1cc19761d41467db12b34a0de8
Successfully built molvs
Installing collected packages: molvs
Successfully installed molvs-0.1.1


In [None]:
!pip install umap

Collecting umap
  Downloading https://files.pythonhosted.org/packages/4b/46/08ab68936625400fe690684428d4db4764f49b406782cc133df1d0299d06/umap-0.1.1.tar.gz
Building wheels for collected packages: umap
  Building wheel for umap (setup.py) ... [?25l[?25hdone
  Created wheel for umap: filename=umap-0.1.1-cp37-none-any.whl size=3568 sha256=6a006cfc26691576bc045737ee62c0b69d3b71478fe00194979b88899068d5d4
  Stored in directory: /root/.cache/pip/wheels/7b/29/33/b4d917dc95f69c0a060e2ab012d95e15db9ed4cc0b94ccac26
Successfully built umap
Installing collected packages: umap
Successfully installed umap-0.1.1


In [None]:
# We temporarily disable warnings for demonstration.
# FutureWarnings and DeprecationWarnings are present from some of the AMPL 
# dependency modules.
import warnings
warnings.filterwarnings('ignore')

import json
import numpy as np
import pandas as pd
import os
import requests
import sys

#import atomsci.ddm.pipeline.model_pipeline as mp
import atomsci.ddm.pipeline.parameter_parser as parse
import atomsci.ddm.utils.curate_data as curate_data
import atomsci.ddm.utils.struct_utils as struct_utils
#from atomsci.ddm.pipeline import perf_plots as pp


In [None]:
!pip install bravado

Collecting bravado
  Downloading https://files.pythonhosted.org/packages/21/ed/03b0c36b5bcafbe2938ed222f9a164a6c0367ce99a9d2d502e462853571d/bravado-11.0.3-py2.py3-none-any.whl
Collecting monotonic (from bravado)
  Downloading https://files.pythonhosted.org/packages/ac/aa/063eca6a416f397bd99552c534c6d11d57f58f2e94c14780f3bbf818c4cf/monotonic-1.5-py2.py3-none-any.whl
Collecting bravado-core>=5.16.1 (from bravado)
[?25l  Downloading https://files.pythonhosted.org/packages/76/11/18e9d28a156c33f2d5f15a5e155dc7130250acb0a569255a2b6b307b596d/bravado_core-5.17.0-py2.py3-none-any.whl (67kB)
[K     |████████████████████████████████| 71kB 5.5MB/s 
Collecting simplejson (from bravado)
[?25l  Downloading https://files.pythonhosted.org/packages/a8/04/377418ac1e530ce2a196b54c6552c018fdf1fe776718053efb1f216bffcd/simplejson-3.17.2-cp37-cp37m-manylinux2010_x86_64.whl (128kB)
[K     |████████████████████████████████| 133kB 26.0MB/s 
Collecting jsonref (from bravado-core>=5.16.1->bravado)
  Downloadin

In [None]:
import atomsci.ddm.pipeline.chem_diversity as cd

## Data curation

We then download and do very simple curation to the related dataset.

We need to set the directory we want to save files to. Next we download the dataset.

In [None]:
working_dir = '/content'

In [None]:
import io
url = 'https://raw.githubusercontent.com/yr2387/E4511-2021-Rong/main/Data/SLC6A2_DTC_SMILES.csv'
#url = 'https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/delaney-processed.csv'
download = requests.get(url).content

In [None]:
# Reading the downloaded content and turning it into a pandas dataframe
raw_df = pd.read_csv(io.StringIO(download.decode('utf-8')), sep=',', header=0 )

In [None]:
# copy of processed delaney data from AMPL
# r = requests.get('https://raw.githubusercontent.com/ravichas/AMPL-Tutorial/master/datasets/delaney-processed_curated_external.csv', verify=True)

Next, we load the downloaded dataset, and process the compound structures:

In [None]:
orig_df=raw_df.drop(columns=raw_df.iloc[:,0:1])
orig_df

Unnamed: 0,Compound_ID,Uniprot_ID,Compound_Name,Standard_inchi_key,Max_Phase,Target_Pref_Name,Gene_Names,Target_Class,Wild_type_or_mutant,Mutation_information,PubMed_ID,End_Point_Standard_Type,End_Point_Standard_Relation,End_Point_Standard_Value,End_Point_Standard_Units,Endpoint_Mode_of_Action,Assay_Format,Assay_Type,Assay_Sub_Type,Inhibitor_Type,Detection_Technology,Compound_concentration_value,Compound_concentration_value_units,Substrate_type,Substrate_Type_Standard_Relation,Substrate_Type_Standard_Value,Substrate_Type_Standard_Units,Assay_cell_line,Assay_Description,Activity_Comments,Title,Journal,Year,Volume,Issue,Authors,Annotation_Comments,Assay_ID,DTC_Tid,DTC_Activity_ID,DTC_Molregno,Record_ID,DTC_Document_ID,pDTC_Value,SMILES,base_rdkit_smiles
0,CHEMBL104700,P23975,,SCDKHPSUXHBJDJ-UHFFFAOYSA-N,0,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,22607684.0,INHIBITION,<,50.0,%,,,,,,,,,,,,,,Inhibition of NET at 10 uM,,7-Azabicyclo[2.2.1]heptane as a scaffold for t...,Bioorg. Med. Chem. Lett.,2012.0,22.0,12.0,"Banister SD, Rendina LM, Kassiou M",,818156.0,DTCT0023180,10478144,DTCC00272114,931177,48255,50.0,C1CC2CCC1N2CC3=CN=CC=C3,c1cncc(CN2C3CCC2CC3)c1
1,CHEMBL1079079,P23975,,WGIPGQAPFNVWIX-XXFZXMJFSA-N,0,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,19767206.0,INHIBITION,<,50.0,%,,,,,,,,,,,,,,Displacement of [3H]Nisoxetine from NET at 10 ...,,Synthesis and in vitro autoradiographic evalua...,Bioorg. Med. Chem. Lett.,2009.0,19.0,21.0,"Donohue SR, Varnäs K, Jia Z, Gulyás B, Pike VW...",,619820.0,DTCT0023180,3267827,DTCC00632772,857661,37697,50.0,COC1=CC=C(C=C1)C2=C(C(=NN2C3=CC=CC=C3[125I])C(...,COc1ccc(-c2c(C#N)c(C(=O)NN3CCCCC3)nn2-c2ccccc2...
2,CHEMBL108,P23975,CARBAMAZEPINE,FFGPTBGBLSHEPO-UHFFFAOYSA-N,4,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,,IC50,,,,,,,,,,,,,,,,MDCK,DRUGMATRIX: Norepinephrine Transporter radioli...,Not Active (inhibition < 50% @ 10 uM and thus ...,DrugMatrix in vitro pharmacology data,,,,,"Scott S. Auerbach, DrugMatrix¨ and ToxFX¨ Coor...",,774705.0,DTCT0023180,7262068,DTCC00144764,249939,46191,,C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N,NC(=O)N1c2ccccc2C=Cc2ccccc21
3,CHEMBL108,P23975,CARBAMAZEPINE,FFGPTBGBLSHEPO-UHFFFAOYSA-N,4,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,,KI,,,,,,,,,,,,,,,,MDCK,DRUGMATRIX: Norepinephrine Transporter radioli...,Not Active (inhibition < 50% @ 10 uM and thus ...,DrugMatrix in vitro pharmacology data,,,,,"Scott S. Auerbach, DrugMatrix¨ and ToxFX¨ Coor...",,774705.0,DTCT0023180,7262069,DTCC00144764,249939,46191,,C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N,NC(=O)N1c2ccccc2C=Cc2ccccc21
4,CHEMBL1082723,P23975,NITD609,CKLPLPZSUQEDRT-WPCRTTGESA-N,0,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,20813948.0,IC50,>,10000.0,NM,,,,,,,,,,,,,,Binding affinity to human recombinant norepine...,,"Spiroindolones, a potent compound class for th...",Science,2010.0,329.0,5996.0,"Rottmann M, McNamara C, Yeung BK, Lee MC, Zou ...",,658576.0,DTCT0023180,3467853,DTCC00031143,1686650,40818,10000.0,C[C@H]1CC2=C([C@]3(N1)C4=C(C=CC(=C4)Cl)NC3=O)N...,C[C@H]1Cc2c([nH]c3cc(Cl)c(F)cc23)[C@@]2(N1)C(=...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397,CHEMBL828,P23975,PHENOTHIAZINE,WJFKNYWRSNBZNX-UHFFFAOYSA-N,0,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,,KI,=,457.0,NM,,,,,,,,,,,,,MDCK,DRUGMATRIX: Norepinephrine Transporter radioli...,,DrugMatrix in vitro pharmacology data,,,,,"Scott S. Auerbach, DrugMatrix¨ and ToxFX¨ Coor...",,774705.0,DTCT0023180,7389434,DTCC00179549,1734060,46191,457.0,C1=CC=C2C(=C1)NC3=CC=CC=C3S2,c1ccc2c(c1)Nc1ccccc1S2
398,CHEMBL85,P23975,RISPERIDONE,RAPZEAPATHNIPO-UHFFFAOYSA-N,4,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,,IC50,,,,,,,,,,,,,,,,MDCK,DRUGMATRIX: Norepinephrine Transporter radioli...,Not Active (inhibition < 50% @ 10 uM and thus ...,DrugMatrix in vitro pharmacology data,,,,,"Scott S. Auerbach, DrugMatrix¨ and ToxFX¨ Coor...",,774705.0,DTCT0023180,7464362,DTCC00138594,1383782,46191,,CC1=C(C(=O)N2CCCCC2=N1)CCN3CCC(CC3)C4=NOC5=C4C...,Cc1nc2n(c(=O)c1CCN1CCC(c3noc4cc(F)ccc34)CC1)CCCC2
399,CHEMBL85,P23975,RISPERIDONE,RAPZEAPATHNIPO-UHFFFAOYSA-N,4,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,,KI,,,,,,,,,,,,,,,,MDCK,DRUGMATRIX: Norepinephrine Transporter radioli...,Not Active (inhibition < 50% @ 10 uM and thus ...,DrugMatrix in vitro pharmacology data,,,,,"Scott S. Auerbach, DrugMatrix¨ and ToxFX¨ Coor...",,774705.0,DTCT0023180,7464363,DTCC00138594,1383782,46191,,CC1=C(C(=O)N2CCCCC2=N1)CCN3CCC(CC3)C4=NOC5=C4C...,Cc1nc2n(c(=O)c1CCN1CCC(c3noc4cc(F)ccc34)CC1)CCCC2
400,CHEMBL86304,P23975,MOCLOBEMIDE,YHXISWVBGDMDLQ-UHFFFAOYSA-N,3,NOREPINEPHRINE TRANSPORTER,SLC6A2,Transporter,,,,IC50,,,,,,,,,,,,,,,,MDCK,DRUGMATRIX: Norepinephrine Transporter radioli...,Not Active (inhibition < 50% @ 10 uM and thus ...,DrugMatrix in vitro pharmacology data,,,,,"Scott S. Auerbach, DrugMatrix¨ and ToxFX¨ Coor...",,774705.0,DTCT0023180,7407118,DTCC00246764,833162,46191,,C1COCCN1CCNC(=O)C2=CC=C(C=C2)Cl,O=C(NCCN1CCOCC1)c1ccc(Cl)cc1


In [None]:
orig_df.columns

Index(['Compound_ID', 'Uniprot_ID', 'Compound_Name', 'Standard_inchi_key',
       'Max_Phase', 'Target_Pref_Name', 'Gene_Names', 'Target_Class',
       'Wild_type_or_mutant', 'Mutation_information', 'PubMed_ID',
       'End_Point_Standard_Type', 'End_Point_Standard_Relation',
       'End_Point_Standard_Value', 'End_Point_Standard_Units',
       'Endpoint_Mode_of_Action', 'Assay_Format', 'Assay_Type',
       'Assay_Sub_Type', 'Inhibitor_Type', 'Detection_Technology',
       'Compound_concentration_value', 'Compound_concentration_value_units',
       'Substrate_type', 'Substrate_Type_Standard_Relation',
       'Substrate_Type_Standard_Value', 'Substrate_Type_Standard_Units',
       'Assay_cell_line', 'Assay_Description', 'Activity_Comments', 'Title',
       'Journal', 'Year', 'Volume', 'Issue', 'Authors', 'Annotation_Comments',
       'Assay_ID', 'DTC_Tid', 'DTC_Activity_ID', 'DTC_Molregno', 'Record_ID',
       'DTC_Document_ID', 'pDTC_Value', 'SMILES', 'base_rdkit_smiles'],
      dtype=

In [None]:
df = orig_df[['Compound_ID','Standard_inchi_key','End_Point_Standard_Type','End_Point_Standard_Relation',
       'End_Point_Standard_Value', 'End_Point_Standard_Units','pDTC_Value', 'SMILES', 'base_rdkit_smiles']]
df

Unnamed: 0,Compound_ID,Standard_inchi_key,End_Point_Standard_Type,End_Point_Standard_Relation,End_Point_Standard_Value,End_Point_Standard_Units,pDTC_Value,SMILES,base_rdkit_smiles
0,CHEMBL104700,SCDKHPSUXHBJDJ-UHFFFAOYSA-N,INHIBITION,<,50.0,%,50.0,C1CC2CCC1N2CC3=CN=CC=C3,c1cncc(CN2C3CCC2CC3)c1
1,CHEMBL1079079,WGIPGQAPFNVWIX-XXFZXMJFSA-N,INHIBITION,<,50.0,%,50.0,COC1=CC=C(C=C1)C2=C(C(=NN2C3=CC=CC=C3[125I])C(...,COc1ccc(-c2c(C#N)c(C(=O)NN3CCCCC3)nn2-c2ccccc2...
2,CHEMBL108,FFGPTBGBLSHEPO-UHFFFAOYSA-N,IC50,,,,,C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N,NC(=O)N1c2ccccc2C=Cc2ccccc21
3,CHEMBL108,FFGPTBGBLSHEPO-UHFFFAOYSA-N,KI,,,,,C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N,NC(=O)N1c2ccccc2C=Cc2ccccc21
4,CHEMBL1082723,CKLPLPZSUQEDRT-WPCRTTGESA-N,IC50,>,10000.0,NM,10000.0,C[C@H]1CC2=C([C@]3(N1)C4=C(C=CC(=C4)Cl)NC3=O)N...,C[C@H]1Cc2c([nH]c3cc(Cl)c(F)cc23)[C@@]2(N1)C(=...
...,...,...,...,...,...,...,...,...,...
397,CHEMBL828,WJFKNYWRSNBZNX-UHFFFAOYSA-N,KI,=,457.0,NM,457.0,C1=CC=C2C(=C1)NC3=CC=CC=C3S2,c1ccc2c(c1)Nc1ccccc1S2
398,CHEMBL85,RAPZEAPATHNIPO-UHFFFAOYSA-N,IC50,,,,,CC1=C(C(=O)N2CCCCC2=N1)CCN3CCC(CC3)C4=NOC5=C4C...,Cc1nc2n(c(=O)c1CCN1CCC(c3noc4cc(F)ccc34)CC1)CCCC2
399,CHEMBL85,RAPZEAPATHNIPO-UHFFFAOYSA-N,KI,,,,,CC1=C(C(=O)N2CCCCC2=N1)CCN3CCC(CC3)C4=NOC5=C4C...,Cc1nc2n(c(=O)c1CCN1CCC(c3noc4cc(F)ccc34)CC1)CCCC2
400,CHEMBL86304,YHXISWVBGDMDLQ-UHFFFAOYSA-N,IC50,,,,,C1COCCN1CCNC(=O)C2=CC=C(C=C2)Cl,O=C(NCCN1CCOCC1)c1ccc(Cl)cc1


In [None]:
df.End_Point_Standard_Type.value_counts()

KI            171
INHIBITION    132
IC50           78
ACTIVITY       16
KD              2
EFFICACY        2
EC50            1
Name: End_Point_Standard_Type, dtype: int64

In [None]:
print(sum(df['pDTC_Value'].isna()) )
print(len(df) - sum(df['pDTC_Value'].isna()))

len(orig_df)

61
341


402

In [None]:
dset_df = df[~(df.Standard_inchi_key.isna())  &
            (df.End_Point_Standard_Units == 'NM') &
            ~df.End_Point_Standard_Value.isna() &
            ~df.Compound_ID.isna() &
             (df.End_Point_Standard_Relation == '=') &
             ~df.End_Point_Standard_Relation.isna()]

In [None]:
dset_df = dset_df.loc[(df.End_Point_Standard_Type == 'IC50') | (df.End_Point_Standard_Type == 'EC50')]
dset_df

Unnamed: 0,Compound_ID,Standard_inchi_key,End_Point_Standard_Type,End_Point_Standard_Relation,End_Point_Standard_Value,End_Point_Standard_Units,pDTC_Value,SMILES,base_rdkit_smiles
11,CHEMBL11,BCGWQEUPMDMJNV-UHFFFAOYSA-N,IC50,=,74.0,NM,74.0,CN(C)CCCN1C2=CC=CC=C2CCC3=CC=CC=C31,CN(C)CCCN1c2ccccc2CCc2ccccc21
63,CHEMBL180101,CBQGYUDMJHNJBX-OALUTQOASA-N,IC50,=,3.0,NM,3.0,CCOC1=CC=CC=C1O[C@H]([C@@H]2CNCCO2)C3=CC=CC=C3,CCOc1ccccc1O[C@@H](c1ccccc1)[C@@H]1CNCCO1
65,CHEMBL180101,CBQGYUDMJHNJBX-OALUTQOASA-N,IC50,=,3.1,NM,3.1,CCOC1=CC=CC=C1O[C@H]([C@@H]2CNCCO2)C3=CC=CC=C3,CCOc1ccccc1O[C@@H](c1ccccc1)[C@@H]1CNCCO1
67,CHEMBL180101,CBQGYUDMJHNJBX-OALUTQOASA-N,IC50,=,2.0,NM,2.0,CCOC1=CC=CC=C1O[C@H]([C@@H]2CNCCO2)C3=CC=CC=C3,CCOc1ccccc1O[C@@H](c1ccccc1)[C@@H]1CNCCO1
73,CHEMBL1829335,XUKROCVZGZNGSI-CQSZACIVSA-N,IC50,=,17500.0,NM,17500.0,C[C@@H]1CCCN1CCCOC2=CC=C(C=C2)C3=NNC(=O)C=C3,C[C@@H]1CCCN1CCCOc1ccc(-c2ccc(=O)[nH]n2)cc1
78,CHEMBL19215,WZHJKEUHNJHDLS-QTGUNEKASA-N,IC50,=,366.0,NM,366.0,CN1C[C@@H](C[C@H]2[C@H]1CC3=CN(C4=CC=CC2=C34)C...,CN1C[C@H](CNC(=O)OCc2ccccc2)C[C@@H]2c3cccc4c3c...
81,CHEMBL1949930,IKXSPLZKOZXPFB-RQBPZYBGSA-N,IC50,=,4400.0,NM,4400.0,CC(C)OC1=CC(=CC(=C1)OC)S(=O)(=O)C2=CC3=C(C=C2)...,COc1cc(OC(C)C)cc(S(=O)(=O)c2ccc3c(c2)O[C@H]2CN...
105,CHEMBL2047561,NBIWTPBSAFCSRX-UHFFFAOYSA-N,IC50,=,327.0,NM,327.0,C1CN(CC=C1CC2=CC=CC=C2)CCC(C3=CC4=CC=CC=C4S3)O,OC(CCN1CC=C(Cc2ccccc2)CC1)c1cc2ccccc2s1
106,CHEMBL2047570,QXCOEFGWAUGVTR-UHFFFAOYSA-N,IC50,=,385.0,NM,385.0,C1CN(CC=C1CC2=CC=CC=C2)CCC(C3=CSC4=CC=CC=C43)O,OC(CCN1CC=C(Cc2ccccc2)CC1)c1csc2ccccc12
107,CHEMBL2047571,JYIMWQCCVNSNJW-UHFFFAOYSA-N,IC50,=,420.0,NM,420.0,COC1=C(C2=C(C=C1)C=C(C=C2)C(CCN3CCC(=CC3)CC4=C...,COc1ccc2cc(C(O)CCN3CC=C(Cc4ccccc4)CC3)ccc2c1Cl


In [None]:
sum(dset_df.base_rdkit_smiles.isna())

0

In [None]:
dset_df = dset_df.rename(columns={"pDTC_Value": "pXC50"})
dset_df.columns

Index(['Compound_ID', 'Standard_inchi_key', 'End_Point_Standard_Type',
       'End_Point_Standard_Relation', 'End_Point_Standard_Value',
       'End_Point_Standard_Units', 'pXC50', 'SMILES', 'base_rdkit_smiles'],
      dtype='object')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
dset_df.to_csv('drive/MyDrive/Columbia_E4511/DTC_DROP.csv')

In [None]:
feat_type = 'ECFP'
dist_metric = 'tanimoto'
smiles_lst1 = dset_df['base_rdkit_smiles'].tolist()
calc_type = 'nearest'
dist_sample = cd.calc_dist_smiles(feat_type, dist_metric, smiles_lst1, None, calc_type)

print(len(dist_sample))
print(len(smiles_lst1))

58
58


In [None]:
import os

# From our dataframe, we are working with the PIC50 column 
data=dset_df

column = 'pXC50'

# tolerance: Percentage of individual respsonse values allowed to different from 
# the average to be included in averaging
tolerance = 10

# list_bad_duplicates: Print structures with bad duplicates
list_bad_duplicates = 'Yes'

# max_std: Maximum allowed standard deviation for computed average response value
# NOTE: In this example, we set this value very high to disable this feature
max_std = 1

# compound_id: Compound ID column
compound_id = 'Compound_ID'

# smiles_col: SMILES column
smiles_col = 'base_rdkit_smiles'

# Here we are creating a new dataframe, called check_df
check_df = curate_data.average_and_remove_duplicates(column, tolerance, 
                                                       list_bad_duplicates, 
                                                       data, max_std, 
                                                       compound_id=compound_id, 
                                                       smiles_col=smiles_col)


Bad duplicates removed from dataset
Dataframe size (25, 13)
List of 'bad' duplicates removed
     Compound_ID    pXC50  VALUE_NUM_mean    Perc_Var  VALUE_NUM_std
15      CHEMBL41  5200.00        2689.870   93.317893    2148.532378
16      CHEMBL41  5200.00        2689.870   93.317893    2148.532378
17      CHEMBL41  2000.00        2689.870   25.646964    2148.532378
18      CHEMBL41  2000.00        2689.870   25.646964    2148.532378
19      CHEMBL41   563.00        2689.870   79.069620    2148.532378
20      CHEMBL41  1020.00        2689.870   62.079952    2148.532378
21      CHEMBL41  1020.00        2689.870   62.079952    2148.532378
22      CHEMBL41   563.00        2689.870   79.069620    2148.532378
23      CHEMBL41  6309.57        2689.870  134.567842    2148.532378
24      CHEMBL41  4410.00        2689.870   63.948444    2148.532378
25      CHEMBL41  1303.00        2689.870   51.558997    2148.532378
27   CHEMBL43048   405.00         951.315   57.427351     750.358037
28   CHEMB

In [None]:
old_compound_id='base_rdkit_smiles'
new_compound_id='base_rdkit_smiles'

# Takes all the compounds that aren't part of the curated data frame and prints them
reject=data[~data[old_compound_id].isin(check_df[new_compound_id])]
reject

Unnamed: 0,Compound_ID,Standard_inchi_key,End_Point_Standard_Type,End_Point_Standard_Relation,End_Point_Standard_Value,End_Point_Standard_Units,pXC50,SMILES,base_rdkit_smiles
247,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,5200.0,NM,5200.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
248,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,5200.0,NM,5200.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
250,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,2000.0,NM,2000.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
251,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,2000.0,NM,2000.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
252,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,563.0,NM,563.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
253,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,1020.0,NM,1020.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
255,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,1020.0,NM,1020.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
257,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,563.0,NM,563.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
259,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,6309.57,NM,6309.57,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1
262,CHEMBL41,RTHCYVBBDHJXIQ-UHFFFAOYSA-N,IC50,=,4410.0,NM,4410.0,CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1


In [None]:
column='pXC50'; #'standard_value'
list_bad_duplicates='Yes'

# Actually compiles all the data with the specifies column and tells what columns to include 
# (inchi_key, smiles, and relation)
temp_df=curate_data.aggregate_assay_data(data, 
                                         value_col=column, 
                                         output_value_col=None,
                                         label_actives=True,
                                         active_thresh=6,
                                         id_col='Compound_ID', 
                                         smiles_col='base_rdkit_smiles', 
                                         relation_col='End_Point_Standard_Relation')

#Removes all duplicates from the data frame and puts it in the final, curated dataframe
curated_df = temp_df[~temp_df.isin([np.inf]).any(1)]

0 entries in input table are missing SMILES strings
27 unique SMILES strings are reduced to 27 unique base SMILES strings


In [None]:
curated_df

Unnamed: 0,compound_id,base_rdkit_smiles,relation,pXC50,active
0,CHEMBL30713,CC(N)Cc1c[nH]c2ccccc12,,3715.35,1
1,CHEMBL471035,COCC(Oc1ccc2ccccc2c1)C1CCNCC1,,827.0,1
2,CHEMBL479,CSc1ccc2c(c1)N(CCC1CCCCN1C)c1ccccc1S2,,1551.0,1
3,CHEMBL3334797,CN(C)CCC(c1ccc(Cl)c(Cl)c1)N1CCOCC1,,477.5,1
4,CHEMBL2047561,OC(CCN1CC=C(Cc2ccccc2)CC1)c1cc2ccccc2s1,,327.0,1
5,CHEMBL2047571,COc1ccc2cc(C(O)CCN3CC=C(Cc4ccccc4)CC3)ccc2c1Cl,,420.0,1
6,CHEMBL42,CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccccc32)CC1,,1470.0,1
7,CHEMBL19215,CN1C[C@H](CNC(=O)OCc2ccccc2)C[C@@H]2c3cccc4c3c...,,366.0,1
8,CHEMBL2047570,OC(CCN1CC=C(Cc2ccccc2)CC1)c1csc2ccccc12,,385.0,1
9,CHEMBL549,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,,2196.0,1


In [None]:
curated_df.to_csv('drive/MyDrive/Columbia_E4511/DTC_Curated.csv')