<a href="https://colab.research.google.com/github/yr2387/E4511-2021-Rong/blob/main/Merge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Predicting Solubility Using AMPL</h1>

The ATOM Modeling PipeLine (AMPL; https://github.com/ATOMconsortium/AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.

**Warning: This is an experimental notebook**

# Goal: Predict solubility using the ATOM Modeling Pipeline (AMPL) on the public dataset

In this notebook, we describe the following steps using AMPL:

1.   Read a public dataset containing chemical structures and some properties
1.   Curate the dataset 
2.   Fit a simple model
3.   Predict solubility for withheld compounds


## Set up
We first import the AMPL modules for use in this notebook.

The relevant AMPL modules for this example are listed below:

|module|Description|
|-|-|
|`atomsci.ddm.pipeline.model_pipeline`|The model pipeline module is used to fit models and load models for prediction.|
|`atomsci.ddm.pipeline.parameter_parser`|The parameter parser reads through pipeline options for the model pipeline.|
|`atomsci.ddm.utils.curate_data`|The curate data module is used for data loading and pre-processing.|
|`atomsci.ddm.utils.struct_utils`|The structure utilities module is used to process loaded structures.|
|`atomsci.ddm.pipeline.perf_plots`|Perf plots contains a variety of plotting functions.|

## Install AMPL

In [None]:
%tensorflow_version 1.x

# get the Anaconda file 
! wget -c https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
! chmod +x Anaconda3-2019.10-Linux-x86_64.sh
! bash ./Anaconda3-2019.10-Linux-x86_64.sh -b -f -p /usr/local

! time conda install -y -c deepchem -c rdkit -c conda-forge -c omnia deepchem-gpu=2.3.0

import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import deepchem as dc

# install mordred, bravado and molvs
! time conda install -c conda-forge -y mordred bravado molvs

# get the Install AMPL_GPU_test.sh
!wget https://raw.githubusercontent.com/ravichas/AMPL-Tutorial/master/config/install_AMPL_GPU_test.sh

# run the script to install AMPL
! chmod u+x install_AMPL_GPU_test.sh
! ./install_AMPL_GPU_test.sh

TensorFlow 1.x selected.
--2021-04-02 14:41:55--  https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

PREFIX=/usr/local
./Anaconda3-2019.10-Linux-x86_64.sh: line 346: /usr/local/conda.exe: Text file busy
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ 



The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - failed

InvalidVersionSpec: Invalid version '4.19.112+': empty version component


real	0m6.614s
user	0m5.625s
sys	0m1.108s
--2021-04-02 14:44:41--  https://raw.githubusercontent.com/ravichas/AMPL-Tutorial/master/config/install_AMPL_GPU_test.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connecte

In [None]:
# Load AMPL in this notebook

site_packages_path = '/content/AMPL/lib/python3.7/site-packages'
if site_packages_path not in sys.path:
  sys.path.insert(1, site_packages_path)
sys.path

['/tensorflow-1.15.2/python3.7',
 '/content/AMPL/lib/python3.7/site-packages',
 '',
 '/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython',
 '/usr/local/lib/python3.7/site-packages/']

In [None]:
# There is a problem with the previously imported cffi, so delete it and 
# load it with AMPL instead
if 'cffi' in sys.modules:
  del sys.modules['cffi']

In [None]:
! pip install umap



In [None]:
# We temporarily disable warnings for demonstration.
# FutureWarnings and DeprecationWarnings are present from some of the AMPL 
# dependency modules.
import warnings
warnings.filterwarnings('ignore')

import json
import numpy as np
import pandas as pd
import os
import requests
import sys

#import atomsci.ddm.pipeline.model_pipeline as mp
import atomsci.ddm.pipeline.parameter_parser as parse
import atomsci.ddm.utils.curate_data as curate_data
import atomsci.ddm.utils.struct_utils as struct_utils
from atomsci.ddm.pipeline import perf_plots as pp


## Data curation

We then download and do very simple curation to the related dataset.

We need to set the directory we want to save files to. Next we download the dataset.

In [None]:
! wget https://raw.githubusercontent.com/yr2387/E4511-2021-Rong/main/Data/CHEMBL_Curated.csv
#! wget https://raw.githubusercontent.com/yr2387/E4511-2021-Rong/main/Data/DTC_Curated.csv
! wget https://raw.githubusercontent.com/yr2387/E4511-2021-Rong/main/Data/SLC6A2_Excape_SMILES.csv

--2021-04-02 14:47:52--  https://raw.githubusercontent.com/yr2387/E4511-2021-Rong/main/Data/CHEMBL_Curated.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 130672 (128K) [text/plain]
Saving to: ‘CHEMBL_Curated.csv.9’


2021-04-02 14:47:52 (4.88 MB/s) - ‘CHEMBL_Curated.csv.9’ saved [130672/130672]

--2021-04-02 14:47:53--  https://raw.githubusercontent.com/yr2387/E4511-2021-Rong/main/Data/SLC6A2_Excape_SMILES.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 504793 (493K) [text/plain]
Saving to: ‘SLC6A2_Excape_SMILES.csv.9’




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
dtc = pd.read_csv('drive/MyDrive/Columbia_E4511/DTC_Curated.csv', header = 0,index_col=0)

In [None]:
#dtc = pd.read_csv('DTC_Curated.csv', header = 0,index_col=0)
chembl = pd.read_csv('CHEMBL_Curated.csv', header = 0,index_col=0)
excape = pd.read_csv('SLC6A2_Excape_SMILES.csv', header = 0,index_col=0)

In [None]:
dtc

Unnamed: 0,compound_id,base_rdkit_smiles,relation,pXC50,active
0,CHEMBL30713,CC(N)Cc1c[nH]c2ccccc12,,3715.35,1
1,CHEMBL471035,COCC(Oc1ccc2ccccc2c1)C1CCNCC1,,827.0,1
2,CHEMBL479,CSc1ccc2c(c1)N(CCC1CCCCN1C)c1ccccc1S2,,1551.0,1
3,CHEMBL3334797,CN(C)CCC(c1ccc(Cl)c(Cl)c1)N1CCOCC1,,477.5,1
4,CHEMBL2047561,OC(CCN1CC=C(Cc2ccccc2)CC1)c1cc2ccccc2s1,,327.0,1
5,CHEMBL2047571,COc1ccc2cc(C(O)CCN3CC=C(Cc4ccccc4)CC3)ccc2c1Cl,,420.0,1
6,CHEMBL42,CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccccc32)CC1,,1470.0,1
7,CHEMBL19215,CN1C[C@H](CNC(=O)OCc2ccccc2)C[C@@H]2c3cccc4c3c...,,366.0,1
8,CHEMBL2047570,OC(CCN1CC=C(Cc2ccccc2)CC1)c1csc2ccccc12,,385.0,1
9,CHEMBL549,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,,2196.0,1


In [None]:
dtc.drop('relation',axis=1,inplace=True)
dtc.rename( columns={"pXC50" : "PXC50"}, inplace = True)
dtc

Unnamed: 0,compound_id,base_rdkit_smiles,PXC50,active
0,CHEMBL30713,CC(N)Cc1c[nH]c2ccccc12,3715.35,1
1,CHEMBL471035,COCC(Oc1ccc2ccccc2c1)C1CCNCC1,827.0,1
2,CHEMBL479,CSc1ccc2c(c1)N(CCC1CCCCN1C)c1ccccc1S2,1551.0,1
3,CHEMBL3334797,CN(C)CCC(c1ccc(Cl)c(Cl)c1)N1CCOCC1,477.5,1
4,CHEMBL2047561,OC(CCN1CC=C(Cc2ccccc2)CC1)c1cc2ccccc2s1,327.0,1
5,CHEMBL2047571,COc1ccc2cc(C(O)CCN3CC=C(Cc4ccccc4)CC3)ccc2c1Cl,420.0,1
6,CHEMBL42,CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccccc32)CC1,1470.0,1
7,CHEMBL19215,CN1C[C@H](CNC(=O)OCc2ccccc2)C[C@@H]2c3cccc4c3c...,366.0,1
8,CHEMBL2047570,OC(CCN1CC=C(Cc2ccccc2)CC1)c1csc2ccccc12,385.0,1
9,CHEMBL549,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,2196.0,1


In [None]:
chembl.drop('relation',axis=1,inplace=True)
chembl

Unnamed: 0,compound_id,base_rdkit_smiles,PXC50,active
0,CHEMBL512967,CCC(=O)N(Cc1ccc(Cl)cc1Cl)[C@H]1CCNC1,7.22,1
1,CHEMBL4248596,COc1ccccc1N1CCN(CCCNC(=O)c2ccc(-c3ccccc3)cc2)CC1,5.30,0
2,CHEMBL828,c1ccc2c(c1)Nc1ccccc1S2,6.34,1
3,CHEMBL67203,c1ccc(CCCN2CCCC(CNCCOC(c3ccccc3)c3ccccc3)C2)cc1,7.01,1
4,CHEMBL497479,CNC[C@@H]1COc2ccccc2[C@@H]1Oc1ccccc1Cl,7.51,1
...,...,...,...,...
1902,CHEMBL3673152,[C-]#[N+]c1cccc(-c2ccc3c(c2)CN2CCC3(c3ccc(Cl)c...,6.45,1
1903,CHEMBL599846,CC(NC1CCCC1)C(=O)c1cccc(Br)c1,5.60,0
1904,CHEMBL3673149,O=c1ccccn1-c1ccc2c(c1)CN1CCC2(c2ccc(Cl)cc2)CC1,5.27,0
1905,CHEMBL3317702,CC1NCCCN(c2ccc3ccccc3c2)C1=O,7.85,1


In [None]:
excape.rename( columns={"pXC50" : "PXC50", "Original_Entry_ID" : "compound_id","Activity_Flag" : "active"}, inplace = True)
excape_ = excape.loc[:,['compound_id','base_rdkit_smiles','PXC50','active']]
excape_.active = excape_.active.map({'A':1, 'N':0})
excape_

Unnamed: 0,compound_id,base_rdkit_smiles,PXC50,active
0,CHEMBL1289,Clc1cc(Cl)c(OCC#CI)cc1Cl,5.56000,1
1,16494915,NCC1(c2cccs2)CCCCC1,5.00056,1
2,CHEMBL195437,CCCCCCCCc1ccc(O)cc1,4.21000,0
3,CHEMBL526,CC(C)c1cccc(C(C)C)c1O,5.03000,1
4,CHEMBL6731,CC(N)Cc1ccc2c(c1)OCO2,6.58000,1
...,...,...,...,...
2771,CHEMBL595767,CN1C2CCC1[C@@H](C(=O)NCc1ccc(CNC(=O)[C@H]3C4CC...,6.21000,1
2772,CHEMBL611963,CN1C2CCC1[C@@H](C(=O)Nc1ccc(CNC(=O)[C@H]3C4CCC...,5.57000,1
2773,CHEMBL2371923,C[C@@H](O)[C@H](NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O...,4.60000,0
2774,CHEMBL1200633,CC[C@H](C)[C@H]1O[C@]2(CC[C@@H]1C)C[C@@H]1C[C@...,5.41000,1


In [None]:
comb_df =  pd.concat([dtc,chembl,excape_])
comb_df

Unnamed: 0,compound_id,base_rdkit_smiles,PXC50,active
0,CHEMBL30713,CC(N)Cc1c[nH]c2ccccc12,3715.35,1
1,CHEMBL471035,COCC(Oc1ccc2ccccc2c1)C1CCNCC1,827.00,1
2,CHEMBL479,CSc1ccc2c(c1)N(CCC1CCCCN1C)c1ccccc1S2,1551.00,1
3,CHEMBL3334797,CN(C)CCC(c1ccc(Cl)c(Cl)c1)N1CCOCC1,477.50,1
4,CHEMBL2047561,OC(CCN1CC=C(Cc2ccccc2)CC1)c1cc2ccccc2s1,327.00,1
...,...,...,...,...
2771,CHEMBL595767,CN1C2CCC1[C@@H](C(=O)NCc1ccc(CNC(=O)[C@H]3C4CC...,6.21,1
2772,CHEMBL611963,CN1C2CCC1[C@@H](C(=O)Nc1ccc(CNC(=O)[C@H]3C4CCC...,5.57,1
2773,CHEMBL2371923,C[C@@H](O)[C@H](NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O...,4.60,0
2774,CHEMBL1200633,CC[C@H](C)[C@H]1O[C@]2(CC[C@@H]1C)C[C@@H]1C[C@...,5.41,1


In [None]:
column = 'PXC50'

# tolerance: Percentage of individual respsonse values allowed to different 
# from the average to be included in averaging
tolerance = 10

# list_bad_duplicates: Print structures with bad duplicates
list_bad_duplicates = 'Yes'

# max_std: Maximum allowed standard deviation for computed average response value
# NOTE: In this example, we set this value very high to disable this feature
max_std = 1

# compound_id: Compound ID column
compound_id = 'compound_id'

# smiles_col: SMILES column
smiles_col = 'base_rdkit_smiles'

curated_df = curate_data.average_and_remove_duplicates(column, tolerance, 
                                                       list_bad_duplicates, 
                                                       comb_df, max_std, 
                                                       compound_id=compound_id, 
                                                       smiles_col=smiles_col)

Bad duplicates removed from dataset
Dataframe size (4623, 8)
List of 'bad' duplicates removed
     compound_id      PXC50  VALUE_NUM_mean    Perc_Var  VALUE_NUM_std
2273    11622909    9.10018        8.140090   11.794587       1.357772
2110    24691160    5.74473      603.831577   99.048620    1035.912242
2321    44592242    6.08249      279.720830   97.825514     473.957664
2584        4528    8.30103        7.250515   14.488833       1.485653
3455    66572393    6.41454      132.611513   95.162909     218.574841
...          ...        ...             ...         ...            ...
4232   CHEMBL726    5.71000      657.806667   99.131964    1129.464558
1073   CHEMBL726    5.71000      657.806667   99.131964    1129.464558
19     CHEMBL828  461.00000      157.893333  191.969262     262.498073
1940   CHEMBL828    6.34000      157.893333   95.984631     262.498073
29     CHEMBL828    6.34000      157.893333   95.984631     262.498073

[87 rows x 5 columns]

Dataset de-duplicated
Datafram

In [None]:
curated_df

Unnamed: 0,compound_id,base_rdkit_smiles,PXC50,active,VALUE_NUM_mean,VALUE_NUM_std,Perc_Var,Remove_BadDuplicate
0,CHEMBL512967,CCC(=O)N(Cc1ccc(Cl)cc1Cl)[C@H]1CCNC1,7.22000,1,7.220925,0.001308,0.012810,0
1,CHEMBL4248596,COc1ccccc1N1CCN(CCCNC(=O)c2ccc(-c3ccccc3)cc2)CC1,5.30000,0,5.300000,,0.000000,0
2,CHEMBL67203,c1ccc(CCCN2CCCC(CNCCOC(c3ccccc3)c3ccccc3)C2)cc1,7.01000,1,7.325000,0.445477,4.300341,0
3,CHEMBL497479,CNC[C@@H]1COc2ccccc2[C@@H]1Oc1ccccc1Cl,7.51000,1,7.550000,0.056569,0.529801,0
4,CHEMBL4226362,CN(C)C[C@]1(c2ccc(Cl)c(Cl)c2)CC[C@@](C)(O)CC1,8.20500,1,8.205000,,0.000000,0
...,...,...,...,...,...,...,...,...
4616,CHEMBL607547,CN1C2CCC1[C@@H](C(=O)NCc1ccc(CCNC(=O)[C@H]3C4C...,5.89000,1,5.890000,,0.000000,0
4617,46226549,CN1C2CCC1[C@@H](C(=O)NCCCCCCCCCCNC(=O)[C@H]1C3...,5.82391,1,5.823910,,0.000000,0
4618,CHEMBL595767,CN1C2CCC1[C@@H](C(=O)NCc1ccc(CNC(=O)[C@H]3C4CC...,6.21000,1,6.210000,,0.000000,0
4619,CHEMBL611963,CN1C2CCC1[C@@H](C(=O)Nc1ccc(CNC(=O)[C@H]3C4CCC...,5.57000,1,5.570000,,0.000000,0


In [None]:
curated_df = curated_df.drop_duplicates(subset='compound_id', keep="first")

In [None]:
len(curated_df.compound_id.unique())

3120

In [None]:
len(curated_df.base_rdkit_smiles.unique())

3120

In [None]:
curated_df.to_csv('drive/MyDrive/Columbia_E4511/merge.csv')