# Credit
This code is adapted based on code examples in [this](https://www.oreilly.com/library/view/deep-learning-for/9781492039822/) book: Ramsundar, Bharath; Eastman, Peter; Walters, Patrick; Pande, Vijay. Deep Learning for the Life Sciences, Chapter 3.



![alt text](https://www.safaribooksonline.com/library/cover/9781492039822/360h/)

# ToolKit Description
We will use DeepChem in this example. DeepChem is an open-source python library built on top of Google's Tensorflow for deep-learning in drug discovery, materials science, quantum chemistry, and biology. 
You can learn more about DeepChem [here](https://deepchem.io/about.html).

![alt text](https://avatars1.githubusercontent.com/u/17170641?s=400&v=4)

# Installing DeepChem

Do not worry if you do not understand everything here. 
All you need to know is this section will allow you to 
install DeepChem on Colab.

In [2]:
# Installing RDKit
!wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!time bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

--2019-08-28 20:06:36--  https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.201.79, 104.18.200.79, 2606:4700::6812:c94f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.201.79|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / done
Solving environment: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / 

In [0]:
# append rdkit path to current python system path.
%matplotlib inline
import matplotlib.pyplot as plt
import sys
import os
sys.path.append('/usr/local/lib/python3.7/site-packages/')

In [4]:
# Install DeepChem 
!pip install deepchem

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/05/03/ccdd048c61c070dca8aa572010c7ae39a46caad162ca7a3ecc62881b5124/deepchem-2.2.1.dev54.tar.gz (3.9MB)
[K     |████████████████████████████████| 3.9MB 2.8MB/s 
[?25hBuilding wheels for collected packages: deepchem
  Building wheel for deepchem (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/c7/49/0f/0b4235337998b7eadd19f137bf648515da501ad09fd63d4ba0
Successfully built deepchem
Installing collected packages: deepchem
Successfully installed deepchem-2.2.1.dev54


# Importing Libraries

In [0]:
import deepchem as dc
import numpy as np

# Loading Data

The goal of this task is to predict the compound activity outcome (active or inactive) in one or more of the 12 pathway assays based on the chemical structure. There are about 10K compunds in this dataset.

This example is based on a recent program started by NIH and EPA (see more [here](https://tripod.nih.gov/tox21/challenge/about.jsp)):

*The Toxicology in the 21st Century (Tox21) program, a federal collaboration involving NIH, the Environmental Protection Agency, and the Food and Drug Administration, is aimed at developing better toxicity assessment methods. The goal is to quickly and efficiently test whether certain chemical compounds have the potential to disrupt processes in the human body that may lead to adverse health effects.*

This dataset can be found in the [MoleculeNet](http://moleculenet.ai/) repository. 


In [7]:
# Load and process Tox21 toxicity dataset
tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21()

Loading raw samples now.
shard_size: 8192
About to start loading CSV from /tmp/tox21.csv.gz
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
TIMING: featurizing shard 0 took 17.025 s
TIMING: dataset construction took 17.405 s
Loading dataset from disk.
TIMING: dataset construction took 0.459 s
Loading dataset from disk.
TIMING: dataset construction took 0.404 s
Loading dataset from disk.
TIMING: dataset construction took 0.214 s
Loading dataset from disk.
TIMING: dataset construction took 0.214 s
Loading dataset from disk.


In [8]:
# Each task corresponds to a particular experiment, 
# i.e. for an enzymatic assay which measures whether the molecules in tox21 bind with a specific biological target.
# NR-AR, NR-AhR, ... are targets. 
print ('Targets: ', tox21_tasks)
print('Number of Tasks: ', len(tox21_tasks))

Targets:  ['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53']
Number of Tasks:  12


### More Information

The 12 pathways include:
1. estrogen receptor alpha, full (NR-AR)
2. estrogen receptor alpha, LBD (NR-AR-LBD)
3. aryl hydrocarbon receptor (NR-AhR)
4. aromatase (NR-Aromatase)
5. androgen receptor, full (NR-ER)
6. androgen receptor, LBD (NR-ER-LBD)
7. peroxisome proliferator-activated receptor gamma (NR-PPAR-gamma)
8. nuclear factor (erythroid-derived 2)-like 2/antioxidant responsive element (SR-ARE)
9. SR-ATAD5
10. Heat shock factor response element (SR-HSE)
11. mitochondrial membrane potential (SR-MMP)
12. p53 (SR-p53)




In [9]:
# the three datasets represent the training, validation, and test sets
# the 12 labels correspond to the 12 tasks
train_dataset, valid_dataset, test_dataset = tox21_datasets
print('train_dataset X size (samples, features)= ', train_dataset.X.shape)
print('train_dataset y size (samples, labels)= ', train_dataset.y.shape, '\n')

print('valid_dataset X size (samples, features)= ', valid_dataset.X.shape)
print('valid_dataset y size (samples, labels)= ', valid_dataset.y.shape, '\n')

print('test_dataset X size (samples, features)= ', test_dataset.X.shape)
print('test_dataset y size (samples, labels)= ', test_dataset.y.shape)

train_dataset X size (samples, features)=  (6264, 1024)
train_dataset y size (samples, labels)=  (6264, 12) 

valid_dataset X size (samples, features)=  (783, 1024)
valid_dataset y size (samples, labels)=  (783, 12) 

test_dataset X size (samples, features)=  (784, 1024)
test_dataset y size (samples, labels)=  (784, 12)


In [10]:
# Tox21 did not test every molecule in every task, i.e. some of the 12 labels are meaningless placeholders for some molecules.
# In such cases, the cprresonding w is zero, representing missing experiments.
print('train_dataset w size (samples, weights)= ', train_dataset.w.shape)
print('Number of non-zero weights in the training set: ', np.count_nonzero(train_dataset.w))
print('Number of zero weights (missing experiments): ', np.count_nonzero(train_dataset.w == 0))

train_dataset w size (samples, weights)=  (6264, 12)
Number of non-zero weights in the training set:  62166
Number of zero weights (missing experiments):  13002


In [14]:
# The Balancing Transformer adjusts the weights for individual data points (over 90% of the weights are zero).
transformers

[<deepchem.trans.transformers.BalancingTransformer at 0x7fba82a3c198>]

# Training the Model

In [0]:
# define a fully connected network with 12 output nodes and a single hidden layer with 1,000 nodes
model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])

In [16]:
# fit to data
model.fit(train_dataset, nb_epoch=10)

W0828 20:48:51.576301 140439255508864 deprecation_wrapper.py:119] From /usr/local/lib/python3.7/site-packages/deepchem/models/tensorgraph/tensor_graph.py:715: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0828 20:48:51.587562 140439255508864 deprecation_wrapper.py:119] From /usr/local/lib/python3.7/site-packages/deepchem/models/tensorgraph/layers.py:2464: The name tf.FIFOQueue is deprecated. Please use tf.queue.FIFOQueue instead.

W0828 20:48:51.600764 140439255508864 deprecation_wrapper.py:119] From /usr/local/lib/python3.7/site-packages/deepchem/models/tensorgraph/layers.py:1216: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0828 20:48:51.751803 140439255508864 deprecation_wrapper.py:119] From /usr/local/lib/python3.7/site-packages/deepchem/models/tensorgraph/tensor_graph.py:728: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0828 20:48:51.788164 14

809.6898889935206

# Evaluating the Model

In [17]:
# evaluate the model
metric = dc.metrics.Metric(dc.metrics.roc_auc_score,np.mean)
train_scores = model.evaluate(train_dataset, [metric], transformers)
test_scores = model.evaluate(test_dataset, [metric], transformers)

computed_metrics: [0.9912407414588746, 0.9961745343609505, 0.9590777287652527, 0.98010221828099, 0.9047563637975553, 0.9826244234555003, 0.9901381586563183, 0.9043029298787376, 0.9862992664407062, 0.9689020349391946, 0.9444702011880124, 0.97462885271578]
computed_metrics: [0.7723298284449363, 0.863822851683246, 0.8993922593026975, 0.8125915846765752, 0.7106472907906489, 0.8043376710043377, 0.7350215160542866, 0.7142704201226109, 0.8608110862033829, 0.7101598549769281, 0.872194047223146, 0.8011363636363638]


In [0]:
print('train_scores:', train_scores)
print('test_scores: ', test_scores)

train_scores: {'mean-roc_auc_score': 0.9646639567598659}
test_scores:  {'mean-roc_auc_score': 0.7926642120599715}
