# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

Let's start by reloading from the last notebook.

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession

session = SnorkelSession()

In [4]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

train = session.query(ChemicalDisease).filter(ChemicalDisease.split == 0).all()
dev = session.query(ChemicalDisease).filter(ChemicalDisease.split == 1).all()
test = session.query(ChemicalDisease).filter(ChemicalDisease.split == 2).all()

print 'Training set:\t{0} candidates'.format(len(train))
print 'Dev set:\t{0} candidates'.format(len(dev))
print 'Test set:\t{0} candidates'.format(len(test))    

Training set:	8272 candidates
Dev set:	888 candidates
Test set:	4620 candidates


In [5]:
from snorkel.annotations import load_marginals
train_marginals = load_marginals(session, split=0)

Load gold lables for validation and scoring

In [6]:
from load_external_annotations import load_external_labels
from snorkel.annotations import load_gold_labels
load_external_labels(session, ChemicalDisease, split=1, annotator='gold')
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)
load_external_labels(session, ChemicalDisease, split=0, annotator='gold')
L_gold_train = load_gold_labels(session, annotator_name='gold', split=0)
load_external_labels(session, ChemicalDisease, split=2, annotator='gold')
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)


AnnotatorLabels created: 0
AnnotatorLabels created: 0
AnnotatorLabels created: 0


Create Features

In [7]:
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()
F_train = featurizer.apply(split=0)
F_dev  = featurizer.apply_existing(split=1)
F_test = featurizer.apply_existing(split=2)

Clearing existing...
Running UDF...

Clearing existing...
Running UDF...

Clearing existing...
Running UDF...



If we've already computed the features, again we can just use the below step:

In [None]:
F_train = featurizer.load_matrix(session, split=0)
F_dev   = featurizer.load_matrix(session, split=1)
F_test  = featurizer.load_matrix(session, split=2)

# Part V: Training an extraction model

In the intro tutorial, we automatically featurized the candidates and trained a linear model over these features. Here, we'll train a more complicated model for relation extraction: an LSTM network. You can read more about LSTMs [here](https://en.wikipedia.org/wiki/Long_short-term_memory) or [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). An LSTM is a type of recurrent neural network and automatically generates a numerical representation for the candidate based on the sentence text, so no need for featurizing explicitly as in the intro tutorial. LSTMs take longer to train, and Snorkel doesn't currently support hyperparameter searches for them. We'll train a single model here, but feel free to try out other parameter sets. Just make sure to use the development set - and not the test set - for model selection.

### Train a Sparse Logistic Regression Model

In [19]:
from snorkel.learning import SparseLogisticRegression
SLR_model = SparseLogisticRegression()

In [20]:
from snorkel.learning.utils import MentionScorer
from snorkel.learning import RandomSearch, ListParameter, RangeParameter

# Searching over learning rate
rate_param = RangeParameter('lr', 1e-6, 1e-2, step=1, log_base=10)
l1_param  = RangeParameter('l1_penalty', 1e-6, 1e-2, step=1, log_base=10)
l2_param  = RangeParameter('l2_penalty', 1e-6, 1e-2, step=1, log_base=10)

SLR_searcher = RandomSearch(session, SLR_model, F_train, train_marginals, [rate_param, l1_param, l2_param], n=20)

Initialized RandomSearch search of size 20. Search space size = 125.


In [21]:
import numpy as np
np.random.seed(1701)
SLR_searcher.fit(F_dev, L_gold_dev, n_epochs=50, rebalance=0.5, print_freq=25)

[1] Testing lr = 1.00e-02, l1_penalty = 1.00e-03, l2_penalty = 1.00e-04
[SparseLR] lr=0.01 l1=0.001 l2=0.0001
[SparseLR] Building model
[SparseLR] Training model
[SparseLR] #examples=6714  #epochs=50  batch size=100
[SparseLR] Epoch 0 (4.78s)	Avg. loss=0.689868	NNZ=122840
[SparseLR] Epoch 25 (70.02s)	Avg. loss=0.637209	NNZ=122840
[SparseLR] Epoch 49 (130.82s)	Avg. loss=0.640750	NNZ=122840
[SparseLR] Training done (130.82s)
[SparseLR] Model saved. To load, use name
		SparseLR_0
[2] Testing lr = 1.00e-04, l1_penalty = 1.00e-06, l2_penalty = 1.00e-03
[SparseLR] lr=0.0001 l1=1e-06 l2=0.001
[SparseLR] Building model
[SparseLR] Training model
[SparseLR] #examples=6714  #epochs=50  batch size=100
[SparseLR] Epoch 0 (2.73s)	Avg. loss=0.769628	NNZ=122840
[SparseLR] Epoch 25 (66.21s)	Avg. loss=0.591674	NNZ=122840
[SparseLR] Epoch 49 (127.43s)	Avg. loss=0.548004	NNZ=122840
[SparseLR] Training done (127.43s)
[3] Testing lr = 1.00e-03, l1_penalty = 1.00e-05, l2_penalty = 1.00e-05
[SparseLR] lr=0.00

Unnamed: 0,lr,l1_penalty,l2_penalty,Prec.,Rec.,F1
2,0.001,1e-05,1e-05,0.45614,0.702703,0.553191
3,0.001,1e-06,0.001,0.461187,0.682432,0.550409
13,0.001,1e-05,0.0001,0.455782,0.679054,0.545455
0,0.01,0.001,0.0001,0.458234,0.648649,0.537063
11,0.001,1e-06,0.001,0.447552,0.648649,0.529655
14,0.0001,0.01,1e-05,0.454988,0.631757,0.528996
12,0.0001,0.0001,1e-05,0.435841,0.665541,0.526738
7,0.01,1e-05,0.01,0.488235,0.560811,0.522013
1,0.0001,1e-06,0.001,0.438073,0.64527,0.521858
8,0.0001,0.01,1e-06,0.446602,0.621622,0.519774


### Train an XGBoost Model

In [25]:
from snorkel.learning.XGBoost_NoiseAware import XGBoost_NoiseAware
from snorkel.learning import RandomSearch, ListParameter, RangeParameter
import numpy as np
xgb_model = XGBoost_NoiseAware()
xgb_model2 = XGBoost_NoiseAware()

Hyperparameter Search: <br>
Other Hyperparameters not listed:<br>
'subsample'<br>
'colsample_bytree'<br>
'lambda_val'<br>
'alpha_val'<br>
See: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/ for details about hyperparameter tuning

In [37]:
eta_param = RangeParameter('eta', .2, .3, step=.05)
max_depth_param  = RangeParameter('max_depth', 4, 6, step=1)
min_child_weight_param  = RangeParameter('min_child_weight', 1, 6, step=1)
gamma_param = RangeParameter('gamma', 0, .5, step = .1)
num_rounds_param = RangeParameter('num_rounds', 140, 300, 20)

xgb_searcher = RandomSearch(session, xgb_model, F_train, train_marginals, [eta_param, max_depth_param, min_child_weight_param, gamma_param, num_rounds_param], n=100)

np.random.seed(1701)
xgb_searcher.fit(F_dev, L_gold_dev)


Initialized RandomSearch search of size 100. Search space size = 2916.
[1] Testing eta = 3.00e-01, max_depth = 6.00e+00, min_child_weight = 3.00e+00, gamma = 2.00e-01, num_rounds = 2.80e+02
[2] Testing eta = 2.00e-01, max_depth = 4.00e+00, min_child_weight = 4.00e+00, gamma = 4.00e-01, num_rounds = 1.40e+02
[3] Testing eta = 3.00e-01, max_depth = 5.00e+00, min_child_weight = 1.00e+00, gamma = 4.00e-01, num_rounds = 2.80e+02
[4] Testing eta = 3.00e-01, max_depth = 4.00e+00, min_child_weight = 4.00e+00, gamma = 5.00e-01, num_rounds = 2.40e+02
[5] Testing eta = 3.00e-01, max_depth = 4.00e+00, min_child_weight = 1.00e+00, gamma = 2.00e-01, num_rounds = 2.60e+02
[6] Testing eta = 2.50e-01, max_depth = 6.00e+00, min_child_weight = 5.00e+00, gamma = 5.00e-01, num_rounds = 2.00e+02
[7] Testing eta = 2.00e-01, max_depth = 4.00e+00, min_child_weight = 2.00e+00, gamma = 4.00e-01, num_rounds = 2.00e+02
[8] Testing eta = 2.00e-01, max_depth = 6.00e+00, min_child_weight = 2.00e+00, gamma = 1.00e-01,

Unnamed: 0,eta,max_depth,min_child_weight,gamma,num_rounds,Prec.,Rec.,F1
0,0.30,6,3,0.2,280,0.472813,0.675676,0.556328
7,0.20,6,2,0.1,180,0.470449,0.672297,0.553547
17,0.30,5,6,0.0,240,0.474328,0.655405,0.550355
76,0.25,6,5,0.1,280,0.472155,0.658784,0.550071
11,0.30,4,5,0.3,240,0.467933,0.665541,0.549512
62,0.30,5,3,0.2,300,0.468900,0.662162,0.549020
97,0.20,4,6,0.1,160,0.463700,0.668919,0.547718
12,0.30,6,2,0.0,140,0.457014,0.682432,0.547425
59,0.30,6,4,0.3,220,0.464623,0.665541,0.547222
68,0.30,4,2,0.3,280,0.464623,0.665541,0.547222


Or run without the hyperparameter search:
```
    xgb_model.train(F_train, train_marginals, num_rounds=250)
```

### Train an LSTM Model

In [None]:
from snorkel.contrib.rnn import reRNN

train_kwargs = {
    'lr':         0.01,
    'dim':        100,
    'n_epochs':   50,
    'dropout':    0.5,
    'rebalance':  0.25,
    'print_freq': 5
}

lstm = reRNN(seed=1701, n_threads=None)
dev_labels = (np.ravel(L_gold_dev.todense()) + 1) / 2
lstm.train(train, train_marginals, dev_candidates=dev, dev_labels=dev_labels, **train_kwargs)

# Scoring on the test set

Finally, we'll evaluate our performance on the blind test set of 500 documents. We'll load labels similar to how we did for the development set, and use the `score` function of our extraction model to see how we did.

In [22]:
_, _, _, _ = SLR_model.score(session, F_test, L_gold_test)

Scores (Un-adjusted)
Pos. class accuracy: 0.704
Neg. class accuracy: 0.588
Precision            0.452
Recall               0.704
F1                   0.551
----------------------------------------
TP: 1060 | FP: 1283 | TN: 1832 | FN: 445



In [38]:
_, _, _, _ = xgb_model.score(session, F_test, L_gold_test)

Scores (Un-adjusted)
Pos. class accuracy: 0.71
Neg. class accuracy: 0.601
Precision            0.462
Recall               0.71
F1                   0.56
----------------------------------------
TP: 1068 | FP: 1244 | TN: 1871 | FN: 437



In [None]:
_, _, _, _ = lstm.score(session, F_test, L_gold_test)