# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

Let's start by reloading from the last notebook.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
from snorkel import SnorkelSession

session = SnorkelSession()

In [2]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

train = session.query(ChemicalDisease).filter(ChemicalDisease.split == 0).all()
dev = session.query(ChemicalDisease).filter(ChemicalDisease.split == 1).all()
test = session.query(ChemicalDisease).filter(ChemicalDisease.split == 2).all()

print('Training set:\t{0} candidates'.format(len(train)))
print('Dev set:\t{0} candidates'.format(len(dev)))
print('Test set:\t{0} candidates'.format(len(test)))

Training set:	8272 candidates
Dev set:	888 candidates
Test set:	4620 candidates


In [3]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

# Computing features

For `SparseLogReg`

In [4]:
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()

%time F_train = featurizer.apply(split=0)
F_train

Clearing existing...
Running UDF...

CPU times: user 10min 35s, sys: 5.96 s, total: 10min 41s
Wall time: 14min 51s


<8272x122840 sparse matrix of type '<type 'numpy.int64'>'
	with 448906 stored elements in Compressed Sparse Row format>

In [5]:
%%time
F_dev  = featurizer.apply_existing(split=1)
F_test = featurizer.apply_existing(split=2)

Clearing existing...
Running UDF...

Clearing existing...
Running UDF...

CPU times: user 5min 27s, sys: 2.73 s, total: 5min 30s
Wall time: 5min 37s


Or if already computed:

In [6]:
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()

In [7]:
F_train = featurizer.load_matrix(session, split=0)
F_dev = featurizer.load_matrix(session, split=1)
F_test = featurizer.load_matrix(session, split=2)

# Training `SparseLogReg`

Instead of LSTM, to start.  First, reloading training marginals:

In [19]:
from snorkel.annotations import load_marginals
train_marginals = load_marginals(session, F_train, split=0)

In [20]:
from snorkel.learning import SparseLogisticRegression
from snorkel.learning.utils import RandomSearch, ListParameter, RangeParameter

# Searching over learning rate
rate_param = RangeParameter('lr', 1e-6, 1e-2, step=1, log_base=10)
l1_param  = RangeParameter('l1_penalty', 1e-6, 1e-2, step=1, log_base=10)
l2_param  = RangeParameter('l2_penalty', 1e-6, 1e-2, step=1, log_base=10)

# NOTE: A larger search (n) would likely lead to a higher score!
searcher = RandomSearch(SparseLogisticRegression, [rate_param, l1_param, l2_param], F_train,
                        Y_train=train_marginals, n=5)

Initialized RandomSearch search of size 5. Search space size = 125.


In [21]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)
L_gold_dev

<888x1 sparse matrix of type '<type 'numpy.int64'>'
	with 888 stored elements in Compressed Sparse Row format>

In [22]:
%%time
import numpy as np
np.random.seed(1701)
disc_model, run_stats = searcher.fit(F_dev, L_gold_dev, n_epochs=50, rebalance=0.5, print_freq=25)

[1] Testing lr = 1.00e-02, l1_penalty = 1.00e-06, l2_penalty = 1.00e-04
[SparseLogisticRegression] Training model
[SparseLogisticRegression] n_train=6736  #epochs=50  batch size=256
[SparseLogisticRegression] Epoch 0 (0.40s)	Average loss=0.785354
[SparseLogisticRegression] Epoch 25 (10.63s)	Average loss=0.711141
[SparseLogisticRegression] Epoch 49 (21.19s)	Average loss=0.716004
[SparseLogisticRegression] Training done (21.19s)
[SparseLogisticRegression] F1 Score: 0.509433962264
[SparseLogisticRegression] Model saved as <SparseLogisticRegression_0>
[2] Testing lr = 1.00e-04, l1_penalty = 1.00e-06, l2_penalty = 1.00e-03
[SparseLogisticRegression] Training model
[SparseLogisticRegression] n_train=6736  #epochs=50  batch size=256
[SparseLogisticRegression] Epoch 0 (0.45s)	Average loss=1.361492
[SparseLogisticRegression] Epoch 25 (12.97s)	Average loss=1.265687
[SparseLogisticRegression] Epoch 49 (22.52s)	Average loss=1.236134
[SparseLogisticRegression] Training done (22.52s)
[SparseLogistic

In [23]:
run_stats

Unnamed: 0,lr,l1_penalty,l2_penalty,Prec.,Rec.,F1
3,0.001,0.0001,0.001,0.42029,0.685811,0.521181
0,0.01,1e-06,0.0001,0.423767,0.638514,0.509434
4,0.01,1e-06,0.0001,0.41704,0.628378,0.501348
1,0.0001,1e-06,0.001,0.423888,0.611486,0.500692
2,0.001,0.01,0.0001,0.387914,0.672297,0.491965


### Scoring on the test set

Finally, we'll evaluate our performance on the blind test set of 500 documents. We'll load labels similar to how we did for the development set, and use the `score` function of our extraction model to see how we did.

In [13]:
from load_external_annotations import load_external_labels
load_external_labels(session, ChemicalDisease, split=2, annotator='gold')

AnnotatorLabels created: 0


In [14]:
from snorkel.annotations import load_gold_labels
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)
L_gold_test

<4620x1 sparse matrix of type '<type 'numpy.int64'>'
	with 4620 stored elements in Compressed Sparse Row format>

In [24]:
_, _, _, _ = disc_model.error_analysis(session, F_test, L_gold_test)

Scores (Un-adjusted)
Pos. class accuracy: 0.728
Neg. class accuracy: 0.507
Precision            0.416
Recall               0.728
F1                   0.53
----------------------------------------
TP: 1095 | FP: 1535 | TN: 1580 | FN: 410



# Part V: Training an LSTM extraction model

In the intro tutorial, we automatically featurized the candidates and trained a linear model over these features. Here, we'll train a more complicated model for relation extraction: an LSTM network. You can read more about LSTMs [here](https://en.wikipedia.org/wiki/Long_short-term_memory) or [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). An LSTM is a type of recurrent neural network and automatically generates a numerical representation for the candidate based on the sentence text, so no need for featurizing explicitly as in the intro tutorial. LSTMs take longer to train, and Snorkel doesn't currently support hyperparameter searches for them. We'll train a single model here, but feel free to try out other parameter sets. Just make sure to use the development set - and not the test set - for model selection.

**Note: Again, training for more epochs than below will greatly improve performance- try it out!**

In [26]:
from snorkel.learning import reRNN

train_kwargs = {
    'lr':         0.01,
    'dim':        100,
    'n_epochs':   20,
    'dropout':    0.5,
    'rebalance':  0.25,
    'print_freq': 5
}

lstm = reRNN(seed=1701, n_threads=None)
lstm.train(train, train_marginals, X_dev=dev, Y_dev=L_gold_dev, **train_kwargs)

[reRNN] Training model
[reRNN] n_train=4491  #epochs=20  batch size=256
[reRNN] Epoch 0 (27.01s)	Average loss=0.663680	Dev F1=29.93
[reRNN] Epoch 5 (165.15s)	Average loss=0.582636	Dev F1=49.40
[reRNN] Epoch 10 (296.02s)	Average loss=0.569274	Dev F1=49.92
[reRNN] Epoch 15 (425.11s)	Average loss=0.566656	Dev F1=51.51
[reRNN] Epoch 19 (526.96s)	Average loss=0.566059	Dev F1=50.78
[reRNN] Model saved as <reRNN>
[reRNN] Training done (530.02s)
[reRNN] Loaded model <reRNN>


In [27]:
lstm.score(test, L_gold_test)

(0.47311212814645309, 0.54950166112956811, 0.5084537350138334)