# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

Let's start by reloading from the last notebook.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

PARALLELISM = 6

import os
os.environ['SNORKELDB'] = 'postgres://localhost:5432/semparse_cdr'

from snorkel import SnorkelSession
session = SnorkelSession()

In [2]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

train = session.query(ChemicalDisease).filter(ChemicalDisease.split == 0).all()
dev = session.query(ChemicalDisease).filter(ChemicalDisease.split == 1).all()
test = session.query(ChemicalDisease).filter(ChemicalDisease.split == 2).all()

print 'Training set:\t{0} candidates'.format(len(train))
print 'Dev set:\t{0} candidates'.format(len(dev))
print 'Test set:\t{0} candidates'.format(len(test))    

Training set:	8272 candidates
Dev set:	888 candidates
Test set:	4620 candidates


In [3]:
from snorkel.annotations import load_marginals
train_marginals = load_marginals(session, split=0)

# (NEW) Featurize

In [4]:
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()

In [5]:
# %time F_train = featurizer.apply(split=0, parallelism=PARALLELISM)
# F_train

In [6]:
# %time F_dev  = featurizer.apply_existing(split=1, parallelism=PARALLELISM)
# F_dev

In [7]:
# %time F_test = featurizer.apply_existing(split=2, parallelism=PARALLELISM)
# F_test

In [8]:
F_train = featurizer.load_matrix(session, split=0)
F_train

<8272x122840 sparse matrix of type '<type 'numpy.float64'>'
	with 448906 stored elements in Compressed Sparse Row format>

In [9]:
F_dev = featurizer.load_matrix(session, split=1)
F_dev

<888x122840 sparse matrix of type '<type 'numpy.float64'>'
	with 27734 stored elements in Compressed Sparse Row format>

In [10]:
F_test = featurizer.load_matrix(session, split=2)
F_test

<4620x122840 sparse matrix of type '<type 'numpy.float64'>'
	with 138487 stored elements in Compressed Sparse Row format>

# Training Logistic Regression

In [11]:
from snorkel.annotations import load_gold_labels
L_gold_train = load_gold_labels(session, annotator_name='gold', split=0)
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

In [12]:
from snorkel.learning import SparseLogisticRegression
disc_model = SparseLogisticRegression()

In [13]:
from snorkel.learning.utils import MentionScorer
from snorkel.learning import RandomSearch, ListParameter, RangeParameter

# Searching over learning rate
rate_param = RangeParameter('lr', 1e-6, 1e-2, step=1, log_base=10)
l1_param  = RangeParameter('l1_penalty', 1e-6, 1e-2, step=1, log_base=10)
l2_param  = RangeParameter('l2_penalty', 1e-6, 1e-2, step=1, log_base=10)

searcher = RandomSearch(session, disc_model, F_train, train_marginals, [rate_param, l1_param, l2_param], n=20)

Initialized RandomSearch search of size 20. Search space size = 125.


In [14]:
%time searcher.fit(F_dev, L_gold_dev, n_epochs=50, rebalance=True, print_freq=25)

[1] Testing lr = 1.00e-04, l1_penalty = 1.00e-03, l2_penalty = 1.00e-04
[SparseLR] lr=0.0001 l1=0.001 l2=0.0001
[SparseLR] Building model
[SparseLR] Training model
[SparseLR] #examples=3686  #epochs=50  batch size=100
[SparseLR] Epoch 0 (1.01s)	Avg. loss=0.690457	NNZ=122840
[SparseLR] Epoch 25 (23.38s)	Avg. loss=-0.241990	NNZ=122840
[SparseLR] Epoch 49 (46.38s)	Avg. loss=-1.055238	NNZ=122840
[SparseLR] Training done (46.38s)
[SparseLR] Model saved. To load, use name
		SparseLR_0
[2] Testing lr = 1.00e-05, l1_penalty = 1.00e-06, l2_penalty = 1.00e-03
[SparseLR] lr=1e-05 l1=1e-06 l2=0.001
[SparseLR] Building model
[SparseLR] Training model
[SparseLR] #examples=3686  #epochs=50  batch size=100
[SparseLR] Epoch 0 (1.03s)	Avg. loss=0.691683	NNZ=122840
[SparseLR] Epoch 25 (23.28s)	Avg. loss=0.595268	NNZ=122840
[SparseLR] Epoch 49 (44.58s)	Avg. loss=0.503365	NNZ=122840
[SparseLR] Training done (44.58s)
[3] Testing lr = 1.00e-06, l1_penalty = 1.00e-03, l2_penalty = 1.00e-03
[SparseLR] lr=1e-06

Unnamed: 0,lr,l1_penalty,l2_penalty,Prec.,Rec.,F1
8,0.001,1e-06,0.01,0.37224,0.797297,0.507527
5,1e-05,0.01,0.001,0.379195,0.763514,0.506726
15,0.001,1e-05,0.001,0.35443,0.851351,0.500497
3,0.01,0.0001,1e-06,0.359584,0.817568,0.499484
17,0.01,0.001,1e-06,0.343381,0.885135,0.494806
14,0.0001,1e-05,0.001,0.343666,0.861486,0.491329
13,0.01,0.001,0.001,0.342896,0.847973,0.488327
16,0.001,0.01,0.01,0.341001,0.851351,0.486957
12,0.01,1e-05,1e-06,0.349702,0.793919,0.485537
19,0.001,0.01,0.001,0.344239,0.817568,0.484484


In [22]:
%time TP, FP, TN, FN = disc_model.score(session, F_dev, L_gold_dev)

Scores (Un-adjusted)
Pos. class accuracy: 0.797
Neg. class accuracy: 0.328
Precision            0.372
Recall               0.797
F1                   0.508
----------------------------------------
TP: 236 | FP: 398 | TN: 194 | FN: 60

CPU times: user 555 ms, sys: 13.5 ms, total: 569 ms
Wall time: 646 ms


In [26]:
from load_external_annotations import load_external_labels
load_external_labels(session, ChemicalDisease, split=2, annotator='gold')

AnnotatorLabels created: 0


In [27]:
from snorkel.annotations import load_gold_labels
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)
L_gold_test

<4620x1 sparse matrix of type '<type 'numpy.float64'>'
	with 4620 stored elements in Compressed Sparse Row format>

In [28]:
%time _, _, _, _ = disc_model.score(session, F_test, L_gold_test)

Scores (Un-adjusted)
Pos. class accuracy: 0.789
Neg. class accuracy: 0.253
Precision            0.338
Recall               0.789
F1                   0.473
----------------------------------------
TP: 1188 | FP: 2328 | TN: 787 | FN: 317

CPU times: user 8.76 s, sys: 229 ms, total: 8.99 s
Wall time: 10.5 s


# Part V: Training an LSTM
In the intro tutorial, we automatically featurized the candidates and trained a linear model over these features. Here, we'll train a more complicated model for relation extraction: an LSTM network. You can read more about LSTMs [here](https://en.wikipedia.org/wiki/Long_short-term_memory) or [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). An LSTM is a type of recurrent neural network and automatically generates a numerical representation for the candidate based on the sentence text, so no need for featurizing explicitly as in the intro tutorial. LSTMs take longer to train, and Snorkel doesn't currently support hyperparameter searches for them. We'll train a single model here, but feel free to try out other parameter sets. Just make sure to use the development set - and not the test set - for model selection.

In [17]:
# from snorkel.contrib.learning import reLSTM

# lstm = reLSTM()
# %time lstm.train(
#     train, train_marginals, lr=0.005, dim=200, n_epochs=30,
#     dropout_rate=0.5, rebalance=0.25, print_freq=5
# )

### Scoring on the test set

Finally, we'll evaluate our performance on the blind test set of 500 documents. We'll load labels similar to how we did for the development set, and use the `score` function of our extraction model to see how we did.

In [18]:
# from load_external_annotations import load_external_labels
# load_external_labels(session, ChemicalDisease, split=2, annotator='gold')

In [19]:
# from snorkel.annotations import load_gold_labels
# L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)
# L_gold_test

In [20]:
# %time _, _, _, _ = lstm.score(session, test, L_gold_test)