# Intro. to Snorkel: Extracting Spouse Relations from the News

## Part III: Training an End Extraction Model

In this final section of the tutorial, we'll use the noisy training labels we generated in the last tutorial part to train our end extraction model.

For this tutorial, we will be training a Bi-LSTM, a state-of-the-art deep neural network implemented in [TensorFlow](https://www.tensorflow.org/).

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

from snorkel import SnorkelSession
session = SnorkelSession()

We repeat our definition of the `Spouse` `Candidate` subclass:

In [2]:
from snorkel.models import candidate_subclass

Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

We reload the probabilistic training labels:

In [3]:
from snorkel.annotations import load_marginals

train_marginals = load_marginals(session, split=0)

We also reload the candidates:

In [4]:
train_cands = session.query(Spouse).filter(Spouse.split == 0).order_by(Spouse.id).all()
dev_cands   = session.query(Spouse).filter(Spouse.split == 1).order_by(Spouse.id).all()
test_cands  = session.query(Spouse).filter(Spouse.split == 2).order_by(Spouse.id).all()

Finally, we load gold labels for evaluation:

In [5]:
from snorkel.annotations import load_gold_labels

L_gold_dev  = load_gold_labels(session, annotator_name='gold', split=1)
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)

In [56]:

L_gold_train  = load_gold_labels(session, annotator_name='gold', split=0, load_as_array=True)
print L_gold_train.shape
print L_gold_train.sum()

AnnotatorLabels created: 0
(22254,)
0


In [66]:
obamas = [c for c in train_cands if 'Obama' in c.person1.get_span() + c.person2.get_span()]
print len(obamas)
from pprint import pprint
pprint(obamas)

114
[Spouse(Span("Mark Wilson/", sentence=27040, chars=[10,21], words=[3,4]), Span("Obama on", sentence=27040, chars=[42,49], words=[11,12])),
 Spouse(Span(" Carson", sentence=33689, chars=[15,21], words=[4,5]), Span("Obama", sentence=33689, chars=[148,152], words=[25,25])),
 Spouse(Span("  Cruz's", sentence=33703, chars=[14,21], words=[3,5]), Span("Barack Obama", sentence=33703, chars=[155,166], words=[30,31])),
 Spouse(Span("Rafael", sentence=33703, chars=[30,35], words=[7,7]), Span("Barack Obama", sentence=33703, chars=[155,166], words=[30,31])),
 Spouse(Span("Peng", sentence=47389, chars=[17,20], words=[4,4]), Span("Michelle Obama", sentence=47389, chars=[88,101], words=[17,18])),
 Spouse(Span("Pope Francis", sentence=16912, chars=[0,11], words=[0,1]), Span("Barack Obama", sentence=16912, chars=[34,45], words=[5,6])),
 Spouse(Span("Barack Obama", sentence=16912, chars=[34,45], words=[5,6]), Span("Laudato Si", sentence=16912, chars=[116,125], words=[20,21])),
 Spouse(Span(" Carson",

In [64]:
print len(obamas)

22254


Now we can setup our discriminative model. Here we specify the model and learning hyperparameters.

They can also be set automatically using a search based on the dev set with a [GridSearch](https://github.com/HazyResearch/snorkel/blob/master/snorkel/learning/utils.py) object.

In [59]:
from snorkel.learning.disc_models.rnn import reRNN

train_kwargs = {
    'lr':         0.01,
    'dim':        100,
    'n_epochs':   20,
    'dropout':    0.5,
    'print_freq': 1,
    'max_sentence_length': 100
}

lstm = reRNN(seed=1701, n_threads=None)
gold_train_labels = L_gold_train
# lstm.train(train_cands, train_marginals, X_dev=dev_cands, Y_dev=L_gold_dev, **train_kwargs)
gold_dev_labels = L_gold_dev.toarray().squeeze()
gold_dev_labels[gold_dev_labels==-1]=0
lstm.train(dev_cands, gold_dev_labels, X_dev=dev_cands, Y_dev=L_gold_dev, **train_kwargs)

[reRNN] Training model
[reRNN] n_train=2811  #epochs=20  batch size=256
[reRNN] Epoch 0 (7.69s)	Average loss=0.319301	Dev F1=0.00
[reRNN] Epoch 1 (18.19s)	Average loss=0.225449	Dev F1=32.06
[reRNN] Epoch 2 (28.67s)	Average loss=0.200281	Dev F1=50.41
[reRNN] Epoch 3 (39.08s)	Average loss=0.144881	Dev F1=60.27
[reRNN] Epoch 4 (49.46s)	Average loss=0.121741	Dev F1=64.39
[reRNN] Epoch 5 (60.03s)	Average loss=0.110585	Dev F1=65.45
[reRNN] Epoch 6 (72.22s)	Average loss=0.100834	Dev F1=65.82
[reRNN] Epoch 7 (84.57s)	Average loss=0.085636	Dev F1=75.27
[reRNN] Epoch 8 (95.20s)	Average loss=0.080916	Dev F1=72.50
[reRNN] Epoch 9 (105.71s)	Average loss=0.077849	Dev F1=73.19
[reRNN] Epoch 10 (116.23s)	Average loss=0.085320	Dev F1=74.85
[reRNN] Epoch 11 (126.75s)	Average loss=0.080988	Dev F1=72.50
[reRNN] Epoch 12 (137.18s)	Average loss=0.071697	Dev F1=77.62
[reRNN] Epoch 13 (147.63s)	Average loss=0.067060	Dev F1=81.82
[reRNN] Epoch 14 (158.09s)	Average loss=0.063371	Dev F1=80.23
[reRNN] Epoch 15 (1

In [16]:
# from snorkel.learning.disc_models.rnn import reRNN

# train_kwargs = {
#     'lr':         0.01,
#     'dim':        50,
#     'n_epochs':   10,
#     'dropout':    0.25,
#     'print_freq': 1,
#     'max_sentence_length': 100
# }

# lstm = reRNN(seed=1701, n_threads=None)
# lstm.train(train_cands, train_marginals, X_dev=dev_cands, Y_dev=L_gold_dev, **train_kwargs)

[reRNN] Training model
[reRNN] n_train=17205  #epochs=10  batch size=256
[reRNN] Epoch 0 (23.43s)	Average loss=0.566997	Dev F1=34.46
[reRNN] Epoch 1 (48.20s)	Average loss=0.535070	Dev F1=36.13
[reRNN] Epoch 2 (73.24s)	Average loss=0.533329	Dev F1=35.49
[reRNN] Epoch 3 (98.04s)	Average loss=0.532918	Dev F1=36.92
[reRNN] Epoch 4 (123.41s)	Average loss=0.532681	Dev F1=35.93
[reRNN] Epoch 5 (148.55s)	Average loss=0.532237	Dev F1=33.81
[reRNN] Epoch 6 (174.32s)	Average loss=0.531906	Dev F1=38.29
[reRNN] Epoch 7 (199.92s)	Average loss=0.531722	Dev F1=38.26
[reRNN] Epoch 8 (224.76s)	Average loss=0.531367	Dev F1=38.54
[reRNN] Model saved as <reRNN>
[reRNN] Epoch 9 (251.51s)	Average loss=0.531073	Dev F1=39.50
[reRNN] Model saved as <reRNN>
[reRNN] Training done (254.57s)
[reRNN] Loaded model <reRNN>


Now, we get the precision, recall, and F1 score from the discriminative model:

In [57]:
p, r, f1 = lstm.score(test_cands, L_gold_test)
print("Prec: {0:.3f}, Recall: {1:.3f}, F1 Score: {2:.3f}".format(p, r, f1))



Prec: 0.141, Recall: 0.211, F1 Score: 0.169


We can also get the candidates returned in sets (true positives, false positives, true negatives, false negatives) as well as a more detailed score report:

In [60]:
tp, fp, tn, fn = lstm.error_analysis(session, test_cands, L_gold_test)

Scores (Un-adjusted)
Pos. class accuracy: 0.183
Neg. class accuracy: 0.899
Precision            0.138
Recall               0.183
F1                   0.157
----------------------------------------
TP: 40 | FP: 250 | TN: 2233 | FN: 178



Note that if this is the final test set that you will be reporting final numbers on, to avoid biasing results you should not inspect results.  However you can run the model on your _development set_ and, as we did in the previous part with the generative labeling function model, inspect examples to do error analysis.

You can also improve performance substantially by increasing the number of training epochs!

Finally, we can save the predictions of the model on the test set back to the database. (This also works for other candidate sets, such as unlabeled candidates.)

In [19]:
lstm.save_marginals(session, test_cands)

Saved 2697 marginals


##### More importantly, you completed the introduction to Snorkel! Give yourself a pat on the back!