<img align="left" src="imgs/logo.jpg" width="50px" style="margin-right:10px">
# Snorkel Workshop: Extracting Spouse Relations <br> from the News
## Part 4: Training our End Extraction Model w/ Logisitic Regression

In this final section of the tutorial, we'll use the noisy training labels we generated in the last tutorial part to train our end extraction model.

For this tutorial, we will be training a simple - but fairly effective - logistic regression model.  More generally, however, Snorkel plugs in with many ML libraries including [TensorFlow](https://www.tensorflow.org/), making it easy to use almost any state-of-the-art model as the end extractor!

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import re
import numpy as np

# Connect to the database backend and initalize a Snorkel session
from lib.init import *
from snorkel.models import candidate_subclass
from snorkel.annotations import load_gold_labels

from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

We repeat our definition of the `Spouse` `Candidate` subclass, and load the test set:

# 1 Training a `SparseLogReg` Discriminative Model
We use the training marginals to train a discriminative model that classifies each `Candidate` as a true or false mention. We'll use a random hyperparameter search, evaluated on the development set labels, to find the best hyperparameters for our model. To run a hyperparameter search, we need labels for a development set. If they aren't already available, we can manually create labels using the Viewer.

## Feature Extraction
Instead of using a deep learning approach to start, let's look at a standard sparse logistic regression model. First, we need to extract out features. This can take a while, but we only have to do it once!

In [2]:
from snorkel.annotations import FeatureAnnotator
from lib.features import hybrid_span_mention_ftrs

featurizer = FeatureAnnotator()

F_train = featurizer.load_matrix(session, split=0)
F_dev = featurizer.load_matrix(session, split=1)
F_test = featurizer.load_matrix(session, split=2)

if F_train.size == 0:    
    %time F_train = featurizer.apply(split=0, parallelism=2, annotation_generator=hybrid_span_mention_ftrs)
if F_dev.size == 0:     
    %time F_dev  = featurizer.apply_existing(split=1, parallelism=2, annotation_generator=hybrid_span_mention_ftrs)
if F_test.size == 0:
    %time F_test = featurizer.apply_existing(split=2, parallelism=2, annotation_generator=hybrid_span_mention_ftrs)

In [3]:
from snorkel.annotations import load_marginals
train_marginals = load_marginals(session, F_train, split=0)

The following code performs model selection by tuning our learning algorithm's hyperparamters.

In [5]:
from snorkel.learning.utils import MentionScorer
from snorkel.learning import RandomSearch, ListParameter, RangeParameter

from snorkel.learning import SparseLogisticRegression
disc_model = SparseLogisticRegression()

# Searching over learning rate
#rate_param        = RangeParameter('lr', 1e-6, 1e-2, step=1, log_base=10)
rate_param        = ListParameter('lr', [0.1 / train_marginals.shape[0]])
l1_param          = RangeParameter('l1_penalty', 1e-6, 1e-2, step=1, log_base=10)
l2_param          = RangeParameter('l2_penalty', 1e-6, 1e-2, step=1, log_base=10)
batch_size_param  = ListParameter('batch_size', [512]) # 32, 64, 
#b_param           = ListParameter('b', [0.5, 0.6, 0.7])
#balance_param     = ListParameter('rebalance', [0.0, 0.3, 0.5])

param_grid = [rate_param, l1_param, l2_param, batch_size_param] # b_param, balance_param

np.random.seed(1701)
searcher = RandomSearch(SparseLogisticRegression, param_grid, F_train,
                        Y_train=train_marginals, n=1, n_threads=4)

logreg, run_stats = searcher.fit(F_dev, L_gold_dev, n_epochs=2000, print_freq=250, n_threads=1)
print run_stats

Initialized RandomSearch search of size 1. Search space size = 25.
[1] Testing lr = 4.49e-06, l1_penalty = 1.00e-02, l2_penalty = 1.00e-04, batch_size = 512
[SparseLogisticRegression] Training model
[SparseLogisticRegression] n_train=17214  #epochs=2000  batch size=512
[SparseLogisticRegression] Epoch 0 (2.01s)	Average loss=291.227203
[SparseLogisticRegression] Epoch 250 (520.72s)	Average loss=290.167267
[SparseLogisticRegression] Epoch 500 (1050.85s)	Average loss=289.672699
[SparseLogisticRegression] Epoch 750 (1568.22s)	Average loss=289.481781
[SparseLogisticRegression] Epoch 1000 (2103.25s)	Average loss=289.478210
[SparseLogisticRegression] Epoch 1250 (2631.88s)	Average loss=289.584717
[SparseLogisticRegression] Epoch 1500 (3171.83s)	Average loss=289.755707
[SparseLogisticRegression] Epoch 1750 (3706.07s)	Average loss=289.960297
[SparseLogisticRegression] Epoch 1999 (4218.17s)	Average loss=290.183441
[SparseLogisticRegression] Training done (4218.17s)
[SparseLogisticRegression] F1 S

In [6]:
from snorkel.annotations import load_gold_labels

L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)
L_gold_dev.shape

(2811, 1)

## Examining Features
Extracting features allows us to inspect and interperet our learned weights 

In [7]:
from lib.scoring import *
print_top_k_features(session, logreg, F_train, top_k=25)

363725
[-0.44589204, u'BETWEEN_SEQ_LEMMAS[sprinkling and]']
[-0.42697382, u'BETWEEN_SEQ_LEMMAS[february and]']
[-0.42161661, u'WIN_RIGHT_SEQ_LEMMAS[-PRON- husband joe]']
[-0.41500682, u'BETWEEN_SEQ_LEMMAS[yogi , grow]']
[-0.40838289, u'WIN_RIGHT_SEQ_LEMMAS[admire throng]']
[-0.4075236, u'WIN_LEFT_SEQ_POS_TAGS[CC NNP :]']
[-0.40371552, u'BETWEEN_SEQ_LEMMAS[criticism ,]']
[-0.40125224, u'BETWEEN_SEQ_LEMMAS[people like to]']
[-0.39913782, u'BETWEEN_SEQ_LEMMAS[to lobby]']
[-0.39903417, u'BETWEEN_SEQ_LEMMAS[dean mcdermott tell]']
[-0.39445403, u'BETWEEN_SEQ_LEMMAS[abbott and -PRON-]']
[-0.3905988, u'BETWEEN_SEQ_LEMMAS[into a 50-acre]']
[-0.39038658, u'BETWEEN_SEQ_LEMMAS[and meet kim]']
[-0.38999209, u"BETWEEN_SEQ_LEMMAS[execution : ']"]
[-0.38748351, u'WIN_LEFT_SEQ_LEMMAS[-PRON- diamond ,]']
[-0.38694522, u'WIN_RIGHT_SEQ_LEMMAS[the launch party]']
[-0.38580975, u'BETWEEN_SEQ_LEMMAS[by question about]']
[-0.38542014, u'WIN_RIGHT_SEQ_LEMMAS[( bass guitar]']
[-0.38492721, u'BETWEEN_SEQ_LEMMAS[

## Evaluate on Test Data

In [None]:
from snorkel.annotations import load_gold_labels
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)

In [None]:
_, _, _, _ = disc_model.score(session, F_test, L_gold_test)