# Intro. to Snorkel: Extracting Spouse Relations from the News

## Part IV: Training a Model with Data Programming

In this part of the tutorial, we will train a statistical model to differentiate between true and false `Spouse` mentions.

We will train this model using _data programming_, and we will **ignore** the training labels provided with the training data. This is a more realistic scenario; in the wild, hand-labeled training data is rare and expensive. Data programming enables us to train a model using only a modest amount of hand-labeled data for validation and testing. For more information on data programming, see the [NIPS 2016 paper](https://arxiv.org/abs/1605.07723).

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

We repeat our definition of the `Spouse` `Candidate` subclass from Parts II and III.

In [2]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

## Automatically Creating Features
Recall that our goal is to distinguish between true and false mentions of spouse relations. To train a model for this task, we first embed our `Spouse` candidates in a feature space.

In [3]:
from snorkel.annotations import FeatureManager
feature_manager = FeatureManager(Spouse)

We can create a new feature set:

**Note that this might take 5-10 min.**

In [None]:
%time F_train = feature_manager.create(session, split=0)

**OR** if we've already created one, we can simply load as follows:

In [4]:
%time F_train = feature_manager.load(session, split=0)

CPU times: user 3.38 s, sys: 292 ms, total: 3.67 s
Wall time: 3.61 s


Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class, with some special features which we demonstrate below:

In [5]:
F_train

<4780x117915 sparse matrix of type '<type 'numpy.float64'>'
	with 281349 stored elements in Compressed Sparse Row format>

In [6]:
F_train.get_candidate(session, 0)

Spouse(Span("Armu", parent=17332, chars=[0,3], words=[0,0]), Span("Mohoasa", parent=17332, chars=[72,78], words=[14,14]))

In [7]:
F_train.get_key(session, 0)

FeatureKey (TDL_LEMMA:PARENTS-OF-BETWEEN-MENTION-and-MENTION[None])

## Creating Labeling Functions
Labeling functions are a core tool of data programming. They are heuristic functions that aim to classify candidates correctly. Their outputs will be automatically combined and denoised to estimate the probabilities of training labels for the training data.

In [8]:
import re
from snorkel.lf_helpers import get_left_tokens, get_right_tokens, get_between_tokens, get_text_between

In [9]:
spouses = {'wife', 'husband', 'ex-wife', 'ex-husband'}
family = {'father', 'mother', 'sister', 'brother', 'son', 'daughter',
              'grandfather', 'grandmother', 'uncle', 'aunt', 'cousin'}
family = family | {f + '-in-law' for f in family}
other = {'boyfriend', 'girlfriend' 'boss', 'employee', 'secretary', 'co-worker'}

def LF_husband_wife(c):
    return 1 if len(spouses.intersection(set(get_between_tokens(c)))) > 0 else 0

def LF_husband_wife_left_window(c):
    if len(spouses.intersection(set(get_left_tokens(c[0], window=2)))) > 0:
        return 1
    elif len(spouses.intersection(set(get_left_tokens(c[1], window=2)))) > 0:
        return 1
    else:
        return 0

def LF_no_spouse_in_sentence(c):
    return -1 if len(spouses.intersection(set(c[0].parent.words))) == 0 else 0

def LF_and_married(c):
    return 1 if 'and' in get_between_tokens(c) and 'married' in get_right_tokens(c) else 0
    
def LF_familial_relationship(c):
    return -1 if len(set(family).intersection(set(get_between_tokens(c)))) > 0 else 0

def LF_family_left_window(c):
    if len(family.intersection(set(get_left_tokens(c[0], window=2)))) > 0:
        return -1
    elif len(family.intersection(set(get_left_tokens(c[1], window=2)))) > 0:
        return -1
    else:
        return 0

def LF_other_relationship(c):
    coworker = ['boss', 'employee', 'secretary', 'co-worker']
    return -1 if len(set(coworker).intersection(set(get_between_tokens(c)))) > 0 else 0

For later convenience we group the labeling functions into a list.

In [10]:
LFs = [LF_husband_wife, LF_husband_wife_left_window, LF_no_spouse_in_sentence,
       LF_and_married, LF_familial_relationship, LF_family_left_window]

## Applying Labeling Functions

First we construct a `LabelManager`.

In [11]:
from snorkel.annotations import LabelManager
label_manager = LabelManager(Spouse)

Next we run the `LabelManager` to to apply the labeling functions to the training `CandidateSet`.

In [None]:
%time L_train = label_manager.create(session, split=0, f=LFs)
L_train

**OR** load if we've already created:

In [12]:
%time L_train = label_manager.load(session, split=0)
L_train

CPU times: user 80 ms, sys: 44 ms, total: 124 ms
Wall time: 87.9 ms


<4780x9 sparse matrix of type '<type 'numpy.float64'>'
	with 5246 stored elements in Compressed Sparse Row format>

We can also add or rerun a single labeling function (or more!) with the below command. Note that we set the argument `expand_key_set` to `True` to indicate that the set of matrix columns should be allowed to expand:

In [None]:
L_train = label_manager.update(session, True, split=0, f=[LF_other_relationship])
L_train

In [None]:
L_train

We can view statistics about the resulting label matrix:

In [13]:
L_train.lf_stats(session)

Unnamed: 0,j,coverage,overlaps,conflicts
ajratner,0,0.0,0.0,0.0
gold,1,0.0,0.0,0.0
LF_husband_wife,2,0.03682,0.023849,0.007113
LF_husband_wife_left_window,3,0.028661,0.022176,0.005439
LF_no_spouse_in_sentence,4,0.929289,0.054184,0.000209
LF_and_married,5,0.000209,0.000209,0.000209
LF_familial_relationship,6,0.053556,0.051883,0.008577
LF_family_left_window,7,0.041423,0.040167,0.004603
LF_other_relationship,8,0.007531,0.007531,0.0


## Fitting the Generative Model
We estimate the accuracies of the labeling functions without supervision. Specifically, we estimate the parameters of a `NaiveBayes` generative model.

In [14]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel(lf_propensity=True)
gen_model.train(L_train)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



Processing
Compiling
	Initialized
	Weight matrix set
	Variable matrix step 1
	Variable matrix step 2
	Priors set
	Output factors set
	Deps set
Initializing FG
Loading FG
Learning FG
Processing weights


We now apply the generative model to the training candidates.

In [15]:
train_marginals = gen_model.marginals(L_train)

We can view the learned parameters.

In [16]:
gen_model.weights.lf_accuracy

array([-0.01915055,  0.00277274, -0.01514995, -0.0152869 ,  0.03066852,
        0.01110797, -0.0015556 , -0.00895907, -0.01574029])

In [17]:
gen_model.weights.lf_propensity

array([-2.0414346 , -2.04384821, -1.90371848, -1.92865249,  0.91401601,
       -2.08016101, -1.83523695, -1.87016857, -2.01907289])

## Training the Discriminative Model
We use the estimated probabilites to train a discriminative model that classifies each `Candidate` as a true or false mention. We'll use a random hyperparameter search, evaluated on the development set labels, to find the best hyperparameters for our model. To run a hyperparameter search, we need labels for a development set. If they aren't already available, we can manually create labels using the Viewer.

In [18]:
from snorkel.learning import LogReg
from snorkel.learning import RandomSearch, ListParameter, RangeParameter

iter_param = ListParameter('n_iter', [250, 500, 1000, 2000])
rate_param = RangeParameter('rate', 1e-4, 1e-2, step=0.75, log_base=10)
reg_param  = RangeParameter('mu', 1e-8, 1e-2, step=1, log_base=10)

disc_model = LogReg()

First, we create features for the development set candidates.

Note that we use the training feature set, because those are the only features for which we have learned parameters. Features that were not encountered during training, e.g., a token that does not appear in the training set, are ignored, because we do not have any information about them.

To do so with the `FeatureManager`, we call update with the new `CandidateSet`, the name of the training `AnnotationKeySet`, and the value `False` for the parameter `extend_key_set` to indicate that the `AnnotationKeySet` should not be expanded with new `Feature` keys encountered during processing.

In [20]:
%time F_dev = feature_manager.update(session, False, split=1)
F_dev


Loading sparse Feature matrix...
CPU times: user 17.3 s, sys: 336 ms, total: 17.6 s
Wall time: 17.4 s


<223x117915 sparse matrix of type '<type 'numpy.float64'>'
	with 5648 stored elements in Compressed Sparse Row format>

**OR** if we've already created one, we can simply load as follows:

In [None]:
%time F_dev = feature_manager.load(session, dev, 'Train Features')
F_dev

Next, we load the development set labels and gold candidates we made in Part III.

In [21]:
L_gold_dev = label_manager.load(session, split=1)

In [23]:
dev_candidates = session.query(Spouse).filter(Spouse.split == 1).all()

Now we set up and run the hyperparameter search, training our model with different hyperparamters and picking the best model configuration to keep. We'll set the random seed to maintain reproducibility.

Note that we are fitting our model's parameters to the training set generated by our labeling functions, while we are picking hyperparamters with respect to score over the development set labels which we created by hand.

In [24]:
from snorkel.learning.utils import MentionScorer

searcher = RandomSearch(disc_model, F_train, train_marginals, 10, MentionScorer, iter_param, rate_param, reg_param)

Initialized RandomSearch search of size 10. Search space size = 112.


In [25]:
np.random.seed(1701)
searcher.fit(F_dev, L_gold_dev, dev_candidates)

Testing n_iter = 1.00e+03, rate = 1.00e-04, mu = 1.00e-02
Training marginals (!= 0.5):	4689
Features:			117915
Using gradient descent...
	Learning epoch = 0	Step size = 0.0001
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 100	Step size = 9.04792147114e-05
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 200	Step size = 8.18648829479e-05
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 300	Step size = 7.40707032156e-05
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 400	Step size = 6.70185906007e-05
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 500	Step size = 6.06378944861e-05
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 600	Step size = 5.48646907485e-05
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 700	Step size = 4.96411413431e-05
	Loss = 3250.167130	Gradient magnitude = 85.672962
	Learning epoch = 800	Step size = 4.4914914861e-05
	

TypeError: get_candidate() takes exactly 3 arguments (2 given)

In [28]:
F_dev.get_candidate(session, 0)

Spouse(Span("Hefford", parent=27796, chars=[52,58], words=[8,8]), Span("Wickenheiser", parent=27796, chars=[35,46], words=[6,6]))

_Note that to train a model without tuning any hyperparameters--at your own risk!--just use the `train` method of the discriminative model. For instance, to train with 500 iterations and a learning rate of 0.001, you could run:_
```
disc_model.train(F_train, train_marginals, n_iter=500, rate=0.001)
```

In [None]:
disc_model.w.shape

In [None]:
%time disc_model.save(session, "Discriminative Params")

In [None]:
tp, fp, tn, fn = disc_model.score(F_dev, L_gold_dev, gold_dev_set)

## Viewing Examples
After evaluating on the development `CandidateSet`, the labeling functions can be modified. Try changing the labeling functions to improve performance. You can view the true positives, false positives, true negatives, and false negatives using the `Viewer`.

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(fn, session, annotator_name="Tutorial Part IV User")
else:
    sv = None

In [None]:
sv

Next, in Part V, we will test our model on the test `CandidateSet`.