# Intro. to Snorkel: Extracting Spouse Relations from the News

## Part IV: Training a Model with Data Programming

In this part of the tutorial, we will train a statistical model to differentiate between true and false `Spouse` mentions.

We will train this model using _data programming_, and we will **ignore** the training labels provided with the training data. This is a more realistic scenario; in the wild, hand-labeled training data is rare and expensive. Data programming enables us to train a model using only a modest amount of hand-labeled data for validation and testing. For more information on data programming, see the [NIPS 2016 paper](https://arxiv.org/abs/1605.07723).

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

We repeat our definition of the `Spouse` `Candidate` subclass from Parts II and III.

In [None]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

# (1) Creating and Modeling a Noisy Training Set

Our biggest step in the data programming pipeline is the creation--_and modeling_--of a noisy training set.  We'll approach this in three main steps:

i. **Creating labeling functions (LFs):** Here's where most of our development time would actually go into if this were a real application; labeling functions encode our heuristics and weak supervision signals to generate (noisy) labels for our training candidates

ii. **Applying the LFs:** Here, we actually use them to label our candidates!

iii. **Training a generative model of our training set:** Here we learn a model over our LFs, learning their respective accuracies automatically

## (1.i) Creating Labeling Functions
Labeling functions are a core tool of data programming. They are heuristic functions that aim to classify candidates correctly. Their outputs will be automatically combined and denoised to estimate the probabilities of training labels for the training data.

In [None]:
import re
from snorkel.lf_helpers import get_left_tokens, get_right_tokens, get_between_tokens, get_text_between

In [None]:
spouses = {'wife', 'husband', 'ex-wife', 'ex-husband'}
family = {'father', 'mother', 'sister', 'brother', 'son', 'daughter',
              'grandfather', 'grandmother', 'uncle', 'aunt', 'cousin'}
family = family | {f + '-in-law' for f in family}
other = {'boyfriend', 'girlfriend' 'boss', 'employee', 'secretary', 'co-worker'}

def LF_husband_wife(c):
    return 1 if len(spouses.intersection(set(get_between_tokens(c)))) > 0 else 0

def LF_husband_wife_left_window(c):
    if len(spouses.intersection(set(get_left_tokens(c[0], window=2)))) > 0:
        return 1
    elif len(spouses.intersection(set(get_left_tokens(c[1], window=2)))) > 0:
        return 1
    else:
        return 0

def LF_no_spouse_in_sentence(c):
    return -1 if len(spouses.intersection(set(c[0].parent.words))) == 0 else 0

def LF_and_married(c):
    return 1 if 'and' in get_between_tokens(c) and 'married' in get_right_tokens(c) else 0
    
def LF_familial_relationship(c):
    return -1 if len(set(family).intersection(set(get_between_tokens(c)))) > 0 else 0

def LF_family_left_window(c):
    if len(family.intersection(set(get_left_tokens(c[0], window=2)))) > 0:
        return -1
    elif len(family.intersection(set(get_left_tokens(c[1], window=2)))) > 0:
        return -1
    else:
        return 0

def LF_other_relationship(c):
    coworker = ['boss', 'employee', 'secretary', 'co-worker']
    return -1 if len(set(coworker).intersection(set(get_between_tokens(c)))) > 0 else 0

For later convenience we group the labeling functions into a list.

In [None]:
LFs = [LF_husband_wife, LF_husband_wife_left_window, LF_no_spouse_in_sentence,
       LF_and_married, LF_familial_relationship, LF_family_left_window, LF_other_relationship]

## (1.ii) Applying the Labeling Functions

Next, we need to actually execute the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database.  We'll do this using the `LabelAnnotator` class, a UDF which we will again run with `UDFRunner`.  **Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.**  We start by setting up the class:

In [None]:
from snorkel.annotations import LabelAnnotator
from snorkel.udf import UDFRunner

labeler = UDFRunner(LabelAnnotator, candidate_subclass=Spouse, f=LFs)

Next, we load the **`id`s** of the candidates we want to label.  _Note that for larger sets, we'd want this to be a generator instead..._

In [None]:
train_cids = [c[0] for c in session.query(Spouse.id).filter(Spouse.split == 0).all()]

Finally, we run the `labeler`; again, this can be run in parallel, given an appropriate database like Postrges is being used:

In [None]:
%time labeler.apply(train_cids)

Now that we've created the labels (saved in the database), we'll load them in as a sparse matrix:

In [None]:
from snorkel.annotations import load_label_matrix
L_train = load_label_matrix(session, train_cids)

Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class, with some special features which we demonstrate below:

In [None]:
L_train

In [None]:
L_train.get_candidate(session, 0)

In [None]:
L_train.get_key(session, 0)

We can view statistics about the resulting label matrix:

In [None]:
L_train.lf_stats(session)

## (1.iii) Fitting the Generative Model
We estimate the accuracies of the labeling functions without supervision. Specifically, we estimate the parameters of a `NaiveBayes` generative model.

In [None]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()
gen_model.train(L_train)

We now apply the generative model to the training candidates.

In [None]:
train_marginals = gen_model.marginals(L_train)

We can view the learned parameters.

In [None]:
gen_model.weights.lf_accuracy

In [None]:
import matplotlib.pyplot as plt
plt.hist(train_marginals)

# (2) Training our Primary Model

Next, we use the training set we've generated--now stored as an array of `training_marginals`--to train a discriminative model.  This part more closely follows a traditional ML process, and will consist of two steps:

i. **Create features**

ii. **Train our model**

## (2.i) Automatically Creating Features
Recall that our goal is to distinguish between true and false mentions of spouse relations. To train a model for this task, we first embed our `Spouse` candidates in a feature space. Here, we use the `FeatureAnnotator` UDF, which uses the default Snorkel features by default:

In [None]:
from snorkel.annotations import FeatureAnnotator
from snorkel.udf import UDFRunner

featurizer = UDFRunner(FeatureAnnotator, candidate_subclass=Spouse)
%time featurizer.apply(train_cids)

And again we then load the features as a matrix:

In [None]:
from snorkel.annotations import load_feature_matrix
F_train = load_feature_matrix(session, train_cids)
F_train

## (2.ii) Training the Discriminative Model
We use the estimated probabilites to train a discriminative model that classifies each `Candidate` as a true or false mention. We'll use a random hyperparameter search, evaluated on the development set labels, to find the best hyperparameters for our model. To run a hyperparameter search, we need labels for a development set. If they aren't already available, we can manually create labels using the Viewer.

In [None]:
from snorkel.learning import LogReg
disc_model = LogReg()

First, we create features for the development set candidates.

Note that we use the training feature set, because those are the only features for which we have learned parameters. Features that were not encountered during training, e.g., a token that does not appear in the training set, are ignored, because we do not have any information about them.

**To use the training features we already created, we specify `create_new_keyset=False` when applying the featurizer to the dev set:**

In [None]:
dev_cids   = [c[0] for c in session.query(Spouse.id).filter(Spouse.split == 1).all()]
featurizer = UDFRunner(FeatureAnnotator, candidate_subclass=Spouse)
%time featurizer.apply(dev_cids, create_new_keyset=False)

In [None]:
F_dev = load_feature_matrix(session, dev_cids)
F_dev

Next, we load the development set labels and gold candidates we made in Part III.

In [None]:
from snorkel.annotations import load_annotator_labels
L_gold_dev = load_annotator_labels(session, dev_cids, annotator_name='gold')

Now we set up and run the hyperparameter search, training our model with different hyperparamters and picking the best model configuration to keep. We'll set the random seed to maintain reproducibility.

Note that we are fitting our model's parameters to the training set generated by our labeling functions, while we are picking hyperparamters with respect to score over the development set labels which we created by hand.

In [None]:
from snorkel.learning.utils import MentionScorer
from snorkel.learning import RandomSearch, ListParameter, RangeParameter

rate_param = RangeParameter('rate', 1e-8, 1e-2, step=1, log_base=10)
reg_param  = RangeParameter('mu', 1e-8, 1e-2, step=1, log_base=10)

searcher = RandomSearch(session, disc_model, F_train, train_marginals, [rate_param, reg_param], n=10)

In [None]:
np.random.seed(1701)
searcher.fit(F_dev, L_gold_dev)

_Note that to train a model without tuning any hyperparameters--at your own risk!--just use the `train` method of the discriminative model. For instance, to train with 500 iterations and a learning rate of 0.001, you could run:_
```
disc_model.train(F_train, train_marginals, n_iter=500, rate=0.001)
```

In [None]:
%time disc_model.save(session)

In [None]:
tp, fp, tn, fn = disc_model.score(session, F_dev, L_gold_dev)

## Viewing Examples
After evaluating on the development `CandidateSet`, the labeling functions can be modified. Try changing the labeling functions to improve performance. You can view the true positives, false positives, true negatives, and false negatives using the `Viewer`.

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(fn, session, annotator_name="Tutorial Part IV User")
else:
    sv = None

In [None]:
sv

Next, in Part V, we will test our model on the test `CandidateSet`.