# Intro. to Snorkel: Extracting Spouse Relations from the News
## Part 2: Writing Distant Supervision Labeling Functions


In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys
import numpy as np
from snorkel.models import Candidate
from snorkel import SnorkelSession

session = SnorkelSession()

Snorkel requires that we formally define a type for our candidate.

In [None]:
from snorkel.models import candidate_subclass
try:
    Spouse = candidate_subclass('Spouse', ['person1', 'person2'])
except:
    print>>sys.stderr,"Info: Candidate type already defined"

Development set data

In [None]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

## 1. Distant Supervision Labeling Functions

In addition to writing labeling functions that describe text pattern-based heuristics for labeling training examples, we can also write labeling functions that distantly supervise examples. Here, we'll load in a list of known spouse pairs and check to see if the candidate pair matches one of these.

### DBpedia
http://wiki.dbpedia.org/
Out database of known spouses comes from DBpedia, which is a community-driven resource similar to Wikipedia but for curating structured data. We use a preprocesses snapshot as our knowledge base for all labeling function development.

In [None]:
import bz2

# Helper function to get last name
def last_name(s):
    name_parts = s.split(' ')
    return name_parts[-1] if len(name_parts) > 1 else None    

# Function to remove special characters from text
def strip_special(s):
    return ''.join(c for c in s if ord(c) < 128)

# Read in known spouse pairs and save as set of tuples
with bz2.BZ2File('data/spouses_dbpedia.csv.bz2', 'rb') as f:
    known_spouses = set(
        tuple(strip_special(x).strip().split(',')) for x in f.readlines()
    )
# Last name pairs for known spouses
last_names = set([(last_name(x), last_name(y)) for x, y in known_spouses if last_name(x) and last_name(y)])

### Example Entries

In [None]:
list(known_spouses)[0:10]

Load our helper functions

In [None]:
import re
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

# Sandbox

Write your labeling functions below:

In [None]:
def LF_distant_supervision(c):
    p1, p2 = c.person1.get_span(), c.person2.get_span()
    return 1 if (p1, p2) in known_spouses or (p2, p1) in known_spouses else 0

## Evaluating Labeling Functions

### Individual LF Statistics
One simple thing we can do is quickly test it on our development set (or any other set), without saving it to the database.  This is simple to do. For example, we can easily get every candidate that this LF labels as true:

In [None]:
def eval_lf(lf, split, gold=None):
    labeled = []
    cands = session.query(Spouse).filter(Spouse.split == split).order_by(Candidate.id).all()
    for i,c in enumerate(cands):
        if lf(c) != 0:
            if gold != None and gold.size != 0:
                labeled.append((c, gold[i,0]))
            else:
                labeled.append(c)
    print("Number labeled:", len(labeled))
    return labeled

In [None]:
labeled = eval_lf(LF_distant_supervision, 1)

We can then easily put this into the Viewer as usual (try it out!):
```
SentenceNgramViewer(labeled, session)
```
to see individual candidates

In [None]:
from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(labeled, session)

or we can view candidates en masse. 
WARNING -- this is slow for very large candidate sets so use with caution!!

In [None]:
for c,label in eval_lf(LF_distant_supervision, 1, L_gold_dev):
    display_candidate(c, label=label)

For later convenience we group the labeling functions into a list.

In [None]:
from snorkel.lf_helpers import test_LF
tp, fp, tn, fn = test_LF(session, LF_distant_supervision, split=1, annotator_name='gold')

## 2. Applying the Labeling Functions

Next, we need to actually run the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database.  We'll do this using the `LabelAnnotator` class, a UDF which we will again run with `UDFRunner`.  **Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.**  We start by setting up the class:

In [None]:
LFs = [
    LF_distant_supervision
]

In [None]:
from snorkel.annotations import LabelAnnotator
labeler = LabelAnnotator(lfs=LFs)

Finally, we run the `labeler`. Note that we set a random seed for reproducibility, since some of the LFs involve random number generators. Again, this can be run in parallel, given an appropriate database like Postgres is being used:

In [None]:
np.random.seed(1701)
%time L_train = labeler.apply(split=0)
L_train

If we've already created the labels (saved in the database), we can load them in as a sparse matrix here too:

In [None]:
L_train = labeler.load_matrix(session, split=0)
L_train

Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class, with some special features which we demonstrate below:

In [None]:
L_train.get_candidate(session, 0)

In [None]:
L_train.get_key(session, 0)

We can also view statistics about the resulting label matrix.

* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a *conflicting* non-zero label for.

In [None]:
L_train.lf_stats(session)