<img align="left" src="imgs/logo.jpg" width="50px" style="margin-right:10px">

# Snorkel Workshop: Extracting Spouse Relations <br> from the News
## Part 2: Writing  Labeling Functions

In Snorkel, our primary interface through which we provide training signal to the end extraction model we are training is by writing **labeling functions (LFs)** (as opposed to hand-labeling massive training sets).  We'll go through some examples for our spouse classification task below.

A labeling function isn't anything special. It's just a Python function that accepts a candidate, or a row of the DataFrame, as the input argument and returns `1` if it says the pair of persons in the candidate were married at some point,  `-1` if the pair of persons in the candidate were never married, and `0` if it doesn't know how to vote and abstains. In practice, many labeling functions are unipolar: it labels only `1`s and `0`s, or it labels only `-1`s and `0`s.

Recall that our goal is to ultimately train a high-performance classification model that predicts which of our candidates are true spouse relations.  It turns out that we can do this by writing potentially low-quality labeling functions!

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import re
import sys

import numpy as np
import pandas as pd
import scipy.sparse as sp

##  I. Background

### Preprocessing the Database

In a real application, there is a lot of data preparation, parsing, and database loading that needs to be completed before we dive into writing labeling functions. Here we've pre-generated candidates in a pandas DataFrame object per split (train,dev,test).

###  Using a _Development Set_ of Human-labeled Data

In our setting, we will use the phrase _development set_ to refer to a set of examples (here, a subset of our training set) which we label by hand and use to help us develop and refine labeling functions.  Unlike the _test set_, which we do not look at and use for final evaluation, we can inspect the development set while writing labeling functions. This is a list of `{-1,1}` labels.

In [2]:
import pickle

with open('fast_dev_data.pkl', 'rb') as f:
    dev_data = pickle.load(f)
    dev_labels = pickle.load(f)
    
with open('fast_train_data.pkl', 'rb') as f:
    train_data = pickle.load(f)
    train_labels = pickle.load(f)

### Labeling Function Metrics

#### Coverage
One simple metric we can compute quickly is our _coverage_, the number of candidates labeled by our LF, on our training set (or any other set).

#### Precision / Recall / F1
If we have gold labeled data, we can also compute standard precision, recall, and F1 metrics for the output of a single labeling function. These metrics are computed over 4 _error buckets_: _True Positives_ (tp), _False Positives_ (fp), _True Negatives_ (tn), and _False Negatives_ (fn).

\begin{equation*}
precision = \frac{tp}{(tp + fp)}
\end{equation*}

\begin{equation*}
recall = \frac{tp}{(tp + fn)}
\end{equation*}

\begin{equation*}
F1 = 2 \cdot \frac{ (precision \cdot recall)}{(precision + recall)}
\end{equation*}

# II. Labeling Functions

## A. Pattern Matching Labeling Functions

One powerful form of labeling function design is defining sets of keywords or regular expressions that, as a human labeler, you know are correlated with the true label. In the terminology of [Bayesian inference](https://en.wikipedia.org/wiki/Statistical_inference#Bayesian_inference), this can be thought of as defining a [_prior_](https://en.wikipedia.org/wiki/Prior_probability) over your word features. 

For example, we could define a dictionary of terms that occur between person names in a candidate. One simple dictionary of terms indicating a true relation could be:
    
    spouses = {'husband', 'wife'}
 
We can then write a labeling function that checks for a match with these terms in the text that occurs between person names. To access the text between the person mentions, we can use the `get_between_tokens` preprocessors from the previous notebook.

    @labeling_function(resources=dict(spouses=spouses), preprocessors=[get_between_tokens])
    def LF_husband_wife(x: DataPoint, spouses: List[str]) -> int:
        return 1 if len(spouses.intersection(set(x.between_tokens))) > 0 else 0
        
The idea is that we can easily create dictionaries that encode themes or categories descibing all kinds of relationships between 2 people and then use these objects to _weakly supervise_ our classification task.

    other_relationship = {'boyfriend', 'girlfriend'}
    
**IMPORTANT** Good labeling functions manage a trade-off between high coverage and high precision. When constructing your dictionaries, think about building larger, noiser sets of terms instead of relying on 1 or 2 keywords. Sometimes a single word can be very predictive (e.g., `ex-wife`) but it's almost always better to define something more general, such as a regular expression pattern capturing _any_ string with the `ex-` prefix. 

In [3]:
from typing import List

from snorkel.labeling.apply import PandasLFApplier
from snorkel.labeling.lf import labeling_function
from snorkel.types import DataPoint

from spouse_preprocessors import get_between_tokens, get_left_tokens, get_right_tokens, \
get_person_last_names, get_person_text
    

spouses = {'spouse', 'wife', 'husband', 'ex-wife', 'ex-husband'}
@labeling_function(resources=dict(spouses=spouses), preprocessors=[get_between_tokens])
def LF_husband_wife(x: DataPoint, spouses: List[str]) -> int:
    return 1 if len(spouses.intersection(set(x.between_tokens))) > 0 else 0

@labeling_function(resources=dict(spouses=spouses), preprocessors=[get_left_tokens])
def LF_husband_wife_left_window(x: DataPoint, spouses: List[str]) -> int:
    if len(set(spouses).intersection(set(x.person1_left_tokens))) > 0:
        return 1
    elif len(set(spouses).intersection(set(x.person2_left_tokens))) > 0:
        return 1
    else:
        return 0

@labeling_function()
def LF_same_last_name(x: DataPoint) -> int:
    p1_ln, p2_ln = get_person_last_names(x)
    
    if p1_ln and p2_ln and p1_ln == p2_ln:
        return 1
    return 0

@labeling_function(preprocessors=[get_between_tokens, get_right_tokens])
def LF_and_married(x: DataPoint) -> int:
    return 1 if 'and' in x.between_tokens and 'married' in x.person2_right_tokens else 0    


family = ['father', 'mother', 'sister', 'brother', 'son', 'daughter',
              'grandfather', 'grandmother', 'uncle', 'aunt', 'cousin']
family = set(family+[f + '-in-law' for f in family])

@labeling_function(resources=dict(family=family), preprocessors=[get_between_tokens])
def LF_familial_relationship(x: DataPoint, family: List[str]) -> int:
    return 1 if len(family.intersection(set(x.between_tokens))) > 0 else 0  


@labeling_function(resources=dict(family=family), preprocessors=[get_left_tokens])
def LF_family_left_window(x: DataPoint, family: List[str]) -> int:
    if len(set(family).intersection(set(x.person1_left_tokens))) > 0:
        return -1
    elif len(set(family).intersection(set(x.person2_left_tokens))) > 0:
        return -1
    else:
        return 0

other = {'boyfriend', 'girlfriend' 'boss', 'employee', 'secretary', 'co-worker'}
@labeling_function(resources=dict(other=other), preprocessors=[get_between_tokens])
def LF_other_relationship(x: DataPoint, other: List[str]) -> int:
    return -1 if len(other.intersection(set(x.between_tokens))) > 0 else 0

In [4]:
applier = PandasLFApplier([LF_husband_wife,
                           LF_husband_wife_left_window,
                           LF_same_last_name,
                           LF_and_married, 
                           LF_familial_relationship,
                           LF_family_left_window,
                           LF_other_relationship])
L = applier.apply(dev_data)

100%|██████████| 2811/2811 [00:25<00:00, 111.46it/s]


**Viewing Performance Metrics**
If we have gold labeled data, we can evaluate formal metrics. It's useful to view specific error metrics for a particular LF. Below, we'll compute our empirical scores using human-labeled development set data and then look at performance metrics for `LF_husband_wife` LF. Notice the precision and recall are both only around 35\%!

In [5]:
from snorkel.model.metrics import coverage_score, f1_score, precision_score, recall_score

print("LF_husband_wife coverage: \t", coverage_score(dev_labels,L[:,0]))
print("LF_husband_wife F1 score:  \t", f1_score(dev_labels,L[:,0]))
print("LF_husband_wife precision:  \t", precision_score(dev_labels,L[:,0]))
print("LF_husband_wife F1 recall:  \t", recall_score(dev_labels,L[:,0]))

LF_husband_wife coverage: 	 0.08964781216648879
LF_husband_wife F1 score:  	 0.4208144796380091
LF_husband_wife precision:  	 0.36904761904761907
LF_husband_wife F1 recall:  	 0.48947368421052634


We can also look at the statistics for the LFs over the entire dev set.

In [6]:
from snorkel.labeling.analysis import lf_summary
lf_summary(L, Y=dev_labels)

Unnamed: 0,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
0,1,0.089648,0.03593,0.005336,93,0,0.369048
1,1,0.025258,0.020633,0.001067,30,0,0.422535
2,1,0.040555,0.01423,0.003557,95,0,0.166667
3,1,0.001423,0.0,0.0,2,0,0.5
4,1,0.115617,0.049449,0.032729,15,0,0.046154
5,-1,0.041266,0.03344,0.041266,2,0,0.948276
6,-1,0.007115,0.001067,0.007115,3,0,0.7


## B. Distant Supervision Labeling Functions

In addition to using factories that encode pattern matching heuristics, we can also write labeling functions that _distantly supervise_ examples. Here, we'll load in a list of known spouse pairs and check to see if the pair of persons in a candidate matches one of these.

**DBpedia**
http://wiki.dbpedia.org/
Out database of known spouses comes from DBpedia, which is a community-driven resource similar to Wikipedia but for curating structured data. We'll use a preprocessed snapshot as our knowledge base for all labeling function development.

We can look at some of the example entries from DBPedia and use them in a simple distant supervision labeling function.

Make sure `dbpedia.pkl` is in the `tutorials/workshop/` directory. 

In [7]:
import pickle 

with open('dbpedia.pkl', 'rb') as f:
     known_spouses = pickle.load(f)
        
list(known_spouses)[0:5]

[('Hank Azaria', 'Helen Hunt'),
 ('Guion Griffis Johnson', 'Guy Benton Johnson'),
 ('Joan Blackman', 'Joby Baker'),
 ('Adele Jergens', 'Glenn Langan'),
 ('Ahmad Tejan Kabbah', 'Patricia Kabbah')]

In [8]:
@labeling_function(resources=dict(known_spouses=known_spouses))
def LF_distant_supervision(x: DataPoint, known_spouses: List[str]) -> int:
    p1, p2 = get_person_text(x)
    return 1 if (p1, p2) in known_spouses or (p2, p1) in known_spouses else 0


# Helper function to get last name
def last_name(s):
    name_parts = s.split(' ')
    return name_parts[-1] if len(name_parts) > 1 else None 

# Last name pairs for known spouses
last_names = set([(last_name(x), last_name(y)) for x, y in known_spouses if last_name(x) and last_name(y)])

@labeling_function(resources=dict(last_names=last_names))
def LF_distant_supervision_last_names(x: DataPoint, last_names: List[str]) -> int:
    p1_ln, p2_ln = get_person_last_names(x)
    
    return 1 if (p1_ln != p2_ln) and ((p1_ln, p2_ln) in last_names or (p2_ln, p1_ln) in last_names) else 0 

In [9]:
applier = PandasLFApplier([LF_husband_wife,
                           LF_husband_wife_left_window,
                           LF_same_last_name,
                           LF_and_married, 
                           LF_familial_relationship,
                           LF_family_left_window,
                           LF_other_relationship,
                           LF_distant_supervision,
                           LF_distant_supervision_last_names])

In [10]:
dev_L = applier.apply(dev_data)
with open('dev_L.pkl', 'wb') as f:
    pickle.dump(dev_L, f)
    
train_L = applier.apply(train_data)
with open('train_L.pkl', 'wb') as f:
    pickle.dump(train_L, f)

100%|██████████| 2811/2811 [00:27<00:00, 101.80it/s]
100%|██████████| 22254/22254 [03:15<00:00, 114.03it/s]


## C. Writing Custom Labeling Functions

The strength of LFs is that you can write any arbitrary function and use it to supervise a classification task. This approach can combine many of the same strategies discussed above or encode other information. 

For example, we observe that when mentions of person names occur far apart in a sentence, this is a good indicator that the candidate's label is False. You can write a labeling function that uses preprocessors `get_text_between` or `get_between_tokens` to write such an LF!

In [11]:
@labeling_function()
def LF_new(x: DataPoint) -> int:
    return 1 if x.person1_word_idx[0] > 3 else 0 #TODO: Change this!

applier = PandasLFApplier([LF_new])

In [12]:
new_dev_L = applier.apply(dev_data)
sp.hstack((dev_L, new_dev_L), format='csr')
with open('dev_L.pkl', 'wb') as f:
    pickle.dump(dev_L, f)
    
new_train_L = applier.apply(train_data)
sp.hstack((train_L, new_train_L), format='csr')
with open('train_L.pkl', 'wb') as f:
    pickle.dump(train_L, f)

100%|██████████| 2811/2811 [00:00<00:00, 34482.00it/s]
100%|██████████| 22254/22254 [00:00<00:00, 33445.38it/s]
