<img align="left" src="imgs/logo.jpg" width="50px" style="margin-right:10px">

# Snorkel Workshop 
## Part 1: Snorkel API

We will walk through an example text classification task to explore how Snorkel works with user-defined LFs.

### Data Download

Make sure `train.pkl`, `dev.pkl`, and `test.pkl` are in the `tutorials/workshop/` directory.

If you do not have these files, run 

    wget https://www.dropbox.com/s/jmrvyaqew4zp9cy/spouse_data.zip
    unzip spouse_data.zip

### Classification Task
<img src="imgs/sentence.jpg" width="700px;">

We want to classify each __candidate__ or pair of people mentioned in a sentene, as being married at some point or not.

In the above example, our candidate represents the possible relation `(Barack Obama, Michelle Obama)`. As readers, we know this mention is true due to external knowledge and the keyword of `wedding` occuring later in the sentence.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import pickle

import pandas as pd

In [2]:
import pickle

with open('fast_dev_data.pkl', 'rb') as f:
    dev_data = pickle.load(f)
    dev_labels = pickle.load(f)
    
with open('fast_train_data.pkl', 'rb') as f:
    train_data = pickle.load(f)
    train_labels = pickle.load(f)

`train_data` and `dev_data` are Pandas DataFrame object, where each row represents a particular __candidate__. The DataFrames contain the fields `sentence`, which refers to the sentence the candidate is in, `tokens`, the tokenized form of the sentence, `person1_word_idx` and `person2_word_idx`, which represent the word indices in the tokens at which the first and second person's name appear, respectively. 

In [3]:
dev_data.head()

Unnamed: 0,person1_word_idx,person2_word_idx,sentence,tokens
0,"(1, 1)","(22, 24)","The Richards are half-sisters to Kathy Hilton,...","[The, Richards, are, half, -, sisters, to, Kat..."
1,"(1, 1)","(7, 8)","The Richards are half-sisters to Kathy Hilton,...","[The, Richards, are, half, -, sisters, to, Kat..."
2,"(7, 8)","(22, 24)","The Richards are half-sisters to Kathy Hilton,...","[The, Richards, are, half, -, sisters, to, Kat..."
3,"(6, 6)","(20, 21)","Prior to both his guests, Colbert's monologue ...","[Prior, to, both, his, guests, ,, Colbert, s, ..."
4,"(2, 2)","(4, 5)",People reported Williams and Ven Veen tied the...,"[People, reported, Williams, and, Ven, Veen, t..."


You'll interact with these candidate while writing labeling functions in Snorkel. We look at a candidate in the development set:

In [4]:
from spouse_preprocessors import get_person_text

candidate = dev_data.loc[12]
person_names = get_person_text(candidate)

print("Sentence:   \t", candidate['sentence'])
print("Person 1:   \t", person_names[0])
print("Person 2:   \t", person_names[1])

Sentence:   	 Nutter has made clear his plans to take advantage of his position of prominence, both as Philadelphia mayor and as honorary co-chair of the World Meeting of Families, to advance the gay agenda when the Pope visits — and the faithful are left wondering why Chaput continues to refuse to do anything to remedy this,” Church Militant’s Christine Niles wrote September 22   .   
Person 1:   	 Nutter
Person 2:   	 Christine Niles


### Labeling Function Helpers

When writing labeling functions, there are several operators you will use over and over again; fetching text between mentions of the two people in a candidate, examing word windows around person mentions, etc. 

We provide several helper functions in `spouse_preprocessors`:  these are Python helper functions that you can apply to candidates in the DataFrame to return objects that are helpful during LF development. You can (and should!) write your own helper functions to help write LFs.

We provide an example of a preprocessor definition and usage here:

In [5]:
from snorkel.labeling.preprocess import Preprocessor, PreprocessorMode, preprocessor
from snorkel.types import FieldMap

@preprocessor
def get_text_between(tokens, person1_word_idx, person2_word_idx) -> FieldMap:
    """
    Returns the text between the two person mentions in the sentence for a candidate
    """
    start = person1_word_idx[1] + 1
    end = person2_word_idx[0]
    text_between = ' '.join(tokens[start:end])
    return dict(text_between=text_between)

In [6]:
get_text_between.set_mode(PreprocessorMode.PANDAS)
get_text_between(candidate).text_between

'has made clear his plans to take advantage of his position of prominence , both as Philadelphia mayor and as honorary co - chair of the World Meeting of Families , to advance the gay agenda when the Pope visits — and the faithful are left wondering why Chaput continues to refuse to do anything to remedy this , ” Church Militant ’s'

### Candidate PreProcessors

We provide a set of helper functions for this task in `spouse_preprocessors.py` that take as input a candidate, or row of a DataFrame in our case. 

`get_between_tokens(candidate)`

`get_left_tokens(candidate)`

`get_right_tokens(candidate)`

In [7]:
from spouse_preprocessors import get_right_tokens

get_right_tokens.set_mode(PreprocessorMode.PANDAS)
get_right_tokens(candidate)

person1_word_idx                                                   (0, 0)
person2_word_idx                                                 (64, 65)
sentence                Nutter has made clear his plans to take advant...
tokens                  [Nutter, has, made, clear, his, plans, to, tak...
person1_right_tokens                              [has, made, clear, his]
person2_right_tokens                             [wrote, September, 22, ]
Name: 12, dtype: object