<img align="left" src="imgs/logo.jpg" width="50px" style="margin-right:10px">

# Snorkel Workshop 
## Part 1: Snorkel API

We will walk through an example text classification task to explore how Snorkel works with user-defined LFs. For this notebook, and the ones that follow, run every cell in the notebook (unless otherwise noted) before proceeding to the next one!

### Data Download

Make sure `train.pkl`, `dev.pkl`, and `test.pkl` are in the `tutorials/workshop/` directory.

If you do not have these files, run 

    wget https://www.dropbox.com/s/jmrvyaqew4zp9cy/spouse_data.zip
    unzip spouse_data.zip

### Classification Task
<img src="imgs/sentence.jpg" width="700px;">

We want to classify each __candidate__ or pair of people mentioned in a sentence, as being married at some point or not.

In the above example, our candidate represents the possible relation `(Barack Obama, Michelle Obama)`. As readers, we know this mention is true due to external knowledge and the keyword of `wedding` occuring later in the sentence.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import pickle
import pandas as pd

In [2]:
with open('new_dev_data.pkl', 'rb') as f:
    dev_data = pickle.load(f)
    dev_labels = pickle.load(f)
    
with open('new_train_data.pkl', 'rb') as f:
    train_data = pickle.load(f)
    train_labels = pickle.load(f)

**Input Data:** `train_data` and `dev_data` are Pandas DataFrame objects, where each row represents a particular __candidate__. The DataFrames contain the fields `sentence`, which refers to the sentence the candidate is in, `tokens`, the tokenized form of the sentence, `person1_word_idx` and `person2_word_idx`, which represent `[start, end]` indices in the tokens at which the first and second person's name appear, respectively. 

For the purpose of this tutorial, we also have certain **preprocessed fields**, that we discuss a few cells below.

In [3]:
dev_data[5:10]

Unnamed: 0,person1_word_idx,person2_word_idx,sentence,tokens,person1_right_tokens,person2_right_tokens,between_tokens
5,"(3, 4)","(8, 9)",Us Weekly reported Andy Cohen and Hanks’ wife ...,"[Us, Weekly, reported, Andy, Cohen, and, Hanks...","[and, Hanks’, wife, Rita]","[,, who, plays, Williams’]","[and, Hanks’, wife]"
6,"(0, 0)","(2, 3)",Williams and Ven Veen arrived at the ranch ear...,"[Williams, and, Ven, Veen, arrived, at, the, r...","[and, Ven, Veen, arrived]","[arrived, at, the, ranch]",[and]
7,"(44, 45)","(51, 51)",As Agrecovery now offers a recycling solution ...,"[As, Agrecovery, now, offers, a, recycling, so...","[, With, FIL, joining]","[is, now, supported, by]","[, With, FIL, joining, ,]"
8,"(11, 12)","(32, 32)",Joining him on the red carpet was his co-star ...,"[Joining, him, on, the, red, carpet, was, his,...","[,, 26, ,, who]","[s, ,, with, the]","[,, 26, ,, who, looked, chic, in, a, very, fem..."
9,"(0, 0)","(4, 4)",Richards and her sister Kyle have been regular...,"[Richards, and, her, sister, Kyle, have, been,...","[and, her, sister, Kyle]","[have, been, regular, cast]","[and, her, sister]"


You'll interact with these candidates while writing labeling functions in Snorkel. We look at a candidate in the development set:

In [4]:
from spouse_preprocessors import get_person_text

candidate = dev_data.loc[112]
person_names = get_person_text(candidate)

print("Sentence:   \t", candidate['sentence'])
print("Person 1:   \t", person_names[0])
print("Person 2:   \t", person_names[1])

Sentence:   	 The engagement between Josh, who was previously married to Diane Lane, and his  former assistant was revealed in May, after a two year romance.   
Person 1:   	 Josh
Person 2:   	 Diane Lane


### Labeling Function Helpers

When writing labeling functions, there are several operators you will use over and over again; fetching text between mentions of the two people in a candidate, examing word windows around person mentions, etc. 

We provide several helper functions in `spouse_preprocessors`:  these are Python helper functions that you can apply to candidates in the DataFrame to return objects that are helpful during LF development. You can (and should!) write your own helper functions to help write LFs.

We provide an example of a preprocessor definition and usage here:

In [5]:
from snorkel.labeling.preprocess import Preprocessor, PreprocessorMode, preprocessor
from snorkel.types import FieldMap

@preprocessor
def get_text_between(tokens, person1_word_idx, person2_word_idx) -> FieldMap:
    """
    Returns the text between the two person mentions in the sentence for a candidate
    """
    start = person1_word_idx[1] + 1
    end = person2_word_idx[0]
    text_between = ' '.join(tokens[start:end])
    return dict(text_between=text_between)

In [6]:
get_text_between.set_mode(PreprocessorMode.PANDAS)
get_text_between(candidate).text_between

', who was previously married to'

### Candidate PreProcessors

We provide a set of helper functions for this task in `spouse_preprocessors.py` that take as input a candidate, or row of a DataFrame in our case. For the purpose of the tutorial, we have two of these fields preprocessed in the data

`get_between_tokens(candidate)`

`get_left_tokens(candidate)`

`get_right_tokens(candidate)`