# Snorkel Tutorial

### You will work with Wikipedia plot descriptions of films that are either comedy or drama.

First, import necessary libraries:

In [None]:
from data.preparer import load_film_dataset
from snorkel.labeling import labeling_function
from snorkel.labeling import LabelModel
from snorkel.labeling import LFAnalysis
from snorkel.labeling import PandasLFApplier
from analyzer import train_model
import re
import pandas as pd
import nltk
nltk.download("punkt")
pd.set_option('display.max_colwidth', -1)

## The Data

These movie plot descriptions are from [Kaggle](https://www.kaggle.com/jrobischon/wikipedia-movie-plots).
You will be labeling films as either "comedy" or "drama" based on their plot descriptions.

If you're not sure about the correct label, that's fine -- either make your best guess or just skip the example.

In [None]:
# Unzip the data. (Don't worry about this, it should be already unzipped.)
# Replace PASSWORD with the password to unzip the data, or download it directly from Kaggle.

#!unzip -P PASSWORD data/data.zip

Load the data

In [None]:
df_train, df_dev, df_valid, df_test = load_film_dataset()
print("{} training examples".format(len(df_train)))
print("{} development examples".format(len(df_dev)))
print("{} validation examples".format(len(df_valid)))
print("{} test examples".format(len(df_test)))

Define the labels for this task:

In [None]:
ABSTAIN = -1
DRAMA = 0
COMEDY = 1

Let's look at some positive and negative examples.

In [None]:
print("Some labeled examples: ")
display(df_dev[df_dev.label==DRAMA].sample(3))
display(df_dev[df_dev.label==COMEDY].sample(3))

## Writing Labeling Functions

__Your task for this tutorial is to write 5 labeling functions.__

Feel free to consult the internet or ask your experiment leader.

*(For the real task, you will be asked to write 10 labeling functions, as quickly and accurately as possible. You will still be allowed to use the internet in this phase, but not ask your experiment leader.)*

Your function should take x as an input and output COMEDY, DRAMA, or ABSTAIN. 

In [None]:
@labeling_function()
def lf0(x):
    return DRAMA if "dying" in x.text.lower() else ABSTAIN

Your turn! try writing a function or editing the one above.

In [None]:
@labeling_function()
def lf1(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf2(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf3(x):
    return ABSTAIN

In [None]:
@labeling_function()
def lf4(x):
    return ABSTAIN

In [None]:
lfs = [lf0, lf1, lf2, lf3, lf4]

Test your function (optional)

In [None]:
from types import SimpleNamespace

def test_func(lf, example_text):
    x = SimpleNamespace(text=example_text)
    return lf(x)

In [None]:
test_func(lf0, "I'm not dying today")

## Applying Functions
This is how we obtain training labels, by training a model to combine the outputs of the noisy labeling functions.
`L_train` and `L_dev` are matrices representing the label returned by each labeling function for each example in the training and development sets.

In [None]:
# Apply the LFs to the unlabeled training data, and the development data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

Train the snorkel model to combine these noisy labels.

In [None]:
# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

## View Unlabeled Examples
You can use these to brainstorm new labeling functions. You may try filtering or sorting them in other ways.

In [None]:
# You can filter for unlabeled data
df_unlabeled = df_train[df_train.label == ABSTAIN]
display(df_unlabeled.sample(5))

## Analyze Results
Evaluate the accuracy of the estimated training labels and development set labels (based on ground truth).

In [None]:
Y_train = df_train.label.values
train_analysis = LFAnalysis(L=L_train, lfs=lfs).lf_summary(Y=Y_train)
display("Training set results:", train_analysis)

In [None]:
Y_dev = df_dev.label.values
dev_analysis = LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)
display("Dev set results:", dev_analysis)

## Save the Model
When you have finished

In [None]:
label_model.save("snorkel_tutorial_lfmodel.pkl")

## Train Model
We can train a simple bag of words model on these labels, and see test accuracy.

(This step may take a while).

In [None]:
train_model(label_model, df_train, df_valid, df_test, L_train)