# Introducing Snorkel

In this notebook we will use Snorkel to enrich our data such that tags with between 500-2,000 examples will be labeled using weak supervision to produce labels for enough examples to allow us to train an accurate full model that includes these new labels.

More information about Snorkel can be found at [Snorkel.org](https://www.snorkel.org/) :) For a basic introduction to Snorkel, see the [Spam Tutorial](http://syndrome:8888/notebooks/snorkel-tutorials/spam/01_spam_tutorial.ipynb). For an introduction to Multi-Task Learning (MTL), see [Multi-Task Tutorial](http://syndrome:8888/notebooks/snorkel-tutorials/multitask/multitask_tutorial.ipynb).

In [1]:
# Snorkel Introduction

from collections import OrderedDict 
from glob import glob
import os
import sys

import cupy
import dask.dataframe as dd
import numpy as np
import pandas as pd
import pyarrow
import random
import snorkel
import spacy
import tensorflow as tf

# Add parent directory to path
parent_dir = os.path.dirname(os.getcwd())
sys.path.append(parent_dir)

# Make reproducible
random.seed(1337)

# Turn off TensorFlow logging messages
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# For reproducibility
os.environ["PYTHONHASHSEED"] = "1337"

In [2]:
TAG_LIMIT = 50

In [3]:
PATHS = {
    'questions': {
        'local': '../data/stackoverflow/Questions.Tags.{}.parquet/*.parquet',
        's3': 's3://stackoverflow-events/08-05-2019/Questions.Tags.{}.parquet/part-00000-93547d3c-1f08-40b5-923e-89b961d01fc2-c000.snappy.parquet',
    }
}

# Define a set of paths for each step for local and S3
PATH_SET = 'local' # 's3'

## Loading our Examples for Enrichment

In [4]:
path = PATHS['questions'][PATH_SET].format(TAG_LIMIT)

df = dd.read_parquet(
    path, 
    engine='pyarrow',
    
)

In [5]:
pd.set_option('display.max_colwidth', 300)
df.sample(frac=0.001).head(5)

Unnamed: 0,_PostId,_AcceptedAnswerId,_Body,_Code,_Tags,_Label,_AnswerCount,_CommentCount,_FavoriteCount,_OwnerUserId,...,_AccountId,_UserId,_UserDisplayName,_UserDownVotes,_UserLocation,_ProfileImageUrl,_UserReputation,_UserUpVotes,_UserViews,_UserWebsiteUrl
108433,550148,550161.0,"Does python have something like C++'s using keyword? In C++ you can often drastically improve the readability of your code by careful usage of the ""using"" keyword, for example: \n\nbecomes\n\nDoes something similar exist for python, or do I have to fully qualify everything?\nI'll add the disclai...","void foo()\n{\n std::vector< std::map <int, std::string> > crazyVector;\n std::cout << crazyVector[0].begin()->first;\n}\n\nvoid foo()\n{\n using namespace std; // limited in scope to foo\n vector< map <int, string> > crazyVector;\n cout << crazyVector[0].begin()->first;\n}\n","[python, namespaces, using]",0,6,0,3.0,8123.0,...,5141.0,8123.0,Doug T.,196.0,"Charlottesville, VA United States",,47950.0,2012.0,3484.0,http://o19s.com/doug
91946,3165563,3165672.0,"Flexible logger class using standard streams in C++ i would like to create a flexible logger class. I want it to be able to output data to a file or to standard output. Also, i want to use streams. The class should look something like:\n\nThe is an enum and defines the formatting. What i want t...","class Logger\n{\nprivate:\n std::ostream m_out; // or ofstream, iostream? i don't know\npublic:\n\n void useFile( std::string fname);\n void useStdOut();\n\n void log( symbol_id si, int val );\n void log( symbol_id si, std::string str );\n //etc..\n};\n\nsymbol_id\nuse*\nm_out\nm_out...","[c++, logging, stream]",0,4,0,3.0,350605.0,...,142724.0,350605.0,PeterK,31.0,"Prague, Czech Republic",,4813.0,1242.0,452.0,
69371,1227310,1227535.0,"Question regarding Lucene scoring I have a question regarding Lucene scoring. I have two documents in the index, one contains ""my name"" and the other contains ""my first name"". When I search for the keyword ""my name"", the second document is listed above the first one. What I want is that if the d...",,[lucene],0,4,0,,150287.0,...,877061.0,150287.0,Truong Do,0.0,,,23.0,2.0,19.0,
144449,18996773,18996887.0,Make UIView constantly fade in and out - completion:^{ possible infinite loop? I've written the following code to make my UIView constantly fade in and out. (FadeAlphaValue is a BOOL)... \n\nIt works but I have a feeling it's going to cause some weird crash if I let it run forever... I'm not to ...,-(void) fade {\n [UIView animateWithDuration:1.0\n animations:^{\n fadeView.alpha = (int)fadeAlphaValue;\n }\n completion:^(BOOL finished){\n fadeAlphaValue=!fadeAlphaValue;\n ...,"[ios, objective-c, animation, memory-management]",0,1,2,1.0,2057171.0,...,405164.0,2057171.0,Albert Renshaw,64.0,"Atlanta, GA",https://i.stack.imgur.com/E9ThY.png,9237.0,1204.0,1784.0,http://www.Apps4Life.com
136543,12978913,12980193.0,"jquery visual website optimizer code I have a javascript code from visual website optimizer that will be placed on HTML header:\n\nbut I want to run this code only with this condition:\n\ni tried this but returns an error ""_vwo_code is not defined""\n\nPlease help me on how can I solve this? Than...","var _vwo_code=(function(){\nvar account_id=7237,\nsettings_tolerance=2000,\nlibrary_tolerance=1500,\nuse_existing_jquery=true,\n// DO NOT EDIT BELOW THIS LINE\nf=false,d=document;return{use_existing_jquery:function(){return use_existing_jquery;},library_tolerance:function(){return library_tolera...","[jquery, web, optimization]",0,1,0,,414930.0,...,181133.0,414930.0,scoohh,0.0,,,184.0,14.0,56.0,


## Enable spaCy GPU Support

That is, if you have a GPU (don't you have a GPU?)

In [6]:
spacy.prefer_gpu()

True

## Create spaCy `Docs` in a `Dask` Column

In [7]:
# Download the spaCy english model
spacy.cli.download('en_core_web_lg')

nlp = spacy.load("en_core_web_lg")

# df['_Lower_Body'] = df['_Body'].apply(lambda x: x.lower())
df['_SpacyDoc'] = df['_Body'].apply(lambda x: nlp(x), meta=('_Body', 'object'))

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [8]:
df.npartitions

30

In [None]:
df_sample = df.head(1000)

## Split the Data into Train/Test/Development Datasets

We'll need to validate our labeling functions (LFs) in Snorkel, so we need train, test and __development__ datasets.

In [None]:
from dask_ml.model_selection import train_test_split

X_train, X_test_dev, y_train, y_test_dev = train_test_split(
    df_sample, 
    df_sample['_Label'], 
    test_size=0.3,
    random_state=1337,
)
X_dev, X_test, y_dev, y_test = train_test_split(
    X_test_dev,
    y_test_dev,
    test_size=0.66667,
    random_state=1337,
)

X_train.shape, X_test.shape, X_dev.shape, y_train.shape, y_test.shape, y_dev.shape

## Label Function 1: Contains Tag

The first labeling function we'll create is a keyword search. We'll look for whether the keyword is contained in the dataset. This would be helpful for a question about HTML with the tag `html` where `html` also appears in the body of the post.

### Snorkel Proprocessors and LFs

To do this we'll use a [`snorkel.preprocess.preprocessor`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/preprocess/snorkel.preprocess.preprocessor.html#snorkel.preprocess.preprocessor) called [`snorkel.preprocess.nlp.SpacyPreprocessor`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/preprocess/snorkel.preprocess.nlp.SpacyPreprocessor.html) to create [`spacy.Docs`](https://spacy.io/api/doc) from our string documents. This will give us the document in various forms: text, a list of [`spacy.Tokens`](https://spacy.io/api/token), and a [`spacy.Doc.vector`](https://spacy.io/api/doc#vector) representation. Additional features of `SnorkelPreprocessor`, `spacy.Doc` and `spacy.Token` make this incredibly useful.

### Debugging `SpacyPreprocessor`

*If you have trouble instantiating `SpacyPreprocessor` because `en_core_web_sm` won't load, restart/run all the notebook and the problem should resolve itself.*

In [None]:
from snorkel.preprocess.nlp import SpacyPreprocessor
from snorkel.labeling import LabelingFunction

ABSTAIN = -1


spacy_processor = SpacyPreprocessor(
    text_field='_Body',
    doc_field='_Doc',
    memoize=True,
    gpu=True,
)

def keyword_lookup(x, keywords, label):
    
    match = any(word in x._Doc.text for word in keywords)
    if match:
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=ABSTAIN):
    return LabelingFunction(
        name=f"keyword_{keywords}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
        pre=[spacy_processor],
    )


# For each keyword, split on hyphen and create an LF that detects if that tag is present in the data
keyword_lfs = OrderedDict()
for label_set, index in zip(df['_Tag'].unique(), df['_Index'].unique()):
    for label in label_set.split('-'):
        keyword_lfs[label] = make_keyword_lf(label, label=index)

list(keyword_lfs.items())

### Apply our LFs

Let's try the first batch of tag keyword search LFs and see how they alone perform compared to the true labels.

In [None]:
from snorkel.labeling import LFAnalysis, PandasLFApplier

X_train = X_train[['_Lower_Text']]
X_dev   = X_dev[['_Lower_Text']]

applier = PandasLFApplier(
    lfs=keyword_lfs.values(),
)

L_train = applier.apply(df=X_train)
L_dev   = applier.apply(df=X_dev)

In [None]:
# summary = LFAnalysis(L=L_dev, lfs=keyword_lfs.values()).lf_summary() # y_dev.as_matrix()
# summary

In [None]:
# summary[summary.index == 'keyword_base64']

### Apply our LFs using PySpark

For performance reasons, we want the parallel processing of PySpark.

In [None]:
from pyspark.sql import SparkSession, Row
import pyspark.sql.functions as F
import pyspark.sql.types as T

from lib.utils import one_hot_encode

PATHS = {
    'bad_questions': {
        'local': '../data/stackoverflow/Questions.Bad.{}.{}.parquet',
        's3': 's3://stackoverflow-events/Questions.Bad.{}.{}.parquet',
    },
}

# Define a set of paths for each step for local and S3
PATH_SET = 'local' # 's3'

spark = SparkSession.builder\
    .appName('Deep Products - Create Weak Labels')\
    .config('spark.dynamicAllocation.enabled', True)\
    .config('spark.shuffle.service.enabled', True)\
    .getOrCreate()
sc = spark.sparkContext

tag_limit, stratify_limit, bad_limit = 2000, 2000, 500

bad_questions = spark.read.parquet(
    PATHS['bad_questions'][PATH_SET].format(tag_limit, bad_limit)
)

# Redone from top for the moment
from pyspark.sql import Row
# from snorkel.preprocess.nlp import SpacyPreprocessor
from snorkel.labeling import LabelingFunction
from snorkel.labeling.apply.spark import SparkLFApplier
from snorkel.labeling.lf.nlp_spark import SparkNLPLabelingFunction


ABSTAIN = -1


spacy_processor = SpacyPreprocessor(
    text_field='_Body',
    doc_field='_Doc',
    memoize=True,
    gpu=True,
)

def keyword_lookup(x, keywords, label):
    
    match = any(word in x._Doc.text for word in keywords)
    
    if match:
        return label
    return ABSTAIN


# For each keyword, split on hyphen and create an LF that detects if that tag is present in the data
keyword_lfs = OrderedDict()
for label_set, index in zip(df['_Tag'].unique(), df['_Index'].unique()):
    for label in label_set.split('-'):
        keyword_lfs[label] = make_keyword_lf(label, label=index)

lf_1 = list(keyword_lfs.items())

LF

## Document Vector Distances

Now lets try using the [`vector`](https://spacy.io/api/doc#vector) feature of the [`spacy.Doc`](https://spacy.io/api/doc) which is returned in the `_Doc` field from the `SpacyPreprocessor` 