# Introducing Snorkel

In this notebook we will use Snorkel to enrich our data such that tags with between 500-2,000 examples will be labeled using weak supervision to produce labels for enough examples to allow us to train an accurate full model that includes these new labels.

More information about Snorkel can be found at [Snorkel.org](https://www.snorkel.org/) :) For a basic introduction to Snorkel, see the [Spam Tutorial](http://syndrome:8888/notebooks/snorkel-tutorials/spam/01_spam_tutorial.ipynb). For an introduction to Multi-Task Learning (MTL), see [Multi-Task Tutorial](http://syndrome:8888/notebooks/snorkel-tutorials/multitask/multitask_tutorial.ipynb).

In [10]:
# Snorkel Introduction

from collections import OrderedDict 
from glob import glob
import os
import sys

import cupy
import pandas as pd
import pyarrow
import random
import snorkel
import spacy
import tensorflow as tf

# Add parent directory to path
parent_dir = os.path.dirname(os.getcwd())
sys.path.append(parent_dir)

# Make reproducible
random.seed(1337)

Turn off TensorFlow logging messages
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# For reproducibility
os.environ["PYTHONHASHSEED"] = "1337"

In [11]:
TAG_LIMIT = 50

In [14]:
PATHS = {
    'questions': {
        'local': '../data/stackoverflow/Questions.Tags.{}.parquet/part-00000-93547d3c-1f08-40b5-923e-89b961d01fc2-c000.snappy.parquet',
        's3': 's3://stackoverflow-events/08-05-2019/Questions.Tags.{}.parquet/part-00000-93547d3c-1f08-40b5-923e-89b961d01fc2-c000.snappy.parquet',
    }
}

# Define a set of paths for each step for local and S3
PATH_SET = 's3' # 'local'

## Loading our Examples for Enrichment

In [15]:
path = PATHS['questions'][PATH_SET].format(TAG_LIMIT)#, BAD_LIMIT)

df = pd.read_parquet(
    path, 
    engine='pyarrow',
)

In [16]:
pd.set_option('display.max_colwidth', 300)
df.sample(50).head()

Unnamed: 0,_PostId,_AcceptedAnswerId,_Body,_Code,_Tags,_AnswerCount,_CommentCount,_FavoriteCount,_OwnerUserId,_OwnerDisplayName,...,_AccountId,_UserId,_UserDisplayName,_UserDownVotes,_UserLocation,_ProfileImageUrl,_UserReputation,_UserUpVotes,_UserViews,_UserWebsiteUrl
10397,329188,,"Report Generation in a Ruby Web Application I've been developing business apps, basically CRUD, in ASP.Net for years now, and am interested in learning another language and platform.\nAfter a few trips to Borders and poking around a bit on the web, I have not found much dealing with generating r...",,[ruby],5,0,,,Mike Thomas,...,,,,,,,,,,
4459,28036053,28036159.0,"Objective-C/ARC ivar vs property I've been up and down the Google and the Stack and read many articles if not outright debates over ivars and properties. But still, even after all this reading I remain confused.\nI understand ivar's are private and properties are typically used to expose (well)...","@interface MyClass\n@property(strong) NSMutableArray *myArray;\n@end\n\n@interface MyClass\n\n-(instancetype)init {\n\n if (self = [super init]) {\n\n self.myArray = [NSMutableArray array];\n\n // OR\n\n // Will this NOT call the Setter? Hence, leading\n // to pos...",[objective-c],4,8,,,user1068477,...,,,,,,,,,,
133304,2977779,2978212.0,"add xml node to xml file with python I wonder if it is better add an element by opening file, search 'good place' and add string which contains xml code.\nOr use some library... i have no idea. I know how can i get nodes and properties from xml through for example lxml but what's the simpliest a...",,"[python, xml]",2,0,,170961.0,,...,57071.0,170961.0,matiit,72.0,"Cardiff, United Kingdom",,5865.0,243.0,407.0,
106967,48491694,48491845.0,How get Environment Variables from lambda (nodejs aws-sdk) We can set up Environment Variables in aws-lambda for example via AWS SAM:\n\nHow can I get this variables from current lambda via Node JS AWS-SDK?\n,Environment:\n Variables:\n TABLE_NAME: !Ref Table\n,"[amazon-web-services, aws-lambda, nodes]",1,0,1.0,6345354.0,,...,8458983.0,6345354.0,Max Vinogradov,0.0,"Sumy, Sums'ka oblast, Ukraine",https://lh5.googleusercontent.com/-p49_IG9LR6c/AAAAAAAAAAI/AAAAAAAAABY/4XdJrDLMl10/photo.jpg?sz=128,150.0,43.0,30.0,https://www.linkedin.com/in/max-vinogradov/
94488,15663654,15664062.0,"Get unique lines I'm creating graph in graphViz and I need every connection to be display only once, how to transform this input using linux commands?\nINPUT\n\nDESIRED OUTPUT:\n\nso equals to and needs to be removed.\nI tried bot it didnt work with delimiter and don't know how to check for ...","aa -- bb[label=xyz]\nab -- bb[label=yzx]\naa -- bb[label=zxy]\nac -- ab[label=xyz]\nbb -- aa[label=xzy]\n\naa -- bb[label=xyz]\nab -- bb[label=yzx]\nac -- ab[label=xyz]\n\naa -- bb\nbb -- aa\nsort -k1,2 -u -t[\n[","[linux, bash, awk, unique, delimiter]",3,0,,619616.0,,...,307994.0,619616.0,Buksy,2.0,Slovakia,,5289.0,538.0,420.0,http://buksy.netkosice.sk/index.php


In [17]:
# %matplotlib inline

# # Make each bin 100 count, since range is atm 500-2,000
# df.groupby('_Tags').count()['_Body'].hist(bins=15)

## Sample the Data Initially

In [18]:
SAMPLE_SIZE = 1000


df['_Lower_Body'] = df['_Body'].apply(lambda x: x.lower())

df_sample = df.sample(SAMPLE_SIZE, random_state=1337)

In [19]:
spacy.prefer_gpu()

# Download the spaCy english model
spacy.cli.download('en_core_web_lg')

nlp = spacy.load("en_core_web_lg")

doc1 = nlp(df['_Body'][0])
doc2 = nlp(df['_Body'][1])

print(type(doc1), type(doc2))
doc1.similarity(doc2)

# df_sample['spacy'] = df['_Body'].apply(lambda x: nlp(x))

# ABSTAIN = -1

# def keyword_lookup(x, keywords, label):
    
#     match = any(word in x.text for word in keywords)
#     # print(keywords, match, label, x)
#     if match:
#         return label
#     return ABSTAIN

# keyword_lookup(doc, ['base64'], 0)

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
<class 'spacy.tokens.doc.Doc'> <class 'spacy.tokens.doc.Doc'>


TypeError: Unsupported type <class 'numpy.ndarray'>

## Split the Data into Train/Test/Development Datasets

We'll need to validate our labeling functions (LFs) in Snorkel, so we need train, test and __development__ datasets.

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test_dev, y_train, y_test_dev = train_test_split(
    df_sample, 
    df_sample['_Index'], 
    test_size=0.3,
    random_state=1337,
)
X_dev, X_test, y_dev, y_test = train_test_split(
    X_test_dev,
    y_test_dev,
    test_size=0.66667,
    random_state=1337,
)

X_train.shape, X_test.shape, X_dev.shape, y_train.shape, y_test.shape, y_dev.shape

## Label Function 1: Contains Tag

The first labeling function we'll create is a keyword search. We'll look for whether the keyword is contained in the dataset. This would be helpful for a question about HTML with the tag `html` where `html` also appears in the body of the post.

### Snorkel Proprocessors and LFs

To do this we'll use a [`snorkel.preprocess.preprocessor`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/preprocess/snorkel.preprocess.preprocessor.html#snorkel.preprocess.preprocessor) called [`snorkel.preprocess.nlp.SpacyPreprocessor`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/preprocess/snorkel.preprocess.nlp.SpacyPreprocessor.html) to create [`spacy.Docs`](https://spacy.io/api/doc) from our string documents. This will give us the document in various forms: text, a list of [`spacy.Tokens`](https://spacy.io/api/token), and a [`spacy.Doc.vector`](https://spacy.io/api/doc#vector) representation. Additional features of `SnorkelPreprocessor`, `spacy.Doc` and `spacy.Token` make this incredibly useful.

### Debugging `SpacyPreprocessor`

*If you have trouble instantiating `SpacyPreprocessor` because `en_core_web_sm` won't load, restart/run all the notebook and the problem should resolve itself.*

In [None]:
from snorkel.preprocess.nlp import SpacyPreprocessor
from snorkel.labeling import LabelingFunction

ABSTAIN = -1


spacy_processor = SpacyPreprocessor(
    text_field='_Lower_Text',
    doc_field='_Doc',
    memoize=True,
    gpu=True,
)

def keyword_lookup(x, keywords, label):
    
    match = any(word in x._Doc.text for word in keywords)
    if match:
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=ABSTAIN):
    return LabelingFunction(
        name=f"keyword_{keywords}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
        pre=[spacy_processor],
    )


# For each keyword, split on hyphen and create an LF that detects if that tag is present in the data
keyword_lfs = OrderedDict()
for label_set, index in zip(df['_Tag'].unique(), df['_Index'].unique()):
    for label in label_set.split('-'):
        keyword_lfs[label] = make_keyword_lf(label, label=index)

list(keyword_lfs.items())

### Apply our LFs

Let's try the first batch of tag keyword search LFs and see how they alone perform compared to the true labels.

In [None]:
from snorkel.labeling import LFAnalysis, PandasLFApplier

X_train = X_train[['_Lower_Text']]
X_dev   = X_dev[['_Lower_Text']]

applier = PandasLFApplier(
    lfs=keyword_lfs.values(),
)

L_train = applier.apply(df=X_train)
L_dev   = applier.apply(df=X_dev)

In [None]:
# summary = LFAnalysis(L=L_dev, lfs=keyword_lfs.values()).lf_summary() # y_dev.as_matrix()
# summary

In [None]:
# summary[summary.index == 'keyword_base64']

### Apply our LFs using PySpark

For performance reasons, we want the parallel processing of PySpark.

In [None]:
from pyspark.sql import SparkSession, Row
import pyspark.sql.functions as F
import pyspark.sql.types as T

from lib.utils import one_hot_encode

PATHS = {
    'bad_questions': {
        'local': '../data/stackoverflow/Questions.Bad.{}.{}.parquet',
        's3': 's3://stackoverflow-events/Questions.Bad.{}.{}.parquet',
    },
}

# Define a set of paths for each step for local and S3
PATH_SET = 'local' # 's3'

spark = SparkSession.builder\
    .appName('Deep Products - Create Weak Labels')\
    .config('spark.dynamicAllocation.enabled', True)\
    .config('spark.shuffle.service.enabled', True)\
    .getOrCreate()
sc = spark.sparkContext

tag_limit, stratify_limit, bad_limit = 2000, 2000, 500

bad_questions = spark.read.parquet(
    PATHS['bad_questions'][PATH_SET].format(tag_limit, bad_limit)
)

# Redone from top for the moment
from pyspark.sql import Row
# from snorkel.preprocess.nlp import SpacyPreprocessor
from snorkel.labeling import LabelingFunction
from snorkel.labeling.apply.spark import SparkLFApplier
from snorkel.labeling.lf.nlp_spark import SparkNLPLabelingFunction


ABSTAIN = -1


spacy_processor = SpacyPreprocessor(
    text_field='_Body',
    doc_field='_Doc',
    memoize=True,
    gpu=True,
)

def keyword_lookup(x, keywords, label):
    
    match = any(word in x._Doc.text for word in keywords)
    
    if match:
        return label
    return ABSTAIN


# For each keyword, split on hyphen and create an LF that detects if that tag is present in the data
keyword_lfs = OrderedDict()
for label_set, index in zip(df['_Tag'].unique(), df['_Index'].unique()):
    for label in label_set.split('-'):
        keyword_lfs[label] = make_keyword_lf(label, label=index)

lf_1 = list(keyword_lfs.items())

LF

## Document Vector Distances

Now lets try using the [`vector`](https://spacy.io/api/doc#vector) feature of the [`spacy.Doc`](https://spacy.io/api/doc) which is returned in the `_Doc` field from the `SpacyPreprocessor` 