# Chapter 3: Introducing Snorkel

In this chapter I will introduce [Snorkel](http://snorkel.org), which we'll use throughout the book. [Snorkel](https://www.snorkel.org/) is a software project ([github](https://github.com/snorkel-team/snorkel)) originally from the Hazy Research group at Stanford University enabling the practice of *weak supervision*, *distant supervision*, *data augmentation* and *data slicing*.

The project has an excellent [Get Started](https://www.snorkel.org/get-started/) page, and I recommend you spend some time working the [tutorials](https://github.com/snorkel-team/snorkel-tutorials) before proceeding beyond this chapter. 

Snorkel implements an unsupervised generative model that accepts a matrix of weak labels for records in your training data and produces strong labels by learning the relationships between these weak labels through matrix factorization.

In [52]:
import random
import sys
sys.path.append("..")

import numpy as np
import pandas as pd
import pyarrow

from lib import utils


# Make randomness reproducible
random.seed(31337)
np.random.seed(31337)

## Example Project: Labeling Amazon Github Repositories

I have previously hand labeled about 2,600 Github repositories belonging to Amazon and its subsidiariesinto categories related to their purpose. We're going to use this dataset to introduce Snorkel.

### Hand Labeling this Data

In order to get a ground truth dataset against which to benchmark our Snorkel labeling, I hand labeled all Amazon Github projects in [this sheet](https://docs.google.com/spreadsheets/d/1wiesQSde5LwWV_vpMFQh24Lqx5Mr3VG7fk_e6yht0jU/edit?usp=sharing). The label categories are:

| Number | Code      | Description                          |
|--------|-----------|--------------------------------------|
| 0      | GENERAL   | A FOSS project of general utility    |
| 1      | API       | API library for AWS / Amazon product |
| 2      | RESEARCH  | A research paper and/or dataset      |
| 3      | DEAD      | Project is dead, no longer useful    |
| 3      | OTHER     | Uncertainty... what is this thing?   |

If you want to make corrections, please open the sheet, click on `File --> Make a Copy`, make any edits and then share the sheet with me.

In [53]:
readme_df = pd.read_parquet('../data/aws_github.parquet', engine='pyarrow')
readme_df.head()

Unnamed: 0_level_0,full_name,html_url,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
61861755,alexa/alexa-skills-kit-sdk-for-nodejs,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Node.js helps you...,"<p align=""center"">\n <img src=""https://m.medi...",API
84138837,alexa/alexa-cookbook,https://github.com/alexa/alexa-cookbook,A series of sample code projects to be used fo...,\n# Alexa Skill Building Cookbook\n\n<div styl...,API
63275452,alexa/skill-sample-nodejs-fact,https://github.com/alexa/skill-sample-nodejs-fact,Build An Alexa Fact Skill,"# Build An Alexa Fact Skill\n<img src=""https:/...",API
81483877,alexa/avs-device-sdk,https://github.com/alexa/avs-device-sdk,An SDK for commercial device makers to integra...,### What is the Alexa Voice Service (AVS)?\n\n...,API
38904647,alexa/alexa-skills-kit-sdk-for-java,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Java helps you ge...,"<p align=""center"">\n <img src=""https://m.medi...",API


## Profile the Data

Let's take a quick look at the labels to see what we'll be classifying.

In [54]:
print(f'Total records: {len(readme_df.index):,}')

readme_df['label'].value_counts()

Total records: 2,568


API         2265
GENERAL      279
DEAD          14
RESEARCH       9
OTHER          1
Name: label, dtype: int64

### How much general utility do Amazon's Github projects have?

One question that occurs to me to ask is - how much general utility do Amazon's Github projects have? Let's look at the number of `GENERAL` purpose compared to the number of `API` projects.

In [55]:
api_count     = readme_df[readme_df['label'] == '    API'].count(axis='index')['full_name']
general_count = readme_df[readme_df['label'] == 'GENERAL'].count(axis='index')['full_name']

general_pct = 100 * (general_count / (api_count + general_count))
api_pct     = 100 * (api_count / (api_count + general_count))

print(f'Percentage of projects having general utility:   {general_pct:,.3f}%')
print(f'Percentage of projects for Amazon products/APIs: {api_pct:,.3f}%')

Percentage of projects having general utility:   10.967%
Percentage of projects for Amazon products/APIs: 89.033%


### Simplify to `API` vs `GENERAL`

We throw out `DEAD`, `RESEARCH` and `OTHER` to focus on `API` vs `GENERAL` - is an open source project of general utility or is it a client to a company's commercial products? Highly imabalanced classes are hard to deal with when building a classifier, and 1:9 for `GENERAL`:`API` is bad enough.

In [59]:
df = readme_df[readme_df['label'].isin(['API', 'GENERAL'])]

print(f'Total records with API/GENERAL labels: {len(df.index):,}')

df.head()

Total records with API/GENERAL labels: 2,544


Unnamed: 0_level_0,full_name,html_url,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
61861755,alexa/alexa-skills-kit-sdk-for-nodejs,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Node.js helps you...,"<p align=""center"">\n <img src=""https://m.medi...",API
84138837,alexa/alexa-cookbook,https://github.com/alexa/alexa-cookbook,A series of sample code projects to be used fo...,\n# Alexa Skill Building Cookbook\n\n<div styl...,API
63275452,alexa/skill-sample-nodejs-fact,https://github.com/alexa/skill-sample-nodejs-fact,Build An Alexa Fact Skill,"# Build An Alexa Fact Skill\n<img src=""https:/...",API
81483877,alexa/avs-device-sdk,https://github.com/alexa/avs-device-sdk,An SDK for commercial device makers to integra...,### What is the Alexa Voice Service (AVS)?\n\n...,API
38904647,alexa/alexa-skills-kit-sdk-for-java,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Java helps you ge...,"<p align=""center"">\n <img src=""https://m.medi...",API


### Split our Data into Training and Validation Data

In order to demonstrate Snorkel's capabilities, we need to create an experiment by splitting our data into training data `train_df` and test data `test_df`. Scikit-Learn is ubiquitous for this purpose :)

The point of Snorkel is that you don't need labels - so lets remove the labels from part of the dataset and use that for development.

In [73]:
from sklearn.model_selection import train_test_split

# First split into a set containing gold labeled data for evaluating LFs as we write them,
train_df, test_df, train_labels, test_labels = train_test_split(
    df,
    df['label'],
    test_size=0.05
)

## Working with Snorkel

Snorkel has three primary programming interfaces: Labeling Functions, Transformation Functions and Slicing Functions.

<img 
     alt="Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions"
     src="images/snorkel_apis_0.9.5.png"
     width="500px"
/>
<div align="center">Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions, from <a href="https://www.snorkel.org/">Snorkel.org</a></div>

### Labeling Functions (LFs)

A labeling function is a deterministic function used to label data as belonging to one class or another. They produce weak labels that in combination, through Snorkel’s generative models, can be used to generate strong labels for unlabeled data.

The [Snorkel paper](https://arxiv.org/pdf/1711.10160.pdf) explains that LFs are open ended, that is that they can leverage information from multiple sources - both inside and outside the record. For example LFs can operate over different parts of the input document, working with document metadata, entire texts, individual paragraphs, sentences or words, parts of speech, named entities extracted by preprocessors, text embeddings or any augmentation of the record whatsoever. They can simultaneously leverage external databases and rules through *distant supervision*. These might include vocabulary for keyword searches, heuristics defined by theoretical considerations or equations, 

For example, a preprocessor might run a text document through a language model such as the included `SpacyPreprocessor` to run Named Entity Resolution (NER) and then look for words queried from WikiData that correspond to a given class. There are many ways to write LFs. We’ll define a broad taxonomy and then demonstrate some techniques from each.

The program interface for Labeling Functions is [`snorkel.labeling.LabelingFunction`](https://snorkel.readthedocs.io/en/v0.9.5/packages/_autosummary/labeling/snorkel.labeling.LabelingFunction.html#snorkel.labeling.LabelingFunction). They are instantiated with a name, a function reference, any resources the function needs and a list of any preprocessors to run on the data records before the labeling function runs.

<img alt="LabelingFunction API" src="images/labeling_function_api.png" width="600" />

### Defining Labeling Schema

In order to write our first labeling function, we need to define the label schema for our problem. The first label in any labeling schema is `-1` for `ABSTAIN`, which means "cast no vote" about the class of the record. This allows Snorkel Labeling Functions to vote only when they are certain, and is critical to how the system works since labeling functions have to perform better than random when they do vote or the Label Model won't work well.

The labels for this analysis are:

| Number | Code      | Description                       |
|--------|-----------|-----------------------------------|
| -1     | ABSTAIN   | No vote, for Labeling Functions   |
| 0      | GENERAL   | A FOSS project of general appeal  |
| 1      | API       | An API library for AWS            |

In [98]:
ABSTAIN = -1
GENERAL = 0
API     = 1

### Writing our First Labeling Function

In order to write a labeling function, we must describe our data to associate a property with a certain class of records that can be programmed as a heuristic. Let's inspect some of our records. The classes are imbalanced 9:1, so lets pull a stratified sample of both labels.

Look at the data table produced by the records below and try to eyeball any patterns among the `API` and the `GENERAL` records. Do you see any markers for `API` records or `GENERAL` records?

In [88]:
# Set Pandas to display more than 10 rows
pd.set_option('display.max_rows', 100)

api_df     = test_df[test_df['label'] ==     'API'].sample(frac=0.1).head(10).sort_values(by='label')
general_df = test_df[test_df['label'] == 'GENERAL'].sample(frac=0.7).head(10).sort_values(by='label')

api_df.append(general_df).head(20)

Unnamed: 0_level_0,full_name,html_url,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
43451986,aws/aws-iot-device-sdk-arduino-yun,https://github.com/aws/aws-iot-device-sdk-ardu...,SDK for connecting to AWS IoT from an Arduino ...,"<h1 align = ""center"">AWS IoT Arduino Yún SDK</...",API
218382910,aws-samples/golden-ami-pipeline-with-tenable-s...,https://github.com/aws-samples/golden-ami-pipe...,"The golden AMI pipeline enables creation, dist...",## Golden AMI Pipeline with Tenable.io\n\nCrea...,API
52386995,awsdocs/aws-java-developer-guide,https://github.com/awsdocs/aws-java-developer-...,Official repository of the AWS SDK for Java De...,".. Copyright 2010-2017 Amazon.com, Inc. or its...",API
63461880,awslabs/amazon-inspector-finding-forwarder,https://github.com/awslabs/amazon-inspector-fi...,Python scripts to run in AWS Lambda to process...,# AmazonInspectorLambdaFindingProcessor\nThis ...,API
61861729,alexa/skill-sample-nodejs-adventure-game,https://github.com/alexa/skill-sample-nodejs-a...,This tool provides an easy to use front-end th...,"# Build An Alexa Gamebook Skill\n<img src=""ht...",API
27615980,awslabs/amazon-kinesis-client-ruby,https://github.com/awslabs/amazon-kinesis-clie...,A Ruby interface for the Amazon Kinesis Client...,# Amazon Kinesis Client Library for Ruby\n\nTh...,API
135637462,alexa/alexa-guided-walkthrough-using-node-sdk,https://github.com/alexa/alexa-guided-walkthro...,Code walkthrough guides to show the ins and ou...,# Alexa Skill Guided Walkthrough using the Nod...,API
111005850,awsdocs/aws-single-sign-on-user-guide,https://github.com/awsdocs/aws-single-sign-on-...,The open source version of the AWS Single Sign...,AWS Single Sign-On (SSO) Docs\n\nThe open sour...,API
222821523,aws-samples/aws-iot-securetunneling-localproxy,https://github.com/aws-samples/aws-iot-securet...,AWS Iot Secure Tunneling local proxy reference...,## AWS IoT Secure Tunneling Local Proxy Refere...,API
216931235,aws-samples/amazon-textract-and-amazon-compreh...,https://github.com/aws-samples/amazon-textract...,"Extract, Validate and Visualize medical claims...",## Automating a claims adjudication workflow u...,API


### Detecting Patterns

In looking at the `full_name` and `html_url`, it looks like projects with `sdk` in the title are `API` projects. Lets filter down to those records to see.

In [97]:
sdk_df = test_df[test_df['full_name'].str.contains('sdk')]

print(f'Total SDK records: {len(sdk_df.index)}')

sdk_df.groupby('label').count()['full_name']

Total SDK records: 9


label
API        8
GENERAL    1
Name: full_name, dtype: int64

## Building an SDK Labeling Function

There is an 8:1 `API`:`GENERAL` ratio of labels among records with `sdk` in their full_name. This is more than good enough for a Labeling Function (LF), since they only have to be better than random! Cool, eh? Don't worry, the `LabelModel` will figure out which signal from which LF to use :) It's like magic!

This is called a keyword labeling function, the simplest type. Despite their simplicity, keyword LFs are incredibly powerful ways to inject subject matter expertise into a project. In the chapter on Weak Supervision, we'll get into the various types of LFs and the strategies researchers and Snorkel users have come up with for labeling data. For now we'll create this and a couple of other LFs and see where that gets us.

In [107]:
# The verbosse way to define an LF
from snorkel.labeling import LabelingFunction


sdk_lf = LabelingFunction(
    name="name_contains_sdk_lf",
    f=lambda x: API if 'sdk' in x.full_name.lower() else ABSTAIN,
)

print(sdk_lf)


# The short form way to define an LF
from snorkel.labeling import labeling_function


@labeling_function()
def sdk_lf():
    return API if 'sdk' in x.full_name.lower() else ABSTAIN

print(sdk_lf)

LabelingFunction name_contains_sdk, Preprocessors: []
LabelingFunction sdk_lf, Preprocessors: []


## Testing our `LabelingFunction`

Snorkel comes with tools to help you run your LFs on your dataset to see how they perform.

## Writing more `LabelingFunctions`

We need more than just one vote 

## Utilities for Creating Keyword LFs

We'll be creating several keyword labeling functions, so we're going to write some utility functions to make this more efficient. These come from the Snorkel Spam tutorial, and later we'll extend their capabilities to remove the need to write code for keyword LFs.

In [None]:
def keyword_lookup(x, keywords, label):
    """Lookup a keyword in a """
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

In [27]:
df['readme_text'] = df['readme'].apply(utils.markdown_to_text)
df['readme_code'] = df['readme'].apply(utils.markdown_to_code)

df.head()

KeyboardInterrupt: 

In [None]:
utils.markdown_to_text(df['readme'].iloc[0])
utils.markdown_to_code(df['readme'].iloc[0])

In [None]:
import io
import re

from bs4 import BeautifulSoup
from markdown import markdown


def markdown_to_code(markdown_text):
    """Extract source code from Markdown snippets"""
    code_blocks = []
    code_snippets = [] # These get a single block

    f = io.StringIO(markdown_text)
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            code_block = [f.readline()]
            while re.search("```", code_block[-1]) is None:
                code_block.append(f.readline())
            code_blocks.append("".join(code_block[:-1]))
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')
                    code_snippets.append(group)
    
    # Now combine all snippets into one code block
    code_blocks.append(' '.join(code_snippets))
    
    return '\n'.join(code_blocks)


def markdown_to_text(markdown_text):
    """Extract plaintext - minus the code snippets - from Markdown"""
    text_blocks = []
    f = io.StringIO(markdown_text)
    i = 0
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            print('is_block')
            first_record = f.readline()
            second_record = f.readline()
            print(f'first_record: {first_record}')
            print(f'second_record: {second_record}')
            code_block = [first_record]
            while re.search("```", code_block[-1]) is None:
                print('inside_block')
                f.readline()
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')

            text_blocks.append(line)
        i += 1
    
    md = ''.join(text_blocks)
    html = markdown(md)
    soup = BeautifulSoup(html, 'lxml')
    text = soup.find_all(text=True)
    out_text = []
    for text in text:
        if text == '\n':
            pass
        else:
            out_text.append(text)
    return out_text

print(df['readme'].iloc[6][1204:-1])

markdown_to_text(df['readme'].iloc[6])