# Chapter 3: Introducing Snorkel

In this chapter I will introduce [Snorkel](http://snorkel.org), which we'll use throughout the book. [Snorkel](https://www.snorkel.org/) is a software project ([github](https://github.com/snorkel-team/snorkel)) originally from the Hazy Research group at Stanford University enabling the practice of *weak supervision*, *distant supervision*, *data augmentation* and *data slicing*.

The project has an excellent [Get Started](https://www.snorkel.org/get-started/) page, and I recommend you spend some time working the [tutorials](https://github.com/snorkel-team/snorkel-tutorials) before proceeding beyond this chapter. 

Snorkel implements an unsupervised generative model that accepts a matrix of weak labels for records in your training data and produces strong labels by learning the relationships between these weak labels through matrix factorization.

In [1]:
import random
import sys
sys.path.append("..")

import numpy as np
import pandas as pd
import pyarrow

from lib import utils


# Make randomness reproducible
random.seed(31337)
np.random.seed(31337)

[nltk_data] Downloading package punkt to /home/rjurney/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rjurney/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Example Project: Labeling Amazon Github Repositories

I have previously hand labeled about 2,600 Github repositories belonging to Amazon and its subsidiariesinto categories related to their purpose. We're going to use this dataset to introduce Snorkel.

### Hand Labeling this Data

In order to get a ground truth dataset against which to benchmark our Snorkel labeling, I hand labeled all Amazon Github projects in [this sheet](https://docs.google.com/spreadsheets/d/1wiesQSde5LwWV_vpMFQh24Lqx5Mr3VG7fk_e6yht0jU/edit?usp=sharing). The label categories are:

| Number | Code      | Description                          |
|--------|-----------|--------------------------------------|
| 0      | GENERAL   | A FOSS project of general utility    |
| 1      | API       | API library for AWS / Amazon product |
| 2      | RESEARCH  | A research paper and/or dataset      |
| 3      | DEAD      | Project is dead, no longer useful    |
| 3      | OTHER     | Uncertainty... what is this thing?   |

If you want to make corrections, please open the sheet, click on `File --> Make a Copy`, make any edits and then share the sheet with me.

In [99]:
readme_df = pd.read_parquet('../data/aws_github.parquet', engine='pyarrow')

readme_df = readme_df.sample(frac=1)

readme_df = readme_df.drop('html_url', axis=1)

readme_df = readme_df.fillna('')

readme_df.head()

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
155613615,aws-robotics/health-metrics-collector-ros1,This is a node for ROS1 that collects metrics ...,# health_metric_collector\n\n\n## Overview\nTh...,API
30503327,c9/c9.ide.language.javascript.eslint,The repository for c9.ide.language.javascript....,# c9.ide.language.javascript.eslint\n,GENERAL
79591397,aws-quickstart/quickstart-splunk-enterprise,AWS Quick Start Team,# Splunk Enterprise on AWS - Quick Start\n\nSo...,API
214026873,aws-samples/aws-codebuild-webhooks,A solution for CodeBuild custom webhook notifi...,# CodeBuild Webhooks\n\nA solution for CodeBui...,API
16440657,amazon-archives/kinesis-log4j-appender,ARCHIVED: Log4J Appender for writing data into...,# Archived\r\n\r\nThis is no longer supported....,API


## Profile the Data

Let's take a quick look at the labels to see what we'll be classifying.

In [100]:
print(f'Total records: {len(readme_df.index):,}')

readme_df['label'].value_counts()

Total records: 2,568


API         2265
GENERAL      279
DEAD          14
RESEARCH       9
OTHER          1
Name: label, dtype: int64

### How much general utility do Amazon's Github projects have?

One question that occurs to me to ask is - how much general utility do Amazon's Github projects have? Let's look at the number of `GENERAL` purpose compared to the number of `API` projects.

In [101]:
api_count     = readme_df[readme_df['label'] == '    API'].count(axis='index')['full_name']
general_count = readme_df[readme_df['label'] == 'GENERAL'].count(axis='index')['full_name']

general_pct = 100 * (general_count / (api_count + general_count))
api_pct     = 100 * (api_count / (api_count + general_count))

print(f'Percentage of projects having general utility:   {general_pct:,.3f}%')
print(f'Percentage of projects for Amazon products/APIs: {api_pct:,.3f}%')

Percentage of projects having general utility:   100.000%
Percentage of projects for Amazon products/APIs: 0.000%


### Simplify to `API` vs `GENERAL`

We throw out `DEAD`, `RESEARCH` and `OTHER` to focus on `API` vs `GENERAL` - is an open source project of general utility or is it a client to a company's commercial products? Highly imabalanced classes are hard to deal with when building a classifier, and 1:9 for `GENERAL`:`API` is bad enough.

In [102]:
df = readme_df[readme_df['label'].isin(['API', 'GENERAL'])]

print(f'Total records with API/GENERAL labels: {len(df.index):,}')

df.head()

Total records with API/GENERAL labels: 2,544


Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
155613615,aws-robotics/health-metrics-collector-ros1,This is a node for ROS1 that collects metrics ...,# health_metric_collector\n\n\n## Overview\nTh...,API
30503327,c9/c9.ide.language.javascript.eslint,The repository for c9.ide.language.javascript....,# c9.ide.language.javascript.eslint\n,GENERAL
79591397,aws-quickstart/quickstart-splunk-enterprise,AWS Quick Start Team,# Splunk Enterprise on AWS - Quick Start\n\nSo...,API
214026873,aws-samples/aws-codebuild-webhooks,A solution for CodeBuild custom webhook notifi...,# CodeBuild Webhooks\n\nA solution for CodeBui...,API
16440657,amazon-archives/kinesis-log4j-appender,ARCHIVED: Log4J Appender for writing data into...,# Archived\r\n\r\nThis is no longer supported....,API


### Split our Data into Training and Validation Data

In order to demonstrate Snorkel's capabilities, we need to create an experiment by splitting our data into three datasets:

* A hand labeled development dataset `dev_df` we will use to determine if our LFs work
* An unlabeled training dataset `train_df` that Snorkel's LabelModel will use to learn the labels
* A hand labeled test dataset `test_df` used to validate that the discriminative model we train with our labeled data works

The point of Snorkel is that you don't need labels - so we won't be using labels with the training dataset, `train_df`. Therefore we delete that variable to keep ourselves honest :) We also keep the development dataset `dev_df` small to demonstrate that you only need to label a small amount of representative data.

Once we've prepared our three dataset splits, because the labeled dev dataset `dev_df` is small, we run a value count for each of its labels to verify we have an adequate number of each label. It looks like we have around ten, which will do. People use Snorkel without any labels at all but at least ten of each label is very helpful in evaluating the performance, as we code, of the data programs we'll be writing to label data/

In [103]:
from sklearn.model_selection import train_test_split

# First split into a dev/train dataset we'll split next and a test dataset for our final model
dev_train_df, test_df, train_labels, test_labels = train_test_split(
    df,
    df['label'],
    test_size=0.75
)

# Then split the dev/train data to create a small labeled dev dataset and a larger unlabeled training dataset
dev_df, train_df, dev_labels, train_labels = train_test_split(
    dev_train_df,
    dev_train_df['label'],
    test_size=0.7
)

# Make sure our split of records makes sense
print(f'Total dev records:   {len(dev_df.index):,}')
print(f'Total train records: {len(train_df.index):,}')
print(f'Total test records:  {len(test_df.index):,}')

# Remove the training data labels - normally we would not have labeled these yet - this is why we're using Snorkel!
del train_labels

# Count labels in the dev set
dev_labels.value_counts(), test_labels.value_counts()

Total dev records:   190
Total train records: 446
Total test records:  1,908


(API        172
 GENERAL     18
 Name: label, dtype: int64,
 API        1703
 GENERAL     205
 Name: label, dtype: int64)

## Working with Snorkel

Snorkel has three primary programming interfaces: Labeling Functions, Transformation Functions and Slicing Functions.

<img 
     alt="Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions"
     src="images/snorkel_apis_0.9.5.png"
     width="500px"
/>
<div align="center">Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions, from <a href="https://www.snorkel.org/">Snorkel.org</a></div>

### Labeling Functions (LFs)

A labeling function is a deterministic function used to label data as belonging to one class or another. They produce weak labels that in combination, through Snorkel’s generative models, can be used to generate strong labels for unlabeled data.

The [Snorkel paper](https://arxiv.org/pdf/1711.10160.pdf) explains that LFs are open ended, that is that they can leverage information from multiple sources - both inside and outside the record. For example LFs can operate over different parts of the input document, working with document metadata, entire texts, individual paragraphs, sentences or words, parts of speech, named entities extracted by preprocessors, text embeddings or any augmentation of the record whatsoever. They can simultaneously leverage external databases and rules through *distant supervision*. These might include vocabulary for keyword searches, heuristics defined by theoretical considerations or equations, 

For example, a preprocessor might run a text document through a language model such as the included `SpacyPreprocessor` to run Named Entity Resolution (NER) and then look for words queried from WikiData that correspond to a given class. There are many ways to write LFs. We’ll define a broad taxonomy and then demonstrate some techniques from each.

The program interface for Labeling Functions is [`snorkel.labeling.LabelingFunction`](https://snorkel.readthedocs.io/en/v0.9.5/packages/_autosummary/labeling/snorkel.labeling.LabelingFunction.html#snorkel.labeling.LabelingFunction). They are instantiated with a name, a function reference, any resources the function needs and a list of any preprocessors to run on the data records before the labeling function runs.

<img alt="LabelingFunction API" src="images/labeling_function_api.png" width="600" />

### Defining Labeling Schema

In order to write our first labeling function, we need to define the label schema for our problem. The first label in any labeling schema is `-1` for `ABSTAIN`, which means "cast no vote" about the class of the record. This allows Snorkel Labeling Functions to vote only when they are certain, and is critical to how the system works since labeling functions have to perform better than random when they do vote or the Label Model won't work well.

The labels for this analysis are:

| Number | Code      | Description                       |
|--------|-----------|-----------------------------------|
| -1     | ABSTAIN   | No vote, for Labeling Functions   |
| 0      | GENERAL   | A FOSS project of general appeal  |
| 1      | API       | An API library for AWS            |

In [104]:
# Define our numeric labels as integers
ABSTAIN = -1
GENERAL = 0
API     = 1


def map_labels(x):
    """Map string labels to integers"""
    if x == 'API':
        return API
    if x == 'GENERAL':
        return GENERAL


dev_labels    =   dev_labels.apply(map_labels, convert_dtype=True)
test_labels   =  test_labels.apply(map_labels, convert_dtype=True)

dev_labels.shape, test_labels.shape

((190,), (1908,))

### Writing our First Labeling Function

In order to write a labeling function, we must describe our data to associate a property with a certain class of records that can be programmed as a heuristic. Let's inspect some of our records. The classes are imbalanced 9:1, so lets pull a stratified sample of both labels.

Look at the data table produced by the records below and try to eyeball any patterns among the `API` and the `GENERAL` records. Do you see any markers for `API` records or `GENERAL` records?

In [105]:
# Set Pandas to display more than 10 rows
pd.set_option('display.max_rows', 100)

api_df     = dev_df[dev_df['label'] ==     'API'].sample(frac=1).head(20).sort_values(by='label')
general_df = dev_df[dev_df['label'] == 'GENERAL'].sample(frac=1).head(10).sort_values(by='label')

api_df.append(general_df).head(30)

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
117173209,awsdocs/amazon-redshift-management-guide,The open source version of the Amazon Redshift...,## amazon-redshift-management-guide\n\nThe ope...,API
174389100,aws/aws-app-mesh-controller-for-k8s,A controller to help manage App Mesh resources...,[![CircleCI](https://circleci.com/gh/aws/aws-a...,API
93118780,aws-quickstart/quickstart-datastax,AWS Quick Start Team,# quickstart-datastax\n## DataStax Enterprise ...,API
170767546,alexa/Alexa-Gadgets-Raspberry-Pi-Samples,This repository enables you to prototype an A...,# Alexa Gadgets Raspberry Pi Samples\n\nQuickl...,API
121428452,aws-quickstart/quickstart-cognizant-jupiter,AWS Quick Start Team,# quickstart-cognizant-jupiter\n## Jupiter on ...,API
48217670,awslabs/route53-dynamic-dns-with-lambda,"A Dynamic DNS system built with API Gateway, L...",# route53-dynamic-dns-with-lambda\n### A Dynam...,API
171544039,aws/amazon-elastic-inference-tools,Amazon Elastic Inference tools and utilities.,## Amazon Elastic Inference Tools\n\nAmazon El...,API
93426515,alexa/skill-sample-csharp-fact,An Alexa Skill Sample showing how to build a f...,# Build An Alexa Fact Skill in C#\n![](https:/...,API
74159230,awslabs/cloudwatch-logs-analyze-data,"A Lambda function that builds an on-demand, sc...",# Cloudwatch Logs Analyze data\n\n### Package ...,API
31339965,aws-samples/aws-training-demo,AWS Technical Trainers Demos,# aws-training-demo\n\nThis repository contain...,API


### Detecting Patterns

In looking at the `full_name` and `html_url`, it looks like projects with `sdk` in the title are `API` projects. Lets filter down to those records to see.

In [106]:
sdk_df = dev_df[dev_df['full_name'].str.contains('sdk')]

print(f'Total SDK records: {len(sdk_df.index)}')

sdk_df.groupby('label').count()['full_name']

Total SDK records: 9


label
API    9
Name: full_name, dtype: int64

## Building an SDK Labeling Function

There is an 11:0 `API`:`GENERAL` ratio of labels among records with `sdk` in their full_name. This is more than good enough for a Labeling Function (LF), since they only have to be better than random! Cool, eh? Don't worry, the `LabelModel` will figure out which signal from which LF to use :) It's like magic!

This is called a keyword labeling function, the simplest type. Despite their simplicity, keyword LFs are incredibly powerful ways to inject subject matter expertise into a project. In the chapter on Weak Supervision, we'll get into the various types of LFs and the strategies researchers and Snorkel users have come up with for labeling data. For now we'll create this and a couple of other LFs and see where that gets us.

In [107]:
# The verbosse way to define an LF
from snorkel.labeling import LabelingFunction


sdk_lf = LabelingFunction(
    name="name_contains_sdk_lf",
    f=lambda x: API if 'sdk' in x.full_name.lower() else ABSTAIN,
)

print(sdk_lf)


# The short form way to define an LF
from snorkel.labeling import labeling_function


@labeling_function()
def name_contains_sdk_lf(x):
    return API if 'sdk' in x.full_name.lower() else ABSTAIN

print(sdk_lf)

LabelingFunction name_contains_sdk_lf, Preprocessors: []
LabelingFunction name_contains_sdk_lf, Preprocessors: []


## Testing our `LabelingFunction`

Snorkel comes with tools to help you run your LFs on your dataset to see how they perform. We're using Pandas, so we use [`snorkel.labeling.PandasLFApplier`](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.PandasLFApplier.html) to apply our list of label functions (in this case just one) to the hand-labeled development dataset `dev_df` and the unlabeled training dataset `train_df`. Note that there are also `LFAppliers` for [Dask](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.apply.dask.DaskLFApplier.html) and [PySpark](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.apply.spark.SparkLFApplier.html#snorkel.labeling.apply.spark.SparkLFApplier). This 

In [108]:
from snorkel.labeling import LFAnalysis
from snorkel.labeling import PandasLFApplier


lfs = [sdk_lf]

# Instantiate our LF applier with our list of LabelFunctions (just one for now)
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)

  from pandas import Panel
100%|██████████| 190/190 [00:00<00:00, 46022.05it/s]
100%|██████████| 446/446 [00:00<00:00, 56017.84it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
name_contains_sdk_lf,0,[1],0.047368,0.0,0.0,9,0,1.0


In [109]:
# Run the same LF analysis on the unlabeled training data, accuracy yet unknown
LFAnalysis(L=L_train,  lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
name_contains_sdk_lf,0,[1],0.047085,0.0,0.0


## Interpreting the `LFAnalysis` Summary

Looking at the tables above coverage of our first LF is about 6%, which means that it abstains by voting `ABSTAIN`/`-1` 94% of the time. In practice we need enough `LabelingFunctions` to cover more of the data than this and we must also write at least one LF per unique tag. Now that we've got an LF for `API`, let's write one for `GENERAL`.

## Writing Another `LabelingFunction`

We need more than just one vote to accurately label our data, so now we're going to inspect the data again and arrive at several more LFs - data programs - to label the data as either `API` or `GENERAL`.

### Inspecting the Development Data

To begin, let's write a function to perform the operation we did above to create a DataFrame showing a mix of `API` and `GENERAL` labels to get a sense of the difference between them. This is the point at which we are injection domain expertise as a form of supervision. Convenient this is about software, as we are the domain experts :)

In [110]:
def stratified_sample(a_df, b_df, labels, n=[20, 10]):
    """Given two pd.DataFrames, their labels and desired ratios, acreate a stratified sample and display n records"""
    a_sample_df = a_df[a_df['label'] == labels[0]].sample(frac=1).head(n[0]).sort_values(by='label')
    b_sample_df = b_df[b_df['label'] == labels[1]].sample(frac=1).head(n[1]).sort_values(by='label')

    return a_df.append(b_df).head(sum(n))


stratified_sample(api_df, general_df, ['API', 'GENERAL'])

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
117173209,awsdocs/amazon-redshift-management-guide,The open source version of the Amazon Redshift...,## amazon-redshift-management-guide\n\nThe ope...,API
174389100,aws/aws-app-mesh-controller-for-k8s,A controller to help manage App Mesh resources...,[![CircleCI](https://circleci.com/gh/aws/aws-a...,API
93118780,aws-quickstart/quickstart-datastax,AWS Quick Start Team,# quickstart-datastax\n## DataStax Enterprise ...,API
170767546,alexa/Alexa-Gadgets-Raspberry-Pi-Samples,This repository enables you to prototype an A...,# Alexa Gadgets Raspberry Pi Samples\n\nQuickl...,API
121428452,aws-quickstart/quickstart-cognizant-jupiter,AWS Quick Start Team,# quickstart-cognizant-jupiter\n## Jupiter on ...,API
48217670,awslabs/route53-dynamic-dns-with-lambda,"A Dynamic DNS system built with API Gateway, L...",# route53-dynamic-dns-with-lambda\n### A Dynam...,API
171544039,aws/amazon-elastic-inference-tools,Amazon Elastic Inference tools and utilities.,## Amazon Elastic Inference Tools\n\nAmazon El...,API
93426515,alexa/skill-sample-csharp-fact,An Alexa Skill Sample showing how to build a f...,# Build An Alexa Fact Skill in C#\n![](https:/...,API
74159230,awslabs/cloudwatch-logs-analyze-data,"A Lambda function that builds an on-demand, sc...",# Cloudwatch Logs Analyze data\n\n### Package ...,API
31339965,aws-samples/aws-training-demo,AWS Technical Trainers Demos,# aws-training-demo\n\nThis repository contain...,API


### Creating an Ion `LabelingFunction`

I notice that there are two projects labeled `GENERAL` that have the word "ion" in their project name. I happen to know that Ion is Amazon's storage format for complex data, and that it is a project with general utility. 

#### Investingating the "ion"/`GENERAL` Pattern

Let's investigate and if it pans out we'll write another LF. 

In [111]:
dev_df[dev_df['full_name'].str.contains('ion')]

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
196085120,aws-samples/aws-cloudformation-publisher,AWS CloudFormation Publisher packages your Clo...,## AWS Cloudformation Publisher\n\nAWS CloudFo...,API
224040189,aws-samples/amazon-sagemaker-time-series-predi...,,This repository contains the material for the ...,API
152124418,aws-samples/aws-cloudformation-advanced-reinve...,Lab Materials for re:Invent 2018 workshop - Ha...,## AWS Cloudformation Advanced Reinvent 2018\n...,API
120396031,aws-quickstart/connect-integration-perficient-...,AWS Quick Start Team,# connect-integration-perficient-msdynamics\n#...,API
95870058,aws-samples/amazon-rekognition-video-analyzer,A working prototype for capturing frames off o...,Create a Serverless Pipeline for Video Frame A...,API
108622915,aws-samples/aws-lambda-manage-rds-connections,Sample code for dynamically managing RDS/RDBMS...,# Dynamic Connections Management for RDS/RDBMS...,API
145019226,amzn/ion-hive-serde,A Apache Hive SerDe (short for serializer/dese...,## Amazon Ion Hive Serde\n\nA Apache Hive SerD...,GENERAL
195899036,awslabs/smart-product-solution,The Smart Product Solution is a customer deplo...,"## Smart Product Solution\nSmart, connected pr...",API
125123613,aws-samples/SageMaker_seq2seq_WordPronunciation,Sequence to Sequence modeling have seen great ...,## SageMaker_seq2seq_WordPronunciation\n\nSequ...,API
173164152,awslabs/aws-cloudformation-template-formatter,cfn-format is a command line tool and Go libra...,[![GitHub version](https://badge.fury.io/gh/aw...,API


#### Iterating on our Pattern

Ah, it looks like "ion" isn't good enough, as it is picking up lots of other words with "ion" in them. Lets try "/ion" since the examples we can see have that pattern"

In [112]:
dev_df[dev_df['full_name'].str.contains('/ion')]

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
145019226,amzn/ion-hive-serde,A Apache Hive SerDe (short for serializer/dese...,## Amazon Ion Hive Serde\n\nA Apache Hive SerD...,GENERAL
55726413,amzn/ion-docs,Source for the GitHub Pages for Ion.,# ion-docs\n\nThis repository contains the con...,GENERAL


Looks good! While 3:0 is not overwhelming support I happen to know there are many Ion projects and it is likely they mostly follow this pattern. Remember, `LabelingFunctions` don't have to be perfect - they just have to perform better than random. The magic of Snorkel's `LabelModel` is that it is unsupervised and models the interactions between LFs as a generative, graphical model it then uses to predict strong labels. When combined, these LFs give the model enough signal work do its job, turning multiple weak labels into one strong label.

### Writing the Ion Labeling Function

Now that we have the pattern, we can write another keyword LF.

In [113]:
@labeling_function()
def name_contains_slash_ion(x):
    return GENERAL if '/ion' in x.full_name.lower() else ABSTAIN


# Update our list of LFs to include this one
lfs = [name_contains_sdk_lf, name_contains_slash_ion]

# Create and apply a new Pandas 
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)

  from pandas import Panel
100%|██████████| 190/190 [00:00<00:00, 33006.87it/s]
100%|██████████| 446/446 [00:00<00:00, 34535.04it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
name_contains_sdk_lf,0,[1],0.047368,0.0,0.0,9,0,1.0
name_contains_slash_ion,1,[0],0.010526,0.0,0.0,2,0,1.0


In [114]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
name_contains_sdk_lf,0,[1],0.047085,0.0,0.0
name_contains_slash_ion,1,[0],0.002242,0.0,0.0


### Evaluating the LF Analysis

This LF works but has low coverage. We'll have to do better in terms of coverage if we're going to do a good job labeling `GENERAL` projects!

### Writing Another `LabelingFunction`

Again let's inspect the data and look what pops out.

In [115]:
stratified_sample(api_df, general_df, ['API', 'GENERAL'])

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
117173209,awsdocs/amazon-redshift-management-guide,The open source version of the Amazon Redshift...,## amazon-redshift-management-guide\n\nThe ope...,API
174389100,aws/aws-app-mesh-controller-for-k8s,A controller to help manage App Mesh resources...,[![CircleCI](https://circleci.com/gh/aws/aws-a...,API
93118780,aws-quickstart/quickstart-datastax,AWS Quick Start Team,# quickstart-datastax\n## DataStax Enterprise ...,API
170767546,alexa/Alexa-Gadgets-Raspberry-Pi-Samples,This repository enables you to prototype an A...,# Alexa Gadgets Raspberry Pi Samples\n\nQuickl...,API
121428452,aws-quickstart/quickstart-cognizant-jupiter,AWS Quick Start Team,# quickstart-cognizant-jupiter\n## Jupiter on ...,API
48217670,awslabs/route53-dynamic-dns-with-lambda,"A Dynamic DNS system built with API Gateway, L...",# route53-dynamic-dns-with-lambda\n### A Dynam...,API
171544039,aws/amazon-elastic-inference-tools,Amazon Elastic Inference tools and utilities.,## Amazon Elastic Inference Tools\n\nAmazon El...,API
93426515,alexa/skill-sample-csharp-fact,An Alexa Skill Sample showing how to build a f...,# Build An Alexa Fact Skill in C#\n![](https:/...,API
74159230,awslabs/cloudwatch-logs-analyze-data,"A Lambda function that builds an on-demand, sc...",# Cloudwatch Logs Analyze data\n\n### Package ...,API
31339965,aws-samples/aws-training-demo,AWS Technical Trainers Demos,# aws-training-demo\n\nThis repository contain...,API


### Investigating Quick Start LFs

I see a pattern wherein proejct names with "quickstart" and project descriptions with "Quick Start" seem to be `API` projects. Let's see if we're right by isolating and inspecting these records and then counting the number of labels for this subset.

In [116]:
# First look for 
quickstart_name_df = dev_df[dev_df['full_name'].str.contains('quickstart')]
quickstart_name_df

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88662666,aws-quickstart/quickstart-pivotal-cloudfoundry,AWS Quick Start Team,# quickstart-pivotal-cloudfoundry\n## Pivotal ...,API
82721804,aws-quickstart/quickstart-git2s3,AWS Quick Start Team,# quickstart-git2s3\n## Git webhooks with AWS ...,API
61226787,aws-quickstart/quickstart-microsoft-exchange,AWS Quick Start Team,# quickstart-microsoft-exchange\n## Microsoft ...,API
93118780,aws-quickstart/quickstart-datastax,AWS Quick Start Team,# quickstart-datastax\n## DataStax Enterprise ...,API
120396031,aws-quickstart/connect-integration-perficient-...,AWS Quick Start Team,# connect-integration-perficient-msdynamics\n#...,API
95723564,aws-quickstart/quickstart-bitnami-wordpress,AWS Quick Start Team,# quickstart-bitnami-wordpress\n## WordPress H...,API
200247219,aws-quickstart/quickstart-boomi-molecule,AWS Quick Start Team,# quickstart-boomi-molecule\n## Boomi Molecule...,API
166293447,aws-quickstart/quickstart-eks-newrelic-infrast...,AWS Quick Start Team,# quickstart-eks-newrelic-infrastructure\n## N...,API
184812907,aws-quickstart/quickstart-ibaset-solumina,AWS Quick Start Team,,API
128856142,aws-quickstart/quickstart-titian-mosaic,AWS Quick Start Team,# quickstart-titian-mosaic\n## Titian Mosaic F...,API


In [117]:
quickstart_df['label'].value_counts()

API    15
Name: label, dtype: int64

In [119]:
dev_df['description_lower'] = dev_df['description'].str.lower()
quickstart_desc_df = dev_df[dev_df['description_lower'].str.contains('quick start')]
quickstart_desc_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
88662666,aws-quickstart/quickstart-pivotal-cloudfoundry,AWS Quick Start Team,# quickstart-pivotal-cloudfoundry\n## Pivotal ...,API,aws quick start team
82721804,aws-quickstart/quickstart-git2s3,AWS Quick Start Team,# quickstart-git2s3\n## Git webhooks with AWS ...,API,aws quick start team
61226787,aws-quickstart/quickstart-microsoft-exchange,AWS Quick Start Team,# quickstart-microsoft-exchange\n## Microsoft ...,API,aws quick start team
93118780,aws-quickstart/quickstart-datastax,AWS Quick Start Team,# quickstart-datastax\n## DataStax Enterprise ...,API,aws quick start team
120396031,aws-quickstart/connect-integration-perficient-...,AWS Quick Start Team,# connect-integration-perficient-msdynamics\n#...,API,aws quick start team
95723564,aws-quickstart/quickstart-bitnami-wordpress,AWS Quick Start Team,# quickstart-bitnami-wordpress\n## WordPress H...,API,aws quick start team
200247219,aws-quickstart/quickstart-boomi-molecule,AWS Quick Start Team,# quickstart-boomi-molecule\n## Boomi Molecule...,API,aws quick start team
166293447,aws-quickstart/quickstart-eks-newrelic-infrast...,AWS Quick Start Team,# quickstart-eks-newrelic-infrastructure\n## N...,API,aws quick start team
184812907,aws-quickstart/quickstart-ibaset-solumina,AWS Quick Start Team,,API,aws quick start team
128856142,aws-quickstart/quickstart-titian-mosaic,AWS Quick Start Team,# quickstart-titian-mosaic\n## Titian Mosaic F...,API,aws quick start team


In [120]:
quickstart_desc_df['label'].value_counts()

API    17
Name: label, dtype: int64

### Evaluating Quick Start Strategy

So it looks like both the `full_name` pattern of `quickstart` (15 `API` labels) and the lowercase `description` pattern of `quick start` (17 `API` labels) both work. The description pattern matches two more records, otherwise they fully overlap. I'm going to leave both LFs in and move on to writing more LFs before we deal with evaluating results.

## Utilities for Creating Keyword LFs

We'll be creating several keyword labeling functions, so we're going to write some utility functions to make this more efficient. These come from the Snorkel Spam tutorial, and later we'll extend their capabilities to remove the need to write code for keyword LFs.

In [None]:
def keyword_lookup(x, keywords, label):
    """Lookup a keyword in a """
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

In [None]:
df['readme_text'] = df['readme'].apply(utils.markdown_to_text)
df['readme_code'] = df['readme'].apply(utils.markdown_to_code)

df.head()

In [None]:
utils.markdown_to_text(df['readme'].iloc[0])
utils.markdown_to_code(df['readme'].iloc[0])

In [None]:
import io
import re

from bs4 import BeautifulSoup
from markdown import markdown


def markdown_to_code(markdown_text):
    """Extract source code from Markdown snippets"""
    code_blocks = []
    code_snippets = [] # These get a single block

    f = io.StringIO(markdown_text)
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            code_block = [f.readline()]
            while re.search("```", code_block[-1]) is None:
                code_block.append(f.readline())
            code_blocks.append("".join(code_block[:-1]))
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')
                    code_snippets.append(group)
    
    # Now combine all snippets into one code block
    code_blocks.append(' '.join(code_snippets))
    
    return '\n'.join(code_blocks)


def markdown_to_text(markdown_text):
    """Extract plaintext - minus the code snippets - from Markdown"""
    text_blocks = []
    f = io.StringIO(markdown_text)
    i = 0
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            print('is_block')
            first_record = f.readline()
            second_record = f.readline()
            print(f'first_record: {first_record}')
            print(f'second_record: {second_record}')
            code_block = [first_record]
            while re.search("```", code_block[-1]) is None:
                print('inside_block')
                f.readline()
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')

            text_blocks.append(line)
        i += 1
    
    md = ''.join(text_blocks)
    html = markdown(md)
    soup = BeautifulSoup(html, 'lxml')
    text = soup.find_all(text=True)
    out_text = []
    for text in text:
        if text == '\n':
            pass
        else:
            out_text.append(text)
    return out_text

print(df['readme'].iloc[6][1204:-1])

markdown_to_text(df['readme'].iloc[6])