# Chapter 3: Introducing Snorkel

In this chapter I will introduce [Snorkel](http://snorkel.org), which we'll use throughout the book. [Snorkel](https://www.snorkel.org/) is a software project ([github](https://github.com/snorkel-team/snorkel)) originally from the Hazy Research group at Stanford University enabling the practice of *weak supervision*, *distant supervision*, *data augmentation* and *data slicing*.

The project has an excellent [Get Started](https://www.snorkel.org/get-started/) page, and I recommend you spend some time working the [tutorials](https://github.com/snorkel-team/snorkel-tutorials) before proceeding beyond this chapter. 

Snorkel implements an unsupervised generative model that accepts a matrix of weak labels for records in your training data and produces strong labels by learning the relationships between these weak labels through matrix factorization.

In [1]:
import random
import sys
sys.path.append("..")

import numpy as np
import pandas as pd
import pyarrow

from lib import utils


# Make randomness reproducible
random.seed(31337)
np.random.seed(31337)

[nltk_data] Downloading package punkt to /home/rjurney/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rjurney/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Example Project: Labeling Amazon Github Repositories

I have previously hand labeled about 2,600 Github repositories belonging to Amazon and its subsidiariesinto categories related to their purpose. We're going to use this dataset to introduce Snorkel.

### Hand Labeling this Data

In order to get a ground truth dataset against which to benchmark our Snorkel labeling, I hand labeled all Amazon Github projects in [this sheet](https://docs.google.com/spreadsheets/d/1wiesQSde5LwWV_vpMFQh24Lqx5Mr3VG7fk_e6yht0jU/edit?usp=sharing). The label categories are:

| Number | Code      | Description                          |
|--------|-----------|--------------------------------------|
| 0      | GENERAL   | A FOSS project of general utility    |
| 1      | API       | API library for AWS / Amazon product |
| 2      | RESEARCH  | A research paper and/or dataset      |
| 3      | DEAD      | Project is dead, no longer useful    |
| 3      | OTHER     | Uncertainty... what is this thing?   |

If you want to make corrections, please open the sheet, click on `File --> Make a Copy`, make any edits and then share the sheet with me.

In [99]:
readme_df = pd.read_parquet('../data/aws_github.parquet', engine='pyarrow')

readme_df = readme_df.sample(frac=1)

readme_df = readme_df.drop('html_url', axis=1)

readme_df = readme_df.fillna('')

readme_df.head()

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
155613615,aws-robotics/health-metrics-collector-ros1,This is a node for ROS1 that collects metrics ...,# health_metric_collector\n\n\n## Overview\nTh...,API
30503327,c9/c9.ide.language.javascript.eslint,The repository for c9.ide.language.javascript....,# c9.ide.language.javascript.eslint\n,GENERAL
79591397,aws-quickstart/quickstart-splunk-enterprise,AWS Quick Start Team,# Splunk Enterprise on AWS - Quick Start\n\nSo...,API
214026873,aws-samples/aws-codebuild-webhooks,A solution for CodeBuild custom webhook notifi...,# CodeBuild Webhooks\n\nA solution for CodeBui...,API
16440657,amazon-archives/kinesis-log4j-appender,ARCHIVED: Log4J Appender for writing data into...,# Archived\r\n\r\nThis is no longer supported....,API


## Profile the Data

Let's take a quick look at the labels to see what we'll be classifying.

In [100]:
print(f'Total records: {len(readme_df.index):,}')

readme_df['label'].value_counts()

Total records: 2,568


API         2265
GENERAL      279
DEAD          14
RESEARCH       9
OTHER          1
Name: label, dtype: int64

### How much general utility do Amazon's Github projects have?

One question that occurs to me to ask is - how much general utility do Amazon's Github projects have? Let's look at the number of `GENERAL` purpose compared to the number of `API` projects.

In [101]:
api_count     = readme_df[readme_df['label'] == '    API'].count(axis='index')['full_name']
general_count = readme_df[readme_df['label'] == 'GENERAL'].count(axis='index')['full_name']

general_pct = 100 * (general_count / (api_count + general_count))
api_pct     = 100 * (api_count / (api_count + general_count))

print(f'Percentage of projects having general utility:   {general_pct:,.3f}%')
print(f'Percentage of projects for Amazon products/APIs: {api_pct:,.3f}%')

Percentage of projects having general utility:   100.000%
Percentage of projects for Amazon products/APIs: 0.000%


### Simplify to `API` vs `GENERAL`

We throw out `DEAD`, `RESEARCH` and `OTHER` to focus on `API` vs `GENERAL` - is an open source project of general utility or is it a client to a company's commercial products? Highly imabalanced classes are hard to deal with when building a classifier, and 1:9 for `GENERAL`:`API` is bad enough.

In [102]:
df = readme_df[readme_df['label'].isin(['API', 'GENERAL'])]

print(f'Total records with API/GENERAL labels: {len(df.index):,}')

df.head()

Total records with API/GENERAL labels: 2,544


Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
155613615,aws-robotics/health-metrics-collector-ros1,This is a node for ROS1 that collects metrics ...,# health_metric_collector\n\n\n## Overview\nTh...,API
30503327,c9/c9.ide.language.javascript.eslint,The repository for c9.ide.language.javascript....,# c9.ide.language.javascript.eslint\n,GENERAL
79591397,aws-quickstart/quickstart-splunk-enterprise,AWS Quick Start Team,# Splunk Enterprise on AWS - Quick Start\n\nSo...,API
214026873,aws-samples/aws-codebuild-webhooks,A solution for CodeBuild custom webhook notifi...,# CodeBuild Webhooks\n\nA solution for CodeBui...,API
16440657,amazon-archives/kinesis-log4j-appender,ARCHIVED: Log4J Appender for writing data into...,# Archived\r\n\r\nThis is no longer supported....,API


### Split our Data into Training and Validation Data

In order to demonstrate Snorkel's capabilities, we need to create an experiment by splitting our data into three datasets:

* A hand labeled development dataset `dev_df` we will use to determine if our LFs work
* An unlabeled training dataset `train_df` that Snorkel's LabelModel will use to learn the labels
* A hand labeled test dataset `test_df` used to validate that the discriminative model we train with our labeled data works

The point of Snorkel is that you don't need labels - so we won't be using labels with the training dataset, `train_df`. Therefore we delete that variable to keep ourselves honest :) We also keep the development dataset `dev_df` small to demonstrate that you only need to label a small amount of representative data.

Once we've prepared our three dataset splits, because the labeled dev dataset `dev_df` is small, we run a value count for each of its labels to verify we have an adequate number of each label. It looks like we have around ten, which will do. People use Snorkel without any labels at all but at least ten of each label is very helpful in evaluating the performance, as we code, of the data programs we'll be writing to label data/

In [144]:
from sklearn.model_selection import train_test_split

# First split into a dev/train dataset we'll split next and a test dataset for our final model
dev_train_df, test_df, train_labels, test_labels = train_test_split(
    df,
    df['label'],
    test_size=0.75
)

# Then split the dev/train data to create a small labeled dev dataset and a larger unlabeled training dataset
dev_df, train_df, dev_labels, train_labels = train_test_split(
    dev_train_df,
    dev_train_df['label'],
    test_size=0.65
)

# Make sure our split of records makes sense
print(f'Total dev records:   {len(dev_df.index):,}')
print(f'Total train records: {len(train_df.index):,}')
print(f'Total test records:  {len(test_df.index):,}')

# Remove the training data labels - normally we would not have labeled these yet - this is why we're using Snorkel!
del train_labels

# Count labels in the dev set
dev_labels.value_counts(), test_labels.value_counts()

Total dev records:   222
Total train records: 414
Total test records:  1,908


(API        191
 GENERAL     31
 Name: label, dtype: int64,
 API        1699
 GENERAL     209
 Name: label, dtype: int64)

## Working with Snorkel

Snorkel has three primary programming interfaces: Labeling Functions, Transformation Functions and Slicing Functions.

<img 
     alt="Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions"
     src="images/snorkel_apis_0.9.5.png"
     width="500px"
/>
<div align="center">Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions, from <a href="https://www.snorkel.org/">Snorkel.org</a></div>

### Labeling Functions (LFs)

A labeling function is a deterministic function used to label data as belonging to one class or another. They produce weak labels that in combination, through Snorkel’s generative models, can be used to generate strong labels for unlabeled data.

The [Snorkel paper](https://arxiv.org/pdf/1711.10160.pdf) explains that LFs are open ended, that is that they can leverage information from multiple sources - both inside and outside the record. For example LFs can operate over different parts of the input document, working with document metadata, entire texts, individual paragraphs, sentences or words, parts of speech, named entities extracted by preprocessors, text embeddings or any augmentation of the record whatsoever. They can simultaneously leverage external databases and rules through *distant supervision*. These might include vocabulary for keyword searches, heuristics defined by theoretical considerations or equations, 

For example, a preprocessor might run a text document through a language model such as the included `SpacyPreprocessor` to run Named Entity Resolution (NER) and then look for words queried from WikiData that correspond to a given class. There are many ways to write LFs. We’ll define a broad taxonomy and then demonstrate some techniques from each.

The program interface for Labeling Functions is [`snorkel.labeling.LabelingFunction`](https://snorkel.readthedocs.io/en/v0.9.5/packages/_autosummary/labeling/snorkel.labeling.LabelingFunction.html#snorkel.labeling.LabelingFunction). They are instantiated with a name, a function reference, any resources the function needs and a list of any preprocessors to run on the data records before the labeling function runs.

<img alt="LabelingFunction API" src="images/labeling_function_api.png" width="600" />

### Defining Labeling Schema

In order to write our first labeling function, we need to define the label schema for our problem. The first label in any labeling schema is `-1` for `ABSTAIN`, which means "cast no vote" about the class of the record. This allows Snorkel Labeling Functions to vote only when they are certain, and is critical to how the system works since labeling functions have to perform better than random when they do vote or the Label Model won't work well.

The labels for this analysis are:

| Number | Code      | Description                       |
|--------|-----------|-----------------------------------|
| -1     | ABSTAIN   | No vote, for Labeling Functions   |
| 0      | GENERAL   | A FOSS project of general appeal  |
| 1      | API       | An API library for AWS            |

In [145]:
# Define our numeric labels as integers
ABSTAIN = -1
GENERAL = 0
API     = 1


def map_labels(x):
    """Map string labels to integers"""
    if x == 'API':
        return API
    if x == 'GENERAL':
        return GENERAL


dev_labels    =   dev_labels.apply(map_labels, convert_dtype=True)
test_labels   =  test_labels.apply(map_labels, convert_dtype=True)

dev_labels.shape, test_labels.shape

((222,), (1908,))

### Writing our First Labeling Function

In order to write a labeling function, we must describe our data to associate a property with a certain class of records that can be programmed as a heuristic. Let's inspect some of our records. The classes are imbalanced 9:1, so lets pull a stratified sample of both labels.

Look at the data table produced by the records below and try to eyeball any patterns among the `API` and the `GENERAL` records. Do you see any markers for `API` records or `GENERAL` records?

In [146]:
# Set Pandas to display more than 10 rows
pd.set_option('display.max_rows', 100)

api_df     = dev_df[dev_df['label'] ==     'API'].sample(frac=1).head(20).sort_values(by='label')
general_df = dev_df[dev_df['label'] == 'GENERAL'].sample(frac=1).head(10).sort_values(by='label')

api_df.append(general_df).head(30)

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
92872366,aws-samples/aws-ai-tracking-bot,"Code samples related to ""Activity Tracking wit...",# tracking-bot\n\nA sample application that us...,API
119881237,awsdocs/amazon-ec2-user-guide,The open source version of the Amazon EC2 User...,## Amazon EC2 User Guide for Linux\n\nThe open...,API
222821523,aws-samples/aws-iot-securetunneling-localproxy,AWS Iot Secure Tunneling local proxy reference...,## AWS IoT Secure Tunneling Local Proxy Refere...,API
168239718,aws-robotics/lex-ros2,ROS packages for facilitating the use of AWS c...,# lex_node\n\n## Overview\nThe ROS `lex_node` ...,API
148551161,aws-samples/machine-learning-using-k8s,Train and Deploy Machine Learning Models on Ku...,# Machine Learning Frameworks on Kubernetes\n\...,API
109730420,aws-samples/amazon-lex-bot-test,Script to test an Amazon Lex bot using the Ama...,# amazon-lex-bot-test\n\nThis is an example sc...,API
95593035,aws-quickstart/quickstart-informatica-secureat...,AWS Quick Start Team,# quickstart-informatica-secureatsource\n## In...,API
191425391,aws/aws-greengrass-core-sdk-js,Greengrass Nodejs SDK,"<div id=""content"">\n\n<div id=""filecontents"">\...",API
129298819,aws-samples/amazon-sagemaker-brain-segmentation,A Jupyter notebook w/ script demonstrating how...,## Amazon Sagemaker Brain Segmentation\n\nA Ju...,API
31675346,aws-samples/py-flask-signup-docker,Sample Python application to show the capabili...,# py-flask-signup-docker\nThis Python sample a...,API


### Detecting Patterns

In looking at the `full_name` and `html_url`, it looks like projects with `sdk` in the title are `API` projects. Lets filter down to those records to see.

In [147]:
sdk_df = dev_df[dev_df['full_name'].str.contains('sdk')]

print(f'Total SDK records: {len(sdk_df.index)}')

sdk_df.groupby('label').count()['full_name']

Total SDK records: 10


label
API    10
Name: full_name, dtype: int64

## Building an SDK Labeling Function

There is an 15:1 `API`:`GENERAL` ratio of labels among records with `sdk` in their full_name. This is more than good enough for a Labeling Function (LF), since they only have to be better than random! Cool, eh? Don't worry, the `LabelModel` will figure out which signal from which LF to use :) It's like magic!

This is called a keyword labeling function, the simplest type. Despite their simplicity, keyword LFs are incredibly powerful ways to inject subject matter expertise into a project. In the chapter on Weak Supervision, we'll get into the various types of LFs and the strategies researchers and Snorkel users have come up with for labeling data. For now we'll create this and a couple of other LFs and see where that gets us.

In [191]:
# The verbosse way to define an LF
from snorkel.labeling import LabelingFunction


sdk_lf = LabelingFunction(
    name="sdk_lf",
    f=lambda x: API if 'sdk' in x.full_name.lower() else ABSTAIN,
)

print(sdk_lf)


# The short form way to define an LF
from snorkel.labeling import labeling_function


@labeling_function()
def sdk_lf(x):
    return API if 'sdk' in x.full_name.lower() else ABSTAIN

print(sdk_lf)

LabelingFunction sdk_lf, Preprocessors: []
LabelingFunction sdk_lf, Preprocessors: []


## Testing our `LabelingFunction`

Snorkel comes with tools to help you run your LFs on your dataset to see how they perform. We're using Pandas, so we use [`snorkel.labeling.PandasLFApplier`](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.PandasLFApplier.html) to apply our list of label functions (in this case just one) to the hand-labeled development dataset `dev_df` and the unlabeled training dataset `train_df`. Note that there are also `LFAppliers` for [Dask](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.apply.dask.DaskLFApplier.html) and [PySpark](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.apply.spark.SparkLFApplier.html#snorkel.labeling.apply.spark.SparkLFApplier). This 

In [192]:
from snorkel.labeling import LFAnalysis
from snorkel.labeling import PandasLFApplier


lfs = [sdk_lf]

# Instantiate our LF applier with our list of LabelFunctions (just one for now)
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)

  from pandas import Panel
100%|██████████| 222/222 [00:00<00:00, 52416.99it/s]
100%|██████████| 414/414 [00:00<00:00, 61138.01it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
sdk_lf,0,[1],0.045045,0.0,0.0,10,0,1.0


In [193]:
# Run the same LF analysis on the unlabeled training data, accuracy yet unknown
LFAnalysis(L=L_train,  lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
sdk_lf,0,[1],0.050725,0.0,0.0


## Interpreting the `LFAnalysis` Summary

Looking at the tables above coverage of our first LF is about 6%, which means that it abstains by voting `ABSTAIN`/`-1` 94% of the time. In practice we need enough `LabelingFunctions` to cover more of the data than this and we must also write at least one LF per unique tag. Now that we've got an LF for `API`, let's write one for `GENERAL`.

## Writing Another `LabelingFunction`

We need more than just one vote to accurately label our data, so now we're going to inspect the data again and arrive at several more LFs - data programs - to label the data as either `API` or `GENERAL`.

### Inspecting the Development Data

To begin, let's write a function to perform the operation we did above to create a DataFrame showing a mix of `API` and `GENERAL` labels to get a sense of the difference between them. This is the point at which we are injection domain expertise as a form of supervision. Convenient this is about software, as we are the domain experts :)

In [166]:
def stratified_sample(df, labels, n=[20, 10]):
    """Given two pd.DataFrames, their labels and desired ratios, acreate a stratified sample and display n records"""
    a_sample_df = df[df['label'] == labels[0]].sample(frac=1).head(n[0]).sort_values(by='label')
    b_sample_df = df[df['label'] == labels[1]].sample(frac=1).head(n[1]).sort_values(by='label')

    return a_sample_df.append(b_sample_df).head(sum(n))



stratified_sample(dev_df, ['API', 'GENERAL'])

Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
103607613,aws-quickstart/quickstart-tibco-jaspersoft,AWS Quick Start Team,# quickstart-tibco-jaspersoft\n## TIBCO Jasper...,API,aws quick start team
95870058,aws-samples/amazon-rekognition-video-analyzer,A working prototype for capturing frames off o...,Create a Serverless Pipeline for Video Frame A...,API,a working prototype for capturing frames off o...
39026916,aws/amazon-cognito-dotnet,Official repository for Amazon Cognito Sync Ma...,# AWS Sync Manager SDK for .NET (Amazon Cognit...,API,official repository for amazon cognito sync ma...
92872366,aws-samples/aws-ai-tracking-bot,"Code samples related to ""Activity Tracking wit...",# tracking-bot\n\nA sample application that us...,API,"code samples related to ""activity tracking wit..."
2344662,amazon-archives/aws-tvm-anonymous,ARCHIVED: Token Vending Machine for Anonymous ...,\nARCHIVED\n--------\n\nAWS has released Amazo...,API,archived: token vending machine for anonymous ...
170180132,awslabs/aws-lambda-java-AWSHealth-check,Java SAM Lambda module for periodically checki...,## AWS Lambda Java module for periodic checkin...,API,java sam lambda module for periodically checki...
224273355,aws-samples/serverless-refarch-for-proxysql,AWS Serverless Reference Architecture for Prox...,# Serverless Reference Architecture for Proxy...,API,aws serverless reference architecture for prox...
72564812,amazon-archives/serverless-image-resizing,ARCHIVED,# Archived\n\nSee https://github.com/awslabs/s...,API,archived
180665776,awslabs/aws-service-catalog-factory,This is a framework where you define a Service...,# aws-service-catalog-factory\n\n![logo](./doc...,API,this is a framework where you define a service...
140887504,aws-amplify/aws-amplify.github.io,Amplify Framework Website,# aws-amplify.github.io\nWebsite\n,API,amplify framework website


### Creating an Ion `LabelingFunction`

I notice that there are two projects labeled `GENERAL` that have the word "ion" in their project name. I happen to know that Ion is Amazon's storage format for complex data, and that it is a project with general utility. 

#### Investingating the "ion"/`GENERAL` Pattern

Let's investigate and if it pans out we'll write another LF. 

In [167]:
dev_df[dev_df['full_name'].str.contains('ion')]

Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
148686462,aws-samples/aws-serverless-subscription-servic...,Building Serverless Subscription Service using...,# Building Subscription Service\n\n## Summary ...,API,building serverless subscription service using...
177833827,aws-quickstart/connect-integration-radish-choi...,AWS Quick Start Team,# connect-integration-radish-choiceview\n## Am...,API,aws quick start team
94404082,aws-samples/aws-rekognition-workshop-twitter-bot,This workshop walks you through creating a sma...,# Build A Rekognition Powered Twitter Bot\n\n#...,API,this workshop walks you through creating a sma...
225433531,amzn/ion-hash-dotnet,A .NET implementation of Amazon Ion Hash.,## My Project\n\nTODO: Fill this README out!\n...,GENERAL,a .net implementation of amazon ion hash.
228691108,aws-samples/amazon-ec2-hpc-automation-samples,AWS HPC-based samples,,API,aws hpc-based samples
150444544,aws-quickstart/quickstart-suse-cloud-applicati...,AWS Quick Start Team,# quickstart-suse-cloud-application-platform\n...,API,aws quick start team
172956400,aws-samples/aws-cloudformation-apigw-sap-idocs,This repository contains sample Lambda functio...,## AWS Cloudformation Apigw Sap Idocs\n\nThis ...,API,this repository contains sample lambda functio...
129298819,aws-samples/amazon-sagemaker-brain-segmentation,A Jupyter notebook w/ script demonstrating how...,## Amazon Sagemaker Brain Segmentation\n\nA Ju...,API,a jupyter notebook w/ script demonstrating how...
95870058,aws-samples/amazon-rekognition-video-analyzer,A working prototype for capturing frames off o...,Create a Serverless Pipeline for Video Frame A...,API,a working prototype for capturing frames off o...
202444854,aws-samples/aws-modernization-with-snyk,AWS Modernization Code Samples with Synk,# Snyk AWS Workshop\n\nThis workshop shows som...,API,aws modernization code samples with synk


#### Iterating on our Pattern

Ah, it looks like "ion" isn't good enough, as it is picking up lots of other words with "ion" in them. Lets try "/ion" since the examples we can see have that pattern"

In [168]:
dev_df[dev_df['full_name'].str.contains('/ion')]

Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
225433531,amzn/ion-hash-dotnet,A .NET implementation of Amazon Ion Hash.,## My Project\n\nTODO: Fill this README out!\n...,GENERAL,a .net implementation of amazon ion hash.
178966086,amzn/ion-kotlin-builder,This library provides Kotlin style type-safe b...,[![Build Status](https://travis-ci.org/amzn/io...,GENERAL,this library provides kotlin style type-safe b...


Looks good! While 3:0 is not overwhelming support I happen to know there are many Ion projects and it is likely they mostly follow this pattern. Remember, `LabelingFunctions` don't have to be perfect - they just have to perform better than random. The magic of Snorkel's `LabelModel` is that it is unsupervised and models the interactions between LFs as a generative, graphical model it then uses to predict strong labels. When combined, these LFs give the model enough signal work do its job, turning multiple weak labels into one strong label.

### Writing the Ion Labeling Function

Now that we have the pattern, we can write another keyword LF.

In [197]:
@labeling_function()
def ion_lf(x):
    return GENERAL if '/ion' in x.full_name.lower() else ABSTAIN


# Update our list of LFs to include this one
lfs = [sdk_lf, ion_lf]

# Create and apply a new Pandas 
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)

  from pandas import Panel
100%|██████████| 222/222 [00:00<00:00, 31576.76it/s]
100%|██████████| 414/414 [00:00<00:00, 32938.94it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
sdk_lf,0,[1],0.045045,0.0,0.0,10,0,1.0
ion_lf,1,[0],0.009009,0.0,0.0,2,0,1.0


In [198]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
sdk_lf,0,[1],0.050725,0.0,0.0
ion_lf,1,[0],0.007246,0.0,0.0


### Evaluating the LF Analysis

This LF works but has low coverage. We'll have to do better in terms of coverage if we're going to do a good job labeling `GENERAL` projects!

### Writing Another `LabelingFunction`

Again let's inspect the data and look what pops out.

In [171]:
stratified_sample(dev_df, ['API', 'GENERAL'])

Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
72667060,awslabs/statelint,A Ruby gem that provides a command-line valida...,# statelint\nA Ruby gem that provides a comman...,API,a ruby gem that provides a command-line valida...
146550078,alexa/alexa-video-multimodal,This repository contains sample code and refer...,,API,this repository contains sample code and refer...
84363643,aws-samples/aws-step-functions-ebs-snapshot-mgmt,Example architecture for integrating AWS Step ...,# aws-step-functions-ebs-snapshot-mgmt\n\nExam...,API,example architecture for integrating aws step ...
144768346,alexa/skill-sample-csharp-smarthome-switch,This is a basic Alexa Smart Home skill sample ...,## Skill Sample : Smarthome Switch (C#)\n\nThi...,API,this is a basic alexa smart home skill sample ...
58993179,awsdocs/aws-toolkit-eclipse-user-guide,Content for the AWS Toolkit for Eclipse User G...,".. Copyright 2010-2016 Amazon.com, Inc. or its...",API,content for the aws toolkit for eclipse user g...
192413997,aws-quickstart/quickstart-dotnet-serverless-cicd,AWS Quick Start Team,# quickstart-dotnet-serverless-cicd\n## .NET S...,API,aws quick start team
211362544,awslabs/aws-sam-cli-app-templates,,# AWS SAM CLI Application Templates\n\nThis re...,API,
210716595,awslabs/ml-io,A high performance data access library for mac...,[![Download](https://img.shields.io/conda/pn/m...,API,a high performance data access library for mac...
135637462,alexa/alexa-guided-walkthrough-using-node-sdk,Code walkthrough guides to show the ins and ou...,# Alexa Skill Guided Walkthrough using the Nod...,API,code walkthrough guides to show the ins and ou...
51314087,awslabs/aws-config-rules,"[Node, Python, Java] Repository of sample Cust...",# AWS Config Rules Repository\n\nAWS Community...,API,"[node, python, java] repository of sample cust..."


### Investigating Quick Start LFs

I see a pattern wherein proejct names with "quickstart" and project descriptions with "Quick Start" seem to be `API` projects. Let's see if we're right by isolating and inspecting these records and then counting the number of labels for this subset.

In [172]:
# First look for 
quickstart_name_df = dev_df[dev_df['full_name'].str.contains('quickstart')]
quickstart_name_df

Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
62259697,aws-quickstart/quickstart-cloudvideoediting,AWS Quick Start Team,# quickstart-cloudvideoediting\n## Cloud Video...,API,aws quick start team
192413997,aws-quickstart/quickstart-dotnet-serverless-cicd,AWS Quick Start Team,# quickstart-dotnet-serverless-cicd\n## .NET S...,API,aws quick start team
177833827,aws-quickstart/connect-integration-radish-choi...,AWS Quick Start Team,# connect-integration-radish-choiceview\n## Am...,API,aws quick start team
96921124,aws-quickstart/quickstart-datalake-wandisco,AWS Quick Start Team,# quickstart-datalake-wandisco\n## Hybrid Data...,API,aws quick start team
193150511,aws-quickstart/quickstart-amazon-redshift,AWS Quick Start Team,# quickstart-amazon-redshift\n## Modular archi...,API,aws quick start team
150444544,aws-quickstart/quickstart-suse-cloud-applicati...,AWS Quick Start Team,# quickstart-suse-cloud-application-platform\n...,API,aws quick start team
82721804,aws-quickstart/quickstart-git2s3,AWS Quick Start Team,# quickstart-git2s3\n## Git webhooks with AWS ...,API,aws quick start team
58955758,aws-quickstart/quickstart-trendmicro-deepsecurity,AWS Quick Start Team,# quickstart-trendmicro-deepsecurity\n## Trend...,API,aws quick start team
103607613,aws-quickstart/quickstart-tibco-jaspersoft,AWS Quick Start Team,# quickstart-tibco-jaspersoft\n## TIBCO Jasper...,API,aws quick start team
215645430,aws-quickstart/quickstart-compliance-irap-prot...,AWS Quick Start Team,,API,aws quick start team


In [173]:
quickstart_df['label'].value_counts()

API    15
Name: label, dtype: int64

In [246]:
dev_df['description_lower'] = dev_df['description'].str.lower()
quickstart_desc_df = dev_df[dev_df['description_lower'].str.contains('quick start')]

del dev_df['description_lower']

quickstart_desc_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
62259697,aws-quickstart/quickstart-cloudvideoediting,AWS Quick Start Team,# quickstart-cloudvideoediting\n## Cloud Video...,API,aws quick start team
192413997,aws-quickstart/quickstart-dotnet-serverless-cicd,AWS Quick Start Team,# quickstart-dotnet-serverless-cicd\n## .NET S...,API,aws quick start team
177833827,aws-quickstart/connect-integration-radish-choi...,AWS Quick Start Team,# connect-integration-radish-choiceview\n## Am...,API,aws quick start team
96921124,aws-quickstart/quickstart-datalake-wandisco,AWS Quick Start Team,# quickstart-datalake-wandisco\n## Hybrid Data...,API,aws quick start team
193150511,aws-quickstart/quickstart-amazon-redshift,AWS Quick Start Team,# quickstart-amazon-redshift\n## Modular archi...,API,aws quick start team
150444544,aws-quickstart/quickstart-suse-cloud-applicati...,AWS Quick Start Team,# quickstart-suse-cloud-application-platform\n...,API,aws quick start team
82721804,aws-quickstart/quickstart-git2s3,AWS Quick Start Team,# quickstart-git2s3\n## Git webhooks with AWS ...,API,aws quick start team
58955758,aws-quickstart/quickstart-trendmicro-deepsecurity,AWS Quick Start Team,# quickstart-trendmicro-deepsecurity\n## Trend...,API,aws quick start team
103607613,aws-quickstart/quickstart-tibco-jaspersoft,AWS Quick Start Team,# quickstart-tibco-jaspersoft\n## TIBCO Jasper...,API,aws quick start team
215645430,aws-quickstart/quickstart-compliance-irap-prot...,AWS Quick Start Team,,API,aws quick start team


In [247]:
quickstart_desc_df['label'].value_counts()

API    18
Name: label, dtype: int64

### Evaluating Quick Start Strategy

So it looks like both the `full_name` pattern of `quickstart` (15 `API` labels) and the lowercase `description` pattern of `quick start` (18 `API` labels) both work. The description pattern matches two more records, otherwise they fully overlap. I'm going to leave both LFs in and move on to writing more LFs before we deal with evaluating results.

### Writing Another `LabelingFunction`

We're not done yet! We need two more LFs to demonstrate Snorkel's `LabelModel`. Lets do a `GENERAL` LF now. We start again by eyeballing the data.

In [257]:
# Change the maximum column width if we've set it longer below
pd.set_option('display.max_colwidth', 100)

stratified_sample(dev_df, ['API', 'GENERAL'], n=[10, 20])

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
111071339,aws/ec2-hibernate-linux-agent,A Hibernating Agent for Linux on Amazon EC2,# The EC2 Spot hibernation agent.\n\n## License\nThe code is released under Apache License Vesio...,API
158597738,aws-samples/amazon-rds-data-api-demo,Its an example Lambda app which showcases how to run queries using SDK for Aurora Serverless Dat...,## Amazon RDS Data API Demo\n\nIts an example Lambda function which showcases how to run queries...,API
75421061,amazon-archives/naip-on-aws,"NAIP on AWS is a web application that uses Amazon S3, Amazon API Gateway, and AWS Lambda to crea...","﻿# NAIP on AWS viewer\n\nThe NAIP on AWS viewer is a serverless, lightweight, fast, and infinite...",API
196250259,aws-samples/amazon-ec2-global-dashboard,Monitor how many EC2 instances are running across all regions with a simple dashboard.,# Amazon EC2 Global Dashboard\n\nMonitor how many EC2 instances are running across all regions w...,API
223240917,aws-samples/aws-elemental-conductor-amazon-sns,This split and stitch addon works with Elemental Conductor for acceleration of transcoding.,"# Elemental-SnS\n### Warning: If you don't know what Elemental Conductor does, you probably shou...",API
95912417,awslabs/aws-ec2rescue-linux,Amazon Web Services Elastic Compute Cloud (EC2) Rescue for Linux is a python-based tool that all...,[![Gitter chat](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/aws-ec2rescue-l...,API
26375672,boto/boto3-sample,"Boto 3 sample application using Amazon Elastic Transcoder, S3, SNS, SQS, and AWS IAM.",=========================\nBoto 3 Sample Application\n=========================\nThis applicatio...,API
198678892,aws-cloudformation/aws-cloudformation-coverage-roadmap,The AWS CloudFormation Public Coverage Roadmap,## CloudFormation Public Coverage Roadmap\n\nThe AWS CloudFormation Public Coverage Roadmap\n\n#...,API
200130789,aws-samples/aws-news-feed-chime-webhook,"Serverless app which picks RSS feed of ""Whats new"" and publishes to an Amazon Chime room using a...","## AWS News Feed for Chime\n\nServerless app which picks RSS feed of ""Whats new"" and publishes t...",API
69492541,awslabs/cloudwatch-api-tracker,This application (in the form of a lambda function) will publish CloudWatch metrics based on API...,# AWS API Usage Tracker\n\nThis application was designed to give customers greater insight into ...,API


### Evaluating a Cloud9 LF Strategy

I see there are several project that are part of the [Cloud9 IDE](https://aws.amazon.com/cloud9/), an open source project of `GENERAL` utility which Amazon acquired. Let's check out a Cloud 9 `LabelingFunction`. 

In [249]:
c9_df = dev_df[dev_df['full_name'].str.contains('c9/')]
c9_df

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30425383,c9/c9.ide.language.html.diff,"The repository for c9.ide.language.html.diff, ...",# c9.ide.language.html.diff\n,GENERAL
30425323,c9/c9.ide.openfiles,"The repository for c9.ide.openfiles, a Cloud9 ...",# c9.ide.openfiles\n,GENERAL
5428660,c9/vfs-local,A VFS implementation for the local file-system.,# VFS Local\n\n[![Build Status](https://secure...,GENERAL
33198322,c9/c9.ide.run.debug.xdebug,Cloud9 debugger plugin for Xdebug,# `c9.ide.run.debug.xdebug`\n\n[Cloud9](https:...,GENERAL
30425296,c9/c9.ide.closeconfirmation,"The repository for c9.ide.closeconfirmation, a...",# c9.ide.closeconfirmation\n,GENERAL
3589061,c9/architect,A simple yet powerful plugin system for large-...,# Architect\n\nArchitect is a simple but power...,GENERAL
30425358,c9/c9.ide.undo,"The repository for c9.ide.undo, a Cloud9 core ...",# c9.ide.undo\n,GENERAL
33680762,c9/c9.automate,"The repository for c9.automate, a Cloud9 core ...",# c9.ide.automate\n,GENERAL
44630167,c9/c9.ide.test.mocha,"The repository for c9.ide.test.mocha, a Cloud9...",# c9.ide.test.mocha\n,GENERAL


In [250]:
c9_df['label'].value_counts()

GENERAL    9
Name: label, dtype: int64

### Writing a Cloud9 `LabelingFunction`

We're getting to be old pros now, so lets write another LF for Cloud9 projects.

In [251]:
@labeling_function()
def cloud9_lf(x):
    return GENERAL if 'c9/' in x.full_name.lower() else ABSTAIN

## Additional `LabelingFunctions`

So far the only form of LF we've introduced is the keyword LF. We'll be introducing more methods of labeling data when we cover Weak and Distant Supervision. For now I'm going to write several more LFs to make the `LabelModel` work.

First we will show longer columns to investigate the READMEs and then we will write a bunch of LFs at once, listing the strategy for each.

In [259]:
# Show more of the README columns
pd.set_option('display.max_colwidth', 600)

stratified_sample(dev_df, ['API', 'GENERAL'], n=[10,20])

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
110872803,awsdocs/aws-directory-service-admin-guide,The open source version of the AWS Directory Service docs. You can submit feedback & requests for changes by submitting issues in this repo or by making proposed changes & submitting a pull request.,Amazon Directory Service Docs\n\nThe open source version of the AWS Directory Service docs. You can submit feedback & requests for changes by submitting issues in this repo or by making proposed changes & submitting a pull request.\n\n## License Summary\n\nThe documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.\n\nThe sample code within this documentation is made available under a modified MIT license. See the LICENSE-SAMPLECODE file.\n,API
45854377,aws-samples/reinvent2015-practicaldynamodb,,"#### Introduction\nThis project is used for demonstrating how Amazon DynamoDB could be used together with AWS Lambda to perform real-time and batch analysis of domain specific data. Real-time analysis is done using DynamDB streams as an event source of a Lambda function. Batch processing utilizes the parallel scan Action of DynamoDB to distribute work to Lambda. Although this is a Maven project, AWS Lambda functions cannot be deployed by Maven. It is expected to use Eclipse to deploy the AWS Lambda functions and run the sample code.\n\n#### Prerequisite\n* [Install Eclipse to your computer...",API
111959454,aws-samples/reinvent-2017-deeplens-workshop,A reference Lambda function that predicts image labels for a image using a MXNet built deep learning model,## Reinvent 2017 Deeplens Workshop\n\nA reference Lambda function that predicts image labels for a image using a MXNet built deep learning model\n\n## License Summary\n\nThis sample code is made available under a modified MIT license. See the LICENSE file.\n,API
112677679,aws-samples/aws-rekognition-lex-demo,Demonstrates how to use AWS Lex to control/use AWS Rekognition.,## AWS Rekognition Lex Demo\n\nDemonstrates how to use AWS Lex to control/use AWS Rekognition.\n\n## License\n\nThis library is licensed under the Apache 2.0 License. \n,API
192413997,aws-quickstart/quickstart-dotnet-serverless-cicd,AWS Quick Start Team,"# quickstart-dotnet-serverless-cicd\n## .NET Serverless CI/CD on the AWS Cloud\n\n.NET Framework is a managed execution environment for applications that provides memory management, class libraries, versioning, and other software development tools.\n\nThis Quick Start builds a .NET serverless CI/CD (continuous integration and continuous delivery) environment on the Amazon Web Services (AWS) Cloud to provide a pipeline for .NET Framework workloads. It can perform the following functions:\n\n- Fetch the latest source code and save it to a source artifact store.\n- Automatically build the app...",API
14697005,awslabs/amazon-kinesis-connectors,,"# Amazon Kinesis Connector Library\n\nThe **Amazon Kinesis Connector Library** helps Java developers integrate [Amazon Kinesis][aws-kinesis] with other AWS and non-AWS services. The current version of the library provides connectors for [Amazon DynamoDB][aws-dynamodb], [Amazon Redshift][aws-redshift], [Amazon S3][aws-s3], [Elasticsearch][Elasticsearch]. The library also includes [sample connectors](#samples) of each type, plus Apache Ant build files for running the samples.\n\n## Requirements\n\n + **Amazon Kinesis Client Library**: In order to use the Amazon Kinesis Connector Library, you...",API
224273355,aws-samples/serverless-refarch-for-proxysql,AWS Serverless Reference Architecture for ProxySQL,"# Serverless Reference Architecture for ProxySQL\n\nThis reference architecture aims to build a serverless connection pooling adapter with proxysql on AWS Fargate and help AWS Lambda better connects to RDS for MySQL or Aurora databases.\n\n![](images/overview.png)\n\n\n\n# What is ProxySQL\n\n[ProxySQL](https://github.com/sysown/proxysql) is a high performance, high availability, protocol aware proxy for MySQL and forks (like Percona Server and MariaDB). It's a common solution to split SQL reads and writes on Amazon Aurora clusters or RDS for MySQL clusters*.\n\n**How to use ProxySQL with...",API
220111204,aws-samples/alexa-skill-with-sap-data-and-scp,This repository showcases a sample architecture for building an Alexa Skill with SAP data using SAP Cloud Platform as the integration layer.,,API
133736604,aws-samples/aws-iot-device-defender-agent-sdk-python,"Example implementation of a Device Defender metrics collection agent, and other Device Defender Python samples.","##########################################\nAWS IoT Device Defender Agent SDK (Python)\n##########################################\n\nExample implementation of an AWS IoT Device Defender metrics collection agent,\nand other Device Defender Python samples.\n\nThe provided sample agent can be used as a basis to implement a custom metrics collection agent.\n\n\n*************\nPrerequisites\n*************\n\nMinimum System Requirements\n===========================\n\nThe Following requirements are shared with the `AWS IoT Device SDK for Python <https://github.com/aws/aws-iot-device-sdk-python>...",API
36835292,amzn/amazon-instant-access-sdk-java,Java SDK to aid in 3p integration with Instant Access,"# Amazon-instant-access-sdk-java\n\n\n## Installation Guide\n1. Download the zip file on GitHub:\n ```bash\n git clone https://github.com/amzn/amazon-instant-access-sdk-java.git\n ```\n2. Import the SDK project as a Maven project in your IDE.\n3. Implement the servlets required for subscriptions or one time purchase.\n4. Run Junit tests to ensure everything is working correctly.\n\n#Example Implementation of AccountLinkingServlet\nThe following example is available under the examples.servlet package:\n```java\n/*\n * Copyright 2016-2019 Amazon.com, Inc. or its affiliates. All Righ...",API


In [267]:
@labeling_function()
def alexa_lf(x):
    """If it has 'alexa' in the full name it is probably an Alexa skill, an API project"""
    return API if 'alexa' in x.full_name.lower() else ABSTAIN


@labeling_function()
def api_lf(x):
    """If it has 'api' in the name it is probably an API project"""
    return API if 'api' in x.full_name.lower() else ABSTAIN


@labeling_function()
def walkthrough_lf(x):
    """If it has 'walkthrough' in the full name or description, it is an example of an API project"""
    return API if ('walkthrough' in x.full_name.lower() or 'walkthrough' in x.description.lower()) else ABSTAIN


@labeling_function()
def skill_lf(x):
    """If it has 'skill' in the full name or description, it is probably an Alexa skill"""
    return API if ('skill' in x.full_name.lower() or 'skill' in x.description.lower()) else ABSTAIN


@labeling_function()
def kit_lf(x):
    """If 'kit' in the description, it is probably an API project"""
    return API if 'skill' in x.description.lower() else ABSTAIN


@labeling_function()
def ext_desc_lf(x):
    """If 'extension' appears in the description, it is probably an API project"""
    return API if 'extension' in x.description.lower() else ABSTAIN


@labeling_function()
def ext_readme_lf(x):
    """If 'extension' appears in the readme, it is probably an API project"""
    return API if 'extension' in x.description.lower() else ABSTAIN


@labeling_function()
def aws_name_lf(x):
    """IF 'aws' appears in the name it is probably an API project"""
    return API if 'aws' in x.full_name.lower() else ABSTAIN


@labeling_function()
def aws_description_lf(x):
    """IF 'aws' appears in the description it is probably an API project"""
    return API if 'aws' in x.description.lower() else ABSTAIN


@labeling_function()
def aws_readme_lf(x):
    """IF 'aws' appears in the readme it is probably an API project"""
    return API if 'aws' in x.readme.lower() else ABSTAIN


@labeling_function()
def integrate_desc_lf(x):
    """If 'integrate' or 'integration' are in the description it is probably an API project"""
    return API if ('integrate' in x.description.lower() or 'integration' in x.description.lower()) else ABSTAIN


@labeling_function()
def integrate_readme_lf(x):
    """If 'integrate' or 'integration' are in the description it is probably an API project"""
    return API if ('integrate' in x.readme.lower() or 'integration' in x.readme.lower()) else ABSTAIN


@labeling_function()
def dataset_lf(x):
    """If 'dataset' is in the description, it is probably a GENERAL academic contribution"""
    return API if ('dataset' in x.description.lower() or 'dataset' in x.readme.lower()) else ABSTAIN


@labeling_function()
def demo_name_lf(x):
    """If 'demo' appears in the full_name it is probably an API example"""
    return API if 'demo' in x.full_name.lower() else ABSTAIN


@labeling_function()
def demo_desc_lf(x):
    """If 'demo' appears in the description it is probably an API example"""
    return API if 'demo' in x.description.lower() else ABSTAIN


@labeling_function()
def demo_readme_lf(x):
    """If 'demo' appears in the readme it is probably an API example"""
    return API if 'demo' in x.readme.lower() else ABSTAIN


@labeling_function()
def ajax_lf(x):
    """If 'ajaxorg' appears in the full name it is probably a GENERAL utility"""
    return GENERAL if 'ajaxorg' in x.full_name.lower() else ABSTAIN

In [268]:
lfs = [
    sdk_lf,
    ion_lf,
    cloud9_lf,
    api_lf,
    walkthrough_lf,
    skill_lf,
    kit_lf,
    ext_desc_lf,
    ext_readme_lf,
    aws_name_lf,
    aws_description_lf,
    aws_readme_lf,
    integrate_desc_lf,
    integrate_readme_lf,
    dataset_lf,s
    demo_name_lf,
    demo_desc_lf,
    demo_readme_lf,
    ajax_lf,
]

# Create and apply a new Pandas 
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)

  from pandas import Panel
100%|██████████| 222/222 [00:00<00:00, 3466.88it/s]
100%|██████████| 414/414 [00:00<00:00, 3682.81it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
sdk_lf,0,[1],0.045045,0.045045,0.0,10,0,1.0
ion_lf,1,[0],0.009009,0.0,0.0,2,0,1.0
cloud9_lf,2,[0],0.040541,0.0,0.0,9,0,1.0
api_lf,3,[1],0.018018,0.013514,0.0,4,0,1.0
walkthrough_lf,4,[1],0.004505,0.004505,0.0,1,0,1.0
skill_lf,5,[1],0.045045,0.045045,0.0,10,0,1.0
kit_lf,6,[1],0.040541,0.040541,0.0,9,0,1.0
ext_desc_lf,7,[1],0.013514,0.013514,0.0,3,0,1.0
ext_readme_lf,8,[1],0.013514,0.013514,0.0,3,0,1.0
aws_name_lf,9,[1],0.824324,0.747748,0.0,174,9,0.95082


In [None]:
def keyword_lookup(x, keywords, label):
    """Lookup a keyword in a """
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

In [None]:
df['readme_text'] = df['readme'].apply(utils.markdown_to_text)
df['readme_code'] = df['readme'].apply(utils.markdown_to_code)

df.head()

In [None]:
utils.markdown_to_text(df['readme'].iloc[0])
utils.markdown_to_code(df['readme'].iloc[0])

In [None]:
import io
import re

from bs4 import BeautifulSoup
from markdown import markdown


def markdown_to_code(markdown_text):
    """Extract source code from Markdown snippets"""
    code_blocks = []
    code_snippets = [] # These get a single block

    f = io.StringIO(markdown_text)
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            code_block = [f.readline()]
            while re.search("```", code_block[-1]) is None:
                code_block.append(f.readline())
            code_blocks.append("".join(code_block[:-1]))
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')
                    code_snippets.append(group)
    
    # Now combine all snippets into one code block
    code_blocks.append(' '.join(code_snippets))
    
    return '\n'.join(code_blocks)


def markdown_to_text(markdown_text):
    """Extract plaintext - minus the code snippets - from Markdown"""
    text_blocks = []
    f = io.StringIO(markdown_text)
    i = 0
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            print('is_block')
            first_record = f.readline()
            second_record = f.readline()
            print(f'first_record: {first_record}')
            print(f'second_record: {second_record}')
            code_block = [first_record]
            while re.search("```", code_block[-1]) is None:
                print('inside_block')
                f.readline()
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')

            text_blocks.append(line)
        i += 1
    
    md = ''.join(text_blocks)
    html = markdown(md)
    soup = BeautifulSoup(html, 'lxml')
    text = soup.find_all(text=True)
    out_text = []
    for text in text:
        if text == '\n':
            pass
        else:
            out_text.append(text)
    return out_text

print(df['readme'].iloc[6][1204:-1])

markdown_to_text(df['readme'].iloc[6])