# Chapter 3: Introducing Snorkel

In this chapter I will introduce [Snorkel](http://snorkel.org), which we'll use throughout the book. [Snorkel](https://www.snorkel.org/) is a software project ([github](https://github.com/snorkel-team/snorkel)) originally from the Hazy Research group at Stanford University enabling the practice of *weak supervision*, *distant supervision*, *data augmentation* and *data slicing*.

The project has an excellent [Get Started](https://www.snorkel.org/get-started/) page, and I recommend you spend some time working the [tutorials](https://github.com/snorkel-team/snorkel-tutorials) before proceeding beyond this chapter. 

Snorkel implements an unsupervised generative model that accepts a matrix of weak labels for records in your training data and produces strong labels by learning the relationships between these weak labels through matrix factorization.

In [466]:
import random
import sys
import warnings

sys.path.append("..")
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import pyarrow

from lib import utils


# Have the notebook span the screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Reset column width
pd.set_option('display.max_colwidth', 200)

# Make randomness reproducible
random.seed(31337)
np.random.seed(31337)

## Example Project: Labeling Amazon Github Repositories

I have previously hand labeled about 2,600 Github repositories belonging to Amazon and its subsidiariesinto categories related to their purpose. We're going to use this dataset to introduce Snorkel.

### Hand Labeling this Data

In order to get a ground truth dataset against which to benchmark our Snorkel labeling, I hand labeled all Amazon Github projects in [this sheet](https://docs.google.com/spreadsheets/d/1wiesQSde5LwWV_vpMFQh24Lqx5Mr3VG7fk_e6yht0jU/edit?usp=sharing). The label categories are:

| Number | Code      | Description                          |
|--------|-----------|--------------------------------------|
| 0      | GENERAL   | A FOSS project of general utility    |
| 1      | API       | API library for AWS / Amazon product |
| 2      | RESEARCH  | A research paper and/or dataset      |
| 3      | DEAD      | Project is dead, no longer useful    |
| 3      | OTHER     | Uncertainty... what is this thing?   |

If you want to make corrections, please open the sheet, click on `File --> Make a Copy`, make any edits and then share the sheet with me.

In [369]:
readme_df = pd.read_parquet('../data/aws_github.parquet', engine='pyarrow')

readme_df = readme_df.sample(frac=1)

readme_df = readme_df.drop('html_url', axis=1)

readme_df = readme_df.fillna('')

readme_df.head()

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5528710,boto/bwclient-js,Botoweb JS client,,API
235887328,aws-samples/aws-sdk-js-tests,,,API
85223732,aws-samples/ecs-mxnet-example,An example project to deploy MXNet inference API with Docker on Amazon ECS. Uses CodePipeline and CodeBuild to build the image to deploy to ECS.,"## Deploy a MXNet predict function to Amazon ECS using CodeCommit and CodePipeline\n\nThis project will create an automated workflow that will provision, configure and orchestrate a pipeline trigg...",API
114181725,awsdocs/aws-certificate-user-guide,The open source version of the AWS Certificate Manager user guide.,## AWS Certificate User Guide\n\nThe open source version of the AWS Certificate Manager user guide.\n\n## License Summary\n\nThe documentation is made available under the Creative Commons Attribut...,API
79949815,amazon-archives/golang-deployment-pipeline,"An example of infrastructure and application CI/CD with AWS CodePipeline, AWS CodeBuild, AWS CloudFormation and AWS CodeDeploy","# Deployment Pipeline for Go Applications on AWS\n\n![pipeline-screenshot](images/pipeline-screenshot.png)\n\nThis repository provides an easy-to-deploy pipeline for the development, testing, buil...",API


## Profile the Data

Let's take a quick look at the labels to see what we'll be classifying.

In [370]:
print(f'Total records: {len(readme_df.index):,}')

readme_df['label'].value_counts()

Total records: 2,568


API         2265
GENERAL      279
DEAD          14
RESEARCH       9
OTHER          1
Name: label, dtype: int64

### How much general utility do Amazon's Github projects have?

One question that occurs to me to ask is - how much general utility do Amazon's Github projects have? Let's look at the number of `GENERAL` purpose compared to the number of `API` projects.

In [371]:
api_count     = readme_df[readme_df['label'] == 'API'].count(axis='index')['full_name']
general_count = readme_df[readme_df['label'] == 'GENERAL'].count(axis='index')['full_name']

general_pct = 100 * (general_count / (api_count + general_count))
api_pct     = 100 * (api_count / (api_count + general_count))

print(f'Percentage of projects having general utility:   {general_pct:,.3f}%')
print(f'Percentage of projects for Amazon products/APIs: {api_pct:,.3f}%')

Percentage of projects having general utility:   10.967%
Percentage of projects for Amazon products/APIs: 89.033%


### Simplify to `API` vs `GENERAL`

We throw out `DEAD`, `RESEARCH` and `OTHER` to focus on `API` vs `GENERAL` - is an open source project of general utility or is it a client to a company's commercial products? Highly imabalanced classes are hard to deal with when building a classifier, and 1:9 for `GENERAL`:`API` is bad enough.

In [372]:
df = readme_df[readme_df['label'].isin(['API', 'GENERAL'])]

print(f'Total records with API/GENERAL labels: {len(df.index):,}')

df.head()

Total records with API/GENERAL labels: 2,544


Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5528710,boto/bwclient-js,Botoweb JS client,,API
235887328,aws-samples/aws-sdk-js-tests,,,API
85223732,aws-samples/ecs-mxnet-example,An example project to deploy MXNet inference API with Docker on Amazon ECS. Uses CodePipeline and CodeBuild to build the image to deploy to ECS.,"## Deploy a MXNet predict function to Amazon ECS using CodeCommit and CodePipeline\n\nThis project will create an automated workflow that will provision, configure and orchestrate a pipeline trigg...",API
114181725,awsdocs/aws-certificate-user-guide,The open source version of the AWS Certificate Manager user guide.,## AWS Certificate User Guide\n\nThe open source version of the AWS Certificate Manager user guide.\n\n## License Summary\n\nThe documentation is made available under the Creative Commons Attribut...,API
79949815,amazon-archives/golang-deployment-pipeline,"An example of infrastructure and application CI/CD with AWS CodePipeline, AWS CodeBuild, AWS CloudFormation and AWS CodeDeploy","# Deployment Pipeline for Go Applications on AWS\n\n![pipeline-screenshot](images/pipeline-screenshot.png)\n\nThis repository provides an easy-to-deploy pipeline for the development, testing, buil...",API


### Split our Data into Training and Validation Data

In order to demonstrate Snorkel's capabilities, we need to create an experiment by splitting our data into three datasets:

* A hand labeled development dataset `dev_df` we will use to determine if our LFs work
* An unlabeled training dataset `train_df` that Snorkel's LabelModel will use to learn the labels
* A hand labeled test dataset `test_df` used to validate that the discriminative model we train with our labeled data works

The point of Snorkel is that you don't need labels - so we won't be using labels with the training dataset, `train_df`. Therefore we delete that variable to keep ourselves honest :) We also keep the development dataset `dev_df` small to demonstrate that you only need to label a small amount of representative data.

Once we've prepared our three dataset splits, because the labeled dev dataset `dev_df` is small, we run a value count for each of its labels to verify we have an adequate number of each label. It looks like we have around ten, which will do. People use Snorkel without any labels at all but at least ten of each label is very helpful in evaluating the performance, as we code, of the data programs we'll be writing to label data/

In [373]:
from sklearn.model_selection import train_test_split

# First split into a dev/train dataset we'll split next and a test dataset for our final model
dev_train_df, test_df, train_labels, test_labels = train_test_split(
    df,
    df['label'],
    test_size=0.75
)

# Then split the dev/train data to create a small labeled dev dataset and a larger unlabeled training dataset
dev_df, train_df, dev_labels, train_labels = train_test_split(
    dev_train_df,
    dev_train_df['label'],
    test_size=0.65
)

# Make sure our split of records makes sense
print(f'Total dev records:   {len(dev_df.index):,}')
print(f'Total train records: {len(train_df.index):,}')
print(f'Total test records:  {len(test_df.index):,}')

# Remove the training data labels - normally we would not have labeled these yet - this is why we're using Snorkel!
del train_labels

# Count labels in the dev set
dev_labels.value_counts(), test_labels.value_counts()

Total dev records:   222
Total train records: 414
Total test records:  1,908


(API        206
 GENERAL     16
 Name: label, dtype: int64,
 API        1683
 GENERAL     225
 Name: label, dtype: int64)

## Working with Snorkel

Snorkel has three primary programming interfaces: Labeling Functions, Transformation Functions and Slicing Functions.

<img 
     alt="Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions"
     src="images/snorkel_apis_0.9.5.png"
     width="500px"
/>
<div align="center">Snorkel Programming Interface: Labeling Functions, Transformation Functions and Slicing Functions, from <a href="https://www.snorkel.org/">Snorkel.org</a></div>

### Labeling Functions (LFs)

A labeling function is a deterministic function used to label data as belonging to one class or another. They produce weak labels that in combination, through Snorkel’s generative models, can be used to generate strong labels for unlabeled data.

The [Snorkel paper](https://arxiv.org/pdf/1711.10160.pdf) explains that LFs are open ended, that is that they can leverage information from multiple sources - both inside and outside the record. For example LFs can operate over different parts of the input document, working with document metadata, entire texts, individual paragraphs, sentences or words, parts of speech, named entities extracted by preprocessors, text embeddings or any augmentation of the record whatsoever. They can simultaneously leverage external databases and rules through *distant supervision*. These might include vocabulary for keyword searches, heuristics defined by theoretical considerations or equations, 

For example, a preprocessor might run a text document through a language model such as the included `SpacyPreprocessor` to run Named Entity Resolution (NER) and then look for words queried from WikiData that correspond to a given class. There are many ways to write LFs. We’ll define a broad taxonomy and then demonstrate some techniques from each.

The program interface for Labeling Functions is [`snorkel.labeling.LabelingFunction`](https://snorkel.readthedocs.io/en/v0.9.5/packages/_autosummary/labeling/snorkel.labeling.LabelingFunction.html#snorkel.labeling.LabelingFunction). They are instantiated with a name, a function reference, any resources the function needs and a list of any preprocessors to run on the data records before the labeling function runs.

<img alt="LabelingFunction API" src="images/labeling_function_api.png" width="600" />

### Defining Labeling Schema

In order to write our first labeling function, we need to define the label schema for our problem. The first label in any labeling schema is `-1` for `ABSTAIN`, which means "cast no vote" about the class of the record. This allows Snorkel Labeling Functions to vote only when they are certain, and is critical to how the system works since labeling functions have to perform better than random when they do vote or the Label Model won't work well.

The labels for this analysis are:

| Number | Code      | Description                       |
|--------|-----------|-----------------------------------|
| -1     | ABSTAIN   | No vote, for Labeling Functions   |
| 0      | GENERAL   | A FOSS project of general appeal  |
| 1      | API       | An API library for AWS            |

In [374]:
# Define our numeric labels as integers
ABSTAIN = -1
GENERAL = 0
API     = 1


def map_labels(x):
    """Map string labels to integers"""
    if x == 'API':
        return API
    if x == 'GENERAL':
        return GENERAL


dev_labels    =   dev_labels.apply(map_labels, convert_dtype=True)
test_labels   =  test_labels.apply(map_labels, convert_dtype=True)

dev_labels.shape, test_labels.shape

((222,), (1908,))

### Writing our First Labeling Function

In order to write a labeling function, we must describe our data to associate a property with a certain class of records that can be programmed as a heuristic. Let's inspect some of our records. The classes are imbalanced 9:1, so lets pull a stratified sample of both labels.

Look at the data table produced by the records below and try to eyeball any patterns among the `API` and the `GENERAL` records. Do you see any markers for `API` records or `GENERAL` records?

In [375]:
# Set Pandas to display more than 10 rows
pd.set_option('display.max_rows', 100)

api_df     = dev_df[dev_df['label'] ==     'API'].sample(frac=1).head(20).sort_values(by='label')
general_df = dev_df[dev_df['label'] == 'GENERAL'].sample(frac=1).head(10).sort_values(by='label')

api_df.append(general_df).head(30)

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
69507910,aws-quickstart/quickstart-github-enterprise,AWS Quick Start Team,## GitHub Enterprise on the AWS Cloud\nAWS provides a comprehensive set of services and tools for deploying Microsoft Windows-based workloads on its highly reliable and secure cloud infrastructure...,API
121416098,amazon-archives/aws-fargate-workshop,Running Containers on AWS Fargate,## AWS Fargate Workshop\n\nRunning Containers on AWS Fargate\n\n## License\n\nThis library is licensed under the Apache 2.0 License.\n,API
112677679,aws-samples/aws-rekognition-lex-demo,Demonstrates how to use AWS Lex to control/use AWS Rekognition.,## AWS Rekognition Lex Demo\n\nDemonstrates how to use AWS Lex to control/use AWS Rekognition.\n\n## License\n\nThis library is licensed under the Apache 2.0 License. \n,API
216632467,aws-samples/amazon-sagemaker-custom-container,,# Bring your own training-completed model with SageMaker by building a custom container\n\n\n[Amazon SageMaker](https://aws.amazon.com/sagemaker/) provides every developer and data scientist with ...,API
120348783,awsdocs/aws-elemental-medialive-user-guide,The open source version of the AWS Elemental MediaConvert user guide,## AWS Elemental Medialive User Guide\n\nThe open source version of the AWS Elemental MediaConvert user guide\n\n## License Summary\n\nThe documentation is made available under the Creative Common...,API
118665838,aws-samples/step-functions-ruby-activity-worker,The code contains a example implementation of Step Functions activity worker written in Ruby,## Step Functions Ruby Activity Worker\n\nThe code contains a example implementation of Step Functions activity worker written in Ruby.\n\n## Summary\n\nThis package defines the Step Functions ref...,API
99021933,aws-quickstart/quickstart-datalake-cognizant-talend,AWS Quick Start Team,"# quickstart-datalake-cognizant-talend\n## Data Lake on the AWS Cloud with Talend Big Data Platform, AWS Services, and Cognizant Best Practices\n\n\nThis Quick Start builds a data lake environment...",API
43578604,aws/aws-sdk-php-v3-bridge,A compatibility pack for services no longer supported in V3 of the AWS SDK for PHP,# AWS SDK for PHP - Version 3 Upgrade Bridge\n\n[![@awsforphp on Twitter](http://img.shields.io/badge/twitter-%40awsforphp-blue.svg?style=flat)](https://twitter.com/awsforphp)\n[![Build Status](ht...,API
8207846,aws-samples/opsworks-demo-php-photo-share-app,A sample PHP application for running on AWS OpsWorks,# AWS OpsWorks PHP Demo App - Photo Share\n\nDirections on how to launch this sample app on AWS OpsWorks can be found in the article: [Walkthrough: Deploying a\nPHP application that leverages the ...,API
157290496,aws-samples/aws-waf-embargoed-countries-ofac,The article provides a push-button solution to protect your infrastructure against incoming traffic from embargoed countries as defined by OFAC,## How to use AWS WAF to filter incoming traffic from embargoed countries\n\nThis project provides you with an automated solution that applies geography-based IP (GeoIP) restrictions based on a de...,API


### Detecting Patterns

In looking at the `full_name` and `html_url`, it looks like projects with `sdk` in the title are `API` projects. Lets filter down to those records to see.

In [376]:
sdk_df = dev_df[dev_df['full_name'].str.contains('sdk')]

print(f'Total SDK records: {len(sdk_df.index)}')

sdk_df.groupby('label').count()['full_name']

Total SDK records: 7


label
API    7
Name: full_name, dtype: int64

## Building an SDK Labeling Function

There is an 15:1 `API`:`GENERAL` ratio of labels among records with `sdk` in their full_name. This is more than good enough for a Labeling Function (LF), since they only have to be better than random! Cool, eh? Don't worry, the `LabelModel` will figure out which signal from which LF to use :) It's like magic!

This is called a keyword labeling function, the simplest type. Despite their simplicity, keyword LFs are incredibly powerful ways to inject subject matter expertise into a project. In the chapter on Weak Supervision, we'll get into the various types of LFs and the strategies researchers and Snorkel users have come up with for labeling data. For now we'll create this and a couple of other LFs and see where that gets us.

In [377]:
# The verbosse way to define an LF
from snorkel.labeling import LabelingFunction


sdk_lf = LabelingFunction(
    name="sdk_lf",
    f=lambda x: API if 'sdk' in x.full_name.lower() else ABSTAIN,
)

print(sdk_lf)


# The short form way to define an LF
from snorkel.labeling import labeling_function


@labeling_function()
def sdk_lf(x):
    return API if 'sdk' in x.full_name.lower() else ABSTAIN

print(sdk_lf)

LabelingFunction sdk_lf, Preprocessors: []
LabelingFunction sdk_lf, Preprocessors: []


## Testing our `LabelingFunction`

Snorkel comes with tools to help you run your LFs on your dataset to see how they perform. We're using Pandas, so we use [`snorkel.labeling.PandasLFApplier`](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.PandasLFApplier.html) to apply our list of label functions (in this case just one) to the hand-labeled development dataset `dev_df` and the unlabeled training dataset `train_df`. Note that there are also `LFAppliers` for [Dask](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.apply.dask.DaskLFApplier.html) and [PySpark](https://snorkel.readthedocs.io/en/latest/packages/_autosummary/labeling/snorkel.labeling.apply.spark.SparkLFApplier.html#snorkel.labeling.apply.spark.SparkLFApplier). This 

In [378]:
from snorkel.labeling import LFAnalysis
from snorkel.labeling import PandasLFApplier


lfs = [sdk_lf]

# Instantiate our LF applier with our list of LabelFunctions (just one for now)
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)

  from pandas import Panel


100%|██████████| 222/222 [00:00<00:00, 36553.82it/s]


100%|██████████| 414/414 [00:00<00:00, 54715.21it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
sdk_lf,0,[1],0.031532,0.0,0.0,7,0,1.0


In [379]:
# Run the same LF analysis on the unlabeled training data, accuracy yet unknown
LFAnalysis(L=L_train,  lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
sdk_lf,0,[1],0.055556,0.0,0.0


## Interpreting the `LFAnalysis` Summary

Looking at the tables above coverage of our first LF is about 6%, which means that it abstains by voting `ABSTAIN`/`-1` 94% of the time. In practice we need enough `LabelingFunctions` to cover more of the data than this and we must also write at least one LF per unique tag. Now that we've got an LF for `API`, let's write one for `GENERAL`.

## Writing Another `LabelingFunction`

We need more than just one vote to accurately label our data, so now we're going to inspect the data again and arrive at several more LFs - data programs - to label the data as either `API` or `GENERAL`.

### Inspecting the Development Data

To begin, let's write a function to perform the operation we did above to create a DataFrame showing a mix of `API` and `GENERAL` labels to get a sense of the difference between them. This is the point at which we are injection domain expertise as a form of supervision. Convenient this is about software, as we are the domain experts :)

In [467]:
# Write a function to pull a stratified sample (one with non-random, prederminted proportions on a field's value)
def stratified_sample(df, labels, n=[20, 10], sorted=False):
    """Given two pd.DataFrames, their labels and desired ratios, acreate a stratified sample and display n records, optionaly sorted"""
    a_sample_df = df[df['label'] == labels[0]].sample(frac=1).head(n[0]).sort_values(by='label')
    b_sample_df = df[df['label'] == labels[1]].sample(frac=1).head(n[1]).sort_values(by='label')
    
    # Combine the two samples
    out_df = a_sample_df.append(b_sample_df).head(sum(n))
    
    # Optionally sort by the full_name to see groupings to infer LFs
    if sorted:
        out_df.sort_values('full_name', axis=0, inplace=True)
    
    return out_df



stratified_sample(dev_df, ['API', 'GENERAL'])

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
43081197,aws-samples/reinvent2015-dev309,"Code examples for the AWS re:Invent 2015 presentation ""Large Scale Metrics Analysis in Ruby""","# Sample Code for ""Large Scale Metrics Analysis in Ruby""\n\nThis repository contains annotated example code to accompany the AWS re:Invent\n2015 presentation DEV309: ""Large Scale Metrics Analysis ...",API
154230305,aws/aws-secretsmanager-caching-python,The AWS Secrets Manager Python caching client enables in-process caching of secrets for Python applications.,## AWS Secrets Manager Python caching client\n\nThe AWS Secrets Manager Python caching client enables in-process caching of secrets for Python applications.\n\n## Getting Started\n\n### Required P...,API
172770379,awslabs/aws-config-resource-schema,AWS Config resource schema define the properties and types of AWS Config resource configuration items (CIs). Resource CI schema are used by developers when performing advanced resource queries and...,## AWS Config Resource Schema\n\nAWS Config resource property files define the properties and types of the AWS Config resource configuration items (CIs) that are searchable using the `SelectResour...,API
111019154,amazon-archives/aws-servicebroker-s3,AWS Service Broker deployment module for Amazon Simple Storage Service,# Amazon S3 for the AWS Service Broker\n\nThis project has been archived and merged into the [aws-servicebroker](https://github.com/awslabs/aws-servicebroker/) repository.\n\n## License\n\nThis li...,API
126059403,awslabs/aws-serverless-financial-functions,Contains a collection of serverless apps that wrap common financial functions as AWS Lambda functions,## AWS Serverless Financial Functions\n\nThis is a collection of serverless apps that wrap common financial functions in AWS Lambda functions. The financial functions' names and interfaces are ide...,API
114183120,aws/efs-utils,Utilities for Amazon Elastic File System (EFS),# efs-utils\n\nUtilities for Amazon Elastic File System (EFS)\n\nThe `efs-utils` package has been verified against the following Linux distributions:\n\n| Distribution | Package Type | `init` Syst...,API
189488985,awsdocs/aws-iot-things-graph-user-guide,The open source version of the AWS IoT Things Graph docs. You can submit feedback & requests for changes by submitting issues in this repo or by making proposed changes & submitting a pull request.,## AWS IoT Things Graph User Guide\n\nThe open source version of the AWS IoT Things Graph docs. You can submit feedback & requests for changes by submitting issues in this repo or by making propos...,API
47997949,aws-samples/lambda-apigateway-twilio-tutorial,Getting started with AWS Lambda + Amazon API Gateway. Use Twilio MMS to upload photos to S3 without servers.,#Lambda + API Gateway Example \n\nThis example uses [Twilio](https://www.twilio.com/) to save an image from your mobile phone to the AWS cloud. A user sends an image using MMS to a Twilio phone n...,API
148227812,aws-samples/aws-ai-ml-workshop-kr,A collection of localized (Korean) AWS AI/ML workshop materials for hands-on labs.,## AWS AI/ML Workshop - Korea\n\nA collection of localized (Korean) AWS AI/ML workshop materials for hands-on labs. \n\n## Directory Structure\n\nHands-on materials wiil get enriched over time as ...,API
137248633,aws-samples/aws-serverless-appsync-app,This workshop shows you how to build a Web Application that demonstrates how easy it is to create data driven web applications all with no servers. You will build a serverless web application that...,"# Serverless Web Application with AppSync Workshop\n\n<a href=""https://www.youtube.com/watch?v=sQN28Jo-nak"" target=""_blank""><img src=""images/twitch.png"" align=""center"" width=""500"" alt=""Serverless ...",API


### Creating an Ion `LabelingFunction`

I notice that there are two projects labeled `GENERAL` that have the word "ion" in their project name. I happen to know that Ion is Amazon's storage format for complex data, and that it is a project with general utility. 

#### Investingating the "ion"/`GENERAL` Pattern

Let's investigate and if it pans out we'll write another LF. 

In [468]:
dev_df[dev_df['full_name'].str.contains('ion')]

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
125903080,twitchdev/extensions-hello-world,The Simplest Extension in the (Hello) World,# Extensions-Hello-World\nThe Simplest Extension in the (Hello) World.\n\n## Motivation\nThe Hello World sample is designed to get you started building a Twitch Extension quickly. It contains all ...,API
152624998,aws-cloudformation/cloudformation-cli-java-plugin,The CloudFormation Provider Development Toolkit Java Plugin allows you to autogenerate java code based on an input schema.,## AWS CloudFormation Resource Provider Java Plugin\n\nThe CloudFormation CLI (cfn) allows you to author your own resource providers that can be used by CloudFormation.\n\nThis plugin library help...,API
225431200,aws-samples/aws-reinvent-2019-builders-session-opn215,,# Intelligent Automation with AWS and Snort IDS\r\n\r\n## Description\r\nThis project demonstrates some of the ways to can add value to your existing Snort IDS system by integrating it with AWS.\r...,API
78151622,awslabs/serverless-photo-recognition,A collection of 3 lambda functions that are invoked by Amazon S3 or Amazon API Gateway to analyze uploaded images with Amazon Rekognition and save picture labels to ElasticSearch (written in Kotlin),\n#### Creator: Vladimir Budilov\n* [LinkedIn](https://www.linkedin.com/in/vbudilov/)\n* [Medium](https://medium.com/@budilov)\n\nServerless Photo Recognition\n====================================...,API
124436956,awsdocs/amazon-chime-administration-guide,"The open source version of the Amazon Chime Administration Guide. To submit feedback or requests for changes, submit an issue or make changes and submit a pull request.","## Amazon Chime Administration Guide\n\nThe open source version of the Amazon Chime Administration Guide. To submit feedback or requests for changes, submit an issue or make changes and submit a p...",API
114841782,awsdocs/amazon-migrationhub-user-guide,The open source version of the Amazon Migration Hub docs.,## Amazon Migrationhub User Guide\n\nThe open source version of the Amazon Migration Hub docs. \n\n## License Summary\n\nThe documentation is made available under the Creative Commons Attribution-...,API
134776216,aws-samples/aws-media-services-vod-automation,Sample code and CloudFormation scripts for automating Video on Demand workflows on AWS,# VOD Automation Toolkit\n\nThis project contains examples for automating Video On Demand (VOD) workflows on AWS. These are code samples to get you started on common tasks rather than an end to e...,API
155783103,awslabs/machine-learning-for-telecommunications,"A base solution that helps to generate insights from their data. The solution provides a framework for an end-to-end machine learning process including ad-hoc data exploration, data processing and...",# AWS Machine Learning for All\n\nMachine Learning for All is a solution that helps data scientists in the industry get started using machine learning to generate insights from their data. The sol...,API
222581713,aws-samples/aws-multi-region-bc-dr-workshop,,# Mythical Mysfits: Building Multi-Region Applications that Align with BC/DR Objectives\n\n## Overview\n![mysfits-welcome](/images/mysfits-welcome.png)\n\n**Mythical Mysfits** is a (fictional) pet...,API
110870783,awsdocs/aws-encryption-sdk-docs,"Explains how to use the AWS Encryption SDK, a library that enables secure client-side encryption. The Encryption SDK uses cryptography best practices to protect your data and the encryption keys u...",# AWS Encryption SDK Developer Guide\n\nThis repository contains the open source version of the [AWS Encryption SDK Developer\nGuide](https://docs.aws.amazon.com/encryption-sdk/latest/developer-gu...,API


#### Iterating on our Pattern

Ah, it looks like "ion" isn't good enough, as it is picking up lots of other words with "ion" in them. Lets try "/ion" since the examples we can see have that pattern"

In [469]:
dev_df[dev_df['full_name'].str.contains('/ion')]

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
43976244,amzn/ion-java,Java streaming parser/serializer for Ion.,# Amazon Ion Java\nA Java implementation of the [Ion data notation](http://amzn.github.io/ion-docs).\n\n[![Build Status](https://travis-ci.org/amzn/ion-java.svg?branch=master)](https://travis-ci.o...,GENERAL


Looks good! While 3:0 is not overwhelming support I happen to know there are many Ion projects and it is likely they mostly follow this pattern. Remember, `LabelingFunctions` don't have to be perfect - they just have to perform better than random. The magic of Snorkel's `LabelModel` is that it is unsupervised and models the interactions between LFs as a generative, graphical model it then uses to predict strong labels. When combined, these LFs give the model enough signal work do its job, turning multiple weak labels into one strong label.

### Writing the Ion Labeling Function

Now that we have the pattern, we can write another keyword LF.

In [470]:
@labeling_function()
def ion_lf(x):
    return GENERAL if '/ion' in x.full_name.lower() else ABSTAIN


# Update our list of LFs to include this one
lfs = [sdk_lf, ion_lf]

# Create and apply a new Pandas 
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)



100%|██████████| 222/222 [00:00<00:00, 33679.44it/s]


100%|██████████| 414/414 [00:00<00:00, 35393.53it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
sdk_lf,0,[1],0.031532,0.0,0.0,7,0,1.0
ion_lf,1,[0],0.004505,0.0,0.0,1,0,1.0


In [471]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
sdk_lf,0,[1],0.055556,0.0,0.0
ion_lf,1,[0],0.009662,0.0,0.0


### Evaluating the LF Analysis

This LF works but has low coverage. We'll have to do better in terms of coverage if we're going to do a good job labeling `GENERAL` projects!

### Writing Another `LabelingFunction`

Again let's inspect the data and look what pops out.

In [472]:
stratified_sample(dev_df, ['API', 'GENERAL'])

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
128856142,aws-quickstart/quickstart-titian-mosaic,AWS Quick Start Team,"# quickstart-titian-mosaic\n## Titian Mosaic FreezerManagement on the AWS Cloud\n\n\nThis Quick Start deploys Mosaic FreezerManagement, a comprehensive and cost-effective software solution for man...",API
167454167,aws-samples/alexa-skill-authentication,This Alexa skill provides users the ability to get flash briefings on Company's KPI's. The skill demonstrates how to use AWS Lambda with SNS to enable authentication within a skill by providing us...,# Alexa Skill Authentication Sample\n\n## About the Alexa skill \n\nThis Alexa Skill provides users the ability to get flash briefings on Company's KPI's. The Skill demonstrates how to use AWS Lam...,API
43081197,aws-samples/reinvent2015-dev309,"Code examples for the AWS re:Invent 2015 presentation ""Large Scale Metrics Analysis in Ruby""","# Sample Code for ""Large Scale Metrics Analysis in Ruby""\n\nThis repository contains annotated example code to accompany the AWS re:Invent\n2015 presentation DEV309: ""Large Scale Metrics Analysis ...",API
148227812,aws-samples/aws-ai-ml-workshop-kr,A collection of localized (Korean) AWS AI/ML workshop materials for hands-on labs.,## AWS AI/ML Workshop - Korea\n\nA collection of localized (Korean) AWS AI/ML workshop materials for hands-on labs. \n\n## Directory Structure\n\nHands-on materials wiil get enriched over time as ...,API
85223732,aws-samples/ecs-mxnet-example,An example project to deploy MXNet inference API with Docker on Amazon ECS. Uses CodePipeline and CodeBuild to build the image to deploy to ECS.,"## Deploy a MXNet predict function to Amazon ECS using CodeCommit and CodePipeline\n\nThis project will create an automated workflow that will provision, configure and orchestrate a pipeline trigg...",API
111019009,amazon-archives/aws-servicebroker-rds,AWS Service Broker deployment module for Amazon Relational Database Service,# Amazon RDS for the AWS Service Broker\n\nThis project has been archived and merged into the [aws-servicebroker](https://github.com/awslabs/aws-servicebroker/) repository.\n\n## License\n\nThis l...,API
226389559,twitchdev/issues,Issue tracker for third party developers.,# Third party developer product bug reports (beta)\nThis repository provides a means for third party developers to report bugs – unexpected errors or flaws – related to Twitch developer products s...,API
173168715,aws/aws-secretsmanager-caching-go,The AWS Secrets Manager Go caching client enables in-process caching of secrets for Go applications.,## AWS Secrets Manager Go Caching Client\n\nThe AWS Secrets Manager Go caching client enables in-process caching of secrets for Go applications.\n\n## Getting Started\n\n### Required Prerequisites...,API
182310705,awsdocs/amazon-lightsail-developer-guide,"The open source version of the Amazon Lightsail docs. To submit feedback or requests for changes, submit an issue or make changes and submit a pull request.","## Amazon Lightsail Developer Guide\n\nThe open source version of the Amazon Lightsail docs. To submit feedback or requests for changes, submit an issue or make changes and submit a pull request.\...",API
65770217,awslabs/aws-dx-monitor,Simple AWS Direct Connect monitoring with Amazon CloudWatch.,# aws-dx-monitor\n\nThe purpose of ***aws-dx-monitor*** is enabling customers to monitor [AWS Direct Connect](https://aws.amazon.com/directconnect/) runtime configuration items with [Amazon Cloud...,API


### Investigating Quick Start LFs

I see a pattern wherein proejct names with "quickstart" and project descriptions with "Quick Start" seem to be `API` projects. Let's see if we're right by isolating and inspecting these records and then counting the number of labels for this subset.

In [473]:
# First look for 
quickstart_name_df = dev_df[dev_df['full_name'].str.contains('quickstart')]
quickstart_name_df

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
162490438,aws-quickstart/quickstart-cohesive-vns3,AWS Quick Start Team,# quickstart-cohesive-vns3\n## Cohesive Networks VNS3 on AWS\n\nThis Quick Start helps you deploy a Cohesive Networks VNS3 overlay network on the Amazon Web Services (AWS) Cloud in about 15 minute...,API
135639697,aws-quickstart/quickstart-spotinst-ecs,AWS Quick Start Team,# quickstart-spotinst-elastigroup-ecs\n## Spotinst Elastigroup for Amazon ECS on the AWS Cloud\n\nThis Quick Start sets up an AWS architecture for Spotinst Elastigroup for Amazon Elastic Container...,API
128856142,aws-quickstart/quickstart-titian-mosaic,AWS Quick Start Team,"# quickstart-titian-mosaic\n## Titian Mosaic FreezerManagement on the AWS Cloud\n\n\nThis Quick Start deploys Mosaic FreezerManagement, a comprehensive and cost-effective software solution for man...",API
61229297,aws-quickstart/quickstart-microsoft-sql,AWS Quick Start Team,# quickstart-microsoft-sql\n## SQL Server on AWS with Windows Server Failover Clustering and Always On Availability Groups\n\nAWS provides a comprehensive set of services and tools for deploying M...,API
99021933,aws-quickstart/quickstart-datalake-cognizant-talend,AWS Quick Start Team,"# quickstart-datalake-cognizant-talend\n## Data Lake on the AWS Cloud with Talend Big Data Platform, AWS Services, and Cognizant Best Practices\n\n\nThis Quick Start builds a data lake environment...",API
69507910,aws-quickstart/quickstart-github-enterprise,AWS Quick Start Team,## GitHub Enterprise on the AWS Cloud\nAWS provides a comprehensive set of services and tools for deploying Microsoft Windows-based workloads on its highly reliable and secure cloud infrastructure...,API
156335297,aws-quickstart/quickstart-memsql,AWS Quick Start Team,"# quickstart-memsql\n## MemSQL on the AWS Cloud\n\nThis Quick Start helps you to deploy MemSQL, a distributed, highly scalable SQL database, on the Amazon Web Services (AWS) Cloud.\n\nMemSQL inges...",API
148649910,aws-quickstart/quickstart-varnish-enterprise,AWS Quick Start Team,# quickstart-varnish-enterprise\n## Varnish on the AWS Cloud\n\n\nThis Quick Start deploys Varnish Enterprise (VE) on the Amazon Web Services (AWS) Cloud in about 30 minutes.\n\nVE is the commerci...,API
184812907,aws-quickstart/quickstart-ibaset-solumina,AWS Quick Start Team,,API
108919385,aws-quickstart/quickstart-aviatrix-controller,AWS Quick Start Team,# quickstart-aviatrix-controller\n\n\nThis readme covers five Amazon Web Services (AWS) Quick Starts that help you build a highly available Aviatrix Controller in a virtual private cloud (VPC) on ...,API


In [474]:
quickstart_df['label'].value_counts()

API    15
Name: label, dtype: int64

In [488]:
dev_df['description_lower'] = dev_df['description'].str.lower()
quickstart_desc_df = dev_df[dev_df['description_lower'].str.contains('quick start')]

del dev_df['description_lower']

quickstart_desc_df

Unnamed: 0_level_0,full_name,description,readme,label,description_lower
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
162490438,aws-quickstart/quickstart-cohesive-vns3,AWS Quick Start Team,"# quickstart-cohesive-vns3\n## Cohesive Networks VNS3 on AWS\n\nThis Quick Start helps you deploy a Cohesive Networks VNS3 overlay network on the Amazon Web Services (AWS) Cloud in about 15 minutes, following best practices from AWS and Cohesive Networks. Specifically, this environment can help organizations with workloads that fall within the scope of the U.S. Health Insurance Portability and Accountability Act (HIPAA). It addresses certain technical requirements in the Privacy, Security, and Breach Notification Rules (45 C.F.R. Parts 160 and 164) under the HIPAA Administrative Simplifica...",API,aws quick start team
135639697,aws-quickstart/quickstart-spotinst-ecs,AWS Quick Start Team,"# quickstart-spotinst-elastigroup-ecs\n## Spotinst Elastigroup for Amazon ECS on the AWS Cloud\n\nThis Quick Start sets up an AWS architecture for Spotinst Elastigroup for Amazon Elastic Container Service (Amazon ECS) and deploys it into your AWS account in about 7 minutes.\n\nSpotinst Elastigroup is an application scaling service. Similar to Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling groups, Spotinst Elastigroup is designed to optimize performance and costs by leveraging Spot Instances combined with On-Demand and Reserved Instances.\n\nUsing a combination of automated Spot Ins...",API,aws quick start team
128856142,aws-quickstart/quickstart-titian-mosaic,AWS Quick Start Team,"# quickstart-titian-mosaic\n## Titian Mosaic FreezerManagement on the AWS Cloud\n\n\nThis Quick Start deploys Mosaic FreezerManagement, a comprehensive and cost-effective software solution for managing and tracking all types of sample inventory, backed by a full audit trail.\n\nMosaic FreezerManagement provides a flexible interface to define and record properties for any type of sample or container, and manages your entire hierarchy of storage, including freezers, shelves, and cupboards. Other features include an intuitive search interface and expiration date tracking.\n\nThis Quick Start ...",API,aws quick start team
61229297,aws-quickstart/quickstart-microsoft-sql,AWS Quick Start Team,"# quickstart-microsoft-sql\n## SQL Server on AWS with Windows Server Failover Clustering and Always On Availability Groups\n\nAWS provides a comprehensive set of services and tools for deploying Microsoft Windows-based workloads on its highly reliable and secure cloud infrastructure. This Quick Start implements a high availability solution built with Windows Server and SQL Server running on Amazon EC2, using the Always On availability groups feature of SQL Server Enterprise edition.\n\nThe deployment includes Windows Server Failover Clustering (WSFC) and clustered SQL Server 2016 or 2017 i...",API,aws quick start team
99021933,aws-quickstart/quickstart-datalake-cognizant-talend,AWS Quick Start Team,"# quickstart-datalake-cognizant-talend\n## Data Lake on the AWS Cloud with Talend Big Data Platform, AWS Services, and Cognizant Best Practices\n\n\nThis Quick Start builds a data lake environment on the Amazon Web Services (AWS) Cloud by deploying Talend Big Data Platform components and AWS services such as Amazon EMR, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon Relational Database Service (Amazon RDS).\n\nThe Quick Start also provides an optional sample dataset and Talend jobs developed by Cognizant Technology Solutions to illustrate big data practices for inte...",API,aws quick start team
69507910,aws-quickstart/quickstart-github-enterprise,AWS Quick Start Team,"## GitHub Enterprise on the AWS Cloud\nAWS provides a comprehensive set of services and tools for deploying Microsoft Windows-based workloads on its highly reliable and secure cloud infrastructure. This Quick Start deploys GitHub Enterprise on the AWS Cloud.\n\nGitHub Enterprise is a development and collaboration platform built on Git that enables developers to build and share software easily and effectively. It provides an integrated platform for continuous integration and development, a non-linear workflow for collaboration, and in-depth monitoring and auditing for administrators. By dep...",API,aws quick start team
156335297,aws-quickstart/quickstart-memsql,AWS Quick Start Team,"# quickstart-memsql\n## MemSQL on the AWS Cloud\n\nThis Quick Start helps you to deploy MemSQL, a distributed, highly scalable SQL database, on the Amazon Web Services (AWS) Cloud.\n\nMemSQL ingests data continuously to perform operational analytics on billions of rows of data in relational SQL, JSON, geospatial, and full-text search formats. MemSQL can handle both database workloads and data warehouse workloads, meeting transactional and analytical requirements. Together, MemSQL and AWS provide a compelling platform for building real-time applications.\n\nYou can use the AWS CloudFormatio...",API,aws quick start team
148649910,aws-quickstart/quickstart-varnish-enterprise,AWS Quick Start Team,"# quickstart-varnish-enterprise\n## Varnish on the AWS Cloud\n\n\nThis Quick Start deploys Varnish Enterprise (VE) on the Amazon Web Services (AWS) Cloud in about 30 minutes.\n\nVE is the commercial enterprise version of the open-source HTTP engine and reverse HTTP proxy, Varnish Cache (VC). Both versions of Varnish speed up a website by caching (storing) a copy of a page served by your web server the first time a user visits your page. The next time the user requests the same page, the cache will serve the copy quickly, instead of requesting the page from the web server again. VE provides...",API,aws quick start team
184812907,aws-quickstart/quickstart-ibaset-solumina,AWS Quick Start Team,,API,aws quick start team
108919385,aws-quickstart/quickstart-aviatrix-controller,AWS Quick Start Team,# quickstart-aviatrix-controller\n\n\nThis readme covers five Amazon Web Services (AWS) Quick Starts that help you build a highly available Aviatrix Controller in a virtual private cloud (VPC) on the AWS Cloud. \n\nYou can deploy the following solutions by using the Aviatrix Controller: \n\n- [Deploying Aviatrix Next-Gen Global Transit Hub on AWS](Transit-Hub-README.md)\n- [Deploying Aviatrix User VPN on AWS](User-VPN-README.md)\n- [Deploying Aviatrix FQDN Egress Filtering on AWS](FQDN-Egress-README.md)\n- [Deploying Aviatrix Site to Cloud VPN on AWS](Site2Cloud-VPN-README.md)\n- [Deployin...,API,aws quick start team


In [489]:
quickstart_desc_df['label'].value_counts()

API    14
Name: label, dtype: int64

### Evaluating Quick Start Strategy

So it looks like both the `full_name` pattern of `quickstart` (15 `API` labels) and the lowercase `description` pattern of `quick start` (18 `API` labels) both work. The description pattern matches two more records, otherwise they fully overlap. I'm going to leave both LFs in and move on to writing more LFs before we deal with evaluating results.

### Writing Another `LabelingFunction`

We're not done yet! We need two more LFs to demonstrate Snorkel's `LabelModel`. Lets do a `GENERAL` LF now. We start again by eyeballing the data.

In [490]:
# Change the maximum column width if we've set it longer below
pd.set_option('display.max_colwidth', 200)

stratified_sample(dev_df, ['API', 'GENERAL'], n=[10, 20])

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
81992883,aws-samples/opsworks-chef-automate-demo,An example CloudFormation template and cookbook for Chef,# opsworks-chef-automate-demo\n\nAn example CloudFormation template and cookbook for OpsWorks for Chef Automate\n\n\n## Contents\n\n* `cf/vpc-webserver.yaml`\n Example cloudformation template t...,API
224262047,awslabs/amazon-athena-cross-account-catalog,🌉 Reference implementation for granting cross-account AWS Glue Data Catalog access from Amazon Athena,,API
40384751,aws-samples/aws-device-farm-sample-app-for-ios,,# AWS Device Farm Sample App for iOS\n\nThis is a sample native iOS app that contains many of the stock iOS components and elements. It also contains multiple [Calabash tests](https://github.com/a...,API
147290704,amzn/emukit-playground,A web page explaining concepts of statistical emulation and making decisions under uncertainty in an interactive way.,"<div align=""center""><img width=""100"" src=""https://github.com/amzn/emukit-playground/raw/master/img/taxi.png"" /></div>\n<h1 align=""center"">Emukit Playground</h1>\n<p align=""center"">Learn about key ...",API
76065184,aws-samples/startup-kit-serverless-workload,"An example serverless RESTful API, to be deployed via the AWS Serverless Application Model (SAM).",# AWS Startup Kit Serverless Workload\n\nAn example serverless application project: a RESTful API backed by DynamoDB. The architecture is as follows:\n\n![Architecture](images/architecture.jpg)\n...,API
43467495,aws-samples/lambda-refarch-streamprocessing,Serverless Reference Architecture for Real-time Stream Processing,# Serverless Reference Architecture: Real-time Stream Processing\nREADME Languages: [DE](README/README-DE.md) | [ES](README/README-ES.md) | [FR](README/README-FR.md) | [IT](README/README-IT.md) |...,API
100396349,aws/amazon-ecs-cluster-state-service,Materialized local view of your ECS cluster state built on top of the Amazon ECS event stream.,# amazon-ecs-cluster-state-service\n\n### Description\n\nThe amazon-ecs-cluster-state-service consumes events from a stream of all changes to containers and instances across your Amazon ECS cluste...,API
234551970,aws-samples/aws-robomaker-sample-application-meirorunner,This sample application can run on AWS RoboMaker and demonstrate reinforcement learning machine learning for robotics,,API
98501101,aws-samples/serverless-codecommit-examples,Examples of serverless automation to process CodeCommit repository changes using CloudWatch Events.,# AWS CodeCommit Serverless Samples\n\nThe samples in this repository demonstrate several uses of AWS Lambda to process Amazon CloudWatch Events in response to changes to a AWS CodeCommit Git repo...,API
167454167,aws-samples/alexa-skill-authentication,This Alexa skill provides users the ability to get flash briefings on Company's KPI's. The skill demonstrates how to use AWS Lambda with SNS to enable authentication within a skill by providing us...,# Alexa Skill Authentication Sample\n\n## About the Alexa skill \n\nThis Alexa Skill provides users the ability to get flash briefings on Company's KPI's. The Skill demonstrates how to use AWS Lam...,API


### Evaluating a Cloud9 LF Strategy

I see there are several project that are part of the [Cloud9 IDE](https://aws.amazon.com/cloud9/), an open source project of `GENERAL` utility which Amazon acquired. Let's check out a Cloud 9 `LabelingFunction`. 

In [491]:
c9_df = dev_df[dev_df['full_name'].str.contains('c9/')]
c9_df

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30425276,c9/c9.ide.ace.keymaps,"The repository for c9.ide.ace.keymaps, a Cloud9 core plugin",# c9.ide.ace.keymaps\n,GENERAL
30425371,c9/c9.ide.language.css,"The repository for c9.ide.language.css, a Cloud9 core plugin",# c9.ide.language.css\n,GENERAL
4225718,c9/node-netutil,"utils to find free ports in a range, checking if a port is open, etc","node.js network utils\n=====================\n\nprovides:\n\nFind the first free port on the server within the given range:\n\n`findFreePort(start, end, hostname, callback)`\n\n\nCheck whether the...",GENERAL
30425336,c9/c9.ide.readonly,"The repository for c9.ide.readonly, a Cloud9 core plugin",# c9.ide.readonly\n,GENERAL
30503327,c9/c9.ide.language.javascript.eslint,"The repository for c9.ide.language.javascript.eslint, a Cloud9 core plugin",# c9.ide.language.javascript.eslint\n,GENERAL
33198322,c9/c9.ide.run.debug.xdebug,Cloud9 debugger plugin for Xdebug,# `c9.ide.run.debug.xdebug`\n\n[Cloud9](https://c9.io/) core plugin for [Xdebug](http://xdebug.org/) and other DBGP\ndebuggers.\n\n\nto install xdebug for php use\n\n```sh\nsudo apt-get update\nsu...,GENERAL
30425303,c9/c9.ide.fontawesome,"The repository for c9.ide.fontawesome, a Cloud9 core plugin",# c9.ide.fontawesome\n,GENERAL
30425338,c9/c9.ide.recentfiles,"The repository for c9.ide.recentfiles, a Cloud9 core plugin",# c9.ide.recentfiles\n,GENERAL


In [492]:
c9_df['label'].value_counts()

GENERAL    8
Name: label, dtype: int64

### Writing Cloud9 `LabelingFunctions`

We're getting to be old pros now, so lets write three more LFs for Cloud9 projects.

In [501]:
@labeling_function()
def cloud9_name_lf(x):
    """If the full name contains c9/ it is part of the Cloud9 IDE project which is GENERAL"""
    return GENERAL if 'c9/' in x.full_name.lower() else ABSTAIN


@labeling_function()
def cloud9_description_lf(x):
    """If the name or abbreviation for Cloud9 IDE is in the description, it is GENERAL"""
    return GENERAL if any(s in x.description.lower() for s in ('cloud9', 'cloud 9', 'c9')) else ABSTAIN


@labeling_function()
def cloud9_readme_lf(x):
    """If the name or abbreviatin for Cloud9 IDE is in the readme, is is GENERAL"""
    return GENERAL if any(s in x.readme.lower() for s in ('cloud9', 'cloud 9', 'c9')) else ABSTAIN

## Additional `LabelingFunctions`

So far the only form of LF we've introduced is the keyword LF. We'll be introducing more methods of labeling data when we cover Weak and Distant Supervision. For now I'm going to write several more LFs to make the `LabelModel` work.

First we will show longer columns to investigate the READMEs and then we will write a bunch of LFs at once, listing the strategy for each.

In [502]:
# Show more of the README columns
pd.set_option('display.max_colwidth', 600)

stratified_sample(dev_df, ['API', 'GENERAL'], n=[10,20], sorted=True)

Unnamed: 0_level_0,full_name,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
43976244,amzn/ion-java,Java streaming parser/serializer for Ion.,# Amazon Ion Java\nA Java implementation of the [Ion data notation](http://amzn.github.io/ion-docs).\n\n[![Build Status](https://travis-ci.org/amzn/ion-java.svg?branch=master)](https://travis-ci.org/amzn/ion-java)\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.amazon.ion/ion-java/badge.svg)](https://maven-badges.herokuapp.com/maven-central/com.amazon.ion/ion-java)\n[![Javadoc](https://javadoc-badge.appspot.com/com.amazon.ion/ion-java.svg?label=javadoc)](http://www.javadoc.io/doc/com.amazon.ion/ion-java)\n\n## Setup\nThis repository contains a [git submodule](https:...,GENERAL
155782195,amzn/smoke-http,Generic HTTP client for Swift applications,"<p align=""center"">\n<a href=""https://travis-ci.com/amzn/smoke-http"">\n<img src=""https://travis-ci.com/amzn/smoke-http.svg?branch=master"" alt=""Build - Master Branch"">\n</a>\n<img src=""https://img.shields.io/badge/os-linux-green.svg?style=flat"" alt=""Linux"">\n<a href=""http://swift.org"">\n<img src=""https://img.shields.io/badge/swift-5.0-orange.svg?style=flat"" alt=""Swift 5.0 Compatible"">\n</a>\n<a href=""http://swift.org"">\n<img src=""https://img.shields.io/badge/swift-5.1-orange.svg?style=flat"" alt=""Swift 5.1 Compatible"">\n</a>\n<a href=""https://gitter.im/SmokeServerSide"">\n<img src=""https://img...",GENERAL
108919385,aws-quickstart/quickstart-aviatrix-controller,AWS Quick Start Team,# quickstart-aviatrix-controller\n\n\nThis readme covers five Amazon Web Services (AWS) Quick Starts that help you build a highly available Aviatrix Controller in a virtual private cloud (VPC) on the AWS Cloud. \n\nYou can deploy the following solutions by using the Aviatrix Controller: \n\n- [Deploying Aviatrix Next-Gen Global Transit Hub on AWS](Transit-Hub-README.md)\n- [Deploying Aviatrix User VPN on AWS](User-VPN-README.md)\n- [Deploying Aviatrix FQDN Egress Filtering on AWS](FQDN-Egress-README.md)\n- [Deploying Aviatrix Site to Cloud VPN on AWS](Site2Cloud-VPN-README.md)\n- [Deployin...,API
110728431,aws-samples/aws-app-defined-permissions-demo,"Demonstrates how to create fully application-defined, dynamic and bespoke access controls to AWS resources at scale with Amazon Cognito","# Combining static IAM roles with application logic to create application-defined, dynamic, bespoke access to AWS resources at scale.\n\nWhen you develop applications using Amazon Cognito you can grant end users direct access to AWS resources using temporary credentials based on their group membership. This is a great mechanism to reduce server load, simplify applications, maintain security and deliver solutions quickly. When working at scale, with hundreds of users or thousands of resources, there are challenges to overcome. Within a single AWS account, you are limited to 500 IAM roles...",API
179733885,aws-samples/aws-appsync-long-query,Invoke AWS services directly from AWS AppSync via extended HTTP data source support.,"## AWS Appsync Long Query\n\n> Invoke AWS services directly from AWS AppSync via extended HTTP data source support.\n\nAWS AppSync has been extended to support directly calling AWS services via HTTP data sources. AppSync will sign requests using the Signature Version 4 process to authorize requests via AWS IAM. This means you can now call a broad array of AWS services without the need to write an intermediary Lambda function. For example, you could start execution of an AWS Step Functions state machine, retrieve a secret from AWS Secrets Manager, or list available GraphQL APIs from AppSync...",API
40384751,aws-samples/aws-device-farm-sample-app-for-ios,,# AWS Device Farm Sample App for iOS\n\nThis is a sample native iOS app that contains many of the stock iOS components and elements. It also contains multiple [Calabash tests](https://github.com/awslabs/aws-device-farm-sample-app-for-ios/tree/master/features) to get you started. You can also use this app with the AWS Device Farm [Built-in Fuzz Test](http://docs.aws.amazon.com/devicefarm/latest/developerguide/test-types-built-in-fuzz.html).\n\nYou can use this app and example test suite as a reference for your own Device Farm tests.\n\n##### **Notes**\nAll of the views are programatically c...,API
223240917,aws-samples/aws-elemental-conductor-amazon-sns,This split and stitch addon works with Elemental Conductor for acceleration of transcoding.,"# Elemental-SnS\n### Warning: If you don't know what Elemental Conductor does, you probably should leave this page.\nThis split and stitch addon works with Elemental Conductor for acceleration of transcoding. It splits a whole video into pieces of jobs with input clippings; after all segments completes transcoding, the stitch server then joins those clips according to corresponding output profiles. \n\n### How to install this addon?\nIn order to make this addon work, we need to install the followings:\n* A. Prepare dependencies:\npython3, pip3\nffmpeg(this is preinstalled on Elemental Serv...",API
156282726,aws-samples/serverless-ai-workshop,This workshop demonstrates two methods of machine learning inference for global production using AWS Lambda and Amazon SageMaker,# ServerlessAI\n\n## Serverless machine learning inference with AWS Lambda and SageMaker \nThis workshop demonstrates two methods of machine learning inference for globally-scalable production using AWS Lambda and Amazon SageMaker. **[Scikit-learn](https://scikit-learn.org)** is a popular machine learning library that covers most aspects of shallow learning. With these techniques the library module and your custom inference code can be combined into a flexible package for immediate cloud deployment. \n\nScikit-learn is vast and deep. It's an essential tool for every data scientist's workbe...,API
154230305,aws/aws-secretsmanager-caching-python,The AWS Secrets Manager Python caching client enables in-process caching of secrets for Python applications.,"## AWS Secrets Manager Python caching client\n\nThe AWS Secrets Manager Python caching client enables in-process caching of secrets for Python applications.\n\n## Getting Started\n\n### Required Prerequisites\n\nTo use this client you must have:\n\n* Python 3.6 or newer. Use of Python versions 3.5 or older are not supported.\n* An Amazon Web Services (AWS) account to access secrets stored in AWS Secrets Manager.\n * **To create an AWS account**, go to [Sign In or Create an AWS Account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) and then choose **I am a new us...",API
119432569,awsdocs/amazon-athena-user-guide,"The open source version of the Amazon Athena documentation. To submit feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.","## Amazon Athena User Guide\n\nThe open source version of the Amazon Athena documentation. To submit feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.\n\n## License Summary\n\nThe documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.\n\nThe sample code within this documentation is made available under a modified MIT license. See the LICENSE-SAMPLECODE file.\n",API


In [515]:
@labeling_function()
def alexa_lf(x):
    """If it has 'alexa' in the full name it is probably an Alexa skill, an API project"""
    return API if 'alexa' in x.full_name.lower() else ABSTAIN


@labeling_function()
def api_lf(x):
    """If it has 'api' in the name it is probably an API project"""
    return API if 'api' in x.full_name.lower() else ABSTAIN


@labeling_function()
def walkthrough_lf(x):
    """If it has 'walkthrough' in the full name or description, it is an example of an API project"""
    return API if ('walkthrough' in x.full_name.lower() or 'walkthrough' in x.description.lower()) else ABSTAIN


@labeling_function()
def skill_lf(x):
    """If it has 'skill' in the full name or description, it is probably an Alexa skill"""
    return API if ('skill' in x.full_name.lower() or 'skill' in x.description.lower()) else ABSTAIN


@labeling_function()
def kit_lf(x):
    """If 'kit' in the description, it is probably an API project"""
    return API if 'skill' in x.description.lower() else ABSTAIN


@labeling_function()
def ext_desc_lf(x):
    """If 'extension' appears in the description, it is probably an API project"""
    return API if 'extension' in x.description.lower() else ABSTAIN


@labeling_function()
def ext_readme_lf(x):
    """If 'extension' appears in the readme, it is probably an API project"""
    return API if 'extension' in x.description.lower() else ABSTAIN


@labeling_function()
def aws_name_lf(x):
    """IF 'aws' appears in the name it is probably an API project"""
    return API if 'aws' in x.full_name.lower() else ABSTAIN


@labeling_function()
def aws_description_lf(x):
    """IF 'aws' appears in the description it is probably an API project"""
    return API if 'aws' in x.description.lower() else ABSTAIN


@labeling_function()
def aws_readme_lf(x):
    """IF 'aws' appears in the readme it is probably an API project"""
    return API if 'aws' in x.readme.lower() else ABSTAIN


@labeling_function()
def integrate_desc_lf(x):
    """If 'integrate' or 'integration' are in the description it is probably an API project"""
    return API if ('integrate' in x.description.lower() or 'integration' in x.description.lower()) else ABSTAIN


@labeling_function()
def integrate_readme_lf(x):
    """If 'integrate' or 'integration' are in the description it is probably an API project"""
    return API if ('integrate' in x.readme.lower() or 'integration' in x.readme.lower()) else ABSTAIN


@labeling_function()
def dataset_lf(x):
    """If 'dataset' is in the description, it is probably a GENERAL academic contribution"""
    return API if ('dataset' in x.description.lower() or 'dataset' in x.readme.lower()) else ABSTAIN


@labeling_function()
def demo_name_lf(x):
    """If 'demo' appears in the full_name it is probably an API example"""
    return API if 'demo' in x.full_name.lower() else ABSTAIN


@labeling_function()
def demo_desc_lf(x):
    """If 'demo' appears in the description it is probably an API example"""
    return API if 'demo' in x.description.lower() else ABSTAIN


@labeling_function()
def demo_readme_lf(x):
    """If 'demo' appears in the readme it is probably an API example"""
    return API if 'demo' in x.readme.lower() else ABSTAIN


@labeling_function()
def ajax_lf(x):
    """If 'ajaxorg' appears in the full name it is probably a GENERAL utility"""
    return GENERAL if 'ajaxorg' in x.full_name.lower() else ABSTAIN


@labeling_function()
def docs_lf(x):
    """If 'docs' in the full name it is probably an API documentation project"""
    return API if 'docs' in x.full_name.lower() else ABSTAIN


@labeling_function()
def elastic_readme_lf(x):
    """If elastic is in the full name it is probably part to do with elastic IPs, so APIs"""
    return API if 'elastic' in x.readme.lower() else ABSTAIN


@labeling_function()
def node_lf(x):
    return API if 'node' in x.full_name.lower() else ABSTAIN

### Add More `GENERAL` LFs

We now have a lot of `API` LFs but not enough `GENERAL` LFs. We'll see later how the coverage for LFs needs to approximately match the distribution of labels.

In [504]:
# Look at just GENERAL readmes - we need more GENERAL LFs
dev_df[dev_df['label'] == 'GENERAL']['readme'].to_frame()

Unnamed: 0_level_0,readme
id,Unnamed: 1_level_1
102120162,"# checkexport #\n\ncheckexport is a tool to make sure that all the stuff you export is actually\nused somewhere else.\n\nYou run it against a particular package, and within a scope. By default, the\nrepo root of the targeted package is used.\n\n## Install ##\n```bash\ngo get github.com/twitchtv/checkexport\n```\n\n## Examples ##\n\nCheck whether exported values in `github.com/golang/dep/internal/gps` are used\nanywhere else in `github.com/golang/dep`:\n\n```bash\n$ checkexport -scope=github.com/golang/dep/... github.com/golang/dep/internal/gps\n/Users/snelson/go/src/github.com/golang/dep/i..."
30425276,# c9.ide.ace.keymaps\n
30425371,# c9.ide.language.css\n
155782195,"<p align=""center"">\n<a href=""https://travis-ci.com/amzn/smoke-http"">\n<img src=""https://travis-ci.com/amzn/smoke-http.svg?branch=master"" alt=""Build - Master Branch"">\n</a>\n<img src=""https://img.shields.io/badge/os-linux-green.svg?style=flat"" alt=""Linux"">\n<a href=""http://swift.org"">\n<img src=""https://img.shields.io/badge/swift-5.0-orange.svg?style=flat"" alt=""Swift 5.0 Compatible"">\n</a>\n<a href=""http://swift.org"">\n<img src=""https://img.shields.io/badge/swift-5.1-orange.svg?style=flat"" alt=""Swift 5.1 Compatible"">\n</a>\n<a href=""https://gitter.im/SmokeServerSide"">\n<img src=""https://img..."
4225718,"node.js network utils\n=====================\n\nprovides:\n\nFind the first free port on the server within the given range:\n\n`findFreePort(start, end, hostname, callback)`\n\n\nCheck whether the given port is open:\n\n`isPortOpen(hostname, port, timeout, callback)`\n\n\nGet the hostname of the current server:\n\n`getHostName(callback)`"
30425336,# c9.ide.readonly\n
55705525,"OutPlan\n=======\n\nOutPlan is an A/B testing framework based on Facebook's [PlanOut](http://facebook.github.io/planout).\nIt's designed to work with Node and client-side JavaScript.\n\nOutPlan is based on [PlanOut.js](https://github.com/HubSpot/PlanOut.js),\nwhich does all the hard work. _Thanks!_ OutPlan however ""outclasses"" classic\nPlanOut by not using classes. The resulting API is clean and simple.\n\n## Installation\n\n```\nnpm install outplan\n```\n\n## Usage\n\nSet up an experiment as follows:\n\n```javascript\noutplan.create(""nice-colors"", [""A"", ""B""]);\n```\n\nand then evaluate th..."
30503327,# c9.ide.language.javascript.eslint\n
33198322,"# `c9.ide.run.debug.xdebug`\n\n[Cloud9](https://c9.io/) core plugin for [Xdebug](http://xdebug.org/) and other DBGP\ndebuggers.\n\n\nto install xdebug for php use\n\n```sh\nsudo apt-get update\nsudo apt-get install -y php5-dev\nsudo pecl install xdebug\nsudo mkdir -p /etc/php5/mods-available\necho ""; Xdebug extension installed by Cloud9\nzend_extension=xdebug.so\nxdebug.remote_enable=1\n"" | sudo tee --append /etc/php5/mods-available/xdebug.ini\nsudo php5enmod xdebug\n```\n\n## License\n\n[The MIT License](http://opensource.org/licenses/MIT)\n\nCopyright (c) 2015 Ajax.org B.V.\n"
30425303,# c9.ide.fontawesome\n


In [507]:
@labeling_function()
def elastic_name_lf(x):
    """If elastic is in the full name it is probably part of the Open Distro for Elasticsearch, so GENERAL"""
    return GENERAL if 'elastic' in x.full_name.lower() else ABSTAIN


@labeling_function()
def elastic_desc_lf(x):
    """If elasticsearch is in the full name it is probably part of the Open Distro for Elasticsearch, so GENERAL"""
    return GENERAL if 'elasticsearch' in x.description.lower() else ABSTAIN

@labeling_function()
def cloud9ide_lf(x):
    """If the full name starts with 'cloud9ide' it is probably part of Cloud9 IDE which is GENERAL"""
    return GENERAL if x.full_name.lower().startswith('cloud9ide') else ABSTAIN

In [516]:
lfs = [
    sdk_lf,
    ion_lf,
    cloud9_name_lf,
    cloud9_description_lf,
    cloud9_readme_lf,
    alexa_lf,
    api_lf,
    walkthrough_lf,
    skill_lf,
    kit_lf,
    ext_desc_lf,
    ext_readme_lf,
    aws_name_lf,
    aws_description_lf,
    aws_readme_lf,
    integrate_desc_lf,
    integrate_readme_lf,
    dataset_lf,
    demo_name_lf,
    demo_desc_lf,
    demo_readme_lf,
    # ajax_lf,
    docs_lf,
    elastic_name_lf,
    elastic_desc_lf,
    elastic_readme_lf,
    node_lf,
    cloud9ide_lf,
]

# Create and apply a new Pandas 
applier = PandasLFApplier(lfs=lfs)

# Apply the LFs to the data to generate a list of labels
L_dev   = applier.apply(df=dev_df)
L_train = applier.apply(df=train_df)

# Run an label function analysis on the results, to describe their output against the labeled development data
lf_df = LFAnalysis(L=L_dev, lfs=lfs).lf_summary(dev_labels.values)

lf_df







100%|██████████| 222/222 [00:00<00:00, 2324.28it/s][A[A






  0%|          | 0/414 [00:00<?, ?it/s][A[A[A[A[A[A





 49%|████▊     | 201/414 [00:00<00:00, 2008.60it/s][A[A[A[A[A[A





100%|██████████| 414/414 [00:00<00:00, 1960.19it/s][A[A[A[A[A[A


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
sdk_lf,0,[1],0.031532,0.031532,0.0,7,0,1.0
ion_lf,1,[0],0.004505,0.0,0.0,1,0,1.0
cloud9_name_lf,2,[0],0.036036,0.036036,0.004505,8,0,1.0
cloud9_description_lf,3,[0],0.031532,0.031532,0.0,7,0,1.0
cloud9_readme_lf,4,[0],0.076577,0.076577,0.045045,7,10,0.411765
alexa_lf,5,[1],0.040541,0.040541,0.0,9,0,1.0
api_lf,6,[1],0.013514,0.013514,0.0,3,0,1.0
walkthrough_lf,7,[1],0.004505,0.004505,0.0,1,0,1.0
skill_lf,8,[1],0.036036,0.036036,0.0,8,0,1.0
kit_lf,9,[1],0.036036,0.036036,0.0,8,0,1.0


## Ensuring `LabelingFunctions` Balance

In order to achieve good performance we need to make certain that the coverage and accurate output numbers of our `LabelingFunctions` are comparable to the proportion in which the labels occur in the dataset.

### Analyze the `DataFrame` returned by `LFAnalysis`

Fortunately `LFAnalysis` returns a `pandas.DataFrame` that we can analyze. Let's group the data by the label the LF returns and determine the coverage and accurate output of all of our LFs combined. First we'll need to remove any LFs that did not return any non `ABSTAIN` labels because they will crash our creation of a `Single Polarity` field that we'll use to group the data. 

Next we'll group the hand-labeled development dataset by label and then compare the proportios between the two tables by joining them and computing the difference between the two sets of proportions.

Note: LFs can return more than one label, which we'll demonstrate in Chapter 5, Weak and Distant Supervision. For now we can assume each returns one label.

In [532]:
# Filter out any LabelingFunctions that didn't return any labels other than ABSTAIN
lf_df = lf_df[lf_df['Polarity'].apply(lambda x: True if len(x) > 0 else alse)]

# Create a single polarity field to group on to evaluate each label's statistics
lf_df['Single Polarity'] = lf_df['Polarity'].apply(lambda x: x[0])

# Group the data by the single polarity
total_lf_df = lf_df.groupby('Single Polarity').agg({'Coverage': 'sum', 'Correct': 'sum'})

# Add a proportion column for the correct values
total_lf_df['Correct Proportions'] = total_lf_df['Correct'].div(total_lf_df['Correct'].sum(), axis=0).multiply(100)
total_lf_df

Unnamed: 0_level_0,Coverage,Correct,Correct Proportions
Single Polarity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.184685,26,3.730273
1,3.067568,671,96.269727


In [538]:
# Compute the same proportions for our hand-labeled development dataset so we can compare
total_dev_df = dev_df.groupby('label').agg({'full_name': 'count'})
total_dev_df

Unnamed: 0_level_0,full_name
label,Unnamed: 1_level_1
API,206
GENERAL,16


### Fixing a Problem



In [None]:
import io
import re

from bs4 import BeautifulSoup
from markdown import markdown


def markdown_to_code(markdown_text):
    """Extract source code from Markdown snippets"""
    code_blocks = []
    code_snippets = [] # These get a single block

    f = io.StringIO(markdown_text)
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            code_block = [f.readline()]
            while re.search("```", code_block[-1]) is None:
                code_block.append(f.readline())
            code_blocks.append("".join(code_block[:-1]))
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')
                    code_snippets.append(group)
    
    # Now combine all snippets into one code block
    code_blocks.append(' '.join(code_snippets))
    
    return '\n'.join(code_blocks)


def markdown_to_text(markdown_text):
    """Extract plaintext - minus the code snippets - from Markdown"""
    text_blocks = []
    f = io.StringIO(markdown_text)
    i = 0
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            print('is_block')
            first_record = f.readline()
            second_record = f.readline()
            print(f'first_record: {first_record}')
            print(f'second_record: {second_record}')
            code_block = [first_record]
            while re.search("```", code_block[-1]) is None:
                print('inside_block')
                f.readline()
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')

            text_blocks.append(line)
        i += 1
    
    md = ''.join(text_blocks)
    html = markdown(md)
    soup = BeautifulSoup(html, 'lxml')
    text = soup.find_all(text=True)
    out_text = []
    for text in text:
        if text == '\n':
            pass
        else:
            out_text.append(text)
    return out_text

print(df['readme'].iloc[6][1204:-1])

markdown_to_text(df['readme']).to_frame()

In [None]:
df['readme_text'] = df['readme'].apply(utils.markdown_to_text)
df['readme_code'] = df['readme'].apply(utils.markdown_to_code)

df.head()

In [413]:
utils.markdown_to_text(df['readme'].iloc[7]), utils.markdown_to_code(df['readme'].iloc[9])

(['Build An Alexa City Guide Skill',
  'This Alexa sample skill is a template for a basic fact skill. Provided a list of interesting facts about a topic, Alexa will select a fact at random and tell it to the user when the skill is invoked.',
  'To ',
  'Get Started',
  ' click the button below:',
  'Or click ',
  'here',
  ' for instructions using the ASK CLI (command line interface).',
  'Additional Resources',
  'Community',
  'Amazon Developer Forums',
  ' - Join the conversation!',
  'Hackster.io',
  ' - See what others are building with Alexa.',
  'Tutorials & Guides',
  'Voice Design Guide',
  ' - A great resource for learning conversational and voice user interface design.',
  'Codecademy: Learn Alexa',
  ' - Learn how to build an Alexa Skill from within your browser with this beginner friendly tutorial on Codecademy!',
  'Documentation',
  'Official Alexa Skills Kit Node.js SDK',
  ' - The Official Node.js SDK Documentation',
  'Official Alexa Skills Kit Documentation',
  ' - O

In [415]:
df['readme_text'] = df['readme'].apply(utils.markdown_to_text)
df['readme_code'] = df['readme'].apply(utils.markdown_to_code)

df.head()

KeyboardInterrupt: 