# Amazon Open Source Analysis

This project analyzes Amazon's Free and Open Source Software (FOSS) Contributions to
determine the extent to which they behave as good citizens of the open source community.
Much has been said about Amazon and open source software, much of it good from Amazon and
much of it bad from the community. But what is the reality? Numbers that tend to indicate
good citizenship are cited by Amazon, Google, Microsoft and other cloud providers but they
are never checked for accuracy, honesty and integrity. This project aims to change that by
rigorously evaluating the open source contributions of cloud providers, starting with Amazon.

This project is an attempt to check the veracity of the claim made by Deirdré Straughan, managing editor of the AWS Open Source blog, on April 8, 2019 that, "Amazon has contributed [over 1,800 projects](https://github.com/search?utf8=%E2%9C%93&q=+user%3Aalexa+user%3Aamzn+user%3Aaws+user%3Aawsdocs+user%3Aawslabs+user%3Aaws-quickstart+user%3Ablox+user%3Aboto+user%3Ac9+user%3Acorretto+user%3Afirecracker-microvm+user%3Aaws-robotics+user%3Aajaxorg+user%3Agluon-api+user%3Acloud9ide+user%3ACarbonado+user%3Agoodreads+user%3AIvonaSoftware+user%3Atwitchtv+user%3Atwitchdev+user%3Atwitchscience+user%3Ajustintv+user%3AZappos+user%3Aamazon-archives+user%3Aalexa-labs+user%3Aaws-samples+user%3Aaws-amplify+user%3Aaws-cloudformation+user%3Aaws-solutions+user%3Aopendistro-for-elasticsearch+user%3Aopendistro&type=Repositories&ref=advsearch&l=&l=)
across 30 GitHub organizations ranging from [Alexa](https://github.com/alexa) to
[Zappos](https://github.com/Zappos). You can search them all from
[aws.github.com](https://aws.github.io/)."

## Source: Snorkel Tutorials

This project uses code from the [Snorkel Tutorials](https://github.com/snorkel-team/snorkel-tutorials) at [https://github.com/snorkel-team/snorkel-tutorials](https://github.com/snorkel-team/snorkel-tutorials)

Thanks to the [Snorkel Team](https://www.snorkel.org/) for making this important work feasible.

In [1]:
import logging

# Make sure randomness is reproducible
import random
random.seed('31337')

import warnings
warnings.simplefilter("ignore")

In [2]:
import pandas as pd

# Load the AWS Github repo READMEs scraped from Github
df = pd.read_json('data/aws_repos.jsonl', lines=True)
print(f'Total repositories: {len(df.index):,}')
df.head(3)

Total repositories: 2,568


Unnamed: 0,id,node_id,name,full_name,private,owner,html_url,description,fork,url,...,archived,disabled,open_issues_count,license,forks,open_issues,watchers,default_branch,permissions,score
0,61861755,MDEwOlJlcG9zaXRvcnk2MTg2MTc1NQ==,alexa-skills-kit-sdk-for-nodejs,alexa/alexa-skills-kit-sdk-for-nodejs,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Node.js helps you...,False,https://api.github.com/repos/alexa/alexa-skill...,...,False,False,8,"{'key': 'apache-2.0', 'name': 'Apache License ...",670,8,2811,2.0.x,"{'admin': False, 'push': False, 'pull': True}",1
1,84138837,MDEwOlJlcG9zaXRvcnk4NDEzODgzNw==,alexa-cookbook,alexa/alexa-cookbook,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/alexa-cookbook,A series of sample code projects to be used fo...,False,https://api.github.com/repos/alexa/alexa-cookbook,...,False,False,13,"{'key': 'other', 'name': 'Other', 'spdx_id': '...",912,13,1557,master,"{'admin': False, 'push': False, 'pull': True}",1
2,63275452,MDEwOlJlcG9zaXRvcnk2MzI3NTQ1Mg==,skill-sample-nodejs-fact,alexa/skill-sample-nodejs-fact,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/skill-sample-nodejs-fact,Build An Alexa Fact Skill,False,https://api.github.com/repos/alexa/skill-sampl...,...,False,False,7,"{'key': 'apache-2.0', 'name': 'Apache License ...",1186,7,1002,master,"{'admin': False, 'push': False, 'pull': True}",1


## How many are original projects and how many are forks?

If the `fork` field is True, this isn't an Amazon company project - it is a fork of another project. Let's see how many they've created and how many their companies have forked. Forking a project is not an indicator of contributing a project - it is one click in Github.

In [3]:
print(f"Total original repositories: {len(df[df['fork'] == False].index)}")

Total original repositories: 2568


In [4]:
df[df['full_name'] == 'awslabs/dynamodb-transactions']

Unnamed: 0,id,node_id,name,full_name,private,owner,html_url,description,fork,url,...,archived,disabled,open_issues_count,license,forks,open_issues,watchers,default_branch,permissions,score
660,10943591,MDEwOlJlcG9zaXRvcnkxMDk0MzU5MQ==,dynamodb-transactions,awslabs/dynamodb-transactions,False,"{'login': 'awslabs', 'id': 3299148, 'node_id':...",https://github.com/awslabs/dynamodb-transactions,,False,https://api.github.com/repos/awslabs/dynamodb-...,...,False,False,4,"{'key': 'apache-2.0', 'name': 'Apache License ...",87,4,303,master,"{'admin': False, 'push': False, 'pull': True}",1


Looks like all the repositories listed were created by Amazon or by companies Amazon has acquired.

## Get READMEs for each repository

The [Github README API](https://developer.github.com/v3/repos/contents/#get-the-readme) makes it very easy to download the README of a project. Let's fetch the README of every Amazon open source project on Github.

**Note: you need only run the following four code cells once, thereafter skip ahead to the 4th cell.**

### Pull Github Tokens from Environment

You need to have set the `GITHUB_USERNAME` and `GITHUB_TOKEN` environment variables in the shell from which you ran `docker-compose up` or `jupyter notebook`.

In [5]:
import base64
import os
import time


# Ensure that both essential environment variables have been set.
GITHUB_USERNAME = os.environ.get('GITHUB_USERNAME')
if not GITHUB_USERNAME:
    raise Exception('Environment variable GITHUB_USERNAME must be defined')

GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN')
if not GITHUB_TOKEN:
    raise Exception('Environment variable GITHUB_TOKEN must be defined')

### Define a `get_readme(full_name)` Function

We need a function `get_readme(full_name)` that will take a `full_name` and return a UTF-8 encoded Github README using the credentials set in the `GITHUB_USERNAME`/`GITHUB_TOKEN` environment variables.

In [6]:
import time
from urllib3.exceptions import ProtocolError

import requests
from requests.exceptions import ConnectionError, SSLError, Timeout


HEADERS = {
    'Accept-Encoding' : 'gzip'
}
GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN')
GITHUB_API = 'https://api.github.com/repos/{full_name}/readme'
TIMEOUT = 3.0
SLEEP_MINUTES = 11


class GithubRequestException(Exception):
    """Set the status code of a Github request for evaluating exceptions"""
    def __init__(self, message, status_code=None):
        self.message = message
        self.status_code = status_code


def get_readme(full_name):
    """Given the full name of a project, return a UTF-8 README or throw a GithubRequestException"""
    
    api_url = GITHUB_API.format(
        full_name=full_name,
    )
    
    try:
        response = requests.get(
            api_url,
            auth=(GITHUB_USERNAME, GITHUB_TOKEN),
            timeout=TIMEOUT,
            headers=HEADERS,
        )
        
        # Parse the README
        record = response.json()
        readme_64 = record.get('content')
        readme = None
        if readme_64:
            readme = base64.b64decode(
                readme_64
            ).decode()
        
        if response.status_code == 200:
            return readme
        else:
            raise GithubRequestException(
                f'Error: got response code {response.status_code}',
                response.status_code
            )

    except SSLError:
        raise GithubRequestException(
            'SSL Error',
            response.status_code
        )

    except Timeout:
        raise GithubRequestException(
            f'Timeout error: {TIMEOUT}s exceeded',
            response.status_code
        )
    
    except (ConnectionError, ProtocolError):
        raise GithubRequestException(
            'Connection or protocol error',
            response.status_code
        )

## Define Variables for READMEs

We want these defined in a separate cell in case we have to retry our requests.

In [7]:
# Use a dict to hold READMEs so we can see if they're done already, but also a list to create a Series from
readmes = {}
readme_nums = {}
readme_list = []

# Use a dict to hold do-over flags so we don't redo work
redos = {}

### Fetch READMEs for All Packages

Now we loop through the `full_names` of all packages and store their READMEs in a dict

In [16]:
from tqdm.notebook import tqdm


# Materialize the list
full_names = list(df['full_name'].iteritems())

# Given a project full_name (owner/repo) fetch the README
for i, full_name in tqdm(
    full_names,
    desc='Fetching READMEs for all packages'
):
    # We already did this one
    if full_name in readmes and readmes[full_name] and isinstance(readmes[full_name], dict):
        continue
    
    # Fetch the Github README, handling errors, logging any GithubRequestException
    # for doing over again
    try:
        readme = get_readme(full_name)

        # If we got a valid record back, insert it
        if readme and isinstance(readme, str):
            readmes[full_name] = readme
            readme_nums[i] = readme
            readme_list.append(readme)
            # Get rid of any redo record
            if full_name in redos:
                del redos[full_name]
        # Otherwise redo it and store empty string
        else:
            redos[full_name] = True
            readme_list.append(readme)

        # Don't flood Github
        time.sleep(0.1)
    
    except GithubRequestException as e:
        logging.error(e)
        redos[full_name] = True
        
        if e.status_code == '403':
            sleep_time = SLEEP_MINUTES * 60
            logging.info(f'Ran into response code {e.status_code}! Sleeping for {sleep_time} minutes ...')
            time.sleep(sleep_time)

print(f'Need to redo {len(redos)} repos ...')

HBox(children=(FloatProgress(value=0.0, description='Fetching READMEs for all packages', max=2568.0, style=Pro…

ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: got response code 404', 404)
ERROR:root:('Error: 


Need to redo 29 repos ...


In [22]:
missed = 0

def add_readme(x, missed):
    """Use the README dict to add the README for this full_name"""
    try:
        return readmes[x['full_name']]
    except KeyError:
        missed +=1
        return ""

df['readme'] = df.apply(lambda x: add_readme(x, missed), axis=1)
df.head()

0


Unnamed: 0,id,node_id,name,full_name,private,owner,html_url,description,fork,url,...,disabled,open_issues_count,license,forks,open_issues,watchers,default_branch,permissions,score,readme
0,61861755,MDEwOlJlcG9zaXRvcnk2MTg2MTc1NQ==,alexa-skills-kit-sdk-for-nodejs,alexa/alexa-skills-kit-sdk-for-nodejs,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Node.js helps you...,False,https://api.github.com/repos/alexa/alexa-skill...,...,False,8,"{'key': 'apache-2.0', 'name': 'Apache License ...",670,8,2811,2.0.x,"{'admin': False, 'push': False, 'pull': True}",1,"<p align=""center"">\n <img src=""https://m.medi..."
1,84138837,MDEwOlJlcG9zaXRvcnk4NDEzODgzNw==,alexa-cookbook,alexa/alexa-cookbook,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/alexa-cookbook,A series of sample code projects to be used fo...,False,https://api.github.com/repos/alexa/alexa-cookbook,...,False,13,"{'key': 'other', 'name': 'Other', 'spdx_id': '...",912,13,1557,master,"{'admin': False, 'push': False, 'pull': True}",1,\n# Alexa Skill Building Cookbook\n\n<div styl...
2,63275452,MDEwOlJlcG9zaXRvcnk2MzI3NTQ1Mg==,skill-sample-nodejs-fact,alexa/skill-sample-nodejs-fact,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/skill-sample-nodejs-fact,Build An Alexa Fact Skill,False,https://api.github.com/repos/alexa/skill-sampl...,...,False,7,"{'key': 'apache-2.0', 'name': 'Apache License ...",1186,7,1002,master,"{'admin': False, 'push': False, 'pull': True}",1,"# Build An Alexa Fact Skill\n<img src=""https:/..."
3,81483877,MDEwOlJlcG9zaXRvcnk4MTQ4Mzg3Nw==,avs-device-sdk,alexa/avs-device-sdk,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/avs-device-sdk,An SDK for commercial device makers to integra...,False,https://api.github.com/repos/alexa/avs-device-sdk,...,False,54,"{'key': 'apache-2.0', 'name': 'Apache License ...",477,54,993,master,"{'admin': False, 'push': False, 'pull': True}",1,### What is the Alexa Voice Service (AVS)?\n\n...
4,38904647,MDEwOlJlcG9zaXRvcnkzODkwNDY0Nw==,alexa-skills-kit-sdk-for-java,alexa/alexa-skills-kit-sdk-for-java,False,"{'login': 'alexa', 'id': 17815977, 'node_id': ...",https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Java helps you ge...,False,https://api.github.com/repos/alexa/alexa-skill...,...,False,2,"{'key': 'apache-2.0', 'name': 'Apache License ...",720,2,715,2.0.x,"{'admin': False, 'push': False, 'pull': True}",1,"<p align=""center"">\n <img src=""https://m.medi..."


## Store the Data for Hand Labeling of a Sample

Store the data as CSV for hand labeling to guide our Labeling Function development. Also store to Parquet.

In [22]:
# df.columns

Index(['id', 'node_id', 'name', 'full_name', 'private', 'owner', 'html_url',
       'description', 'fork', 'url', 'forks_url', 'keys_url',
       'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url',
       'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url',
       'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url',
       'languages_url', 'stargazers_url', 'contributors_url',
       'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url',
       'comments_url', 'issue_comment_url', 'contents_url', 'compare_url',
       'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url',
       'milestones_url', 'notifications_url', 'labels_url', 'releases_url',
       'deployments_url', 'created_at', 'updated_at', 'pushed_at', 'git_url',
       'ssh_url', 'clone_url', 'svn_url', 'homepage', 'size',
       'stargazers_count', 'watchers_count', 'language', 'has_issues',
       'has_projects', 'has_downloads', 'has_wiki', 'has

### Write the Data Out to Parquet

Here we flatten the nested fields we need, remove the rest and store the data in Parquet format for later use. We also create a dummy/empty `label` column.

In [24]:
# import pyarrow

# # Show all columns
# pd.options.display.max_columns = None

# # Un-nest/flatten owner and license
# df['owner']   = df['owner'].apply(lambda x:  x['login'] if isinstance(x, dict) and 'login' in x else x)
# df['license'] = df['license'].apply(lambda x: x['name'] if isinstance(x, dict) and 'name'  in x else x)

# # Write the relevant columns to parquet format
# df[[
#     'id',
#     'node_id',
#     'name',
#     'full_name',
#     'owner',
#     'description',
#     'url',
#     'html_url',
#     'created_at',
#     'updated_at',
#     'pushed_at',
#     'homepage',
#     'size',
#     'stargazers_count',
#     'watchers_count',
#     'language',
#     'license',
#     'forks_count',
#     'open_issues_count',
#     'default_branch',
#     'score',
#     'readme',
# ]].to_parquet(
#     'data/Amazon.Repos.READMEs.2-26-2020.parquet'
# )

# # Load the data we previously enriched with Github READMEs
# df = pd.read_parquet('data/Amazon.Repos.READMEs.2-26-2020.parquet')

# Add a dummy column for labels
df['label'] = None

### Prepare Data for Hand Labeling

Here we prepare the data for labeling using [SMART](https://github.com/RTIInternational/SMART). I had to alter smart with this [pull request](https://github.com/RTIInternational/SMART/pull/47) (explanation in [ticket #46](https://github.com/RTIInternational/SMART/issues/46)) to display multi-line project descriptions and annotation cards. This way we can show the project URL, description and the top part of the README in a readable way.

![Smart Project](images/SMART_multi_line_project_description.png)

![Smart Annotation Card](images/SMART_multi_line_annotation_card.png)

In [25]:
import re

from bs4 import BeautifulSoup
from markdown import markdown


def extract_text_plain(x):
    """Extract non-code text from posts (questions/answers)"""
    if not isinstance(x, str):
        x = ''
    doc = BeautifulSoup(x or '')
    codes = doc.find_all('code')
    [code.extract() if code else None for code in codes]
    text = re.sub(r'http\S+', ' ', doc.text)
    text = re.sub(r'\n+', ' ', text)
    return text


# Add the README text to the dataframe
df['readme_text'] = df['readme'].apply(extract_text_plain)


def create_question(row):
    """Given a row in a DataFrame, add a question column from the full_name and description"""
    
    head_len = 3000
    readme_len = len(row['readme_text'])
    if readme_len < head_len:
        head_len = readme_len
    readme_text = row['readme_text'][:head_len]
    
    question = (
        f"""{row['url']}

{row['description']}

{readme_text}"""
    )
    row['question'] = question
    return row


# Select and rename the fields to the format that SMART expects. See https://github.com/RTIInternational/SMART
df_smart = df.apply(create_question, axis=1)[['id', 'question', 'label']]
df_smart = df_smart.rename(columns={'id': 'ID', 'question': 'Text', 'label': 'Label'})

# Write to CSV for SMART to load for hand labeling.
df_smart.to_csv(
    'data/Amazon_Open_Source_Analysis_Questions - SMART.csv',
    index=False,
)

## Load the READMEs

We now load the READMEs back and join it with the original repository DataFrame.

**Note: From now on we will simply load the data, skipping the previous cell.**

In [None]:
# Temporary load from other machine
import pyarrow

readme_df = pd.read_parquet('data/aws_readmes.parquet', engine='pyarrow')[['id', 'readme']]

# Join READMEs in and drop duplicate ID column
df_join = df.join(readme_df, lsuffix='_readme_df')
del df_join['id_readme_df']

df = df_join

df.head(3)

## Create spaCy Documents from READMEs

Setup the large english language model and have it merge multi-token named entities.

In [None]:
import spacy
from spacy.pipeline import merge_entities


# Enable a GPU if you have one
spacy.prefer_gpu()

# Download the spaCy english model
spacy.cli.download('en_core_web_lg')
nlp = spacy.load("en_core_web_lg")

# Merge multi-token entities together
nlp.add_pipe(merge_entities)

nlp.pipeline

In [None]:
df['spacy'] = df['readme'].apply(nlp)
df.head(3)

## Load the Gold Labeled Data

Data was labeled via a [Google Sheet](https://docs.google.com/spreadsheets/d/1ULt0KxIdb5HUJCEMt_AmOuPbTvN1zg8UA_4RvjlVwXQ/edit?usp=sharing) and exported to CSV at [data/Amazon_Open_Source_Analysis_Gold.csv](data/Amazon_Open_Source_Analysis_Gold.csv).

### Note: Submitting Corrections or Additions

If you feel any labels are wrong, first read the definitions in the README and comment on the sheet. You may also copy the Google Sheet and continue labeling yourself if you want to ensure the accuracy of this analysis.

In [None]:
# Load all 2,469 records and then filter out the unlabeled ones (all but 200)
df_gold = pd.read_csv('data/Amazon_Open_Source_Analysis_Gold.csv')

df_gold = df_gold[df_gold['label'].notnull()]
print(f'Gold labeled records: {len(df_gold.index):,}')

df_gold = df.set_index('id').join(
    df_gold.set_index('id'),
    how='inner',
    on='id',
    rsuffix='_gold',
)

# Drop duplicate columns
df_gold = df_gold.drop(
    [
        'full_name_gold','url_gold','description_gold','fork_gold','forks_count_gold',
        'language_gold','homepage_gold','open_issues_count_gold','watchers_gold', 
        'readme_gold',
        
    ],
    axis=1,
)

# Drop ABSTAIN labels
df_gold = df_gold[df_gold['label'] != 'ABSTAIN']
print(f'Records minus ABSTAIN: {len(df_gold.index):,}')

df_gold[['full_name', 'label']].head(20)

## Check Label Imbalance

If the labels are highly imbalanced, it will throw off our `LabelModel`.

In [None]:
df_gold['label'].value_counts()

### Defining Label Schema

The labels for this dataset are:

| Number | Code      | Description                       |
|--------|-----------|-----------------------------------|
| -1     | ABSTAIN   | No vote, for Labeling Functions   |
| 0      | GENERAL   | A FOSS project of general appeal  |
| 1      | API       | An API library for AWS            |
| 2      | EDUCATION | An educational library for AWS    |
| 3      | DATASET   | REMOVED: An open dataset by Amazon|

In [None]:
ABSTAIN   = -1
GENERAL   = 0
API       = 1
EDUCATION = 2
# DATASET   = 3

label_pairs = [
    (ABSTAIN, 'ABSTAIN'),
    (GENERAL, 'GENERAL'),
    (API, 'API'),
    (EDUCATION, 'EDUCATION'),
    # (DATASET, 'DATASET'),
]

# Forward and reverse indexes to labels/names
number_to_name_dict = dict(label_pairs)
name_to_number_dict = dict([(x[1],x[0]) for x in label_pairs])

In [None]:
# Map the labels to their numeric label numbers
def name_to_number(name):
    """Convert string labels from the Google Sheet to their numeric values"""
    return name_to_number_dict[name]


def number_to_name(num):
    """Convert numeric labels to their values in the Google Sheet"""
    return number_to_name_dict[num]


df_gold['label_num'] = df_gold['label'].apply(name_to_number)
df_gold[['full_name','label','label_num']].head(20)

## Now Create a Random Forest Model using a Sparse Representation to Pick Keyword Label Functions

We will use the spaCy doc we created to lemmatize as we tokenize the words, giving us better representations for feature importances.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS


STOP_WORDS = STOP_WORDS.union(set(['-', '=', "'", '/', '.', ';', '#', '##', '###', '<', '>', ':', '\n', '\n\n', '!', '[', ']', ')', '{', '}',]))


def tokenize(doc):
    """Tokenize: skip stop words, return proper nouns as n-grams and lemmas for everything else"""
    tokens = []
    for token in doc:
        # Drop stop words
        if token.text.lower() in STOP_WORDS:
            continue
            
        # Lemmatize anything that isn't a proper noun
        if token.pos_ != 'PROPN':
            tokens.append(token.lemma_.lower())
        else:
            tokens.append(token.text.lower())
    return tokens


df_gold['lemmas'] = df_gold['spacy'].apply(tokenize)
df_gold['lemmas']

## TF-IDF Vectorize and Split the Text Data and Labels into Train/Test Sets

We need to vectorize the data in a sparse representation to train the model and get feature importances for each word, so we use sklearn's `TfidfVectorizer` to give more important words more weight.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split


vectorizer = TfidfVectorizer(
    analyzer='word',
    min_df=3,
    stop_words=None,
    tokenizer=lambda x: x,
    preprocessor=lambda x: x,
    token_pattern=None,
    lowercase=False,
    ngram_range=(1, 3)
)

df_gold_train, df_gold_test, train_labels, test_labels = train_test_split(
    df_gold,
    df_gold['label_num'],
    test_size=0.3,
    random_state=1337,
)

train_vec = vectorizer.fit_transform(
    df_gold_train['lemmas']
)
test_vec = vectorizer.transform(
    df_gold_test['lemmas']
)

## Now Train a `RandomForestClassifier` and Determine Overall Feature Importances

A random forest model can give us overall feature importances directly, but it doesn't tell us which class they were important for or in which direction: for or against.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score


# Fit the model on the training data
clf = RandomForestClassifier()
clf.fit(train_vec, train_labels)

# Score the model to see if it is worth using for inference
avg = 'weighted'
pred = clf.predict(test_vec)
print(f"Model weighted F1 score: {f1_score(test_labels, pred, average=avg)}")

# Display features and importances in a DataFrame
features = pd.DataFrame(
    {'importance': clf.feature_importances_},
    index=vectorizer.get_feature_names(),
)
features = features.sort_values(
    by=['importance'],
    ascending=False,
)
features[0:20]

## Split the Data to Enable and Experiment

We need test, train and development sets to work with.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Use the common notation for df_train and df_test
df_train = df
y_train = np.zeros(
    len(df.index)
)
df_test = df_gold
y_test  = df_gold['label_num'].to_numpy()

## Create Utilities for Generating Keyword `LabelFunctions`

Given a set of keywords and a set of fields, we want to generate a set of those keywords operating on each of the fields. We want the list of keywords to be together and the list of fields to be separate `LabelFunctions`, although we may or may not actually group keywords in a single LF.

In [None]:
from snorkel.labeling import LabelingFunction


def keyword_lookup(x, keywords, field, label):
    """Given a list of tuples, look for any of a list of keywords"""
    if field in x and x[field] and any(word.lower() in x[field].lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, field, label=ABSTAIN, separate=False):
    """Given a list of keywords and a label, return a keyword search LabelingFunction"""
    prefix = 'separate_' if separate else ''
    name = f'{prefix}{keywords[0]}_field_{field}'        
    return LabelingFunction(
        name=name,
        f=keyword_lookup,
        resources=dict(
            keywords=keywords,
            field=field,
            label=label,
        ),
    )

def make_keyword_lfs(keywords, fields, label=ABSTAIN, separate=False):
    """Given a list of keywords and fields, make one or more LabelingFunctions for the keywords with each field"""
    lfs = []
    for field in fields:
        
        # Optionally make one LF per keyword
        if separate:
            for i, keyword in enumerate(keywords):
                lfs.append(
                    make_keyword_lf(
                        [keyword],
                        field,
                        label=label,
                        separate=separate,
                    )
                )
        # Optionally group keywords in a single LF for each field
        else:
            lfs.append(
                make_keyword_lf(
                    keywords,
                    field,
                    label=label,
                )
            )
    return lfs

In [None]:
%%html
<style>
table {float:left}
</style>

## Label Schema

The labels for this dataset are:

| Number | Code      | Description                       |
|--------|-----------|-----------------------------------|
| -1     | ABSTAIN   | No vote, for Labeling Functions   |
| 0      | GENERAL   | A FOSS project of general appeal  |
| 1      | API       | An API library for AWS            |
| 2      | EDUCATION | An educational library for AWS    |
| 3      | DATASET   | REMOVED: An open dataset by Amazon|

In [None]:
%%html
<style>
table {float:left}
</style>

## Labeling Functions

Labeling functions each weakly label the data and need only be better than random. Snorkel's
unsupervised generative graphical model combines these weak labels into strong labels by 
looking at the overlap, conflict and coverage of each weak label set.

| Logic                           | Fields                               | Label       | 200 Sample Accuracy |
|---------------------------------|--------------------------------------|-------------|---------------------|
| If 'sdk' is in                  | `full_name`, `description`, `readme` | `API`       |                     |
| If 'sample' is in               | `full_name`, `description`, `readme` | `EDUCATION` |                     |
| If 'dataset' is in              | `full_name`, `description`, `readme` | `GENERAL`   |                     |
| If 'demonstrate' / 'demo' is in | `full_name`, `description`, `readme` | `EDUCATION` |                     |
| If 'walkthrough' is in          | `full_name`, `description`, `readme` | `EDUCATION` |                     |
| If 'skill' is in                | `full_name`, `description`           | `EDUCATION` |                     |
| If 'kit' is in                  | `full_name`, `description`           | `EDUCATION` |                     |
| If 'toolbox' is in              | `description`                        | `GENERAL`   |                     |
| if 'extension' is in            | `description`                        | `API`       |                     |
| id 'add amazon' is in           | `description`                        | `API`       |                     |
| if 'integrate' is in            | `description`                        | `API`       |                     |
| if 'ion' is in                  | `full_name`                          | `GENERAL`   |                     |
|                                 |                                      |             |                     |
|                                 |                                      |             |                     |


In [None]:
# If it says SDK, it is probably an API library
sdk_lfs = make_keyword_lfs(
    keywords=['sdk'],
    fields=['full_name'],
    label=API,
)

# If api is in the name... its an API project
api_lfs = make_keyword_lfs(
    keywords=['api'],
    fields=['full_name'],
    label=API,
)

# Walkthroughs be EDUCATION
walkthrough_lfs = make_keyword_lfs(
    keywords=['walkthrough'],
    fields=['full_name', 'description', 'readme'],
    label=EDUCATION,
)

# Anything mentioning a skill is usually an Alexa skill example, of which there are many
skill_lfs = make_keyword_lfs(
    keywords=['skill', 'skills'],
    fields=['full_name', 'description', 'readme'],
    label=EDUCATION,
)

# Kits be EDUCATION
kit_lfs = make_keyword_lfs(
    keywords=['kit', 'kits'],
    fields=['description', 'readme'],
    label=EDUCATION,
)

# Toolboxes are generally GENERAL
tool_lfs = make_keyword_lfs(
    keywords=['toolbox'],
    fields=['description'],
    label=GENERAL,
)

# Extensions are APIs
extension_lfs = make_keyword_lfs(
    keywords=['extension'],
    fields=['description', 'readme'],
    label=API,
)

# Add amazon means API
add_amazon_lfs = make_keyword_lfs(
    keywords=['add amazon'],
    fields=['description', 'readme'],
    label=API,
)

# Add amazon means API
aws_lfs = make_keyword_lfs(
    keywords=['aws'],
    fields=['full_name', 'description'],
    label=API,
)

# Integrations tend to be about APIs
integration_lfs = make_keyword_lfs(
    keywords=['integrate', 'integration'],
    fields=['full_name', 'description', 'readme'],
    label=API,
)

# Ion is a major GENERAL purpose project
ion_lfs = make_keyword_lfs(
    keywords=['ion'],
    fields=['full_name'],
    label=GENERAL,
)

# Sample tends to indicate EDUCATION
sample_lfs = make_keyword_lfs(
    keywords=['sample'],
    fields=['full_name', 'description', 'readme'],
    label=EDUCATION,
)

# Datasets tend to self describe themselves :)
dataset_lfs = make_keyword_lfs(
    keywords=['dataset'],
    fields=['full_name', 'description'],
    label=GENERAL,
)

# Demos be EDUCATION
demo_lfs = make_keyword_lfs(
    keywords=['demonstrate', 'demo'],
    fields=['full_name', 'description', 'readme'],
    label=EDUCATION,
)


# Add the LFs to one large list
lfs = sdk_lfs + \
      api_lfs + \
      walkthrough_lfs + \
      skill_lfs + \
      kit_lfs + \
      tool_lfs + \
      extension_lfs + \
      add_amazon_lfs + \
      aws_lfs + \
      integration_lfs + \
      ion_lfs + \
      sample_lfs + \
      dataset_lfs + \
      demo_lfs
lfs

## Apply the Label Functions to the Data

Since we're using Pandas we'll use `PandasLFApplier` to run the label functions over the train and test sets. The training data will be used to traing a label model while the test set will be used both for development (seeing how our label functions do) and evaluating the label model.

In [None]:
from snorkel.labeling import PandasLFApplier


applier  = PandasLFApplier(lfs=lfs)
L_train  = applier.apply(df=df_train)
L_test   = applier.apply(df=df_test)

## Check Overall Label Coverage

We need to check how much of the data is covered by our different labelers in aggregate.

In [None]:
from matplotlib import pyplot as plt


def plot_label_frequency(L):
    plt.hist(
        (L != ABSTAIN).sum(axis=1),
        density=True,
        bins=range(L.shape[1])
    )
    plt.xlabel("Number of labels")
    plt.ylabel("Fraction of dataset")
    plt.show()


plot_label_frequency(L_train)
plot_label_frequency(L_test)

### Interpretation

The overall label coverage looks good. Now we need to look at each label's coverage.

## Analyze the Labeling Functions' Performance

Overall label coverage is good but we need to make sure the distribution of our LF output is approximately even otherwise the label model won't have enough data with which to make good inferences about how LFs relate.

To help with this, we first prepare a `DataFrame` of label function names and their corresponding text labels to add to the `LFAnalysis.lf_summary` output to make it clearer what the coverage is for each label. 

In [None]:
# Prepare a name/label DataFrame to join to the LF Summary DataFrame below
lf_names = [lf.name for lf in lfs]
lf_labels = [lf._resources['label'] for lf in lfs]
lf_label_names = [{'Labels': number_to_name_dict[l]} for l in lf_labels]
label_name_df = pd.DataFrame(lf_label_names, index=lf_names)
len(label_name_df.index)

### Run our `LFAnalysis` with Labels

Now we can run the LFAnalysis, get the summary and join our label names to see a clear indication of how well we're covering each label.

In [None]:
from snorkel.labeling import LFAnalysis

# Run the LF analysis on the gold labeled data
lfa = LFAnalysis(L=L_test, lfs=lfs)
lfa_df = lfa.lf_summary(Y=y_test)

# Join the label names in because the 'Polarity' field is confusing
lfa_label_df = lfa_df.join(label_name_df)
lfa_label_df

## Determine Coverage by Label

Now we can group by the label and determine the raw count correct, incorrect and total. This will give us an idea of coverage per label.

In [None]:
g_lf_df = lfa_label_df.groupby('Labels').agg({'Correct': 'sum', 'Incorrect': 'sum'})
g_lf_df['Total LFs'] = g_lf_df['Correct'] + g_lf_df['Incorrect']
sum_total_lf = g_lf_df['Total LFs'].sum()
g_lf_df['Total LF Ratio'] = g_lf_df['Total LFs'] / sum_total_lf
g_lf_df['Total Correct Ratio'] = g_lf_df['Correct'] / sum_total_lf

g_lf_df

## Compare to Gold Label Coverage

We need to compare this LF coverage to the raw gold label coverage, which will give us an idea of the disparity between the two.

In [None]:
g_df_gold = df_gold.groupby('label').agg({'name': 'count'})

sum_total_labels = g_df_gold['name'].sum()
g_df_gold['Total Gold Label Ratio'] = g_df_gold['name'] / sum_total_labels

g_df_gold.columns = ['Total Labels', 'Total Gold Label Ratio']
g_df_gold

## Combine and Stir

Now combine them to see the disparity.

In [None]:
combined_df = g_df_gold.join(g_lf_df)
combined_df['LF / Label Ratio'] = combined_df['Total LF Ratio'] / combined_df['Total Gold Label Ratio']
combined_df['Correct LF / Label Ratio'] = combined_df['Total Correct Ratio'] / combined_df['Total Gold Label Ratio']

combined_df

### Interpretation

It looks like we are way over-covering `EDUCATION` and way under covering `API` and to a lesser extent (by magnitude) `GENERAL`. Lets fix that!

## Determine Standout Tokens Per Class

Inspect the top 5 tokens in terms of TF-IDF per record then look for standouts per class.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


v = TfidfVectorizer()
m = v.fit_transform(df_gold['readme'])
index_to_word = dict([(value, key) for key, value in v.vocabulary_.items()])

term_rows = []
for row in m.toarray():
    words = []
    for i, val in enumerate(row):
        if val > 0:
            words.append((
                val,
                index_to_word[i]
            ))
    term_rows.append(
        [y[1] for y in sorted(words, key=lambda x: x[0], reverse=True)[0:5]]
    )

df_gold['top_terms'] = term_rows
df_short = df_gold[['full_name', 'description', 'label', 'top_terms', 'readme']]
api_df = df_short[df_short['label'] == 'API']

# See the READMEs
pd.set_option('display.max_colwidth', 300)
api_df[['full_name', 'description', 'label', 'top_terms', 'readme']]

api_df.to_csv('/home/rjurney/api_df.csv')

## New LFs for API

We need some new LFs for API, so lets look at common n-grams of consequence for that class.

In [None]:
sample_lfs = make_keyword_lfs(
    keywords=['aws sdk'],
    fields=['description', 'readme'],
    label=API,
)

alexa_api_lfs = make_keyword_lfs(
    keywords=['alexa api'],
    fields=['description', 'readme'],
    label=API,
)

skills_kit_lfs = make_keyword_lfs(
    keywords=['alexa-skills-kit', 'skills-kit'],
    fields=['full_name'],
    label=API,
)

skill_sample_lfs = make_keyword_lfs(
    keywords=['skill-sample'],
    fields=['full_name'],
    label=EDUCATION,
)

workshop_title_lfs = make_keyword_lfs(
    keywords=['workshop'],
    fields=['full_name'],
    label=EDUCATION,
)

new_lfs = lfs + \
    sample_lfs + \
    alexa_api_lfs + \
    skills_kit_lfs + \
    skill_sample_lfs

applier  = PandasLFApplier(lfs=new_lfs)
L_train  = applier.apply(df=df_train)
L_test   = applier.apply(df=df_test)

lfa = LFAnalysis(L=L_test, lfs=new_lfs)
lfa_df = lfa.lf_summary(Y=y_test)

# Prepare a name/label DataFrame to join to the LF Summary DataFrame below
lf_names = [lf.name for lf in new_lfs]
lf_labels = [lf._resources['label'] for lf in new_lfs]
lf_label_names = [{'Labels': number_to_name_dict[l]} for l in lf_labels]
label_name_df = pd.DataFrame(lf_label_names, index=lf_names)

lfa_label_df = lfa_df.join(label_name_df)
lfa_label_df

## Majority Label Voter Baseline

A simple baseline is helpful to evaluate the performance of the `LabelModel` we're about to train. We can use a majority vote labeler to label the data for comparison.

In [None]:
from snorkel.labeling import MajorityLabelVoter


majority_model = MajorityLabelVoter(cardinality=4)
preds_train = majority_model.predict(
    L=L_train,
)

## Generative LabelModel

Snorkel's generative `LabelModel` learns the relationships between labelers and creates a better label out of all of them.

In [None]:
from snorkel.labeling import LabelModel


label_model = LabelModel(cardinality=4, verbose=True)
label_model.fit(
    L_train=L_train,
    n_epochs=500,
    lr=0.001,
    log_freq=100,
    seed=31337
)

In [None]:
majority_acc = majority_model.score(L=L_test, Y=y_test, tie_break_policy='random')[
    'accuracy'
]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=y_test, tie_break_policy='random')[
    'accuracy'
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

## Inspect Errors to Improve `LabelFunctions`

In [None]:
from snorkel.analysis import get_label_buckets


# Trim the fields for figuring out problems
df_viz = df_test[['full_name', 'description', 'label']]

# Display all errors for debugging purposes
pd.set_option('display.max_rows', len(df_viz.index))


def get_mistakes(df, probs_test, buckets, labels, label_names):
    """Take DataFrame and pair of actual/predicted labels/names and return a DataFrame showing those records."""
    df_fn = df.iloc[buckets[labels]]
    df_fn['probability'] = probs_test[buckets[labels], 1]
    df_fn['true label'] = label_names[0]
    df_fn['predicted label'] = label_names[1]
    return df_fn


def mistakes_df(df, label_model, L_test, y_test):
    """Compute a DataFrame of all the mistakes we've seen."""
    out_dfs = []

    probs_test = label_model.predict_proba(L=L_test)
    preds_test = probs_test >= 0.5

    buckets = get_label_buckets(
        y_test,
        L_test[:, 1]
    )
    print(buckets)

    for (actual, predicted) in buckets.keys():
    
        # Only shot mistakes that we actually voted on
        if actual != predicted:

            actual_name    = number_to_name_dict[actual]
            predicted_name = number_to_name_dict[predicted]

            out_dfs.append(
                get_mistakes(
                    df,
                    probs_test,
                    buckets=buckets,
                    labels=(actual, predicted),
                    label_names=(actual_name, predicted_name)
                )
            )

    if len(out_dfs) > 1:    
        return out_dfs[0].append(
            out_dfs[1:]
        )
    else:
        return out_dfs[0]


pd.set_option('display.max_colwidth', 200)

m_df = mistakes_df(df_viz, majority_model, L_test, y_test)
m_df.head()

In [None]:
m_df_voted = m_df[m_df['predicted label'] != 'ABSTAIN']
m_df_voted

In [None]:
vote_index = [i for i in range(0,L_test.shape[0])]

votes_df = pd.DataFrame(
    data=L_test,
    index=df_viz.index,
    columns=lf_names,
)
def c(x):
    return number_to_name_dict[x]

votes_df.apply(c, axis=1)

# df1 = votes_df.gt(0, 0)
# s = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
# df_viz['positive_votes'] = s
# df_viz

## Filter out Unlabeled Data Points

In [None]:
from snorkel.labeling import filter_unlabeled_dataframe


probs_test = label_model.predict_proba(L=L_test)
preds_test = probs_test >= 0.5

buckets = get_label_buckets(
    y_test,
    L_test[:, 1]
)
print(buckets)


df_test_filtered, probs_test_filtered = filter_unlabeled_dataframe(
    X=df_viz, y=probs_test, L=L_test
)
df_test_filtered

In [None]:
m_no_abstain_df = m_df[m_df['predicted label'] != 'ABSTAIN']
len(m_no_abstain_df.index)

In [None]:
df[df['full_name'].str.startswith('alexa/alexa-skills-kit-sdk')][['full_name', 'description']]


In [None]:
# Look at Ion libraries
df_gold[df['full_name'].str.contains('ion')][['full_name', 'description']]