# Part II: Crowdsourced Sentiment Analysis with Snorkel - Training an ML Model with Snorkel

In [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb) of the tutorial we saw how `Snorkel's` generative model can be used to resolve conflicts in crowdsourced answers for a sentiment analysis task. In this second part, we will show how the the output of `Snorkel's` generative model can be used to provide the necessary labeled data for training a Logistic Regression model that takes as input a tweet and predicts the associated sentiment. The following tutorial is broken up into four parts, each covering a step in the pipeline:
1. Load files from Part I
2. Train Snorkel's generative model
3. Load and featurize the tweet bodies
4. Train a Logistic regression model with Snorkel
5. Evaluation

## Step 1: Load Files from Part I

We first load certain dataframes and pickled files from Part I. These files are required in the subsequent steps. For more details on how the files were generated please check [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb).

In [1]:
# Initialize Spark Environment and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Snorkel Crowdsourcing Demo") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [3]:
# Load dataframes from parquet files
worker_labels = spark.read.parquet("data/worker_labels.parquet")
gold_answers = spark.read.parquet("data/gold_answers.parquet")

# Load maps
import pickle
task2ObjMap = pickle.load( open( "data/task2ObjMap.pkl", "rb" ) )
obj2TaskMap = pickle.load( open( "data/obj2TaskMap.pkl", "rb" ) )
worker2LFMap = pickle.load( open( "data/worker2LFMap.pkl", "rb" ) )
lf2WorkerMap = pickle.load( open( "data/lf2WorkerMap.pkl", "rb" ) )
taskLabels = pickle.load( open( "data/taskLabels.pkl", "rb" ) )
taskLabelsMap = pickle.load( open( "data/taskLabelsMap.pkl", "rb" ) )

## Step 2: Train Snorkel's Generative Model

We now generate the labeling matrix for Snorkel and train the corresponding generative model. Details on these steps are provided in [Part I](Crowdsourced_Sentiment_Analysis_Part2.ipynb).

In [7]:
# The labeling matrix is represented
# as a sparse scipy array

# Imports
import numpy as np
from scipy import sparse

# Initialize dimensions of labeling matrix
objects = worker_labels.select("task_id").distinct().count()
LFs = worker_labels.select("worker_id").distinct().count()

# Initialize empty labeling matrix
L = sparse.lil_matrix((objects, LFs), dtype=np.int64)

# Iterate over crowdsourced labels and populate labeling matrix
for assigned_label in worker_labels.select("worker_id", "task_id", "label").collect():
    oid = task2ObjMap[assigned_label.task_id]
    LFid = worker2LFMap[assigned_label.worker_id]
    label = taskLabelsMap[assigned_label.label]
    L[oid, LFid] = label

In [8]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



In [9]:
# Train the generative model
gen_model.train(
    L,
    reg_type=2,
    reg_param=0.01,
    epochs=10
)

Inferred cardinality: 5


## Step 3: Load and Featurize Tweets

The following command uses the labeling matrix and the learned generative model to estimate the marginal distribution over the domain of possible labels for each task.

In [5]:
# Load tweet bodies in a dataframe
raw_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-non-agg-DFE.csv")
tweet_bodies = raw_crowd_answers.select("tweet_id", "tweet_body").distinct()

In [8]:
# Featurize each tweet
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()

%time F_train = featurizer.apply(split=0)
F_train

## Step 4: Train a Logistic Regression Model with Snorkel

In [None]:
# Obtain marginals for tweets from generative model
train_marginals = gen_model.marginals(L)

# Import logistic regression model
from snorkel.learning import SparseLogisticRegression

# Init model
disc_model_sparse = SparseLogisticRegression()

# Train model
disc_model_sparse.train(F_train, train_marginals, n_epochs=20, lr=0.001)

## Step 5: Evaluation

We now evaluate the accuracy of `Snorkel's` model at identifying the correct label for each task by fusing the labels provided by differnet crowd contributors. For this we compare the MAP label assigned to tasks against the provided groundtruth data.

In [27]:
# Extract ground truth per tweet_id
gold_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-evaluated-agg-DFE.csv")
gold_crowd_answers.createOrReplaceTempView("gold_crowd_answers")
gold_answers = spark.sql("SELECT tweet_id, sentiment, tweet_body FROM gold_crowd_answers WHERE correct_category ='Yes' and correct_category_conf = 1")

In [28]:
errors = 0
total = float(gold_answers.count())
for trueLabel in gold_answers.select("tweet_id","sentiment","tweet_body").collect():
    if trueLabel.sentiment != inferedLabels[trueLabel.tweet_id]:
        errors += 1
        print '*** Error ***'
        print 'Original tweet: '+trueLabel.tweet_body
        print 'Groundtruth label: '+trueLabel.sentiment
        print 'Snorkel label: '+inferedLabels[trueLabel.tweet_id]
        print '\n'
print 'Overall accuracy of Snorkel''s model = ', (total-errors)/total

*** Error ***
Original tweet: BLOG: Another Day, Another Round Of Thunderstorms {link}
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Negative


*** Error ***
Original tweet: RT @mention: It'll be sunny again by time you land RT @mention: Every time I think about going back to Portland it starts ra ...
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Negative


*** Error ***
Original tweet: #sunshine
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Positive


*** Error ***
Original tweet: This hot, dry, & windy weather is going to turn the #canola fast. Keep a close eye on it if you plan on swathing or pushing. #okanola
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Negative


*** Error ***
Original tweet: @mention It's supposed to go up to 70 today. Sunshine early. A slight chance for a shower or t-storm later. Going to my Aunt's house later.
Groundtruth label: N