# Part II: Crowdsourced Sentiment Analysis with Snorkel - Training an ML Model with Snorkel

In [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb) of the tutorial we saw how `Snorkel's` generative model can be used to resolve conflicts in crowdsourced answers for a sentiment analysis task. In this second part, we will show how the the output of `Snorkel's` generative model can be used to provide the necessary labeled data for training a Logistic Regression model that takes as input a tweet and predicts the associated sentiment. The following tutorial is broken up into four parts, each covering a step in the pipeline:
1. Load files from Part I
2. Train Snorkel's generative model
3. Featurize tweets and train a Logistic regression model with Snorkel
4. Evaluation

## Step 1: Load Files from Part I

We first load certain dataframes and pickled files from Part I. These files are required in the subsequent steps. For more details on how the files were generated please check [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb).

In [1]:
# Initialize Spark Environment and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Snorkel Crowdsourcing Demo") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [2]:
# Load dataframes from parquet files
worker_labels = spark.read.parquet("data/worker_labels.parquet")
gold_answers = spark.read.parquet("data/gold_answers.parquet")

# Load maps
import pickle
task2ObjMap = pickle.load( open( "data/task2ObjMap.pkl", "rb" ) )
obj2TaskMap = pickle.load( open( "data/obj2TaskMap.pkl", "rb" ) )
worker2LFMap = pickle.load( open( "data/worker2LFMap.pkl", "rb" ) )
lf2WorkerMap = pickle.load( open( "data/lf2WorkerMap.pkl", "rb" ) )
taskLabels = pickle.load( open( "data/taskLabels.pkl", "rb" ) )
taskLabelsMap = pickle.load( open( "data/taskLabelsMap.pkl", "rb" ) )

## Step 2: Train Snorkel's Generative Model

We now generate the labeling matrix for Snorkel and train the corresponding generative model. Details on these steps are provided in [Part I](Crowdsourced_Sentiment_Analysis_Part2.ipynb).

In [3]:
# The labeling matrix is represented
# as a sparse scipy array

# Imports
import numpy as np
from scipy import sparse

# Initialize dimensions of labeling matrix
objects = worker_labels.select("task_id").distinct().count()
LFs = worker_labels.select("worker_id").distinct().count()

# Initialize empty labeling matrix
L = sparse.lil_matrix((objects, LFs), dtype=np.int64)

# Iterate over crowdsourced labels and populate labeling matrix
for assigned_label in worker_labels.select("worker_id", "task_id", "label").collect():
    oid = task2ObjMap[assigned_label.task_id]
    LFid = worker2LFMap[assigned_label.worker_id]
    label = taskLabelsMap[assigned_label.label]
    L[oid, LFid] = label

In [4]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



In [17]:
# Train the generative model
gen_model.train(
    L,
    reg_type=2,
    reg_param=0.1,
    epochs=30
)

Inferred cardinality: 5


## Step 3: Featurize tweets and train a Logistic regression model with Snorkel

In the part of the tutorial we show how to use the output of `Snorkel's` generative model to train a discriminative model (here a Logistic Regression model) to classify the sentiment of the available tweets.

The first task we need to perfom is load the raw tweet bodies into Snorkel:

In [18]:
# Load tweet bodies in a dataframe
raw_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-non-agg-DFE.csv")
tweet_bodies = raw_crowd_answers.select("tweet_id", "tweet_body").orderBy("tweet_id").distinct()

In [19]:
# Initialize a Snorkel session
from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from snorkel.contrib.models.context import RawText

session = SnorkelSession()

# Define a tweet candidate
Tweet = candidate_subclass('Tweet', ['tweet_body'], values=taskLabels)

# Generate and store the tweet candidates to be classified
for tweet in tweet_bodies.collect():
    tweet_text = RawText(stable_id=tweet.tweet_id, name=tweet.tweet_id, text=tweet.tweet_body)
    tweet = Tweet(tweet_body=tweet_text, split=0)
    session.add(tweet)
session.commit()

  item.__name__


InvalidRequestError: Table 'tweet' is already defined for this MetaData instance.  Specify 'extend_existing=True' to redefine options and columns on an existing Table object.

Now we will generate simple bag-of-word features that will be used by the Logistic Regression model. To do this we will use Snorkel's `FeatureAnnotator` class. All we need to provide as input to that class is a simple user-defined function (UDF) that takes as input a candidate and returns the bag-of-word features.

In [None]:
# We define a UDF that parses the body of a candidate (a tweet here)
# and returns as features the token of the tweet body
def bow_feature_generator(c):
    for tok in c.get_contexts()[0].text.split():
        yield tok, 1

# We now use the FeatureAnnotaror provided by Snorkel 
# to generate features for all candidates.
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator(f=bow_feature_generator)

%time F_train = featurizer.apply(split=0)
F_train

The final step is to train a Logistic Regression model to predict the sentiment of each tweet. The corresponding model takes as input: (i) the generated features, and (ii) the marginals estimated by `Snorkel's` generative model.

In [None]:
from snorkel.learning import SparseLogisticRegression
disc_model_sparse = SparseLogisticRegression()
train_marginals = gen_model.marginals(L)
#disc_model_sparse.train(F_train, train_marginals, n_epochs=2000, lr=0.001,
#        batch_size=800, l2_penalty=0.1, print_freq=100)

disc_model_sparse.train(F_train, train_marginals, n_epochs=4000, lr=0.0001,
        batch_size=500, l2_penalty=0.001, print_freq=100)

## Step 4: Evaluation

Finally, we evaluate the performance of the end-to-end `Snorkel` model against the groundtruth labels. As with [Part I](Crowdsourced_Sentiment_Analysis_Part2.ipynb), we assign the final label of each tweet to be the MAP assignment given the marginal distribution returned by the Logistic Regression model.

In [None]:
# Get MAP assignment for each task
test_marginals = disc_model_sparse.marginals(F_train)
task_map_assignment = np.argmax(test_marginals, axis=1)
inferedLabels = {}
for i in range(len(task_map_assignment)):
    inferedLabels[obj2TaskMap[i]] =  taskLabels[task_map_assignment[i]+1]

In [None]:
errors = 0
total = float(gold_answers.count())
for trueLabel in gold_answers.select("tweet_id","sentiment","tweet_body").collect():
    if trueLabel.sentiment != inferedLabels[trueLabel.tweet_id]:
        errors += 1
        print '*** Error ***'
        print 'Original tweet: '+trueLabel.tweet_body
        print 'Groundtruth label: '+trueLabel.sentiment
        print 'Snorkel label: '+inferedLabels[trueLabel.tweet_id]
        print '\n'
print '\n*** Overall Performance Statistics ***'
print 'Wrongly infered labels: '+str(errors)+' out of '+str(total)
print 'Accuracy of Snorkel''s model = ', (total-errors)/total