# Part II: Crowdsourced Sentiment Analysis with Snorkel - Training an ML Model with Snorkel for Sentiment Analysis over Unseen Tweets

In [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb) of the tutorial we saw how `Snorkel's` generative model can be used to resolve conflicts in crowdsourced answers for a sentiment analysis task. In that part we assumed that we have crowd labels for all our tweets. 

In this second part, we will show how the output of `Snorkel's` generative model can be used to provide the necessary labeled data for training a Logistic Regression model that takes as input a tweet **for which no crowd labels are available** and predicts its sentiment. To emulate the above we split our dataset in two parts: one with tweets for which crowd labels are available and one with tweets for which crowd labels are hidden from Snorkel.


The following tutorial is broken up into four parts, each covering a step in the pipeline:
1. Load files from Part I
2. Load and featurize tweets with Snorkel
3. Train an ML model with Snorkel
4. Evaluation

## Step 1: Load Files from Part I

We first load certain dataframes and pickled files from Part I. These files are required in the subsequent steps. For more details on how the files were generated please check [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb).

In [1]:
# Initialize Spark Environment and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Snorkel Crowdsourcing Demo") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [2]:
# Load dataframes from parquet files
worker_labels = spark.read.parquet("data/worker_labels.parquet")
gold_answers = spark.read.parquet("data/gold_answers.parquet")

# Load maps
import pickle
task2ObjMap = pickle.load( open( "data/task2ObjMap.pkl", "rb" ) )
obj2TaskMap = pickle.load( open( "data/obj2TaskMap.pkl", "rb" ) )
worker2LFMap = pickle.load( open( "data/worker2LFMap.pkl", "rb" ) )
lf2WorkerMap = pickle.load( open( "data/lf2WorkerMap.pkl", "rb" ) )
taskLabels = pickle.load( open( "data/taskLabels.pkl", "rb" ) )
taskLabelsMap = pickle.load( open( "data/taskLabelsMap.pkl", "rb" ) )

## Step 2: Load and Featurize Tweets

In this part we show how to use `Snorkel` to featurize the tweets.

The first task we need to perfom is load the raw tweet bodies into Snorkel:

In [3]:
# Load tweet bodies in a dataframe
raw_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-non-agg-DFE.csv")
tweet_bodies = raw_crowd_answers.select("tweet_id", "tweet_body").orderBy("tweet_id").distinct()

In [28]:
# Initialize a Snorkel session
from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from snorkel.contrib.models.context import RawText

session = SnorkelSession()

# Define a tweet candidate
Tweet = candidate_subclass('Tweet', ['tweet_body'], values=taskLabels)

# Generate and store the tweet candidates to be classified

# We split the tweets in two sets: one for which the crowd 
# labels are available to Snorkel and one for which we assume
# no crowd labels
total_tweets = tweet_bodies.count()
train_split = total_tweets*0.6

train_tweets = []
test_tweets = []

count = 0
for tweet_entry in tweet_bodies.collect():
    tweet_text = RawText(stable_id=tweet_entry.tweet_id, name=tweet_entry.tweet_id, text=tweet_entry.tweet_body)
    if count < train_split:
        tweet = Tweet(tweet_body=tweet_text, split=0)
        session.add(tweet)
        train_tweets.append(tweet_entry.tweet_id)
    else:
        tweet = Tweet(tweet_body=tweet_text, split=1)
        session.add(tweet)
        test_tweets.append(tweet_entry.tweet_id)
    count += 1
session.commit()

  item.__name__


InvalidRequestError: Table 'tweet' is already defined for this MetaData instance.  Specify 'extend_existing=True' to redefine options and columns on an existing Table object.

Now we will generate simple bag-of-word features that will be used by the Logistic Regression model. To do this we will use Snorkel's `FeatureAnnotator` class. All we need to provide as input to that class is a simple user-defined function (UDF) that takes as input a candidate and returns the bag-of-word features.

In [6]:
# We define a UDF that parses the body of a candidate (a tweet here)
# and returns as features the token of the tweet body
def bow_feature_generator(c):
    for tok in c.get_contexts()[0].text.split():
        yield tok, 1

# We now use the FeatureAnnotaror provided by Snorkel 
# to generate features for all candidates.
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator(f=bow_feature_generator)

%time F_train = featurizer.apply(split=0)
print F_train.shape

%time F_test = featurizer.apply(split=1)
print F_test.shape

Clearing existing...
Running UDF...

CPU times: user 5.47 s, sys: 151 ms, total: 5.63 s
Wall time: 6.15 s
(600, 3639)
Clearing existing...
Running UDF...

CPU times: user 3.99 s, sys: 129 ms, total: 4.12 s
Wall time: 4.23 s
(400, 2641)


## Step 3: Train An ML Model with Snorkel

Now we show how to train an end-to-end model with `Snorkel` where crowdsourced labels are used to generate training data. 

First we need to train `Snorkel's` generative model over the tweets in the training set. These are the ones for which noisy crowd labels are available.

To this end, we generate the labeling matrix for Snorkel and train the corresponding generative model. Details on these steps are provided in [Part I](Crowdsourced_Sentiment_Analysis_Part2.ipynb).

In [9]:
# The labeling matrix is represented
# as a sparse scipy array

# Imports
import numpy as np
from scipy import sparse

# Initialize dimensions of labeling matrix
objects = len(train_tweets)
LFs = worker_labels.select("worker_id").distinct().count()

# Initialize empty labeling matrix
L_train = sparse.lil_matrix((objects, LFs), dtype=np.int64)

# Iterate over crowdsourced labels and populate labeling matrix
for assigned_label in worker_labels.select("worker_id", "task_id", "label").collect():
    if assigned_label.task_id in train_tweets:
        oid = task2ObjMap[assigned_label.task_id]
        LFid = worker2LFMap[assigned_label.worker_id]
        label = taskLabelsMap[assigned_label.label]
        L_train[oid, LFid] = label

We now use the generated labeling matrix to train the generative model.

In [10]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



In [11]:
# Train the generative model
gen_model.train(
    L_train,
    reg_type=2,
    reg_param=0.1,
    epochs=30
)

Inferred cardinality: 5


The final step is to train a Logistic Regression model to predict the sentiment of each tweet. The corresponding model takes as input: (i) the generated features, and (ii) the marginals estimated by `Snorkel's` generative model.

In [None]:
from snorkel.learning import SparseLogisticRegression
disc_model_sparse = SparseLogisticRegression()
train_marginals = gen_model.marginals(L_train)

#This is the good one!
#disc_model_sparse.train(F_train, train_marginals, n_epochs=4000, lr=0.001,
#        batch_size=500, l2_penalty=0.01, print_freq=100)

disc_model_sparse.train(F_train, train_marginals, n_epochs=4000, lr=0.001,
        batch_size=500, l2_penalty=0.1, print_freq=100)

[SparseLR] lr=0.001 l1=0.0 l2=0.1
[SparseLR] Building model
[SparseLR] Training model
[SparseLR] #examples=600  #epochs=4000  batch size=500
[SparseLR] Epoch 0 (1.19s)	Avg. loss=1.768709	NNZ=18195
[SparseLR] Epoch 100 (3.52s)	Avg. loss=0.579655	NNZ=18195
[SparseLR] Epoch 200 (5.33s)	Avg. loss=0.371713	NNZ=18195
[SparseLR] Epoch 300 (7.11s)	Avg. loss=0.305302	NNZ=18195
[SparseLR] Epoch 400 (8.72s)	Avg. loss=0.274311	NNZ=18195
[SparseLR] Epoch 500 (10.51s)	Avg. loss=0.256409	NNZ=18195
[SparseLR] Epoch 600 (12.48s)	Avg. loss=0.244749	NNZ=18195
[SparseLR] Epoch 700 (14.39s)	Avg. loss=0.236602	NNZ=18195
[SparseLR] Epoch 800 (16.32s)	Avg. loss=0.230642	NNZ=18195
[SparseLR] Epoch 900 (18.37s)	Avg. loss=0.226133	NNZ=18195
[SparseLR] Epoch 1000 (20.21s)	Avg. loss=0.222629	NNZ=18195
[SparseLR] Epoch 1100 (22.01s)	Avg. loss=0.219845	NNZ=18195
[SparseLR] Epoch 1200 (23.95s)	Avg. loss=0.217594	NNZ=18195
[SparseLR] Epoch 1300 (26.89s)	Avg. loss=0.215750	NNZ=18195
[SparseLR] Epoch 1400 (29.83s)	Avg. 

## Step 4: Evaluation

Finally, we evaluate the performance of the trained Logistic Regression `Snorkel` model against the groundtruth labels of the tweets in the test set. We assign the final label of each tweet to be the MAP assignment given the marginal distribution returned by the Logistic Regression model.

In [54]:
# Get MAP assignment for each task
test_marginals = disc_model_sparse.marginals(F_test)
task_map_assignment = np.argmax(test_marginals, axis=1)
inferedLabels = {}
for i in range(len(task_map_assignment)):
    inferedLabels[obj2TaskMap[i+len(train_tweets)]] =  taskLabels[task_map_assignment[i]+1]    

In [55]:
errors = 0
total = 0
for trueLabel in gold_answers.select("tweet_id","sentiment","tweet_body").collect():
    if trueLabel.tweet_id in inferedLabels:
        total += 1.0
        if trueLabel.sentiment != inferedLabels[trueLabel.tweet_id]:
            errors += 1
            print '*** Error ***'
            print 'Original tweet: '+trueLabel.tweet_body
            print 'Groundtruth label: '+trueLabel.sentiment
            print 'Snorkel label: '+inferedLabels[trueLabel.tweet_id]
            print '\n'
print '\n*** Overall Performance Statistics ***'
print 'Wrongly infered labels: '+str(errors)+' out of '+str(total)
print 'Accuracy of Snorkel''s model = ', (total-errors)/total

*** Error ***
Original tweet: US GAS: Warm-Weather Forecasts Lift Natural Gas Futures {link}
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Positive


*** Error ***
Original tweet: Fire Weather Watch issued May 17 at 4:21PM CDT expiring May 19 at 9:00PM CDT by NWS Lubbock... {link}
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Tweet not related to weather condition


*** Error ***
Original tweet: soo I pressed my way out to bible study in this weather! truly a #sacrifice
Groundtruth label: Negative
Snorkel label: Tweet not related to weather condition


*** Error ***
Original tweet: Nothing like sirens going off in the area to add a sense of immediacy to the weather reports.  At least the worst has passed to the north.
Groundtruth label: Negative
Snorkel label: Positive


*** Error ***
Original tweet: Sunny skies over by KC Library, Plaza branch. But ominous skies over Downtown KC. Sirens still ringing. {link}
Groundtruth

**Take-away**: As shown above the performance of the trained discriminative model is the same as that of the generative model. The big difference between the two is that the discriminative model can be used to label new tweets without paying the additional cost of obtaining crowd labels.