# Part II: Crowdsourced Sentiment Analysis with Snorkel - Training an ML Model with Snorkel for Sentiment Analysis over Unseen Tweets

In [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb) of the tutorial we saw how `Snorkel's` generative model can be used to resolve conflicts in crowdsourced answers for a sentiment analysis task. In that part we assumed that we have crowd labels for all our tweets. 

In this second part, we will show how the output of `Snorkel's` generative model can be used to provide the necessary labeled data for training an LSTM that takes as input a tweet **for which no crowd labels are available** and predicts its sentiment. To emulate the above we split our dataset in two parts: one with tweets for which crowd labels are available and one with tweets for which crowd labels are hidden from Snorkel.


The following tutorial is broken up into four parts, each covering a step in the pipeline:
1. Load files from Part I
2. Load and split data
3. Train an ML model with Snorkel
4. Evaluation

## Step 1: Load Files from Part I

We first load certain dataframes and pickled files from Part I. These files are required in the subsequent steps. For more details on how the files were generated please check [Part I](Crowdsourced_Sentiment_Analysis_Part1.ipynb).

In [1]:
# Initialize Spark Environment and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Snorkel Crowdsourcing Demo") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [2]:
# Load dataframes from parquet files
worker_labels = spark.read.parquet("data/worker_labels.parquet")
gold_answers = spark.read.parquet("data/gold_answers.parquet")

# Load maps
import pickle
task2ObjMap = pickle.load( open( "data/task2ObjMap.pkl", "rb" ) )
obj2TaskMap = pickle.load( open( "data/obj2TaskMap.pkl", "rb" ) )
worker2LFMap = pickle.load( open( "data/worker2LFMap.pkl", "rb" ) )
lf2WorkerMap = pickle.load( open( "data/lf2WorkerMap.pkl", "rb" ) )
taskLabels = pickle.load( open( "data/taskLabels.pkl", "rb" ) )
taskLabelsMap = pickle.load( open( "data/taskLabelsMap.pkl", "rb" ) )

## Step 2: Load and Split Tweets to Test and Train Sets.

In this part we show how to load the tweets to `Snorkel's` backend database and split them into a training and testing set.

The first task we need to perfom is load the raw tweet bodies into Snorkel:

In [3]:
# Load tweet bodies in a dataframe
raw_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-non-agg-DFE.csv")
tweet_bodies = raw_crowd_answers.select("tweet_id", "tweet_body").orderBy("tweet_id").distinct()

In [4]:
# Initialize a Snorkel session
from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from snorkel.contrib.models.context import RawText

session = SnorkelSession()

# Define a tweet candidate
Tweet = candidate_subclass('Tweet', ['tweet_body'], values=taskLabels)

# Generate and store the tweet candidates to be classified

# We split the tweets in two sets: one for which the crowd 
# labels are not available to Snorkel (test) and one for which we assume
# crowd labels are obtained (to be used for training)
total_tweets = tweet_bodies.count()

# Take the first 10% of tweets as a test set
# The remaining 90% will be used for training
test_split = total_tweets*0.1

train_tweets = []
test_tweets = []

count = 0
for tweet_entry in tweet_bodies.collect():
    tweet_text = RawText(stable_id=tweet_entry.tweet_id, name=tweet_entry.tweet_id, text=tweet_entry.tweet_body)
    if count > test_split:
        tweet = Tweet(tweet_body=tweet_text, split=0)
        session.add(tweet)
        train_tweets.append(tweet_entry.tweet_id)
    else:
        tweet = Tweet(tweet_body=tweet_text, split=1)
        session.add(tweet)
        test_tweets.append(tweet_entry.tweet_id)
    count += 1
session.commit()

## Step 3: Train An ML Model with Snorkel

Now we show how to train an end-to-end model with `Snorkel` where crowdsourced labels are used to generate training data. 

First we need to train `Snorkel's` generative model over the tweets in the training set. These are the ones for which noisy crowd labels are available.

To this end, we generate the labeling matrix for Snorkel and train the corresponding generative model. Details on these steps are provided in [Part I](Crowdsourced_Sentiment_Analysis_Part2.ipynb).

In [5]:
# The labeling matrix is represented
# as a sparse scipy array

# Imports
import numpy as np
from scipy import sparse

# Initialize dimensions of labeling matrix
objects = len(train_tweets)
LFs = worker_labels.select("worker_id").distinct().count()

# Initialize empty labeling matrix
L_train = sparse.lil_matrix((objects, LFs), dtype=np.int64)

# Iterate over crowdsourced labels and populate labeling matrix
for assigned_label in worker_labels.select("worker_id", "task_id", "label").collect():
    if assigned_label.task_id in train_tweets:
        oid = task2ObjMap[assigned_label.task_id] - len(test_tweets)
        LFid = worker2LFMap[assigned_label.worker_id]
        label = taskLabelsMap[assigned_label.label]
        L_train[oid, LFid] = label

We now use the generated labeling matrix to train the generative model.

In [6]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



In [7]:
# Train the generative model
gen_model.train(
    L_train,
    reg_type=2,
    reg_param=0.1,
    epochs=30
)

Inferred cardinality: 5


The final step is to train a Deep Network (an LSTM) using the output of `Snorkel's` generative model as training data. The LSTM will be used to predict the sentiment of tweets for which no crowdlabels are available. The corresponding model takes as input: (i) the generated features, and (ii) the marginals estimated by `Snorkel's` generative model.

In [None]:
train_marginals = gen_model.marginals(L_train)

from snorkel.contrib.rnn import textRNN

train_kwargs = {
    'lr':         0.01,
    'dim':        100,
    'n_epochs':   200,
    'dropout':    0.2,
    'rebalance':  0.01,
    'print_freq': 5
}

lstm = textRNN(seed=1701, n_threads=None)
train_cands = session.query(Tweet).filter(Tweet.split == 0).order_by(Tweet.id).all()

lstm.train(train_cands, train_marginals, **train_kwargs)

[textRNN] Dimension=100  LR=0.01
[textRNN] Begin preprocessing
899


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


[textRNN] Preprocessing done (4.55s)
[textRNN] Training model
[textRNN] #examples=899  #epochs=200  batch size=256
[textRNN] Epoch 0 (2.41s)	Average loss=1.527904
[textRNN] Epoch 5 (10.57s)	Average loss=0.255238
[textRNN] Epoch 10 (19.79s)	Average loss=0.122817


## Step 4: Evaluation

Finally, we evaluate the performance of the trained LSTM `Snorkel` model against the groundtruth labels of the tweets in the test set. We assign the final label of each tweet to be the MAP assignment given the marginal distribution returned by the Logistic Regression model.

In [None]:
# Get MAP assignment for each task
#test_marginals = disc_model_sparse.marginals(F_test)


test_cands = session.query(Tweet).filter(Tweet.split == 1).order_by(Tweet.id).all()
test_marginals = lstm.marginals(test_cands)

task_map_assignment = np.argmax(test_marginals, axis=1)
inferedLabels = {}
for i in range(len(task_map_assignment)):
    inferedLabels[obj2TaskMap[i]] =  taskLabels[task_map_assignment[i]+1]    

In [None]:
errors = 0
total = 0
verbose = False
for trueLabel in gold_answers.select("tweet_id","sentiment","tweet_body").collect():
    if trueLabel.tweet_id in inferedLabels:
        total += 1.0
        if trueLabel.sentiment != inferedLabels[trueLabel.tweet_id]:
            errors += 1
            if verbose:
                print '*** Error ***'
                print 'Original tweet: '+trueLabel.tweet_body
                print 'Groundtruth label: '+trueLabel.sentiment
                print 'Snorkel label: '+inferedLabels[trueLabel.tweet_id]
                print '\n'
print '\n*** Overall Performance Statistics ***'
print 'Wrongly infered labels: '+str(errors)+' out of '+str(total)
print 'Accuracy of Snorkel''s model = ', (total-errors)/total

**Take-away**: The trained discriminative model can be used to infer the sentiment of tweets without obtaining labels from human contributors. 
**Disclaimer**: The LSTM used above is one of the simplest models and is trained over relatively few training examples. Nonetheless, we see that its accuracy is comparable to that of humans, which is [expected to be around 80% for binary sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis).