# Part I: Crowdsourced Sentiment Analysis with Snorkel - Resolving Conflicts

In this part of the tutorial, we will walk through the process of using `Snorkel` to resolve conflicts in crowdsourced answers for a sentiment analysis task. The following tutorial is broken up into four core parts and a bonus part. Each part covers a step in the pipeline:
1. Preprocessing
2. Construction of a Snorkel Labeling Matrix
3. Conflict Resolution
4. Evaluation
5. Bonus: Comparison against Majority Vote

In this notebook, we preprocess the data collected by the crowd contributors using [Spark SQL and Dataframes](https://spark.apache.org/docs/latest/sql-programming-guide.html).

## Step 0: Sentiment Analysis of Tweets

In this tutorial we focus on the [Weather sentiment](https://www.crowdflower.com/data/weather-sentiment/) task from [Crowdflower](https://www.crowdflower.com/).

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. The catch is that 20 contributors graded each tweet. We then ran an additional job (the one below) where we asked 10 contributors to grade the original sentiment evaluation.

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. Contributors could choose among the following categories:
1. Positive
2. Negative
3. I can't tell
4. Neutral / author is just sharing information
5. Tweet not related to weather condition

The catch is that 20 contributors graded each tweet. Thus, in many cases contributors assigned conflicting sentiment labels to the same tweet. 


The task comes with two data files (to be found in the `data` directory of the tutorial:
1. [weather-non-agg-DFE.csv](data/weather-non-agg-DFE.csv) contains the raw contributor answers for each of the 1,000 tweets.
2. [weather-evaluated-agg-DFE.csv](data/weather-evaluated-agg-DFE.csv) contains gold sentiment labels by trusted workers for each of the 1,000 tweets.

**GOAL:** The goal of this tutorial is to demonstrate how `Snorkel` can be used to accurately infer a single sentiment label for each tweet, thus, denoising the collected contributor answers.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

## Step 3: Conflict Resolution

Until now we have converted the raw crowdsourced data into a labeling matrix that can be provided as input to `Snorkel`. We will now show how to:

1. Use `Snorkel's` generative model to learn the accuracy of each crowd contributor.
2. Use the learned model to estimate a marginal distribution over the domain of possible labels for each task.
3. Use the estimated marginal distribution to obtain the maximum a posteriori probability estimate for the label that each task takes.

### Reloading the label matrix

In [2]:
from snorkel.annotations import load_label_matrix
L_train = load_label_matrix(session, split=0)
L_train

<568x102 sparse matrix of type '<type 'numpy.float64'>'
	with 11360 stored elements in Compressed Sparse Row format>

### Importing and training a Snorkel generative model

First we import and initialize `Snorkel's` generative model.

In [3]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

Then we train `Snorkel's` generative model by passing as input the labeling matrix that corresponds to the crowdsourced data.

In [None]:
# Train the generative model
gen_model.train(
    L_train,
    reg_type=2,
    reg_param=0.1,
    epochs=30
)

Inferred cardinality: 5.0


### Infering the marginal distribution
The following command uses the labeling matrix and the learned generative model to estimate the marginal distribution over the domain of possible labels for each task.

In [None]:
task_marginals = gen_model.marginals(L)

### Infering the MAP assignment for each task
Each task corresponds to an indipendent random variable. Thus, we can simply associate each task with the most probably label based on the estimated marginal distribution.

In [None]:
# Get MAP assignment for each task
task_map_assignment = np.argmax(task_marginals, axis=1)
inferedLabels = {}
for i in range(len(task_map_assignment)):
    inferedLabels[obj2TaskMap[i]] =  taskLabels[task_map_assignment[i]+1]

## Step 4: Evaluation

We now evaluate the accuracy of `Snorkel's` model at identifying the correct label for each task by fusing the labels provided by differnet crowd contributors. For this we compare the MAP label assigned to tasks against the provided groundtruth data.

In [None]:
# Extract ground truth per tweet_id
gold_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-evaluated-agg-DFE.csv")
gold_crowd_answers.createOrReplaceTempView("gold_crowd_answers")
gold_answers = spark.sql("SELECT tweet_id, sentiment, tweet_body FROM gold_crowd_answers WHERE correct_category ='Yes' and correct_category_conf = 1")

In [None]:
errors = 0
total = float(gold_answers.count())
for trueLabel in gold_answers.select("tweet_id","sentiment","tweet_body").collect():
    if trueLabel.sentiment != inferedLabels[trueLabel.tweet_id]:
        errors += 1
        print '*** Error ***'
        print 'Original tweet: '+trueLabel.tweet_body
        print 'Groundtruth label: '+trueLabel.sentiment
        print 'Snorkel label: '+inferedLabels[trueLabel.tweet_id]
        print '\n'
print '\n*** Overall Performance Statistics ***'
print 'Wrongly infered labels: '+str(errors)+' out of '+str(total)
print 'Accuracy of Snorkel''s model = ', (total-errors)/total

We store certain dataframes and maps generated during the previous steps to persistent files. These will be used in [Part 2](Crowdsourced_Sentiment_Analysis_Part2.ipynb) of the tutorial.

In [None]:
# Save dataframe as parquet files
worker_labels.write.parquet("data/worker_labels.parquet",mode="overwrite")
gold_answers.write.parquet("data/gold_answers.parquet",mode="overwrite")

# Save maps as pickle files
import pickle
pickle.dump( task2ObjMap, open( "data/task2ObjMap.pkl", "wb" ) )
pickle.dump( obj2TaskMap, open( "data/obj2TaskMap.pkl", "wb" ) )
pickle.dump( worker2LFMap, open( "data/worker2LFMap.pkl", "wb" ) )
pickle.dump( lf2WorkerMap, open( "data/lf2WorkerMap.pkl", "wb" ) )
pickle.dump( taskLabels, open( "data/taskLabels.pkl", "wb" ) )
pickle.dump( taskLabelsMap, open( "data/taskLabelsMap.pkl", "wb" ) )

## Bonus: Comparison against Majority Vote

As a bonus we evaluate majority voting against `Snorkel`. Given that we have 20 contributors per task (and most of them are better than chance) **we expect majority voting to perform extremely well**. However, as shown below, **Snorkel's model, which estimates the accuracy of each worker, makes fewer mistakes and achieves a higher accuracy.** Specifically, Majority Vote makes 9 mistakes versus the 3 mistakes that Snorkel makes.

In [None]:
# Majority vote evaluation
import operator

taskMVassignment = {}
obj2TaskMap
objects, LFs
for i in range(objects):
    objectVotes = {}
    for j in range(LFs):
        label = L[i,j]
        if label != 0:
            if label not in objectVotes:
                objectVotes[label] = 0
            objectVotes[label] += 1
    maxValue = -1
    assignedLabel = ''
    for key in objectVotes:
        if objectVotes[key] > maxValue:
            maxValue = objectVotes[key]
            assignedLabel = key
    taskMVassignment[obj2TaskMap[i]] = taskLabels[assignedLabel]

In [None]:
errors = 0
total = float(gold_answers.count())
for trueLabel in gold_answers.select("tweet_id","sentiment","tweet_body").collect():
    if str(trueLabel.sentiment) != str(taskMVassignment[trueLabel.tweet_id]):
        errors += 1
        print '*** Error ***'
        print 'Original tweet: '+trueLabel.tweet_body
        print 'Groundtruth label: '+trueLabel.sentiment
        print 'MV label: '+taskMVassignment[trueLabel.tweet_id]
        print '\n'
print '\n*** Overall Performance Statistics ***'
print 'Wrongly infered labels: '+str(errors)+' out of '+str(total)
print 'Accuracy of Majority Voting model = ', (total-errors)/total