# Part 0: Crowdsourced Sentiment Analysis with Snorkel - Resolving Conflicts

In this part of the tutorial, we will walk through the process of using `Snorkel` to resolve conflicts in crowdsourced answers for a sentiment analysis task. The following tutorial is broken up into four core parts and a bonus part. Each part covers a step in the pipeline:
1. Preprocessing
2. Construction of a Snorkel Labeling Matrix
3. Conflict Resolution
4. Evaluation
5. Bonus: Comparison against Majority Vote

In this notebook, we preprocess the data collected by the crowd contributors using [Spark SQL and Dataframes](https://spark.apache.org/docs/latest/sql-programming-guide.html).

## Step 0: Sentiment Analysis of Tweets

In this tutorial we focus on the [Weather sentiment](https://www.crowdflower.com/data/weather-sentiment/) task from [Crowdflower](https://www.crowdflower.com/).

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. The catch is that 20 contributors graded each tweet. We then ran an additional job (the one below) where we asked 10 contributors to grade the original sentiment evaluation.

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. Contributors could choose among the following categories:
1. Positive
2. Negative
3. I can't tell
4. Neutral / author is just sharing information
5. Tweet not related to weather condition

The catch is that 20 contributors graded each tweet. Thus, in many cases contributors assigned conflicting sentiment labels to the same tweet. 


The task comes with two data files (to be found in the `data` directory of the tutorial:
1. [weather-non-agg-DFE.csv](data/weather-non-agg-DFE.csv) contains the raw contributor answers for each of the 1,000 tweets.
2. [weather-evaluated-agg-DFE.csv](data/weather-evaluated-agg-DFE.csv) contains gold sentiment labels by trusted workers for each of the 1,000 tweets.

**GOAL:** The goal of this tutorial is to demonstrate how `Snorkel` can be used to accurately infer a single sentiment label for each tweet, thus, denoising the collected contributor answers.

## Step 1: Preprocessing - Data Loading with Spark SQL and Dataframes

First, we initialize a `SparkSession`, which manages a connection to a local Spark master which allows us to preprocess the raw data and prepare convert them to the necessary `Snorkel` format:

In [1]:
# Initialize Spark Environment and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Snorkel Crowdsourcing Demo") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

We can now load the raw data for our crowdsourcing task (stored in a local csv file) into a dataframe. 

In [2]:
# Load Raw Crowdsourcing Data
raw_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-non-agg-DFE.csv")
raw_crowd_answers.printSchema()

# Load Groundtruth Crowdsourcing Data
gold_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-evaluated-agg-DFE.csv")
gold_crowd_answers.createOrReplaceTempView("gold_crowd_answers")
gold_answers = spark.sql("SELECT tweet_id, sentiment, tweet_body FROM gold_crowd_answers WHERE correct_category ='Yes' and correct_category_conf = 1").orderBy("tweet_id")

# Keep Only the Tweets with Available Groundtruth
candidate_labeled_tweets = raw_crowd_answers.join(gold_answers, raw_crowd_answers.tweet_id == gold_answers.tweet_id).select(raw_crowd_answers.tweet_id,raw_crowd_answers.tweet_body,raw_crowd_answers.worker_id,raw_crowd_answers.emotion)

root
 |-- _unit_id_: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- trust: string (nullable = true)
 |-- worker_id: string (nullable = true)
 |-- country: string (nullable = true)
 |-- region: string (nullable = true)
 |-- city: string (nullable = true)
 |-- emotion: string (nullable = true)
 |-- tweet_id: string (nullable = true)
 |-- tweet_body: string (nullable = true)



As mentioned above, contributors can provide conflicting labels for the same tweet:

In [3]:
candidate_labeled_tweets.select("worker_id", "emotion", "tweet_body").orderBy("tweet_id").show()

+---------+--------------------+--------------------+
|worker_id|             emotion|          tweet_body|
+---------+--------------------+--------------------+
|  6498214|        I can't tell|I dunno which ass...|
|  7450342|Neutral / author ...|I dunno which ass...|
| 10752241|            Positive|I dunno which ass...|
| 10235355|            Negative|I dunno which ass...|
| 17475684|            Negative|I dunno which ass...|
|  6346694|Neutral / author ...|I dunno which ass...|
| 14806909|Neutral / author ...|I dunno which ass...|
| 19028457|            Positive|I dunno which ass...|
|  6737418|            Negative|I dunno which ass...|
| 14584835|            Negative|I dunno which ass...|
| 18381123|Neutral / author ...|I dunno which ass...|
| 16498372|Tweet not related...|I dunno which ass...|
|  7012325|            Positive|I dunno which ass...|
|  9333400|            Negative|I dunno which ass...|
| 10379699|            Positive|I dunno which ass...|
| 14298198|            Posit

## Generate Snorkel Candidates

In [4]:
# Extract worker votes
worker_labels = candidate_labeled_tweets.selectExpr("tweet_id as task_id", "worker_id", "emotion as label")
worker_labels.show()

+--------+---------+--------------------+
| task_id|worker_id|               label|
+--------+---------+--------------------+
|82846118| 18034918|Neutral / author ...|
|82510997| 18034918|            Positive|
|83271279| 18034918|            Negative|
|80058872| 18034918|            Positive|
|79188429| 18034918|Neutral / author ...|
|82838136| 18034918|            Positive|
|82680080| 18034918|            Positive|
|83258314| 18034918|Neutral / author ...|
|80052652| 18034918|Tweet not related...|
|83257167| 18034918|Neutral / author ...|
|84312699| 18034918|            Negative|
|84034743| 18034918|            Negative|
|84319723| 18034918|Neutral / author ...|
|83259244| 18034918|            Positive|
|84047815| 18034918|Neutral / author ...|
|83268482| 18034918|            Positive|
|82845437| 18034918|            Positive|
|82844107| 18034918|Neutral / author ...|
|83255526| 18034918|            Positive|
|82675594| 18034918|Tweet not related...|
+--------+---------+--------------

In [5]:
from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from snorkel.contrib.models.context import RawText

session = SnorkelSession()

taskLabels = [False] 
taskLabels.extend([str(i.label) for i in worker_labels.select("label").distinct().collect()])
taskLabelsMap = {}
for i in range(len(taskLabels)):
    taskLabelsMap[taskLabels[i]] = i
import pickle
taskLabels = pickle.load( open( "data/taskLabels.pkl", "rb" ) )

tweet_bodies = candidate_labeled_tweets.select("tweet_id", "tweet_body").orderBy("tweet_id").distinct()
Tweet = candidate_subclass('Tweet', ['tweet_body'], values=taskLabels)

# Generate and store the tweet candidates to be classified

# We split the tweets in two sets: one for which the crowd 
# labels are not available to Snorkel (test) and one for which we assume
# crowd labels are obtained (to be used for training)
total_tweets = tweet_bodies.count()

# Take the first 10% of tweets as a test set
# The remaining 90% will be used for training
test_split = total_tweets*0.1

train_tweets = []
test_tweets = []
train_tweet_true_labels = []
test_tweet_true_labels = []


count = 0
for tweet_entry in tweet_bodies.collect():
    tweet_text = RawText(stable_id=tweet_entry.tweet_id, name=tweet_entry.tweet_id, text=tweet_entry.tweet_body)
    if count > test_split:
        tweet = Tweet(tweet_body=tweet_text, split=0)
        session.add(tweet)
        train_tweets.append(tweet_entry.tweet_id)
        train_tweet_true_labels.append(gold_answers.select("sentiment").filter("tweet_id == "+str(tweet_entry.tweet_id)).collect()[0].sentiment)
    else:
        tweet = Tweet(tweet_body=tweet_text, split=1)
        session.add(tweet)
        test_tweets.append(tweet_entry.tweet_id)
        test_tweet_true_labels.append(gold_answers.select("sentiment").filter("tweet_id == "+str(tweet_entry.tweet_id)).collect()[0].sentiment)
    count += 1
session.commit()

In [8]:
train_cands = session.query(Tweet).filter(Tweet.split == 0).order_by(Tweet.id).all()

def label_generator(c):
    labels = worker_labels.select("worker_id","label").filter("task_id == "+str(c.tweet_body.name))
    for entry in labels.collect():
        yield entry.worker_id, entry.label, 1

for i in range(10):
    for lab in label_generator(train_cands[i]):
        print lab[0], lab[1]

6346694 Positive
11426608 Positive
18112936 Positive
18465660 Positive
6737418 Positive
17475684 Negative
10752241 Negative
14584835 Positive
14391185 Positive
14806909 Negative
19028457 Positive
8939802 Positive
16498372 Positive
11040334 Positive
6363996 Positive
20043586 Positive
9040042 Positive
16573689 Positive
20095709 Positive
16846915 Positive
6346694 Positive
7450342 Neutral / author is just sharing information
6332651 Neutral / author is just sharing information
6737418 Positive
17475684 Neutral / author is just sharing information
14472526 Neutral / author is just sharing information
14400603 Positive
19028457 Negative
18354072 Positive
11040334 Neutral / author is just sharing information
14298198 Positive
17594578 Positive
6344001 Positive
16498372 Positive
13763729 Neutral / author is just sharing information
16846915 I can't tell
10235355 Neutral / author is just sharing information
16738677 Positive
19376841 Neutral / author is just sharing information
7860247 I can't 