# Part 0: Crowdsourced Sentiment Analysis with Snorkel - Resolving Conflicts

In this part of the tutorial, we will walk through the process of using `Snorkel` to resolve conflicts in crowdsourced answers for a sentiment analysis task. The following tutorial is broken up into four core parts and a bonus part. Each part covers a step in the pipeline:
1. Preprocessing
2. Construction of a Snorkel Labeling Matrix
3. Conflict Resolution
4. Evaluation
5. Bonus: Comparison against Majority Vote

In this notebook, we preprocess the data collected by the crowd contributors using [Spark SQL and Dataframes](https://spark.apache.org/docs/latest/sql-programming-guide.html).

## Step 0: Sentiment Analysis of Tweets

In this tutorial we focus on the [Weather sentiment](https://www.crowdflower.com/data/weather-sentiment/) task from [Crowdflower](https://www.crowdflower.com/).

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. The catch is that 20 contributors graded each tweet. We then ran an additional job (the one below) where we asked 10 contributors to grade the original sentiment evaluation.

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. Contributors could choose among the following categories:
1. Positive
2. Negative
3. I can't tell
4. Neutral / author is just sharing information
5. Tweet not related to weather condition

The catch is that 20 contributors graded each tweet. Thus, in many cases contributors assigned conflicting sentiment labels to the same tweet. 


The task comes with two data files (to be found in the `data` directory of the tutorial:
1. [weather-non-agg-DFE.csv](data/weather-non-agg-DFE.csv) contains the raw contributor answers for each of the 1,000 tweets.
2. [weather-evaluated-agg-DFE.csv](data/weather-evaluated-agg-DFE.csv) contains gold sentiment labels by trusted workers for each of the 1,000 tweets.

**GOAL:** The goal of this tutorial is to demonstrate how `Snorkel` can be used to accurately infer a single sentiment label for each tweet, thus, denoising the collected contributor answers.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

## Step 0: Preprocessing - Data Loading with Spark SQL and Dataframes

First, we initialize a `SparkSession`, which manages a connection to a local Spark master which allows us to preprocess the raw data and prepare convert them to the necessary `Snorkel` format:

In [2]:
# Initialize Spark Environment and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Snorkel Crowdsourcing Demo") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

We can now load the raw data for our crowdsourcing task (stored in a local csv file) into a dataframe. 

In [3]:
# Load Raw Crowdsourcing Data
raw_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-non-agg-DFE.csv")
raw_crowd_answers.printSchema()

# Load Groundtruth Crowdsourcing Data
gold_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-evaluated-agg-DFE.csv")
gold_crowd_answers.createOrReplaceTempView("gold_crowd_answers")
gold_answers = spark.sql("SELECT tweet_id, sentiment, tweet_body FROM gold_crowd_answers WHERE correct_category ='Yes' and correct_category_conf = 1").orderBy("tweet_id")

# Keep Only the Tweets with Available Groundtruth
candidate_labeled_tweets = raw_crowd_answers.join(gold_answers, raw_crowd_answers.tweet_id == gold_answers.tweet_id).select(raw_crowd_answers.tweet_id,raw_crowd_answers.tweet_body,raw_crowd_answers.worker_id,raw_crowd_answers.emotion)

root
 |-- _unit_id_: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- trust: string (nullable = true)
 |-- worker_id: string (nullable = true)
 |-- country: string (nullable = true)
 |-- region: string (nullable = true)
 |-- city: string (nullable = true)
 |-- emotion: string (nullable = true)
 |-- tweet_id: string (nullable = true)
 |-- tweet_body: string (nullable = true)



As mentioned above, contributors can provide conflicting labels for the same tweet:

In [4]:
candidate_labeled_tweets.select("worker_id", "emotion", "tweet_body").orderBy("tweet_id").show()

+---------+--------------------+--------------------+
|worker_id|             emotion|          tweet_body|
+---------+--------------------+--------------------+
|  6498214|        I can't tell|I dunno which ass...|
|  7450342|Neutral / author ...|I dunno which ass...|
| 10752241|            Positive|I dunno which ass...|
| 10235355|            Negative|I dunno which ass...|
| 17475684|            Negative|I dunno which ass...|
|  6346694|Neutral / author ...|I dunno which ass...|
| 14806909|Neutral / author ...|I dunno which ass...|
| 19028457|            Positive|I dunno which ass...|
|  6737418|            Negative|I dunno which ass...|
| 14584835|            Negative|I dunno which ass...|
| 18381123|Neutral / author ...|I dunno which ass...|
| 16498372|Tweet not related...|I dunno which ass...|
|  7012325|            Positive|I dunno which ass...|
|  9333400|            Negative|I dunno which ass...|
| 10379699|            Positive|I dunno which ass...|
| 14298198|            Posit

## Step 1: Generate Snorkel Candidates

We'll start by generating a set of Snorkel `Candidate` objects representing the tweets. `Candidate` objects in Snorkel just represent the objects we wish to classify. All `Candidate` objects point to one or more `Context` objects; in this case, our candidates will each point to a single `Context` object representing the raw text of the tweet.

In [5]:
from snorkel.models import candidate_subclass

# We create a Candidate subclass, Tweet, which has one argument--
# tweet_body, representing the raw text of the tweet-- and can take
# on one of the values in `taskLabels`
values = map(
    lambda r: r.emotion,
    candidate_labeled_tweets.select("emotion").distinct().collect()
)
Tweet = candidate_subclass('Tweet', ['tweet'], values=values)

In [6]:
from snorkel.models import RawText, Context, Candidate

# Make sure DB is cleared
session.query(Context).delete()
session.query(Candidate).delete()

# Now we create the candidates with a simple loop
tweet_bodies = candidate_labeled_tweets \
    .select("tweet_id", "tweet_body") \
    .orderBy("tweet_id") \
    .distinct()

# Generate and store the tweet candidates to be classified
# Note: We split the tweets in two sets: one for which the crowd 
# labels are not available to Snorkel (test, 10%) and one for which we assume
# crowd labels are obtained (to be used for training, 90%)
total_tweets = tweet_bodies.count()
test_split = total_tweets*0.1
for i, t in enumerate(tweet_bodies.collect()):
    split = 1 if i <= test_split else 0
    raw_text = RawText(stable_id=t.tweet_id, name=t.tweet_id, text=t.tweet_body)
    tweet = Tweet(tweet=raw_text, split=split)
    session.add(tweet)
session.commit()

## Storing the Worker Labels

Note: this is not the most efficient way to do this, but is a small dataset

In [7]:
from snorkel.annotations import LabelAnnotator

# Extract worker votes
worker_labels = candidate_labeled_tweets.select("tweet_id", "worker_id", "emotion")

# Create a label generator
def worker_label_generator(t):
    """A generator over the different (worker_id, label_id) pairs for a Tweet."""
    labels = worker_labels \
        .select("worker_id", "emotion") \
        .filter("tweet_id == " + str(t.tweet.name)) \
        .collect()
    for row in labels:
        yield row.worker_id, row.emotion

labeler = LabelAnnotator(label_generator=worker_label_generator)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...

CPU times: user 5.61 s, sys: 567 ms, total: 6.17 s
Wall time: 1min 39s


<568x102 sparse matrix of type '<type 'numpy.float64'>'
	with 11360 stored elements in Compressed Sparse Row format>

# Obtaining the groundtruth labels

In [8]:
train_cands = session.query(Tweet).filter(Tweet.split == 0).order_by(Tweet.id).all()

# Generate and store the golden labels for each tweet candidate.
# The labels are split into two lists, one for training data and one for testing data. 
# The raw groundtruth labels are stored with respect to their unique id (from 1-5).
candidate_labels = {}

# Iterate over splits
for split in range(2):
    # Init candidate labels
    candidate_labels[split] = []
    # Get candidates
    cands = session.query(Tweet).filter(Tweet.split == split).order_by(Tweet.id).all()
    # Iterate over candidates    
    for c in cands:
        # Get candidate tweet_it
        cand_tweet_id = c.tweet.name
        # Get candidate numberic label
        raw_label = gold_answers.select("sentiment").filter("tweet_id == "+str(cand_tweet_id)).collect()[0].sentiment
        # Add offset to start enumeration from one
        numeric_label = values.index(raw_label) + 1
        # Store label
        candidate_labels[split].append(numeric_label)
         
# Split candidate labels            
train_cand_labels = candidate_labels[0]
test_cand_labels = candidate_labels[1]

print len(train_cand_labels)
print len(test_cand_labels)

568
64
