# Resolving Conflicts in Crowdsourced Data with Snorkel

In this tutorial, we will walk through the process of using `Snorkel` to resolve conflicts in crowdsourced answers for a sentiment analysis task. The following tutorial is broken up into four parts, each covering a step in the pipeline:
1. Preprocessing
2. Construction of a Snorkel Labeling Matrix
3. Conflict Resolution
4. Evaluation

In this notebook, we preprocess the data collected by the crowd contributors using [Spark SQL and Dataframes](https://spark.apache.org/docs/latest/sql-programming-guide.html).

## Part 0: Sentiment Analysis of Tweets

In this tutorial we focus on the [Weather sentiment](https://www.crowdflower.com/data/weather-sentiment/) task from [Crowdflower](https://www.crowdflower.com/).

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. The catch is that 20 contributors graded each tweet. We then ran an additional job (the one below) where we asked 10 contributors to grade the original sentiment evaluation.

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. Contributors could choose among the following categories:
1. Positive
2. Negative
3. I can't tell
4. Neutral / author is just sharing information
5. Tweet not related to weather condition

The catch is that 20 contributors graded each tweet. Thus, in many cases contributors assigned conflicting sentiment labels to the same tweet. 


The task comes with two data files (to be found in the `data` directory of the tutorial:
1. [weather-non-agg-DFE.csv](data/weather-non-agg-DFE.csv) contains the raw contributor answers for each of the 1,000 tweets.
2. [weather-evaluated-agg-DFE.csv](data/weather-evaluated-agg-DFE.csv) contains gold sentiment labels by trusted workers for each of the 1,000 tweets.

**GOAL:** The goal of this tutorial is to demonstrate how `Snorkel` can be used to accurately infer a single sentiment label for each tweet, thus, denoising the collected contributor answers.

## Part I: Preprocessing - Data Loading with Spark SQL and Dataframes

First, we initialize a `SparkSession`, which manages a connection to a local Spark master which allows us to preprocess the raw data and prepare convert them to the necessary `Snorkel` format:

In [1]:
# Initialize Spark Environment and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Snorkel Crowdsourcing Demo") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

We can now load the raw data for our crowdsourcing task (stored in a local csv file) into a dataframe. 

In [2]:
# Load Crowdsourcing Data
raw_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-non-agg-DFE.csv")
raw_crowd_answers.printSchema()

root
 |-- _unit_id_: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- trust: string (nullable = true)
 |-- worker_id: string (nullable = true)
 |-- country: string (nullable = true)
 |-- region: string (nullable = true)
 |-- city: string (nullable = true)
 |-- emotion: string (nullable = true)
 |-- tweet_id: string (nullable = true)
 |-- tweet_body: string (nullable = true)



As mentioned above, contributors can provide conflicting labels for the same tweet:

In [3]:
raw_crowd_answers.select("worker_id", "emotion", "tweet_body").orderBy("tweet_id").show()

+---------+--------------------+--------------------+
|worker_id|             emotion|          tweet_body|
+---------+--------------------+--------------------+
| 14806909|Neutral / author ...|I dunno which ass...|
|  7450342|Neutral / author ...|I dunno which ass...|
|  6737418|            Negative|I dunno which ass...|
| 18381123|Neutral / author ...|I dunno which ass...|
| 10752241|            Positive|I dunno which ass...|
| 14584835|            Negative|I dunno which ass...|
|  6346694|Neutral / author ...|I dunno which ass...|
| 19028457|            Positive|I dunno which ass...|
| 17475684|            Negative|I dunno which ass...|
|  6498214|        I can't tell|I dunno which ass...|
| 16498372|Tweet not related...|I dunno which ass...|
|  7012325|            Positive|I dunno which ass...|
|  9333400|            Negative|I dunno which ass...|
| 10379699|            Positive|I dunno which ass...|
| 14298198|            Positive|I dunno which ass...|
| 20043586|            Negat

## Part II: Construction of a Snorkel Labeling Matrix

We now demonstrate how to convert the raw crowd data stored in a Dataframe to the input assumed by `Snorkel`. 

### A recap of Snorkel's labeling matrix
`Snorkel` is a system for rapidly creating, modeling, and managing training data. It is built around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but `Snorkel` automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).

The key input in `Snorkel's` programing model corresponds to a **labeling matrix**. The rows of the labeling matrix correspond to objects for which we require to obtain labels, and the columns to labeling functions that assign labels to these objects. Different labeling functions can provide conflicting assignments for the same object. `Snorkel` leverages the agreement rates across labeling functions to automatically infer the accuracy of each function. The infered labeling function accuracies can then be used to estimate the most probable label assignment for each object.

### Labeling matrices for crowdsourcing
There is a one-to-one correspondance between the task of resolving conflicting answers by crowd contributors and `Snorkel's` core task. Each crowdsourcing task corresponds to an object (a row in the labeling matrix) and each worker to a different labeling function that assigns labels to a subset of objects.

Below we demonstrate how to map the raw crowdsourced data to a labeling matrix and how to use `Snorkel` to resolve disagreements across contributors.

First we perform a selection over the dataframe containing the raw crowdsourced data to obtain the labels assigned to different tasks by different workers. We refer to the result of this selection as `worker_votes`.

In [4]:
# Extract worker votes
worker_labels = raw_crowd_answers.selectExpr("tweet_id as task_id", "worker_id", "emotion as label")
worker_labels.show()

+--------+---------+--------------------+
| task_id|worker_id|               label|
+--------+---------+--------------------+
|82846118| 18034918|Neutral / author ...|
|82510997| 18034918|            Positive|
|83271279| 18034918|            Negative|
|80058872| 18034918|            Positive|
|80058809| 18034918|Neutral / author ...|
|79188429| 18034918|Neutral / author ...|
|82838136| 18034918|            Positive|
|82513588| 18034918|            Positive|
|84321017| 18034918|Neutral / author ...|
|82680080| 18034918|            Positive|
|82510259| 18034918|            Negative|
|83258314| 18034918|Neutral / author ...|
|80052652| 18034918|Tweet not related...|
|79189357| 18034918|            Positive|
|83257167| 18034918|Neutral / author ...|
|84312699| 18034918|            Negative|
|79187315| 18034918|Neutral / author ...|
|84034743| 18034918|            Negative|
|84319723| 18034918|Neutral / author ...|
|83259244| 18034918|            Positive|
+--------+---------+--------------

Before we populate the labeling matrix to be used as input for `Snorkel` we generate a series of maps that:

1. Map each `task_id` to a unique Object id represented as an integer.
2. Map each `worker_id` to a unique Labeling Function (LF) id represented as an integer.
3. Map each possible label from the active domain of the crowdsourced tasks to a unique integer key in 1..D. The value of 0 is reserved to denote that a worker (labeling function) abstains from assigning a label for a task (object).

We will later use these maps to populate the entries of the actual labeling matrix, which corresponds to a sparse numpy array.

In [5]:
# Generate task to object map
task2ObjMap = {}
oid = 0
obj2TaskMap = []
for task in worker_labels.select("task_id").distinct().orderBy("task_id").collect():
    task2ObjMap[task.task_id] = oid
    obj2TaskMap.append(task.task_id)
    oid += 1

# Generate workers map
worker2LFMap = {}
lfid = 0
lf2WorkerMap = []
for worker in worker_labels.select("worker_id").distinct().orderBy("worker_id").collect():
    worker2LFMap[worker.worker_id] = lfid
    lf2WorkerMap.append(worker.worker_id)
    lfid += 1
    
# Generate label map

# The special class label False corresponds to the label key 0 
# which means that a worker (labeling function) abstains from
# providing a label for a task (object).
taskLabels = [False] 
taskLabels.extend([str(i.label) for i in worker_labels.select("label").distinct().collect()])
taskLabelsMap = {}
for i in range(len(taskLabels)):
    taskLabelsMap[taskLabels[i]] = i

In [6]:
# Inspect the domain map for possible labels
for label in taskLabelsMap:
    print str(label) + " <=> "+ str(taskLabelsMap[label])

False <=> 0
Positive <=> 3
Negative <=> 4
I can't tell <=> 2
Neutral / author is just sharing information <=> 5
Tweet not related to weather condition <=> 1


We will now iterate over the entries of the `worker_labels` dataframe and use the above maps to populate the labeling matrix.

In [7]:
# The labeling matrix is represented
# as a sparse scipy array

# Imports
import numpy as np
from scipy import sparse

# Initialize dimensions of labeling matrix
objects = worker_labels.select("task_id").distinct().count()
LFs = worker_labels.select("worker_id").distinct().count()

# Initialize empty labeling matrix
L = sparse.lil_matrix((objects, LFs), dtype=np.int64)

# Iterate over crowdsourced labels and populate labeling matrix
for assigned_label in worker_labels.select("worker_id", "task_id", "label").collect():
    oid = task2ObjMap[assigned_label.task_id]
    LFid = worker2LFMap[assigned_label.worker_id]
    label = taskLabelsMap[assigned_label.label]
    L[oid, LFid] = label

## Part III: Conflict Resolution

Until now we have converted the raw crowdsourced data into a labeling matrix that can be provided as input to `Snorkel`. We will now show how to:

1. Use `Snorkel's` generative model to learn the accuracy of each crowd contributor.
2. Use the learned model to estimate a marginal distribution over the domain of possible labels for each task.
3. Use the estimated marginal distribution to obtain the maximum a posteriori probability estimate for the label that each task takes.

### Importing and training a Snorkel generative model

First we import and initialize `Snorkel's` generative model.

In [8]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



Then we train `Snorkel's` generative model by passing as input the labeling matrix that corresponds to the crowdsourced data.

In [9]:
# Train the generative model
gen_model.train(
    L,
    reg_type=2,
    reg_param=0.01,
    epochs=10
)

Inferred cardinality: 5


### Infering the marginal distribution
The following command uses the labeling matrix and the learned generative model to estimate the marginal distribution over the domain of possible labels for each task.

In [10]:
task_marginals = gen_model.marginals(L)

### Infering the MAP assignment for each task
Each task corresponds to an indipendent random variable. Thus, we can simply associate each task with the most probably label based on the estimated marginal distribution.

In [11]:
# Get MAP assignment for each task
task_map_assignment = np.argmax(task_marginals, axis=1)
inferedLabels = {}
for i in range(len(task_map_assignment)):
    inferedLabels[obj2TaskMap[i]] =  taskLabels[task_map_assignment[i]+1]

## Part IV: Evaluation

We now evaluate the accuracy of `Snorkel's` model at identifying the correct label for each task by fusing the labels provided by differnet crowd contributors. For this we compare the MAP label assigned to tasks against the provided groundtruth data.

In [12]:
# Extract ground truth per tweet_id
gold_crowd_answers = spark.read.format("csv").option("header", "true").csv("data/weather-evaluated-agg-DFE.csv")
gold_crowd_answers.createOrReplaceTempView("gold_crowd_answers")
gold_answers = spark.sql("SELECT tweet_id, sentiment, tweet_body FROM gold_crowd_answers WHERE correct_category ='Yes' and correct_category_conf = 1")

In [13]:
errors = 0
total = float(gold_answers.count())
for trueLabel in gold_answers.select("tweet_id","sentiment","tweet_body").collect():
    if trueLabel.sentiment != inferedLabels[trueLabel.tweet_id]:
        errors += 1
        print '*** Error ***'
        print 'Original tweet: '+trueLabel.tweet_body
        print 'Groundtruth label: '+trueLabel.sentiment
        print 'Snorkel label: '+inferedLabels[trueLabel.tweet_id]
        print '\n'
print 'Overall accuracy of Snorkel''s model = ', (total-errors)/total

*** Error ***
Original tweet: BLOG: Another Day, Another Round Of Thunderstorms {link}
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Negative


*** Error ***
Original tweet: RT @mention: It'll be sunny again by time you land RT @mention: Every time I think about going back to Portland it starts ra ...
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Negative


*** Error ***
Original tweet: #sunshine
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Positive


*** Error ***
Original tweet: This hot, dry, & windy weather is going to turn the #canola fast. Keep a close eye on it if you plan on swathing or pushing. #okanola
Groundtruth label: Neutral / author is just sharing information
Snorkel label: Negative


*** Error ***
Original tweet: @mention It's supposed to go up to 70 today. Sunshine early. A slight chance for a shower or t-storm later. Going to my Aunt's house later.
Groundtruth label: N