# Crowdsourcing tutorial
In this tutorial, we'll provide a simple walkthrough of how to use Snorkel to resolve conflicts
in a noisy, hybrid dataset for a sentiment analysis task.
We have crowdsourced labels for a portion of the training dataset, and combine these
with heuristic labeling functions to increase the number of training labels we have.
Like most Snorkel labeling pipelines, we'll use the denoised labels to train a deep learning
model which can be applied to new, unseen data to automatically make predictions!

In this tutorial, we're using the
[Weather Sentiment](https://data.world/crowdflower/weather-sentiment)
dataset from Figure Eight.
Our goal is to label each tweet as either positive or negative so that
we can train a language model over the tweets themselves that can be applied
to new, unseen data points.
Crowd workers were asked to grade the sentiment of a
particular tweet relating to the weather.
Crowd workers could choose among the following categories:

* Positive
* Negative
* I can't tell
* Neutral / author is just sharing information
* Tweet not related to weather condition

The catch is that 20 crowd workers graded each tweet, and in many cases
crowd workers assigned conflicting sentiment labels to the same tweet.
This is a common issue when dealing with crowdsourced labeling workloads.

We've also altered the data set to reflect a realistic crowdsourcing pipeline
where only a subset of our full training set have recieved crowd labels.
We'll encode the crowd labels themselves as labeling functions in order
to learn trust weights for each crowd worker, and write a few heuristic
labeling functions to cover the data points without crowd labels.
Snorkel's ability to build high-quality datasets from multiple noisy labeling
signals makes it an ideal framework to approach this problem.

We start by loading our data which has 287 examples in total.
We take 50 for our development set and 50 for our test set.
The remaining 187 examples form our training set.
This data set is very small, and we're primarily using it for demonstration purposes.
In particular, we'd expect to have access to many more unlabeled tweets in order to
train a high performance text model.

The labels above have been mapped to integers, which we show here.

## Loading Crowdsourcing Dataset

In [1]:
import os

if os.path.basename(os.getcwd()) == "snorkel-tutorials":
    os.chdir("crowdsourcing")

In [2]:
from data import load_data, answer_mapping

crowd_answers, df_train, df_dev, df_test = load_data()
Y_dev = df_dev.sentiment.values
Y_test = df_test.sentiment.values

print("Answer to int mapping:")
for k, v in sorted(answer_mapping.items(), key=lambda kv: kv[1]):
    print(f"{k:<50}{v}")

Answer to int mapping:
I can't tell                                      -1
Negative                                          0
Positive                                          1
Neutral / author is just sharing information      2
Tweet not related to weather condition            3


First, let's take a look at our development set to get a sense of
what the tweets look like.

In [3]:
df_dev.head()

Unnamed: 0_level_0,tweet_id,tweet_text,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
79197834,79197834,@mention not in sunny dover! haha,1
80059939,80059939,It is literally pissing it down in sideways ra...,0
79196441,79196441,"Dear perfect weather, thanks for the vest lunc...",1
84047300,84047300,RT @mention: I can't wait for the storm tonigh...,1
83255121,83255121,60 degrees. And its almost the end of may. Wis...,0


Now let's take a look at the crowd labels.
We'll convert these into labeling functions.

In [4]:
crowd_answers.head()

Unnamed: 0_level_0,worker_id,answer
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
82510997,18034918,1
82510997,7450342,1
82510997,18465660,1
82510997,17475684,0
82510997,14472526,1


## Writing Labeling Functions
Each crowd worker can be thought of as a single labeling function,
as each worker labels a subset of examples,
and may have errors or conflicting answers with other workers / labeling functions.
So we create one labeling function per worker.
We'll simply return the label the worker submitted for a given tweet, and abstain
if they didn't submit an answer for it.

### Crowd worker labeling functions

In [5]:
labels_by_annotator = crowd_answers.groupby("worker_id")
worker_dicts = {}
for worker_id in labels_by_annotator.groups:
    worker_df = labels_by_annotator.get_group(worker_id)[["answer"]]
    if len(worker_df) > 10:
        worker_dicts[worker_id] = dict(zip(worker_df.index, worker_df.answer))

print("Number of workers:", len(worker_dicts))

Number of workers: 68


In [6]:
from snorkel.labeling.lf import LabelingFunction


def f_pos(x, worker_dict):
    label = worker_dict.get(x.tweet_id)
    return 1 if label == 1 else -1


def f_neg(x, worker_dict):
    label = worker_dict.get(x.tweet_id)
    return 0 if label == 0 else -1


def get_worker_labeling_function(worker_id, f):
    worker_dict = worker_dicts[worker_id]
    name = f"worker_{worker_id}"
    return LabelingFunction(name, f=f, resources={"worker_dict": worker_dict})


worker_lfs_pos = [
    get_worker_labeling_function(worker_id, f_pos) for worker_id in worker_dicts
]
worker_lfs_neg = [
    get_worker_labeling_function(worker_id, f_neg) for worker_id in worker_dicts
]

Let's take a quick look at how well they do on the development set.

In [7]:
from snorkel.labeling.apply import PandasLFApplier

lfs = worker_lfs_pos + worker_lfs_neg
lf_names = [lf.name for lf in lfs]

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

  0%|          | 0/187 [00:00<?, ?it/s]

 44%|████▍     | 83/187 [00:00<00:00, 827.50it/s]

 89%|████████▉ | 166/187 [00:00<00:00, 827.03it/s]

100%|██████████| 187/187 [00:00<00:00, 816.35it/s]


  0%|          | 0/50 [00:00<?, ?it/s]

100%|██████████| 50/50 [00:00<00:00, 813.10it/s]




In [8]:
from snorkel.labeling.analysis import LFAnalysis

LFAnalysis(L_dev).lf_summary(Y_dev, lf_names=lf_names).head(10)

  return np.nan_to_num(0.5 * (X.sum(axis=0) / (self.L != -1).sum(axis=0) + 1))


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
worker_6340330,0,[1],0.04,0.04,0.04,2,0,1.0
worker_6344001,1,[1],0.04,0.04,0.04,2,0,1.0
worker_6346694,2,[1],0.12,0.12,0.1,5,1,0.833333
worker_6363996,3,[1],0.04,0.04,0.02,2,0,1.0
worker_6371053,4,[1],0.06,0.06,0.06,2,1,0.666667
worker_6453108,5,[1],0.04,0.04,0.02,2,0,1.0
worker_6737418,6,[1],0.06,0.06,0.06,2,1,0.666667
worker_7325249,7,[1],0.1,0.1,0.1,4,1,0.8
worker_7450342,8,[],0.0,0.0,0.0,0,0,0.0
worker_7860247,9,[1],0.16,0.16,0.14,8,0,1.0


So the crowd labels are quite good! But how much of our dev and training
sets do they cover?

In [9]:
print("Training set coverge:", LFAnalysis(L_train).label_coverage())
print("Dev set coverge:", LFAnalysis(L_dev).label_coverage())

Training set coverge: 0.5026737967914439
Dev set coverge: 0.5


### Additional labeling functions

We can mix the crowd worker labeling functions with labeling
functions of other types.
We'll use a few varied approaches and use the label model learn
how to combine their values.

In [10]:
from snorkel.labeling.lf import labeling_function
from snorkel.labeling.preprocess import preprocessor
from textblob import TextBlob


@preprocessor()
def textblob_polarity(x):
    scores = TextBlob(x.tweet_text)
    x.polarity = scores.polarity
    return x


textblob_polarity.memoize = True


@labeling_function(preprocessors=[textblob_polarity])
def polarity_positive(x):
    return 1 if x.polarity > 0.3 else -1


@labeling_function(preprocessors=[textblob_polarity])
def polarity_negative(x):
    return 0 if x.polarity < -0.25 else -1


@labeling_function(preprocessors=[textblob_polarity])
def polarity_negative_2(x):
    return 0 if x.polarity <= 0.3 else -1

### Applying labeling functions to the training set

In [11]:
from snorkel.labeling.apply import PandasLFApplier

text_lfs = [polarity_positive, polarity_negative, polarity_negative_2]
lfs = text_lfs + worker_lfs_pos + worker_lfs_neg
lf_names = [lf.name for lf in lfs]

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

  0%|          | 0/187 [00:00<?, ?it/s]

 16%|█▌        | 30/187 [00:00<00:00, 299.30it/s]

 41%|████      | 76/187 [00:00<00:00, 333.55it/s]

 65%|██████▌   | 122/187 [00:00<00:00, 363.28it/s]

 90%|████████▉ | 168/187 [00:00<00:00, 386.50it/s]

100%|██████████| 187/187 [00:00<00:00, 417.34it/s]


  0%|          | 0/50 [00:00<?, ?it/s]

 92%|█████████▏| 46/50 [00:00<00:00, 458.93it/s]

100%|██████████| 50/50 [00:00<00:00, 444.55it/s]




In [12]:
LFAnalysis(L_dev).lf_summary(Y_dev, lf_names=lf_names).head(10)

  return np.nan_to_num(0.5 * (X.sum(axis=0) / (self.L != -1).sum(axis=0) + 1))


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
polarity_positive,0,[1],0.3,0.16,0.12,15,0,1.0
polarity_negative,1,[0],0.1,0.1,0.04,5,0,1.0
polarity_negative_2,2,[0],0.7,0.4,0.32,26,9,0.742857
worker_6340330,3,[1],0.04,0.04,0.04,2,0,1.0
worker_6344001,4,[1],0.04,0.04,0.04,2,0,1.0
worker_6346694,5,[1],0.12,0.12,0.1,5,1,0.833333
worker_6363996,6,[1],0.04,0.04,0.02,2,0,1.0
worker_6371053,7,[1],0.06,0.06,0.06,2,1,0.666667
worker_6453108,8,[1],0.04,0.04,0.02,2,0,1.0
worker_6737418,9,[1],0.06,0.06,0.06,2,1,0.666667


Using the text-based LFs, we've expanded coverage on both our training set
and dev set to 100%.
We'll now take these noisy and conflicting labels, and use the label model
to denoise and combine them.

In [13]:
print("Training set coverge:", LFAnalysis(L_train).label_coverage())
print("Dev set coverge:", LFAnalysis(L_dev).label_coverage())

Training set coverge: 1.0
Dev set coverge: 1.0


## Train Label Model And Generate Soft Labels

In [14]:
from snorkel.labeling.model.label_model import LabelModel

# Train label model.
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=100, seed=123, log_freq=20, l2=0.1, lr=0.01)

Computing O...
Estimating \mu...
[0 epochs]: TRAIN:[loss=2.371]
[20 epochs]: TRAIN:[loss=0.568]


[40 epochs]: TRAIN:[loss=0.536]
[60 epochs]: TRAIN:[loss=0.522]
[80 epochs]: TRAIN:[loss=0.524]


Finished Training


As a spot-check for the quality of our label model, we'll score it on the dev set.

In [15]:
from snorkel.analysis.metrics import metric_score
from snorkel.analysis.utils import probs_to_preds

Y_dev_prob = label_model.predict_proba(L_dev)
Y_dev_pred = probs_to_preds(Y_dev_prob)

acc = metric_score(Y_dev, Y_dev_pred, probs=None, metric="accuracy")
print(f"Label Model Accuracy: {acc:.3f}")

Label Model Accuracy: 0.920


Look at that, we get very high accuracy on the development set.
This is due to the abundance of high quality crowd worker labels.
**Since we don't have these high quality crowdsourcing labels for the
test set or new incoming examples, we can't use the label model reliably
at inference time.**
In order to run inference on new incoming examples, we need to train a
discriminative model over the tweets themselves.
Let's generate a set of probabilistic labels for the training set.

In [16]:
Y_train_prob = label_model.predict_proba(L_train)

## Use Soft Labels to Train End Model

### Getting features from BERT
Since we have very limited training data, we cannot train a complex model like an LSTM with a lot of parameters. Instead, we use a pre-trained model, [BERT](https://github.com/google-research/bert), to generate embeddings for each our tweets, and treat the embedding values as features.

In [17]:
import numpy as np
import torch
from pytorch_transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


def encode_text(text):
    input_ids = torch.tensor([tokenizer.encode(text)])
    return model(input_ids)[0].mean(1)[0].detach().numpy()


train_vectors = np.array(list(df_train.tweet_text.apply(encode_text).values))
test_vectors = np.array(list(df_test.tweet_text.apply(encode_text).values))

### Model on soft labels
Now, we train a simple logistic regression model on the BERT features, using labels
obtained from our label model.

In [18]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(solver="liblinear")
sklearn_model.fit(train_vectors, probs_to_preds(Y_train_prob))

print(f"Accuracy of trained model: {sklearn_model.score(test_vectors, Y_test)}")

Accuracy of trained model: 0.86
