# Crowdsourcing Tutorial
How to use Snorkel in conjunction with crowdsourcing to create a training set for a sentiment analysis task.

Given
- Crowdsourced labels account for about half of the training dataset.
- The crowdsourced labels are fairly accurate
- They do not cover the entire training dataset
- They are not available for the test set or during inference

To make up for their lack of training set coverage, we combine crowdsourced labels with heuristic labeling functions to increase the number of training labels. We will use the denoised labels to train a deep learning model which can be applied to new, unseen data to automatically make predictions.

## Dataset Details
This tuturial will use [Weather Sentiment](https://data.world/crowdflower/weather-sentiment) dataset from Figure Eight. The goal is to train a classifier that can label new tweets as expressing either a positive or negative sentiment.

Crowdworkers were asked to label the sentiment of a particular tweet relating to the weather. The catch is that 20 crowdworkers graded each tweet, and in many cases crowdworkers assigned conflicting sentiment labels to the same tweet. This is a common issue when dealing with crowdsourced labeling workloads.

Each crowdworker's labels are treated as coming from a single labeling function (LF). This will allow us to learn a weight for how much to trust the labels from each crowdworker. We will also write a few heuristic labeling functions to cover the data points without crowd labels. Snorkel's ability to build high-quality dataset from multiple noisy labeling signals make it an ideal framework to approach this problem.



## Preparation

In [1]:
import os

if os.path.basename(os.getcwd()) == "snorkel-tutorials":
  os.chdir("./crowdsourcing")

os.getcwd()

'/Users/scottchu/Projects/learning/snorkel-tutorials/crowdsourcing'

In [2]:
!pip install -r requirements.txt 



## Loading Crowdsourcing Dataset

- Data point: 287
- Dev set: 50
- Test set: 50

In [3]:
from data import load_data

crowd_labels, df_train, df_dev, df_test = load_data()

Y_dev = df_dev.sentiment.values
Y_test = df_test.sentiment.values

In [4]:
crowd_labels.head()

Unnamed: 0_level_0,worker_id,label
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
82510997,18034918,1
82510997,7450342,1
82510997,18465660,1
82510997,17475684,0
82510997,14472526,1


In [5]:
df_train.head()

Unnamed: 0_level_0,tweet_id,tweet_text
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
82838854,82838854,@mention nope. Never got a chance to head out....
83258131,83258131,"Damn weather, just turn and stay sunny."
82512145,82512145,"Hi from sunny Boston. Weather is fine, wish yo..."
82511193,82511193,Ughhhhhh its so damn hot outside
81179800,81179800,This week in NYC will mark the longest stretch...


In [6]:
df_dev.head()

Unnamed: 0_level_0,tweet_id,tweet_text,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
79197834,79197834,@mention not in sunny dover! haha,1
80059939,80059939,It is literally pissing it down in sideways ra...,0
79196441,79196441,"Dear perfect weather, thanks for the vest lunc...",1
84047300,84047300,RT @mention: I can't wait for the storm tonigh...,1
83255121,83255121,60 degrees. And its almost the end of may. Wis...,0


In [7]:
df_test.head()

Unnamed: 0_level_0,tweet_id,tweet_text,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
82850069,82850069,RT @mention: Life is so much more manageable w...,1
81995718,81995718,@mention Sunshine! You lucky dog!,1
80058872,80058872,It is hot out here but it feels great,1
84313204,84313204,- this house is beyond cold..feels like imm la...,0
81996499,81996499,damn its goin rain all next week -.- where the...,0


## Writing Labeling Functions

Each crowdworkers can be thought of as a single labeling function, as each worker labels a subset of data points, and may have errors or conflicting labels with other workers / labeling functions. So we create one labeling function per worker. We'll simply return the label the worker submitted for a given tweet, and abstain if they didn't submit a label for it.

### Crowdworker labeling functions

In [8]:
labels_by_annotator = crowd_labels.groupby("worker_id")

worker_dicts = {}
for worker_id in labels_by_annotator.groups:
  worker_df = labels_by_annotator.get_group(worker_id)[["label"]]
  worker_dicts[worker_id] = dict(zip(worker_df.index, worker_df.label))

print("Number of workers", len(worker_dicts))


Number of workers 100


In [9]:
from snorkel.labeling import LabelingFunction

ABSTAIN = -1

def worker_lf(x, worker_dict):
  return worker_dict.get(x.tweet_id, ABSTAIN)

def make_worker_lf(worker_id):
  worker_dict = worker_dicts[worker_id]
  name = f"worker_{worker_id}"
  return LabelingFunction(name, f=worker_lf, resources={"worker_dict": worker_dict})

worker_lfs = [make_worker_lf(worker_id) for worker_id in worker_dicts]

In [10]:
from snorkel.labeling import PandasLFApplier

applier = PandasLFApplier(lfs=worker_lfs)

L_train = applier.apply(df_train)

L_dev = applier.apply(df_dev)

100%|████████████████████████████████████████████████████████████████████| 187/187 [00:00<00:00, 2928.92it/s]
100%|██████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 2981.32it/s]


In [11]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_dev, worker_lfs).lf_summary(Y_dev).sample(5)



Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
worker_14472526,47,"[0, 1]",0.12,0.12,0.12,5,1,0.833333
worker_6498214,12,[1],0.02,0.02,0.02,0,1,0.0
worker_8449724,20,[1],0.06,0.06,0.04,3,0,1.0
worker_18500901,81,"[0, 1]",0.04,0.04,0.04,2,0,1.0
worker_15549817,54,[],0.0,0.0,0.0,0,0,0.0


In [12]:
print(f"Training set coverage: {100 * LFAnalysis(L_train).label_coverage(): 0.1f}%")
print(f"Dev set coverage: {100 * LFAnalysis(L_dev).label_coverage(): 0.1f}%")

Training set coverage:  50.3%
Dev set coverage:  50.0%


In [13]:
from snorkel.labeling import labeling_function
from snorkel.preprocess import preprocessor
from textblob import TextBlob

POSITIVE = 1
NEGATIVE = 0

@preprocessor(memoize=True)
def textblob_polarity(x):
  scores = TextBlob(x.tweet_text)
  x.polarity = scores.polarity
  return x

@labeling_function(pre=[textblob_polarity])
def polarity_positive(x):
  return POSITIVE if x.polarity > 0.3 else ABSTAIN

@labeling_function(pre=[textblob_polarity])
def polarity_negative(x):
  return NEGATIVE if x.polarity < -0.25 else ABSTAIN

@labeling_function(pre=[textblob_polarity])
def polarity_negative_2(x):
  return NEGATIVE if x.polarity <= 0.3 else ABSTAIN

In [14]:
text_lfs = [
  polarity_positive,
  polarity_negative,
  polarity_negative_2
]

lfs = text_lfs + worker_lfs

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

100%|████████████████████████████████████████████████████████████████████| 187/187 [00:00<00:00, 1383.81it/s]
100%|██████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 1467.96it/s]


In [15]:
LFAnalysis(L_dev, lfs=lfs).lf_summary(Y_dev).head()



Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
polarity_positive,0,[1],0.3,0.16,0.12,15,0,1.0
polarity_negative,1,[0],0.1,0.1,0.04,5,0,1.0
polarity_negative_2,2,[0],0.7,0.4,0.32,26,9,0.742857
worker_6332651,3,"[0, 1]",0.06,0.06,0.06,1,2,0.333333
worker_6336109,4,[],0.0,0.0,0.0,0,0,0.0


In [16]:
print(f"Training set coverage: {100 * LFAnalysis(L_train).label_coverage(): 0.1f}%")
print(f"Dev set coverage: {100 * LFAnalysis(L_dev).label_coverage(): 0.1f}%")

Training set coverage:  100.0%
Dev set coverage:  100.0%


## Train LabelModel and Generate Probabilistic Labels

In [17]:
from snorkel.labeling.model import LabelModel

# Train LabelModel
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=100, seed=123, log_freq=20, l2=0.1, lr=0.01)

INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|                                                                             | 0/100 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=2.494]
INFO:root:[20 epochs]: TRAIN:[loss=0.635]
INFO:root:[40 epochs]: TRAIN:[loss=0.605]
 46%|██████████████████████████████▊                                    | 46/100 [00:00<00:00, 455.57epoch/s]INFO:root:[60 epochs]: TRAIN:[loss=0.590]
INFO:root:[80 epochs]: TRAIN:[loss=0.592]
100%|██████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 407.92epoch/s]
INFO:root:Finished Training


In [18]:
from snorkel.analysis import metric_score

preds_dev = label_model.predict(L_dev)

acc = metric_score(Y_dev, preds_dev, probs=None, metric="accuracy")

print(f"LabelModel Accuracy: {acc:.3f}")

LabelModel Accuracy: 0.920


The model got very high accuracy on the development set, This is due to the abundance of high quality crowdworker labels. **Since we don't have these high quality crowdsourcing labels for the test set or new incoming data points, we can't use the LabelMdoel reliably at inference time**. In order to run inference on new incoming data points, we need to train a discriminative model over the tweets themselves.

In [19]:
preds_train = label_model.predict(L_train)

## Use Soft Labels to Train End Model

### Getting features from BERT

In [24]:
import numpy as np
import torch
from pytorch_transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

INFO:pytorch_transformers.modeling_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /Users/scottchu/.cache/torch/pytorch_transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
INFO:pytorch_transformers.modeling_utils:Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pad_token_id": 0,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

INFO:pytorch_

In [25]:
def encode_text(text):
  input_ids = torch.tensor([tokenizer.encode(text)])
  return model(input_ids)[0].mean(1)[0].detach().numpy()

In [26]:
X_train = np.array(list(df_train.tweet_text.apply(encode_text).values))
X_test = np.array(list(df_test.tweet_text.apply(encode_text).values))

### Model on labels

In [37]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(solver="liblinear")
sklearn_model.fit(X_train, preds_train)

LogisticRegression(solver='liblinear')

In [38]:
print(f"Accuracy of trained model: {sklearn_model.score(X_test, Y_test)}")

Accuracy of trained model: 0.86


## Further Readings
- [BERT](https://github.com/google-research/bert)