# Crowdsourcing tutorial
In this tutorial, we'll provide a simple walkthrough of how to use Snorkel to resolve conflicts
in a noisy crowdsourced dataset for a sentiment analysis task.
Like most Snorkel labeling pipelines, we'll use these denoised labels a deep learning model
which can be applied to new, unseen data to automatically make predictions!

In this tutorial, we're using the
[Weather Sentiment](https://data.world/crowdflower/weather-sentiment)
dataset from Figure Eight.
In this task, contributors were asked to grade the sentiment of a particular tweet relating
to the weather.
Contributors could choose among the following categories:

* Positive
* Negative
* I can't tell
* Neutral / author is just sharing information
* Tweet not related to weather condition

The catch is that 20 contributors graded each tweet, and in many cases contributors assigned
conflicting sentiment labels to the same tweet. Our goal is to label each tweet as either
positive or negative.

This is a common issue when dealing with crowdsourced labeling workloads.
We've also altered the data set to reflect a realistic crowdsourcing pipeline
where only a subset of our full training set have recieved crowd labels.
We'll encode the crowd labels themselves as labeling functions in order to learn trust
weights for each crowdworker, and write a few heuristic labeling functions to cover the
data points without crowd labels.
Snorkel's ability to build high-quality datasets from multiple noisy labeling
signals makes it an ideal framework to approach this problem.

We start by loading our data. It has 632 examples. We take 50 for our development set and 50 for our test set. The remaining 187 examples form our training set. 100 of the examples have crowd labels, and the remaining 87 do not. This data set is very small, and we're primarily using it for demonstration purposes.

The labels above have been mapped to integers, which we show here.

## Loading Crowdsourcing Dataset

In [1]:
import os

if os.path.basename(os.getcwd()) == "snorkel-tutorials":
    os.chdir("crowdsourcing")

In [2]:
from data import load_data, answer_mapping

crowd_answers, df_train, df_dev, df_test = load_data()
Y_dev = df_dev.sentiment.values
Y_test = df_test.sentiment.values

print("Answer to int mapping:")
for k, v in sorted(answer_mapping.items(), key=lambda kv: kv[1]):
    print(f"{k:<50}{v}")

Answer to int mapping:
I can't tell                                      -1
Negative                                          0
Positive                                          1
Neutral / author is just sharing information      2
Tweet not related to weather condition            3


First, let's take a look at our development set to get a sense of what the tweets look like.

In [3]:
df_dev.head()

Unnamed: 0_level_0,tweet_id,tweet_text,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
79197834,79197834,@mention not in sunny dover! haha,1
80059939,80059939,It is literally pissing it down in sideways ra...,0
79196441,79196441,"Dear perfect weather, thanks for the vest lunc...",1
84047300,84047300,RT @mention: I can't wait for the storm tonigh...,1
83255121,83255121,60 degrees. And its almost the end of may. Wis...,0


Now let's take a look at the crowd labels. We'll convert these into labeling functions.

In [4]:
crowd_answers.head()

Unnamed: 0_level_0,worker_id,answer
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
82510997,18034918,1
82510997,7450342,1
82510997,18465660,1
82510997,17475684,0
82510997,14472526,1


## Writing Labeling Functions
Each crowd worker can be thought of as a single labeling function, as each worker labels a subset of examples, and may have errors or conflicting answers with other workers / labeling functions. So we create one labeling function per worker. We'll simply return the label the worker submitted for a given tweet, and abstain if they didn't submit an answer for it.

### Crowd worker labeling functions

In [5]:
labels_by_annotator = crowd_answers.groupby("worker_id")
worker_dicts = {}
for worker_id in labels_by_annotator.groups:
    worker_df = labels_by_annotator.get_group(worker_id)[["answer"]]
    v = set(worker_df.answer.tolist())
    if len(worker_df) > 10:
        worker_dicts[worker_id] = dict(zip(worker_df.index, worker_df.answer))

print("Number of workers:", len(worker_dicts))

Number of workers: 79


In [6]:
from snorkel.labeling.lf import LabelingFunction


def f_pos(x, worker_dict):
    label = worker_dict.get(x.tweet_id)
    return 1 if label == 1 else -1


def f_neg(x, worker_dict):
    label = worker_dict.get(x.tweet_id)
    return 0 if label == 0 else -1


def get_worker_labeling_function(worker_id, f):
    worker_dict = worker_dicts[worker_id]
    name = f"worker_{worker_id}"
    return LabelingFunction(name, f=f, resources={"worker_dict": worker_dict})


worker_lfs_pos = [
    get_worker_labeling_function(worker_id, f_pos) for worker_id in worker_dicts
]
worker_lfs_neg = [
    get_worker_labeling_function(worker_id, f_neg) for worker_id in worker_dicts
]

Let's take a quick look at how well they do on the development set.

In [7]:
from snorkel.labeling.apply import PandasLFApplier

lfs = worker_lfs_pos + worker_lfs_neg
lf_names = [lf.name for lf in lfs]

applier = PandasLFApplier(lfs)
L_dev = applier.apply(df_dev)

  0%|          | 0/50 [00:00<?, ?it/s]

100%|██████████| 50/50 [00:00<00:00, 689.83it/s]




In [8]:
from snorkel.labeling.analysis import LFAnalysis

LFAnalysis(L_dev).lf_summary(Y_dev, lf_names=lf_names).head(10)

  return np.nan_to_num(0.5 * (X.sum(axis=0) / (self.L != -1).sum(axis=0) + 1))


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
worker_6340330,0,[1],0.06,0.06,0.06,3,0,1.0
worker_6340848,1,[1],0.04,0.04,0.02,2,0,1.0
worker_6344001,2,[1],0.22,0.22,0.14,11,0,1.0
worker_6346694,3,[1],0.18,0.18,0.12,8,1,0.888889
worker_6348036,4,[],0.0,0.0,0.0,0,0,0.0
worker_6363996,5,[1],0.16,0.16,0.08,8,0,1.0
worker_6369809,6,[1],0.04,0.04,0.02,2,0,1.0
worker_6371053,7,[1],0.1,0.1,0.1,2,3,0.4
worker_6453108,8,[1],0.04,0.04,0.02,2,0,1.0
worker_6737418,9,[1],0.08,0.08,0.06,3,1,0.75


### Additional labeling functions

We can mix the crowd worker labeling functions with labeling functions of other types.
We'll use a few varied approaches and use the label model learn how to combine their values.

In [9]:
from snorkel.labeling.lf import labeling_function
from snorkel.labeling.preprocess import preprocessor
from textblob import TextBlob


@preprocessor()
def textblob_polarity(x):
    scores = TextBlob(x.tweet_text)
    x.polarity = scores.polarity
    return x


textblob_polarity.memoize = True


@labeling_function(preprocessors=[textblob_polarity])
def polarity_positive(x):
    return 1 if x.polarity > 0.3 else -1


@labeling_function(preprocessors=[textblob_polarity])
def polarity_negative(x):
    return 0 if x.polarity < -0.25 else -1


@labeling_function(preprocessors=[textblob_polarity])
def polarity_negative_2(x):
    return 0 if x.polarity <= 0.3 else -1

### Applying labeling functions to the training set

In [10]:
from snorkel.labeling.apply import PandasLFApplier

text_lfs = [polarity_positive, polarity_negative, polarity_negative_2]
lfs = text_lfs + worker_lfs_pos + worker_lfs_neg
lf_names = [lf.name for lf in lfs]

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

  0%|          | 0/187 [00:00<?, ?it/s]

 16%|█▌        | 29/187 [00:00<00:00, 283.41it/s]

 38%|███▊      | 71/187 [00:00<00:00, 313.98it/s]

 60%|██████    | 113/187 [00:00<00:00, 339.45it/s]

 83%|████████▎ | 156/187 [00:00<00:00, 360.41it/s]

100%|██████████| 187/187 [00:00<00:00, 388.47it/s]


  0%|          | 0/50 [00:00<?, ?it/s]

 86%|████████▌ | 43/50 [00:00<00:00, 426.99it/s]

100%|██████████| 50/50 [00:00<00:00, 413.88it/s]




In [11]:
LFAnalysis(L_dev).lf_summary(Y_dev, lf_names=lf_names).head(10)

  return np.nan_to_num(0.5 * (X.sum(axis=0) / (self.L != -1).sum(axis=0) + 1))


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
polarity_positive,0,[1],0.3,0.3,0.18,15,0,1.0
polarity_negative,1,[0],0.1,0.1,0.1,5,0,1.0
polarity_negative_2,2,[0],0.7,0.7,0.62,26,9,0.742857
worker_6340330,3,[1],0.06,0.06,0.06,3,0,1.0
worker_6340848,4,[1],0.04,0.04,0.02,2,0,1.0
worker_6344001,5,[1],0.22,0.22,0.16,11,0,1.0
worker_6346694,6,[1],0.18,0.18,0.12,8,1,0.888889
worker_6348036,7,[],0.0,0.0,0.0,0,0,0.0
worker_6363996,8,[1],0.16,0.16,0.1,8,0,1.0
worker_6369809,9,[1],0.04,0.04,0.02,2,0,1.0


## Train Label Model And Generate Soft Labels

In [12]:
from snorkel.labeling.model.label_model import LabelModel

# Train label model.
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=100, seed=123, log_freq=20, l2=0.1, lr=0.01)

Computing O...
Estimating \mu...
[0 epochs]: TRAIN:[loss=2.186]


[20 epochs]: TRAIN:[loss=0.612]
[40 epochs]: TRAIN:[loss=0.544]
[60 epochs]: TRAIN:[loss=0.545]


[80 epochs]: TRAIN:[loss=0.541]
Finished Training


As a spot-check for the quality of our label model, we'll score it on the dev set.

In [13]:
from snorkel.analysis.metrics import metric_score
from snorkel.analysis.utils import probs_to_preds

Y_dev_prob = label_model.predict_proba(L_dev)
Y_dev_pred = probs_to_preds(Y_dev_prob)

acc = metric_score(Y_dev, Y_dev_pred, probs=None, metric="accuracy")
print(f"Label Model Accuracy: {acc:.3f}")

Label Model Accuracy: 1.000


Look at that, we get perfect accuracy on the development set. This is due to the abundance of high quality crowd worker labels. In order to train a discriminative model, let's generate a set of probabilistic labels for the training set.

In [14]:
Y_train_prob = label_model.predict_proba(L_train)

## Use Soft Labels to Train End Model

For simplicity and speed, we use a simple "bag of n-grams" feature representation: each data point is represented by a one-hot vector marking which words or 2-word combinations are present in the comment text.

### Featurization

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

train_tokens = [row.tweet_text for _, row in df_train.iterrows()]
test_tokens = [row.tweet_text for _, row in df_test.iterrows()]

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_tokens).toarray().astype("float")
X_test = vectorizer.transform(test_tokens).toarray().astype("float")

### Model on soft labels
Now, we train a simple MLP model on the bag-of-words features, using labels obtained from our label model.

In [16]:
import tensorflow as tf

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax))
model.compile("Adam", "categorical_crossentropy")
callbacks = model.fit(X_train, Y_train_prob, epochs=100, verbose=0)

W0725 07:42:39.129775 140235228354368 deprecation.py:506] From /home/ubuntu/snorkel-tutorials/.env/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [17]:
probs = model.predict(X_test)
preds = probs_to_preds(probs)
acc = metric_score(Y_test, preds=preds, metric="accuracy")
print(f"Test Accuracy when trained with soft training labels: {acc:.3f}")

Test Accuracy when trained with soft training labels: 0.640
