# Crowdsourcing tutorial
In this tutorial, we'll provide a simple walkthrough of how to use Snorkel alongside crowdsourcing to generate labels for a sentiment analysis task.
We have crowdsourced labels for about half of the training dataset.
The crowdsourcing labels are of a fairly high quality, but do not cover the entire training dataset, nor are they available for the test set or during inference.
To make up for their lack of training set coverage, we combine crowdsourcing labels with heuristic labeling functions to increase the number of training labels we have.
Like most Snorkel labeling pipelines, we'll use the denoised labels to train a deep learning
model which can be applied to new, unseen data to automatically make predictions!

In this tutorial, we're using the
[Weather Sentiment](https://data.world/crowdflower/weather-sentiment)
dataset from Figure Eight.
Our goal is to label each tweet as either positive or negative so that
we can train a language model over the tweets themselves that can be applied
to new, unseen data points.
Crowd workers were asked to grade the sentiment of a
particular tweet relating to the weather. They could say it was positive or
negative, or choose one of three other options saying they weren't sure it was
positive or negative.

The catch is that 20 crowd workers graded each tweet, and in many cases
crowd workers assigned conflicting sentiment labels to the same tweet.
This is a common issue when dealing with crowdsourced labeling workloads.

We've also altered the data set to reflect a realistic crowdsourcing pipeline
where only a subset of our full training set have recieved crowd labels.
Since our objective is to classify tweets as positive or negative, we limited
the dataset to tweets that were either positive or negative.

We'll encode the crowd labels themselves as labeling functions in order
to learn trust weights for each crowd worker, and write a few heuristic
labeling functions to cover the data points without crowd labels.
Snorkel's ability to build high-quality datasets from multiple noisy labeling
signals makes it an ideal framework to approach this problem.

We start by loading our data which has 287 examples in total.
We take 50 for our development set and 50 for our test set.
The remaining 187 examples form our training set.
This data set is very small, and we're primarily using it for demonstration purposes.
In particular, we'd expect to have access to many more unlabeled tweets in order to
train a high performance text model.
Since the dataset is already small, we skip
using a validation set.

The labels above have been mapped to integers, which we show here.

## Loading Crowdsourcing Dataset

In [1]:
import os

if os.path.basename(os.getcwd()) == "snorkel-tutorials":
    os.chdir("crowdsourcing")

In [2]:
from data import load_data, answer_mapping

crowd_answers, df_train, df_dev, df_test = load_data()
Y_dev = df_dev.sentiment.values
Y_test = df_test.sentiment.values

print("Answer to int mapping:")
for k, v in sorted(answer_mapping.items(), key=lambda kv: kv[1]):
    print(f"{k:<50}{v}")

Answer to int mapping:
I can't tell                                      -1
Negative                                          0
Positive                                          1
Neutral / author is just sharing information      2
Tweet not related to weather condition            3


First, let's take a look at our development set to get a sense of
what the tweets look like.

In [3]:
import pandas as pd

# Don't truncate text fields in the display
pd.set_option("display.max_colwidth", 0)

df_dev.head()

Unnamed: 0_level_0,tweet_id,tweet_text,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
79197834,79197834,@mention not in sunny dover! haha,1
80059939,80059939,It is literally pissing it down in sideways rain. I have nothing to protect me from this monstrous weather.,0
79196441,79196441,"Dear perfect weather, thanks for the vest lunch hour of all time. (@ Lady Bird Lake Trail w/ 2 others) {link}",1
84047300,84047300,RT @mention: I can't wait for the storm tonight :),1
83255121,83255121,60 degrees. And its almost the end of may. Wisconsin... I hate you.,0


Now let's take a look at the crowd labels.
We'll convert these into labeling functions.

In [4]:
crowd_answers.head()

Unnamed: 0_level_0,worker_id,answer
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
82510997,18034918,1
82510997,7450342,1
82510997,18465660,1
82510997,17475684,0
82510997,14472526,1


## Writing Labeling Functions
Each crowd worker can be thought of as a single labeling function,
as each worker labels a subset of examples,
and may have errors or conflicting answers with other workers / labeling functions.
So we create one labeling function per worker.
We'll simply return the label the worker submitted for a given tweet, and abstain
if they didn't submit an answer for it.

### Crowd worker labeling functions

In [5]:
labels_by_annotator = crowd_answers.groupby("worker_id")
worker_dicts = {}
for worker_id in labels_by_annotator.groups:
    worker_df = labels_by_annotator.get_group(worker_id)[["answer"]]
    if len(worker_df) > 10:
        worker_dicts[worker_id] = dict(zip(worker_df.index, worker_df.answer))

print("Number of workers:", len(worker_dicts))

Number of workers: 68


In [6]:
from snorkel.labeling.lf import LabelingFunction


def f_pos(x, worker_dict):
    label = worker_dict.get(x.tweet_id)
    return 1 if label == 1 else -1


def f_neg(x, worker_dict):
    label = worker_dict.get(x.tweet_id)
    return 0 if label == 0 else -1


def get_worker_labeling_function(worker_id, f):
    worker_dict = worker_dicts[worker_id]
    name = f"worker_{worker_id}"
    return LabelingFunction(name, f=f, resources={"worker_dict": worker_dict})


worker_lfs_pos = [
    get_worker_labeling_function(worker_id, f_pos) for worker_id in worker_dicts
]
worker_lfs_neg = [
    get_worker_labeling_function(worker_id, f_neg) for worker_id in worker_dicts
]

Let's take a quick look at how well they do on the development set.

In [7]:
from snorkel.labeling.apply import PandasLFApplier

lfs = worker_lfs_pos + worker_lfs_neg

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

  0%|          | 0/187 [00:00<?, ?it/s]

 44%|████▍     | 83/187 [00:00<00:00, 821.28it/s]

 88%|████████▊ | 165/187 [00:00<00:00, 819.99it/s]

100%|██████████| 187/187 [00:00<00:00, 809.63it/s]


  0%|          | 0/50 [00:00<?, ?it/s]

100%|██████████| 50/50 [00:00<00:00, 809.62it/s]




In [8]:
from snorkel.labeling.analysis import LFAnalysis

LFAnalysis(L_dev, lfs).lf_summary(Y_dev).head()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
worker_6340330,0,[1],0.04,0.04,0.04,2,0,1.0
worker_6344001,1,[1],0.04,0.04,0.04,2,0,1.0
worker_6346694,2,[1],0.12,0.12,0.1,5,1,0.833333
worker_6363996,3,[1],0.04,0.04,0.02,2,0,1.0
worker_6371053,4,[1],0.06,0.06,0.06,2,1,0.666667


So the crowd labels are quite good! But how much of our dev and training
sets do they cover?

In [9]:
print("Training set coverage:", LFAnalysis(L_train).label_coverage())
print("Dev set coverage:", LFAnalysis(L_dev).label_coverage())

Training set coverage: 0.5026737967914439
Dev set coverage: 0.5


### Additional labeling functions

To improve coverage of the training set, we can mix the crowd worker labeling functions with labeling
functions of other types.
For example, we can use [TextBlob](https://textblob.readthedocs.io/en/dev/index.html), a tool that provides a pretrained sentiment analyzer. We run TextBlob on our tweets and create some simple LFs that threshold its polarity score, similar to what we did in the spam_tutorial.

In [10]:
from snorkel.labeling.lf import labeling_function
from snorkel.preprocess import preprocessor
from textblob import TextBlob


@preprocessor()
def textblob_polarity(x):
    scores = TextBlob(x.tweet_text)
    x.polarity = scores.polarity
    return x


textblob_polarity.memoize = True

# Label high polarity tweets as positive.
@labeling_function(pre=[textblob_polarity])
def polarity_positive(x):
    return 1 if x.polarity > 0.3 else -1


# Label low polarity tweets as negative.
@labeling_function(pre=[textblob_polarity])
def polarity_negative(x):
    return 0 if x.polarity < -0.25 else -1


# Similar to polarity_negative, but with higher coverage and lower precision.
@labeling_function(pre=[textblob_polarity])
def polarity_negative_2(x):
    return 0 if x.polarity <= 0.3 else -1

### Applying labeling functions to the training set

In [11]:
text_lfs = [polarity_positive, polarity_negative, polarity_negative_2]
lfs = text_lfs + worker_lfs_pos + worker_lfs_neg

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

  0%|          | 0/187 [00:00<?, ?it/s]

 16%|█▌        | 30/187 [00:00<00:00, 298.98it/s]

 41%|████      | 76/187 [00:00<00:00, 332.51it/s]

 65%|██████▌   | 122/187 [00:00<00:00, 361.34it/s]

 89%|████████▉ | 167/187 [00:00<00:00, 383.49it/s]

100%|██████████| 187/187 [00:00<00:00, 413.21it/s]


  0%|          | 0/50 [00:00<?, ?it/s]

 94%|█████████▍| 47/50 [00:00<00:00, 460.17it/s]

100%|██████████| 50/50 [00:00<00:00, 446.66it/s]




In [12]:
LFAnalysis(L_dev, lfs).lf_summary(Y_dev).head()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
polarity_positive,0,[1],0.3,0.16,0.12,15,0,1.0
polarity_negative,1,[0],0.1,0.1,0.04,5,0,1.0
polarity_negative_2,2,[0],0.7,0.4,0.32,26,9,0.742857
worker_6340330,3,[1],0.04,0.04,0.04,2,0,1.0
worker_6344001,4,[1],0.04,0.04,0.04,2,0,1.0


Using the text-based LFs, we've expanded coverage on both our training set
and dev set to 100%.
We'll now take these noisy and conflicting labels, and use the LabelModel
to denoise and combine them.

In [13]:
print("Training set coverage:", LFAnalysis(L_train).label_coverage())
print("Dev set coverage:", LFAnalysis(L_dev).label_coverage())

Training set coverage: 1.0
Dev set coverage: 1.0


## Train LabelModel And Generate Probabilistic Labels

In [14]:
from snorkel.labeling.model.label_model import LabelModel

# Train LabelModel.
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=100, seed=123, log_freq=20, l2=0.1, lr=0.01)

As a spot-check for the quality of our LabelModel, we'll score it on the dev set.

In [15]:
from snorkel.analysis.metrics import metric_score
from snorkel.analysis.utils import probs_to_preds

Y_dev_prob = label_model.predict_proba(L_dev)
Y_dev_pred = probs_to_preds(Y_dev_prob)

acc = metric_score(Y_dev, Y_dev_pred, probs=None, metric="accuracy")
print(f"LabelModel Accuracy: {acc:.3f}")

LabelModel Accuracy: 0.920


Look at that, we get very high accuracy on the development set.
This is due to the abundance of high quality crowd worker labels.
**Since we don't have these high quality crowdsourcing labels for the
test set or new incoming examples, we can't use the LabelModel reliably
at inference time.**
In order to run inference on new incoming examples, we need to train a
discriminative model over the tweets themselves.
Let's generate a set of probabilistic labels for the training set.

In [16]:
Y_train_prob = label_model.predict_proba(L_train)

## Use Soft Labels to Train End Model

### Getting features from BERT
Since we have very limited training data, we cannot train a complex model like an LSTM with a lot of parameters. Instead, we use a pre-trained model, [BERT](https://github.com/google-research/bert), to generate embeddings for each our tweets, and treat the embedding values as features.

In [17]:
import numpy as np
import torch
from pytorch_transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


def encode_text(text):
    input_ids = torch.tensor([tokenizer.encode(text)])
    return model(input_ids)[0].mean(1)[0].detach().numpy()


train_vectors = np.array(list(df_train.tweet_text.apply(encode_text).values))
test_vectors = np.array(list(df_test.tweet_text.apply(encode_text).values))

  0%|          | 0/313 [00:00<?, ?B/s]

100%|██████████| 313/313 [00:00<00:00, 253439.60B/s]




  0%|          | 0/440473133 [00:00<?, ?B/s]

  0%|          | 34816/440473133 [00:00<28:57, 253426.90B/s]

  0%|          | 208896/440473133 [00:00<21:43, 337684.43B/s]

  0%|          | 939008/440473133 [00:00<15:36, 469496.05B/s]

  1%|          | 2590720/440473133 [00:00<11:00, 662635.83B/s]

  1%|          | 4373504/440473133 [00:00<07:48, 930852.90B/s]

  2%|▏         | 7084032/440473133 [00:00<05:30, 1310500.26B/s]

  2%|▏         | 9092096/440473133 [00:00<03:57, 1817158.88B/s]

  3%|▎         | 11773952/440473133 [00:00<02:49, 2522684.89B/s]

  3%|▎         | 13794304/440473133 [00:01<02:05, 3406731.61B/s]

  4%|▎         | 16496640/440473133 [00:01<01:31, 4617272.80B/s]

  4%|▍         | 18639872/440473133 [00:01<01:10, 5980346.66B/s]

  5%|▍         | 21570560/440473133 [00:01<00:53, 7856287.16B/s]

  5%|▌         | 23903232/440473133 [00:01<00:43, 9634596.86B/s]

  6%|▌         | 26377216/440473133 [00:01<00:36, 11317190.13B/s]

  7%|▋         | 28996608/440473133 [00:01<00:30, 13641447.91B/s]

  7%|▋         | 31284224/440473133 [00:01<00:27, 15135946.02B/s]

  8%|▊         | 34132992/440473133 [00:01<00:23, 17612290.85B/s]

  8%|▊         | 36536320/440473133 [00:02<00:22, 17923119.80B/s]

  9%|▉         | 38960128/440473133 [00:02<00:21, 18809909.45B/s]

  9%|▉         | 41555968/440473133 [00:02<00:19, 20503184.41B/s]

 10%|▉         | 43880448/440473133 [00:02<00:19, 20522025.18B/s]

 11%|█         | 46692352/440473133 [00:02<00:17, 22331633.77B/s]

 11%|█         | 49106944/440473133 [00:02<00:17, 21905875.30B/s]

 12%|█▏        | 51526656/440473133 [00:02<00:18, 20946424.52B/s]

 12%|█▏        | 54193152/440473133 [00:02<00:17, 22386605.15B/s]

 13%|█▎        | 56528896/440473133 [00:02<00:17, 21375133.32B/s]

 13%|█▎        | 59390976/440473133 [00:03<00:17, 21690844.78B/s]

 14%|█▍        | 62082048/440473133 [00:03<00:16, 23030853.33B/s]

 15%|█▍        | 64446464/440473133 [00:03<00:17, 21904414.44B/s]

 15%|█▌        | 67255296/440473133 [00:03<00:16, 21961745.93B/s]

 16%|█▌        | 69958656/440473133 [00:03<00:15, 23269144.31B/s]

 16%|█▋        | 72334336/440473133 [00:03<00:16, 21980592.12B/s]

 17%|█▋        | 75103232/440473133 [00:03<00:16, 22023579.51B/s]

 18%|█▊        | 77794304/440473133 [00:03<00:15, 23292519.54B/s]

 18%|█▊        | 80166912/440473133 [00:03<00:16, 22015216.83B/s]

 19%|█▉        | 82967552/440473133 [00:04<00:16, 22110298.32B/s]

 19%|█▉        | 85635072/440473133 [00:04<00:15, 23306541.59B/s]

 20%|█▉        | 88005632/440473133 [00:04<00:16, 21973532.73B/s]

 21%|██        | 90831872/440473133 [00:04<00:15, 22144733.52B/s]

 21%|██        | 93431808/440473133 [00:04<00:14, 23174554.88B/s]

 22%|██▏       | 95783936/440473133 [00:04<00:15, 22408349.44B/s]

 22%|██▏       | 98548736/440473133 [00:04<00:14, 23758730.78B/s]

 23%|██▎       | 100967424/440473133 [00:04<00:15, 22415005.32B/s]

 23%|██▎       | 103398400/440473133 [00:04<00:15, 21741087.73B/s]

 24%|██▍       | 106011648/440473133 [00:05<00:14, 22895243.83B/s]

 25%|██▍       | 108343296/440473133 [00:05<00:14, 22161116.84B/s]

 25%|██▌       | 111114240/440473133 [00:05<00:13, 23576806.82B/s]

 26%|██▌       | 113521664/440473133 [00:05<00:14, 22272054.74B/s]

 26%|██▋       | 115981312/440473133 [00:05<00:14, 21713037.75B/s]

 27%|██▋       | 118569984/440473133 [00:05<00:14, 22807307.07B/s]

 27%|██▋       | 120892416/440473133 [00:05<00:14, 22074000.63B/s]

 28%|██▊       | 123663360/440473133 [00:05<00:13, 23508229.63B/s]

 29%|██▊       | 126063616/440473133 [00:05<00:14, 22209075.76B/s]

 29%|██▉       | 128547840/440473133 [00:06<00:14, 21743048.96B/s]

 30%|██▉       | 131152896/440473133 [00:06<00:13, 22846083.26B/s]

 30%|███       | 133478400/440473133 [00:06<00:13, 22100171.33B/s]

 31%|███       | 136237056/440473133 [00:06<00:12, 23501788.12B/s]

 31%|███▏      | 138635264/440473133 [00:06<00:13, 22184458.17B/s]

 32%|███▏      | 141130752/440473133 [00:06<00:13, 21774652.94B/s]

 33%|███▎      | 143719424/440473133 [00:06<00:12, 22840666.07B/s]

 33%|███▎      | 146042880/440473133 [00:06<00:13, 22074350.18B/s]

 34%|███▍      | 148782080/440473133 [00:06<00:12, 23362699.11B/s]

 34%|███▍      | 151161856/440473133 [00:07<00:13, 22169506.52B/s]

 35%|███▍      | 153697280/440473133 [00:07<00:13, 21862331.54B/s]

 35%|███▌      | 156177408/440473133 [00:07<00:12, 22667449.62B/s]

 36%|███▌      | 158474240/440473133 [00:07<00:12, 22056637.13B/s]

 37%|███▋      | 161116160/440473133 [00:07<00:12, 23205954.89B/s]

 37%|███▋      | 163470336/440473133 [00:07<00:12, 22056366.66B/s]

 38%|███▊      | 166280192/440473133 [00:07<00:12, 22168291.31B/s]

 38%|███▊      | 168763392/440473133 [00:07<00:11, 22904745.75B/s]

 39%|███▉      | 171077632/440473133 [00:07<00:12, 22231284.57B/s]

 39%|███▉      | 173737984/440473133 [00:08<00:11, 23383589.06B/s]

 40%|███▉      | 176107520/440473133 [00:08<00:11, 22124727.15B/s]

 41%|████      | 178846720/440473133 [00:08<00:11, 22143518.80B/s]

 41%|████      | 181353472/440473133 [00:08<00:11, 22944760.51B/s]

 42%|████▏     | 183672832/440473133 [00:08<00:11, 22289835.97B/s]

 42%|████▏     | 186257408/440473133 [00:08<00:10, 23249397.33B/s]

 43%|████▎     | 188608512/440473133 [00:08<00:11, 22108660.17B/s]

 43%|████▎     | 191429632/440473133 [00:08<00:11, 22197744.41B/s]

 44%|████▍     | 193963008/440473133 [00:08<00:10, 23053663.57B/s]

 45%|████▍     | 196291584/440473133 [00:09<00:10, 22357533.45B/s]

 45%|████▌     | 198900736/440473133 [00:09<00:10, 23349701.56B/s]

 46%|████▌     | 201262080/440473133 [00:09<00:10, 22115339.72B/s]

 46%|████▋     | 203996160/440473133 [00:09<00:10, 22126186.45B/s]

 47%|████▋     | 206502912/440473133 [00:09<00:10, 22933021.62B/s]

 47%|████▋     | 208820224/440473133 [00:09<00:10, 22264163.92B/s]

 48%|████▊     | 211396608/440473133 [00:09<00:09, 23209387.77B/s]

 49%|████▊     | 213742592/440473133 [00:09<00:10, 22119262.61B/s]

 49%|████▉     | 216579072/440473133 [00:09<00:10, 22217317.05B/s]

 50%|████▉     | 219016192/440473133 [00:10<00:09, 22822179.90B/s]

 50%|█████     | 221317120/440473133 [00:10<00:09, 22220725.36B/s]

 51%|█████     | 223832064/440473133 [00:10<00:09, 23023910.03B/s]

 51%|█████▏    | 226153472/440473133 [00:10<00:09, 22389397.93B/s]

 52%|█████▏    | 228711424/440473133 [00:10<00:09, 23258403.72B/s]

 52%|█████▏    | 231058432/440473133 [00:10<00:09, 22183715.32B/s]

 53%|█████▎    | 233864192/440473133 [00:10<00:09, 22214439.79B/s]

 54%|█████▎    | 236333056/440473133 [00:10<00:08, 22901805.74B/s]

 54%|█████▍    | 238641152/440473133 [00:10<00:09, 22219814.48B/s]

 55%|█████▍    | 241210368/440473133 [00:11<00:08, 23157643.04B/s]

 55%|█████▌    | 243548160/440473133 [00:11<00:08, 22464991.88B/s]

 56%|█████▌    | 246141952/440473133 [00:11<00:08, 23404293.06B/s]

 56%|█████▋    | 248507392/440473133 [00:11<00:08, 22187124.24B/s]

 57%|█████▋    | 251165696/440473133 [00:11<00:08, 22086531.83B/s]

 58%|█████▊    | 253660160/440473133 [00:11<00:08, 22872422.20B/s]

 58%|█████▊    | 255970304/440473133 [00:11<00:08, 22181606.71B/s]

 59%|█████▊    | 258521088/440473133 [00:11<00:07, 23084120.62B/s]

 59%|█████▉    | 260853760/440473133 [00:11<00:08, 22424745.85B/s]

 60%|█████▉    | 263402496/440473133 [00:12<00:07, 23263227.27B/s]

 60%|██████    | 265751552/440473133 [00:12<00:07, 22119721.43B/s]

 61%|██████    | 268450816/440473133 [00:12<00:07, 22103447.37B/s]

 62%|██████▏   | 270910464/440473133 [00:12<00:07, 22796166.32B/s]

 62%|██████▏   | 273209344/440473133 [00:12<00:07, 22202285.77B/s]

 63%|██████▎   | 275708928/440473133 [00:12<00:07, 22953522.86B/s]

 63%|██████▎   | 278023168/440473133 [00:12<00:07, 22334151.34B/s]

 64%|██████▎   | 280565760/440473133 [00:12<00:06, 23178599.36B/s]

 64%|██████▍   | 282903552/440473133 [00:12<00:07, 22088110.15B/s]

 65%|██████▍   | 285752320/440473133 [00:13<00:06, 22265247.33B/s]

 65%|██████▌   | 288189440/440473133 [00:13<00:06, 22857392.71B/s]

 66%|██████▌   | 290491392/440473133 [00:13<00:06, 22293683.51B/s]

 67%|██████▋   | 292984832/440473133 [00:13<00:06, 23024288.73B/s]

 67%|██████▋   | 295304192/440473133 [00:13<00:06, 22354189.88B/s]

 68%|██████▊   | 297846784/440473133 [00:13<00:06, 23194656.81B/s]

 68%|██████▊   | 300185600/440473133 [00:13<00:06, 22220489.01B/s]

 69%|██████▉   | 303037440/440473133 [00:13<00:06, 22247974.23B/s]

 69%|██████▉   | 305511424/440473133 [00:13<00:05, 22934162.79B/s]

 70%|██████▉   | 307821568/440473133 [00:13<00:05, 22268028.94B/s]

 70%|███████   | 310355968/440473133 [00:14<00:05, 23109148.72B/s]

 71%|███████   | 312686592/440473133 [00:14<00:05, 22423627.82B/s]

 72%|███████▏  | 315259904/440473133 [00:14<00:05, 23321985.61B/s]

 72%|███████▏  | 317614080/440473133 [00:14<00:05, 22176197.19B/s]

 73%|███████▎  | 320322560/440473133 [00:14<00:05, 22128901.55B/s]

 73%|███████▎  | 322796544/440473133 [00:14<00:05, 22831464.77B/s]

 74%|███████▍  | 325099520/440473133 [00:14<00:05, 22214561.73B/s]

 74%|███████▍  | 327629824/440473133 [00:14<00:04, 23053947.74B/s]

 75%|███████▍  | 329955328/440473133 [00:14<00:04, 22436082.29B/s]

 75%|███████▌  | 332474368/440473133 [00:15<00:04, 23196629.20B/s]

 76%|███████▌  | 334813184/440473133 [00:15<00:04, 22137253.16B/s]

 77%|███████▋  | 337607680/440473133 [00:15<00:04, 22164403.84B/s]

 77%|███████▋  | 340097024/440473133 [00:15<00:04, 22916677.28B/s]

 78%|███████▊  | 342407168/440473133 [00:15<00:04, 22255490.12B/s]

 78%|███████▊  | 344945664/440473133 [00:15<00:04, 23109860.99B/s]

 79%|███████▉  | 347277312/440473133 [00:15<00:04, 22414996.98B/s]

 79%|███████▉  | 349863936/440473133 [00:15<00:03, 23348631.30B/s]

 80%|███████▉  | 352222208/440473133 [00:15<00:03, 22171661.69B/s]

 81%|████████  | 354909184/440473133 [00:16<00:03, 22103450.78B/s]

 81%|████████  | 357390336/440473133 [00:16<00:03, 22851660.66B/s]

 82%|████████▏ | 359696384/440473133 [00:16<00:03, 22207327.24B/s]

 82%|████████▏ | 362231808/440473133 [00:16<00:03, 23065716.01B/s]

 83%|████████▎ | 364560384/440473133 [00:16<00:03, 22389292.44B/s]

 83%|████████▎ | 367118336/440473133 [00:16<00:03, 23259156.00B/s]

 84%|████████▍ | 369467392/440473133 [00:16<00:03, 22139419.47B/s]

 84%|████████▍ | 372194304/440473133 [00:16<00:03, 22119642.87B/s]

 85%|████████▌ | 374668288/440473133 [00:16<00:02, 22834886.66B/s]

 86%|████████▌ | 376971264/440473133 [00:17<00:02, 22245133.38B/s]

 86%|████████▌ | 379439104/440473133 [00:17<00:02, 22922469.99B/s]

 87%|████████▋ | 381748224/440473133 [00:17<00:02, 22338960.49B/s]

 87%|████████▋ | 384236544/440473133 [00:17<00:02, 22896207.78B/s]

 88%|████████▊ | 386540544/440473133 [00:17<00:02, 22474170.56B/s]

 88%|████████▊ | 388955136/440473133 [00:17<00:02, 22873902.57B/s]

 89%|████████▉ | 391252992/440473133 [00:17<00:02, 22412385.02B/s]

 89%|████████▉ | 393673728/440473133 [00:17<00:02, 22867377.81B/s]

 90%|████████▉ | 395969536/440473133 [00:17<00:02, 20204251.60B/s]

 91%|█████████ | 398691328/440473133 [00:18<00:01, 21897008.31B/s]

 91%|█████████ | 401538048/440473133 [00:18<00:01, 23358115.26B/s]

 92%|█████████▏| 403960832/440473133 [00:18<00:01, 22617370.29B/s]

 92%|█████████▏| 406404096/440473133 [00:18<00:01, 21489219.07B/s]

 93%|█████████▎| 409285632/440473133 [00:18<00:01, 23263482.25B/s]

 93%|█████████▎| 411695104/440473133 [00:18<00:01, 22504201.57B/s]

 94%|█████████▍| 414235648/440473133 [00:18<00:01, 23301923.55B/s]

 95%|█████████▍| 416617472/440473133 [00:18<00:01, 22449109.41B/s]

 95%|█████████▌| 418970624/440473133 [00:18<00:01, 21238170.62B/s]

 96%|█████████▌| 421840896/440473133 [00:19<00:00, 23034360.43B/s]

 96%|█████████▋| 424219648/440473133 [00:19<00:00, 22270668.92B/s]

 97%|█████████▋| 426818560/440473133 [00:19<00:00, 23266672.91B/s]

 97%|█████████▋| 429197312/440473133 [00:19<00:00, 22424532.82B/s]

 98%|█████████▊| 431553536/440473133 [00:19<00:00, 21214663.14B/s]

 99%|█████████▊| 434444288/440473133 [00:19<00:00, 23054861.48B/s]

 99%|█████████▉| 436827136/440473133 [00:19<00:00, 22274530.81B/s]

100%|█████████▉| 439417856/440473133 [00:19<00:00, 21429592.72B/s]

100%|██████████| 440473133/440473133 [00:19<00:00, 22194643.83B/s]




  0%|          | 0/231508 [00:00<?, ?B/s]

 15%|█▌        | 34816/231508 [00:00<00:00, 237654.79B/s]

 97%|█████████▋| 225280/231508 [00:00<00:00, 314632.98B/s]

100%|██████████| 231508/231508 [00:00<00:00, 781437.43B/s]




### Model on soft labels
Now, we train a simple logistic regression model on the BERT features, using labels
obtained from our LabelModel.

In [18]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(solver="liblinear")
sklearn_model.fit(train_vectors, probs_to_preds(Y_train_prob))

print(f"Accuracy of trained model: {sklearn_model.score(test_vectors, Y_test)}")

Accuracy of trained model: 0.86


We now have a model with accuracy not much lower than the LabelModel, but with the advantage of being faster and cheaper than crowdsourcing, and applicable to all future examples.

## Summary

In this tutorial, we accomplished the following:
* We showed how Snorkel can handle crowdsourced labels, combining them with other programmatic LFs to improve coverage.
* We showed how the LabelModel learns to combine inputs from crowd workers and other LFs by appropriately weighting them to generate high quality probabilistic labels.
* We showed that a classifier trained on the combined labels can achieve a fairly high accuracy while also generalizing to new, unseen examples.