# Training a Sentiment Analysis LSTM Using Noisy Crowd Labels

##### Adapted for pandas instead of Spark

In this tutorial, we'll provide a simple walkthrough of how to use Snorkel to resolve conflicts in a noisy crowdsourced dataset for a sentiment analysis task, and then use these denoised labels to train an LSTM sentiment analysis model which can be applied to new, unseen data to automatically make predictions!

1. Creating basic Snorkel objects: `Candidates`, `Contexts`, and `Labels`
2. Training the `GenerativeModel` to resolve labeling conflicts
3. Training a simple LSTM sentiment analysis model, which can then be used on new, unseen data!

Note that this is a simple tutorial meant to give an overview of the mechanics of using Snorkel-- we'll note places where more careful fine-tuning could be done!


### Task Detail: Weather Sentiments in Tweets

In this tutorial we focus on the [Weather sentiment](https://www.crowdflower.com/data/weather-sentiment/) task from [Crowdflower](https://www.crowdflower.com/).

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. Contributors could choose among the following categories:
1. Positive
2. Negative
3. I can't tell
4. Neutral / author is just sharing information
5. Tweet not related to weather condition

The catch is that 20 contributors graded each tweet. Thus, in many cases contributors assigned conflicting sentiment labels to the same tweet. 

The task comes with two data files (to be found in the `data` directory of the tutorial:
1. [weather-non-agg-DFE.csv](data/weather-non-agg-DFE.csv) contains the raw contributor answers for each of the 1,000 tweets.
2. [weather-evaluated-agg-DFE.csv](data/weather-evaluated-agg-DFE.csv) contains gold sentiment labels by trusted workers for each of the 1,000 tweets.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np


## Step 1: Preprocessing - Data Loading

In [2]:
import pandas as pd

In [3]:
# Load raw crowdsourcing data
raw_crowd_answers = pd.read_csv("data/weather-non-agg-DFE.csv")

# Load groundtruth crowdsourcing data
gold_crowd_answers = pd.read_csv("data/weather-evaluated-agg-DFE.csv")

# # filter out low-confidence answers
gold_answers = gold_crowd_answers[['tweet_id', 'sentiment', 'tweet_body']][(gold_crowd_answers.correct_category == 'Yes') & (gold_crowd_answers.correct_category_conf == 1)] 

# # keep only the tweets with available groundtruth
# # Note the funny way in which we have to join dfs in pandas :/
candidate_labeled_tweets = raw_crowd_answers.join(gold_answers.set_index('tweet_id',drop=False),on=['tweet_id'],lsuffix='.raw',rsuffix='.gold',how='inner')
candidate_labeled_tweets = candidate_labeled_tweets[['tweet_id.raw','tweet_body.raw','worker_id','emotion']]
candidate_labeled_tweets.columns = ['tweet_id','tweet_body','worker_id','emotion']

As mentioned above, contributors can provide conflicting labels for the same tweet:

In [4]:
candidate_labeled_tweets.sort_values(['worker_id','tweet_id']).tweet_id == 79185673

1527     False
1512     False
1517     False
1546     False
1526     False
1543     False
1520     False
1515     False
1544     False
1516     False
1534     False
1539     False
1525     False
1507     False
1528     False
1533     False
1518     False
1509     False
1523     False
1521     False
1513     False
1542     False
1524     False
1508     False
1536     False
1540     False
1514     False
7779     False
7783     False
7774     False
         ...  
14975    False
15063    False
15060    False
14958    False
14973    False
14914    False
15082    False
14935    False
14966    False
14990    False
15007    False
15066    False
15078    False
15020    False
14994    False
14950    False
14976    False
15065    False
14940    False
14938    False
14971    False
15043    False
15057    False
14923    False
15024    False
14912    False
15054    False
15009    False
15027    False
15006    False
Name: tweet_id, Length: 12640, dtype: bool

## Step 2: Generating Snorkel Objects

### `Candidates`

`Candidates` are the core objects in Snorkel representing objects to be classified. We'll use a helper function to create a custom `Candidate` sub-class, `Tweet`, with values representing the possible labels that it can be classified with:

In [5]:
from snorkel import SnorkelSession
session = SnorkelSession()

from snorkel.models import candidate_subclass

values = list(candidate_labeled_tweets.emotion.unique())

Tweet = candidate_subclass('Tweet', ['tweet'], values=values)

### `Contexts`

All `Candidate` objects point to one or more `Context` objects, which represent the raw data that they are rooted in. In this case, our candidates will each point to a single `Context` object representing the raw text of the tweet.

Once we have defined the `Context` for each `Candidate`, we can commit them to the database. Note that we also split into two sets while doing this:

1. **Training set (`split=0`):** The tweets for which we have noisy, conflicting crowd labels; we will resolve these conflicts using the `GenerativeModel` and then use them as training data for the LSTM

2. **Test set (`split=1`):** We will pretend that we do not have any crowd labels for this split of the data, and use these to test the LSTM's performance on unseen data

In [6]:
from snorkel.models import Context, Candidate
from snorkel.contrib.models.text import RawText

In [7]:
# Make sure DB is cleared
session.query(Context).delete()
session.query(Candidate).delete()

632

In [8]:
# Now we create the candidates with a simple loop
tweet_bodies = candidate_labeled_tweets \
    [["tweet_id", "tweet_body"]] \
    .sort_values("tweet_id") \
    .drop_duplicates()

In [9]:
# Generate and store the tweet candidates to be classified
# Note: We split the tweets in two sets: one for which the crowd 
# labels are not available to Snorkel (test, 10%) and one for which we assume
# crowd labels are obtained (to be used for training, 90%)
total_tweets = len(tweet_bodies)
tweet_list = []
test_split = total_tweets*0.1
for i, t in tweet_bodies.iterrows():
    split = 1 if i <= test_split else 0
    raw_text = RawText(stable_id=t.tweet_id, name=t.tweet_id, text=t.tweet_body)
    tweet = Tweet(tweet=raw_text, split=split)
    tweet_list.append(tweet)
    session.add(tweet)
session.commit()

### `Labels`

Next, we'll store the labels for each of the training candidates in a sparse matrix (which will also automatically be saved to the Snorkel database), with one row for each candidate and one column for each crowd worker:

In [10]:
from snorkel.annotations import LabelAnnotator
from collections import defaultdict  

In [11]:
# A defaultdict works exactly like a normal dict, but it is initialized with a function (“default factory”)
# that takes no arguments and provides the default value for a nonexistent key.

In [12]:
# Extract worker votes
# Cache locally to speed up for this small set
worker_labels = candidate_labeled_tweets[["tweet_id", "worker_id", "emotion"]]

In [13]:
worker_labels.shape

(12640, 3)

In [14]:
wls = defaultdict(list)
for i, row in worker_labels.iterrows():
    wls[str(row.tweet_id)].append((str(row.worker_id), row.emotion))

In [15]:
wls['82846118']

[('18034918', 'Neutral / author is just sharing information'),
 ('18465660', 'Neutral / author is just sharing information'),
 ('18927389', 'Neutral / author is just sharing information'),
 ('17475684', 'Neutral / author is just sharing information'),
 ('14472526', 'Neutral / author is just sharing information'),
 ('18806438', 'Neutral / author is just sharing information'),
 ('14584835', 'Neutral / author is just sharing information'),
 ('14400603', 'Neutral / author is just sharing information'),
 ('12063015', 'Tweet not related to weather condition'),
 ('19028457', 'Positive'),
 ('14466721', 'Neutral / author is just sharing information'),
 ('15847995', 'Neutral / author is just sharing information'),
 ('10197897', 'Neutral / author is just sharing information'),
 ('20043586', 'Neutral / author is just sharing information'),
 ('7325249', 'Negative'),
 ('17948184', 'Neutral / author is just sharing information'),
 ('18500901', "I can't tell"),
 ('11800825', 'Neutral / author is just 

In [16]:
# Create a label generator
def worker_label_generator(t):
    """A generator over the different (worker_id, label_id) pairs for a Tweet."""
    for worker_id, label in wls[t.tweet.name]:
        yield worker_id, label

In [17]:
labeler = LabelAnnotator(label_generator=worker_label_generator)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...


100%|███████████████████████████████████████████████████████████████████████████████| 629/629 [00:02<00:00, 247.78it/s]


Wall time: 2.61 s


<629x102 sparse matrix of type '<class 'numpy.int32'>'
	with 12580 stored elements in Compressed Sparse Row format>

Finally, we load the ground truth ("gold") labels for both the training and test sets, and store as numpy arrays"

In [18]:
gold_labels = defaultdict(list)

# Get gold labels in verbose form
verbose_labels = dict([(str(t.tweet_id), t.sentiment) 
                       for i, t in gold_answers[["tweet_id", "sentiment"]].iterrows()])

In [19]:
# Iterate over splits, align with Candidate ordering

for split in range(2):
    cands = session.query(Tweet).filter(Tweet.split == split).order_by(Tweet.id).all() 
    for c in cands:
        # Think this is just an odd way of label encoding between 1 and 5?
        gold_labels[split].append(values.index(verbose_labels[c.tweet.name]) + 1) 

In [20]:
train_cand_labels = np.array(gold_labels[0])
test_cand_labels = np.array(gold_labels[1])

## Step 3: Resolving Crowd Conflicts with the Generative Model

Until now we have converted the raw crowdsourced data into a labeling matrix that can be provided as input to `Snorkel`. We will now show how to:

1. Use `Snorkel's` generative model to learn the accuracy of each crowd contributor.
2. Use the learned model to estimate a marginal distribution over the domain of possible labels for each task.
3. Use the estimated marginal distribution to obtain the maximum a posteriori probability estimate for the label that each task takes.

In [21]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

In [22]:
# Train the generative model
gen_model.train(
    L_train,
    reg_type=2,
    reg_param=0.1,
    epochs=30
)

Inferred cardinality: 5


### Infering the MAP assignment for each task
Each task corresponds to an independent random variable. Thus, we can simply associate each task with the most probably label based on the estimated marginal distribution and get an accuracy score:

In [23]:
gen_model.score(L_train, train_cand_labels)


0.9952305246422893

In [24]:
correct, incorrect = gen_model.error_analysis(session, L_train, train_cand_labels)
print("Number incorrect:{}".format(len(incorrect)))

Accuracy: 0.9952305246422893
Number incorrect:3


### Majority vote

It seems like we did well- but how well?  Given that this is a fairly simple task--we have 20 contributors per tweet (and most of them are far better than random)--**we expect majority voting to perform extremely well**, so we can check against majority vote:

In [25]:
from collections import Counter

# Collect the majority vote answer for each tweet
mv = []
for i in range(L_train.shape[0]):
    c = Counter([L_train[i,j] for j in L_train[i].nonzero()[1]])
    mv.append(c.most_common(1)[0][0])
mv = np.array(mv)

# Count the number correct by majority vote
n_correct = np.sum([1 for i in range(L_train.shape[0]) if mv[i] == train_cand_labels[i]])
print ("Accuracy:{}".format(n_correct / float(L_train.shape[0])))
print ("Number incorrect:{}".format(L_train.shape[0] - n_correct))

Accuracy:0.9841017488076311
Number incorrect:10


We see that while majority vote makes 10 errors, the Snorkel model makes only 3!  What about an average crowd worker?

### Average human accuracy

We see that the average accuracy of a single crowd worker is in fact much lower:

In [26]:
accs = []
for j in range(L_train.shape[1]):
    n_correct = np.sum([1 for i in range(L_train.shape[0]) if L_train[i,j] == train_cand_labels[i]])
    acc = n_correct / float(L_train[:,j].nnz)
    accs.append(acc)
print( "Mean Accuracy:{}".format( np.mean(accs)))

Mean Accuracy:0.729664764868133


## Step 4: Training an ML Model with Snorkel for Sentiment Analysis over Unseen Tweets

In the previous step, we saw that Snorkel's generative model can help to denoise crowd labels automatically. However, what happens when we don't have noisy crowd labels for a tweet?

In this step, we'll use the estimates of the generative model as _probabilistic training labels_ to train a simple LSTM sentiment analysis model, which takes as input a tweet **for which no crowd labels are available** and predicts its sentiment.

First, we get the probabilistic training labels (_training marginals_) which are just the marginal estimates of the generative model:

In [27]:
train_marginals = gen_model.marginals(L_train)

In [28]:
from snorkel.annotations import save_marginals
save_marginals(session, L_train, train_marginals)

Saved 629 marginals


Next, we'll train a simple LSTM:

In [29]:
from snorkel.learning.tensorflow import TextRNN

In [30]:
train_kwargs = {
    'lr':         0.01,
    'dim':        100,
    'n_epochs':   200,
    'dropout':    0.2,
    'print_freq': 5
}

lstm = TextRNN(seed=1701, cardinality=Tweet.cardinality)
train_cands = session.query(Tweet).filter(Tweet.split == 0).order_by(Tweet.id).all()
lstm.train(train_cands, train_marginals, **train_kwargs)



Instructions for updating:
This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


[TextRNN] Training model
[TextRNN] n_train=629  #epochs=200  batch size=256
[TextRNN] Epoch 0 (1.82s)	Average loss=1.547467
[TextRNN] Epoch 5 (2.70s)	Average loss=0.177643
[TextRNN] Epoch 10 (3.58s)	Average loss=0.050922
[TextRNN] Epoch 15 (4.47s)	Average loss=0.054936
[TextRNN] Epoch 20 (5.35s)	Average loss=0.026458
[TextRNN] Epoch 25 (6.23s)	Average loss=0.024820
[TextRNN] Epoch 30 (7.13s)	Average loss=0.024731
[TextRNN] Epoch 35 (8.00s)	Average loss=0.027555
[TextRNN] Epoch 40 (8.88s)	Average loss=0.024720
[TextRNN] Epoch 45 (9.74s)	Average loss=0.029702
[TextRNN] Epoch 50 (10.62s)	Average loss=0.022299
[TextRNN] Epoch 55 (11.47s)	Average loss=0.023716
[TextRNN] Epoch 60 (12.36s)	Average loss=0.020875
[TextRNN] Epoch 65 (13.26s)	Average loss=0.027355
[TextRNN] Epoch 70 (14.14s)	Average loss=0.023160
[TextRNN] Epoch 75 (15.04s)	Average loss=0.023468
[TextRNN] Epoch 80 (15.92s)	Average loss=0.020987
[TextRNN] Epoch 85 (16.82s)	Average loss=0.019817
[TextRNN] Epoch 90 (17.70s)	Average 

In [31]:
test_cands = session.query(Tweet).filter(Tweet.split == 1).order_by(Tweet.id).all()
lstm.score(test_cands, test_cand_labels)
# print ("Number incorrect:{}".format(len(incorrect)))

0.6666666666666666

We see that we're already close to the accuracy of an average crowd worker! If we wanted to improve the score, we could tune the LSTM model using grid search (see the Intro tutorial), use [pre-trained word embeddings](https://nlp.stanford.edu/projects/glove/), or many other common techniques for getting state-of-the-art scores. Notably, we're doing this without using gold labels, but rather noisy crowd-labels!

For more, checkout the other tutorials!