# 📈 Snorkel Intro Tutorial: Data Augmentation

In this tutorial, we will walk through the process of using *transformation functions* (TFs) to perform data augmentation.
Like the labeling tutorial, our goal is to train a classifier to YouTube comments as `SPAM` or `HAM` (not spam).
In the [previous tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb),
we demonstrated how to label training sets programmatically with Snorkel.
In this tutorial, we'll assume that step has already been done, and start with labeled training data,
which we'll aim to augment using transformation functions.


* For more details on the task, check out the [labeling tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb)
* For an overview of Snorkel, visit [snorkel.org](http://snorkel.org)
* You can also check out the [Snorkel API documentation](https://snorkel.readthedocs.io/)


Data augmentation is a popular technique for increasing the size of labeled training sets by applying class-preserving transformations to create copies of labeled data points.
In the image domain, it is a crucial factor in almost every state-of-the-art result today and is quickly gaining
popularity in text-based applications.
Snorkel models the data augmentation process by applying user-define *transformation functions* (TFs) in sequence.
You can learn more about data augmentation in
[this blog post about our NeurIPS 2017 work on automatically learned data augmentation](https://snorkel.org/tanda/).

The tutorial is divided into four parts:
1. **Loading Data**: We load a [YouTube comments dataset](http://www.dt.fee.unicamp.br/~tiago//youtubespamcollection/).
2. **Writing Transformation Functions**: We write Transformation Functions (TFs) that can be applied to training data points to generate new training data points.
3. **Applying Transformation Functions to Augment Our Dataset**: We apply a sequence of TFs to each training data point, using a random policy, to generate an augmented training set.
4. **Training a Model**: We use the augmented training set to train an LSTM model for classifying new comments as `SPAM` or `HAM`.

This next cell takes care of some notebook-specific housekeeping.
You can ignore it.

In [1]:
import os
import random

import numpy as np

# Make sure we're running from the spam/ directory
if os.path.basename(os.getcwd()) == "snorkel-tutorials":
    os.chdir("spam")

# Turn off TensorFlow logging messages
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# For reproducibility
seed = 0
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(0)
random.seed(0)

If you want to display all comment text untruncated, change `DISPLAY_ALL_TEXT` to `True` below.

In [2]:
import pandas as pd


DISPLAY_ALL_TEXT = False

pd.set_option("display.max_colwidth", 0 if DISPLAY_ALL_TEXT else 50)

This next cell makes sure a spaCy English model is downloaded.
If this is your first time downloading this model, restart the kernel after executing the next cell.

In [3]:
# Download the spaCy english model
! python -m spacy download en_core_web_sm



You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## 1. Loading Data

We load the Kaggle dataset and create Pandas DataFrame objects for each of the sets described above.
The two main columns in the DataFrames are:
* **`text`**: Raw text content of the comment
* **`label`**: Whether the comment is `SPAM` (1) or `HAM` (0).

For more details, check out the [labeling tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb).

In [4]:
from utils import load_spam_dataset

df_train, _, df_valid, df_test = load_spam_dataset(load_train_labels=True)

# We pull out the label vectors for ease of use later
Y_valid = df_valid["label"].values
Y_train = df_train["label"].values
Y_test = df_test["label"].values

In [5]:
df_train.head()

Unnamed: 0,author,date,text,label,video
0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx...,1,1
1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Ta...",1,1
2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,0,1
3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",0,1
4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",0,1


## 2. Writing Transformation Functions (TFs)

Transformation functions are functions that can be applied to a training data point to create another valid training data point of the same class.
For example, for image classification problems, it is common to rotate or crop images in the training data to create new training inputs.
Transformation functions should be atomic e.g. a small rotation of an image, or changing a single word in a sentence.
We then compose multiple transformation functions when applying them to training data points.

Common ways to augment text includes replacing words with their synonyms, or replacing names entities with other entities.
More info can be found
[here](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28) or
[here](https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610).
Our basic modeling assumption is that applying these operations to a comment generally shouldn't change whether it is `SPAM` or not.

Transformation functions in Snorkel are created with the
[`transformation_function` decorator](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.transformation_function.html#snorkel.augmentation.transformation_function),
which wraps a function that takes in a single data point and returns a transformed version of the data point.
If no transformation is possible, a TF can return `None` or the original data point.
If all the TFs applied to a data point return `None`, the data point won't be included in
the augmented dataset when we apply our TFs below.

Just like the `labeling_function` decorator, the `transformation_function` decorator
accepts `pre` argument for `Preprocessor` objects.
Here, we'll use a
[`SpacyPreprocessor`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/preprocess/snorkel.preprocess.nlp.SpacyPreprocessor.html#snorkel.preprocess.nlp.SpacyPreprocessor).

In [6]:
from snorkel.preprocess.nlp import SpacyPreprocessor

spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

In [7]:
import names
from snorkel.augmentation import transformation_function

# Pregenerate some random person names to replace existing ones with
# for the transformation strategies below
replacement_names = [names.get_full_name() for _ in range(50)]


# Replace a random named entity with a different entity of the same type.
@transformation_function(pre=[spacy])
def change_person(x):
    person_names = [ent.text for ent in x.doc.ents if ent.label_ == "PERSON"]
    # If there is at least one person name, replace a random one. Else return None.
    if person_names:
        name_to_replace = np.random.choice(person_names)
        replacement_name = np.random.choice(replacement_names)
        x.text = x.text.replace(name_to_replace, replacement_name)
        return x


# Swap two adjectives at random.
@transformation_function(pre=[spacy])
def swap_adjectives(x):
    adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
    # Check that there are at least two adjectives to swap.
    if len(adjective_idxs) >= 2:
        idx1, idx2 = sorted(np.random.choice(adjective_idxs, 2, replace=False))
        # Swap tokens in positions idx1 and idx2.
        x.text = " ".join(
            [
                x.doc[:idx1].text,
                x.doc[idx2].text,
                x.doc[1 + idx1 : idx2].text,
                x.doc[idx1].text,
                x.doc[1 + idx2 :].text,
            ]
        )
        return x

We add some transformation functions that use `wordnet` from [NLTK](https://www.nltk.org/) to replace different parts of speech with their synonyms.

In [8]:
import nltk
from nltk.corpus import wordnet as wn

nltk.download("wordnet")


def get_synonym(word, pos=None):
    """Get synonym for word given its part-of-speech (pos)."""
    synsets = wn.synsets(word, pos=pos)
    # Return None if wordnet has no synsets (synonym sets) for this word and pos.
    if synsets:
        words = [lemma.name() for lemma in synsets[0].lemmas()]
        if words[0].lower() != word.lower():  # Skip if synonym is same as word.
            # Multi word synonyms in wordnet use '_' as a separator e.g. reckon_with. Replace it with space.
            return words[0].replace("_", " ")


def replace_token(spacy_doc, idx, replacement):
    """Replace token in position idx with replacement."""
    return " ".join([spacy_doc[:idx].text, replacement, spacy_doc[1 + idx :].text])


@transformation_function(pre=[spacy])
def replace_verb_with_synonym(x):
    # Get indices of verb tokens in sentence.
    verb_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "VERB"]
    if verb_idxs:
        # Pick random verb idx to replace.
        idx = np.random.choice(verb_idxs)
        synonym = get_synonym(x.doc[idx].text, pos="v")
        # If there's a valid verb synonym, replace it. Otherwise, return None.
        if synonym:
            x.text = replace_token(x.doc, idx, synonym)
            return x


@transformation_function(pre=[spacy])
def replace_noun_with_synonym(x):
    # Get indices of noun tokens in sentence.
    noun_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "NOUN"]
    if noun_idxs:
        # Pick random noun idx to replace.
        idx = np.random.choice(noun_idxs)
        synonym = get_synonym(x.doc[idx].text, pos="n")
        # If there's a valid noun synonym, replace it. Otherwise, return None.
        if synonym:
            x.text = replace_token(x.doc, idx, synonym)
            return x


@transformation_function(pre=[spacy])
def replace_adjective_with_synonym(x):
    # Get indices of adjective tokens in sentence.
    adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
    if adjective_idxs:
        # Pick random adjective idx to replace.
        idx = np.random.choice(adjective_idxs)
        synonym = get_synonym(x.doc[idx].text, pos="a")
        # If there's a valid adjective synonym, replace it. Otherwise, return None.
        if synonym:
            x.text = replace_token(x.doc, idx, synonym)
            return x

[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [9]:
tfs = [
    change_person,
    swap_adjectives,
    replace_verb_with_synonym,
    replace_noun_with_synonym,
    replace_adjective_with_synonym,
]

Let's check out a few examples of transformed data points to see what our TFs are doing.

In [10]:
from utils import preview_tfs

preview_tfs(df_train, tfs)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,change_person,Check out Berzerk video on my channel ! :D,Check out Jennifer Selby video on my channel ! :D
1,swap_adjectives,hey guys look im aware im spamming and it piss...,hey guys look im aware im spamming and it piss...
2,replace_verb_with_synonym,"""eye of the tiger"" ""i am the champion"" seems l...","""eye of the tiger"" ""i be the champion"" seems l..."
3,replace_noun_with_synonym,"Hey, check out my new website!! This site is a...","Hey, check out my new web site !! This site is..."
4,replace_adjective_with_synonym,I started hating Katy Perry after finding out ...,I started hating Katy Perry after finding out ...


We notice a couple of things about the TFs.

* Sometimes they make trivial changes (`"website"` to `"web site"` for replace_noun_with_synonym).
  This can still be helpful for training our model, because it teaches the model to be invariant to such small changes.
* Sometimes they introduce incorrect grammar to the sentence (e.g. `swap_adjectives` swapping `"young"` and `"more"` above).

The TFs are expected to be heuristic strategies that indeed preserve the class most of the time, but
[don't need to be perfect](https://arxiv.org/pdf/1901.11196.pdf).
This is especially true when using automated
[data augmentation techniques](https://snorkel.org/tanda/)
which can learn to avoid particularly corrupted data points.
As we'll see below, Snorkel is compatible with such learned augmentation policies.

## 3. Applying Transformation Functions

We'll first define a `Policy` to determine what sequence of TFs to apply to each data point.
We'll start with a [`RandomPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.RandomPolicy.html)
that samples `sequence_length=2` TFs to apply uniformly at random per data point.
The `n_per_original` argument determines how many augmented data points to generate per original data point.

In [11]:
from snorkel.augmentation import RandomPolicy

random_policy = RandomPolicy(
    len(tfs), sequence_length=2, n_per_original=2, keep_original=True
)

In some cases, we can do better than uniform random sampling.
We might have domain knowledge that some TFs should be applied more frequently than others,
or have trained an [automated data augmentation model](https://snorkel.org/tanda/)
that learned a sampling distribution for the TFs.
Snorkel supports this use case with a
[`MeanFieldPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.MeanFieldPolicy.html),
which allows you to specify a sampling distribution for the TFs.
We give higher probabilities to the `replace_[X]_with_synonym` TFs, since those provide more information to the model.

In [12]:
from snorkel.augmentation import MeanFieldPolicy

mean_field_policy = MeanFieldPolicy(
    len(tfs),
    sequence_length=2,
    n_per_original=2,
    keep_original=True,
    p=[0.05, 0.05, 0.3, 0.3, 0.3],
)

To apply one or more TFs that we've written to a collection of data points according to our policy, we use a
[`PandasTFApplier`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.PandasTFApplier.html)
because our data points are represented with a Pandas DataFrame.

In [13]:
from snorkel.augmentation import PandasTFApplier

tf_applier = PandasTFApplier(tfs, mean_field_policy)
df_train_augmented = tf_applier.apply(df_train)
Y_train_augmented = df_train_augmented["label"].values

  0%|          | 0/1586 [00:00<?, ?it/s]

  1%|          | 9/1586 [00:00<00:19, 79.48it/s]

  1%|          | 16/1586 [00:00<00:21, 74.19it/s]

  1%|▏         | 23/1586 [00:00<00:21, 71.22it/s]

  2%|▏         | 31/1586 [00:00<00:21, 71.78it/s]

  2%|▏         | 37/1586 [00:00<00:23, 67.21it/s]

  3%|▎         | 46/1586 [00:00<00:21, 72.57it/s]

  4%|▎         | 56/1586 [00:00<00:19, 77.53it/s]

  4%|▍         | 65/1586 [00:00<00:18, 80.38it/s]

  5%|▍         | 74/1586 [00:00<00:18, 80.85it/s]

  5%|▌         | 85/1586 [00:01<00:17, 87.22it/s]

  6%|▌         | 96/1586 [00:01<00:16, 91.89it/s]

  7%|▋         | 106/1586 [00:01<00:19, 74.89it/s]

  7%|▋         | 116/1586 [00:01<00:18, 80.97it/s]

  8%|▊         | 125/1586 [00:01<00:19, 75.85it/s]

  9%|▊         | 135/1586 [00:01<00:17, 81.76it/s]

  9%|▉         | 144/1586 [00:01<00:17, 83.41it/s]

 10%|▉         | 153/1586 [00:01<00:17, 82.63it/s]

 10%|█         | 162/1586 [00:02<00:17, 81.74it/s]

 11%|█         | 171/1586 [00:02<00:17, 82.82it/s]

 11%|█▏        | 180/1586 [00:02<00:21, 66.82it/s]

 12%|█▏        | 188/1586 [00:02<00:22, 62.23it/s]

 12%|█▏        | 195/1586 [00:02<00:22, 63.21it/s]

 13%|█▎        | 204/1586 [00:02<00:20, 68.74it/s]

 14%|█▎        | 215/1586 [00:02<00:18, 75.19it/s]

 14%|█▍        | 224/1586 [00:02<00:17, 77.45it/s]

 15%|█▍        | 233/1586 [00:03<00:17, 79.52it/s]

 15%|█▌        | 243/1586 [00:03<00:15, 84.13it/s]

 16%|█▌        | 252/1586 [00:03<00:16, 79.74it/s]

 16%|█▋        | 261/1586 [00:03<00:16, 80.32it/s]

 17%|█▋        | 270/1586 [00:03<00:17, 76.83it/s]

 18%|█▊        | 279/1586 [00:03<00:16, 78.79it/s]

 18%|█▊        | 287/1586 [00:03<00:17, 72.89it/s]

 19%|█▊        | 295/1586 [00:03<00:17, 73.76it/s]

 19%|█▉        | 303/1586 [00:03<00:19, 66.41it/s]

 20%|█▉        | 310/1586 [00:04<00:19, 64.37it/s]

 20%|██        | 321/1586 [00:04<00:17, 72.13it/s]

 21%|██        | 329/1586 [00:04<00:17, 71.63it/s]

 21%|██        | 337/1586 [00:04<00:17, 71.73it/s]

 22%|██▏       | 346/1586 [00:04<00:16, 75.15it/s]

 22%|██▏       | 355/1586 [00:04<00:15, 78.37it/s]

 23%|██▎       | 364/1586 [00:04<00:15, 79.72it/s]

 24%|██▎       | 373/1586 [00:04<00:16, 74.81it/s]

 24%|██▍       | 381/1586 [00:04<00:16, 74.21it/s]

 25%|██▍       | 391/1586 [00:05<00:15, 76.45it/s]

 25%|██▌       | 399/1586 [00:05<00:17, 66.54it/s]

 26%|██▌       | 407/1586 [00:05<00:17, 68.55it/s]

 26%|██▌       | 415/1586 [00:05<00:17, 67.33it/s]

 27%|██▋       | 427/1586 [00:05<00:15, 76.70it/s]

 27%|██▋       | 436/1586 [00:05<00:15, 76.61it/s]

 28%|██▊       | 447/1586 [00:05<00:13, 82.48it/s]

 29%|██▉       | 456/1586 [00:05<00:13, 83.18it/s]

 29%|██▉       | 465/1586 [00:06<00:15, 73.57it/s]

 30%|██▉       | 474/1586 [00:06<00:14, 77.03it/s]

 30%|███       | 483/1586 [00:06<00:15, 70.03it/s]

 31%|███       | 492/1586 [00:06<00:14, 73.54it/s]

 32%|███▏      | 500/1586 [00:06<00:16, 64.84it/s]

 32%|███▏      | 510/1586 [00:06<00:15, 71.31it/s]

 33%|███▎      | 518/1586 [00:06<00:15, 70.40it/s]

 33%|███▎      | 526/1586 [00:06<00:15, 67.51it/s]

 34%|███▎      | 535/1586 [00:07<00:14, 71.64it/s]

 34%|███▍      | 543/1586 [00:07<00:15, 66.41it/s]

 35%|███▍      | 552/1586 [00:07<00:14, 72.02it/s]

 35%|███▌      | 562/1586 [00:07<00:13, 78.37it/s]

 36%|███▌      | 571/1586 [00:07<00:13, 75.78it/s]

 37%|███▋      | 581/1586 [00:07<00:12, 79.09it/s]

 37%|███▋      | 590/1586 [00:07<00:14, 70.89it/s]

 38%|███▊      | 598/1586 [00:07<00:13, 71.63it/s]

 38%|███▊      | 608/1586 [00:08<00:12, 76.06it/s]

 39%|███▉      | 616/1586 [00:08<00:13, 71.65it/s]

 39%|███▉      | 624/1586 [00:08<00:13, 71.66it/s]

 40%|███▉      | 633/1586 [00:08<00:12, 75.12it/s]

 40%|████      | 642/1586 [00:08<00:12, 78.28it/s]

 41%|████      | 652/1586 [00:08<00:11, 82.14it/s]

 42%|████▏     | 661/1586 [00:08<00:12, 75.77it/s]

 42%|████▏     | 671/1586 [00:08<00:11, 80.16it/s]

 43%|████▎     | 680/1586 [00:08<00:11, 77.99it/s]

 44%|████▎     | 691/1586 [00:09<00:10, 84.58it/s]

 44%|████▍     | 700/1586 [00:09<00:11, 79.49it/s]

 45%|████▍     | 711/1586 [00:09<00:10, 85.78it/s]

 45%|████▌     | 720/1586 [00:09<00:10, 83.92it/s]

 46%|████▌     | 731/1586 [00:09<00:09, 89.95it/s]

 47%|████▋     | 741/1586 [00:09<00:09, 88.03it/s]

 47%|████▋     | 752/1586 [00:09<00:09, 92.11it/s]

 48%|████▊     | 763/1586 [00:09<00:08, 95.46it/s]

 49%|████▊     | 773/1586 [00:09<00:08, 94.32it/s]

 49%|████▉     | 785/1586 [00:10<00:08, 99.70it/s]

 50%|█████     | 796/1586 [00:10<00:08, 95.32it/s]

 51%|█████     | 806/1586 [00:10<00:08, 91.78it/s]

 52%|█████▏    | 817/1586 [00:10<00:08, 93.86it/s]

 52%|█████▏    | 829/1586 [00:10<00:07, 99.35it/s]

 53%|█████▎    | 842/1586 [00:10<00:07, 103.30it/s]

 54%|█████▍    | 853/1586 [00:10<00:07, 103.13it/s]

 55%|█████▍    | 865/1586 [00:10<00:07, 102.87it/s]

 55%|█████▌    | 876/1586 [00:10<00:07, 101.41it/s]

 56%|█████▌    | 887/1586 [00:11<00:07, 98.27it/s] 

 57%|█████▋    | 897/1586 [00:11<00:07, 97.21it/s]

 57%|█████▋    | 907/1586 [00:11<00:07, 95.41it/s]

 58%|█████▊    | 917/1586 [00:11<00:07, 87.41it/s]

 59%|█████▊    | 929/1586 [00:11<00:07, 93.74it/s]

 59%|█████▉    | 939/1586 [00:11<00:07, 86.16it/s]

 60%|█████▉    | 950/1586 [00:11<00:07, 88.63it/s]

 61%|██████    | 962/1586 [00:11<00:06, 95.07it/s]

 61%|██████▏   | 972/1586 [00:12<00:06, 94.93it/s]

 62%|██████▏   | 982/1586 [00:12<00:06, 90.51it/s]

 63%|██████▎   | 992/1586 [00:12<00:06, 90.55it/s]

 63%|██████▎   | 1003/1586 [00:12<00:06, 94.38it/s]

 64%|██████▍   | 1013/1586 [00:12<00:08, 66.81it/s]

 65%|██████▍   | 1025/1586 [00:12<00:07, 76.24it/s]

 65%|██████▌   | 1035/1586 [00:12<00:06, 81.24it/s]

 66%|██████▌   | 1045/1586 [00:12<00:06, 79.75it/s]

 67%|██████▋   | 1056/1586 [00:13<00:06, 86.54it/s]

 67%|██████▋   | 1066/1586 [00:13<00:06, 84.99it/s]

 68%|██████▊   | 1075/1586 [00:13<00:06, 81.89it/s]

 68%|██████▊   | 1084/1586 [00:13<00:06, 77.94it/s]

 69%|██████▉   | 1094/1586 [00:13<00:05, 82.97it/s]

 70%|██████▉   | 1104/1586 [00:13<00:05, 83.99it/s]

 70%|███████   | 1113/1586 [00:13<00:05, 83.18it/s]

 71%|███████   | 1123/1586 [00:13<00:05, 85.89it/s]

 71%|███████▏  | 1133/1586 [00:13<00:05, 88.69it/s]

 72%|███████▏  | 1142/1586 [00:14<00:06, 73.03it/s]

 73%|███████▎  | 1150/1586 [00:14<00:07, 59.13it/s]

 73%|███████▎  | 1157/1586 [00:14<00:07, 53.80it/s]

 73%|███████▎  | 1164/1586 [00:14<00:09, 44.78it/s]

 74%|███████▍  | 1171/1586 [00:14<00:08, 49.27it/s]

 74%|███████▍  | 1177/1586 [00:15<00:09, 41.17it/s]

 75%|███████▍  | 1183/1586 [00:15<00:09, 43.19it/s]

 75%|███████▍  | 1188/1586 [00:15<00:09, 40.90it/s]

 75%|███████▌  | 1193/1586 [00:15<00:10, 39.06it/s]

 76%|███████▌  | 1198/1586 [00:15<00:10, 35.91it/s]

 76%|███████▌  | 1202/1586 [00:15<00:10, 35.16it/s]

 76%|███████▌  | 1206/1586 [00:15<00:10, 35.42it/s]

 76%|███████▋  | 1210/1586 [00:15<00:10, 35.75it/s]

 77%|███████▋  | 1216/1586 [00:16<00:09, 39.21it/s]

 77%|███████▋  | 1224/1586 [00:16<00:07, 46.28it/s]

 78%|███████▊  | 1230/1586 [00:16<00:07, 45.38it/s]

 78%|███████▊  | 1235/1586 [00:16<00:08, 40.50it/s]

 78%|███████▊  | 1240/1586 [00:16<00:08, 38.47it/s]

 78%|███████▊  | 1245/1586 [00:16<00:09, 37.65it/s]

 79%|███████▉  | 1251/1586 [00:16<00:08, 40.34it/s]

 79%|███████▉  | 1256/1586 [00:16<00:08, 38.75it/s]

 80%|███████▉  | 1261/1586 [00:17<00:09, 35.77it/s]

 80%|███████▉  | 1266/1586 [00:17<00:08, 37.70it/s]

 80%|████████  | 1272/1586 [00:17<00:07, 40.25it/s]

 81%|████████  | 1277/1586 [00:17<00:07, 39.59it/s]

 81%|████████  | 1282/1586 [00:17<00:07, 39.04it/s]

 81%|████████  | 1286/1586 [00:17<00:07, 38.35it/s]

 81%|████████▏ | 1290/1586 [00:17<00:09, 32.65it/s]

 82%|████████▏ | 1294/1586 [00:18<00:09, 31.95it/s]

 82%|████████▏ | 1298/1586 [00:18<00:09, 30.93it/s]

 82%|████████▏ | 1302/1586 [00:18<00:09, 31.20it/s]

 82%|████████▏ | 1306/1586 [00:18<00:09, 28.91it/s]

 83%|████████▎ | 1313/1586 [00:18<00:08, 33.82it/s]

 83%|████████▎ | 1317/1586 [00:18<00:07, 34.73it/s]

 84%|████████▎ | 1327/1586 [00:18<00:06, 40.34it/s]

 84%|████████▍ | 1332/1586 [00:19<00:06, 39.50it/s]

 84%|████████▍ | 1337/1586 [00:19<00:06, 36.73it/s]

 85%|████████▍ | 1343/1586 [00:19<00:06, 39.23it/s]

 85%|████████▍ | 1348/1586 [00:19<00:06, 38.38it/s]

 85%|████████▌ | 1353/1586 [00:19<00:06, 36.28it/s]

 86%|████████▌ | 1358/1586 [00:19<00:05, 38.55it/s]

 86%|████████▌ | 1363/1586 [00:19<00:05, 39.21it/s]

 86%|████████▋ | 1368/1586 [00:20<00:06, 33.05it/s]

 87%|████████▋ | 1374/1586 [00:20<00:05, 38.07it/s]

 87%|████████▋ | 1379/1586 [00:20<00:06, 31.98it/s]

 87%|████████▋ | 1385/1586 [00:20<00:05, 37.11it/s]

 88%|████████▊ | 1390/1586 [00:20<00:05, 35.29it/s]

 88%|████████▊ | 1395/1586 [00:20<00:05, 37.58it/s]

 88%|████████▊ | 1400/1586 [00:20<00:04, 37.81it/s]

 89%|████████▊ | 1406/1586 [00:20<00:04, 39.30it/s]

 89%|████████▉ | 1414/1586 [00:21<00:03, 44.68it/s]

 89%|████████▉ | 1419/1586 [00:21<00:04, 36.60it/s]

 90%|████████▉ | 1425/1586 [00:21<00:04, 38.74it/s]

 90%|█████████ | 1430/1586 [00:21<00:04, 36.98it/s]

 90%|█████████ | 1434/1586 [00:21<00:04, 33.80it/s]

 91%|█████████ | 1440/1586 [00:21<00:04, 35.23it/s]

 91%|█████████ | 1444/1586 [00:22<00:05, 26.42it/s]

 91%|█████████▏| 1448/1586 [00:22<00:05, 27.12it/s]

 92%|█████████▏| 1455/1586 [00:22<00:04, 32.03it/s]

 92%|█████████▏| 1459/1586 [00:22<00:04, 27.53it/s]

 92%|█████████▏| 1466/1586 [00:22<00:03, 32.85it/s]

 93%|█████████▎| 1471/1586 [00:22<00:03, 35.20it/s]

 93%|█████████▎| 1476/1586 [00:22<00:03, 33.99it/s]

 93%|█████████▎| 1480/1586 [00:23<00:03, 34.11it/s]

 94%|█████████▍| 1487/1586 [00:23<00:02, 39.17it/s]

 94%|█████████▍| 1492/1586 [00:23<00:03, 31.23it/s]

 94%|█████████▍| 1498/1586 [00:23<00:02, 35.47it/s]

 95%|█████████▍| 1504/1586 [00:23<00:02, 39.48it/s]

 95%|█████████▌| 1509/1586 [00:23<00:01, 41.12it/s]

 95%|█████████▌| 1514/1586 [00:23<00:02, 35.42it/s]

 96%|█████████▌| 1518/1586 [00:24<00:02, 28.63it/s]

 96%|█████████▌| 1523/1586 [00:24<00:01, 32.63it/s]

 96%|█████████▋| 1528/1586 [00:24<00:01, 35.41it/s]

 97%|█████████▋| 1532/1586 [00:24<00:01, 35.45it/s]

 97%|█████████▋| 1537/1586 [00:24<00:01, 36.79it/s]

 97%|█████████▋| 1541/1586 [00:24<00:01, 33.61it/s]

 97%|█████████▋| 1545/1586 [00:24<00:01, 29.87it/s]

 98%|█████████▊| 1549/1586 [00:25<00:01, 28.42it/s]

 98%|█████████▊| 1553/1586 [00:25<00:01, 26.89it/s]

 98%|█████████▊| 1556/1586 [00:25<00:01, 25.52it/s]

 98%|█████████▊| 1560/1586 [00:25<00:01, 24.42it/s]

 99%|█████████▊| 1564/1586 [00:25<00:00, 27.60it/s]

 99%|█████████▉| 1570/1586 [00:25<00:00, 32.13it/s]

 99%|█████████▉| 1577/1586 [00:25<00:00, 37.40it/s]

100%|█████████▉| 1582/1586 [00:26<00:00, 38.80it/s]

100%|██████████| 1586/1586 [00:26<00:00, 60.61it/s]




In [14]:
print(f"Original training set size: {len(df_train)}")
print(f"Augmented training set size: {len(df_train_augmented)}")

Original training set size: 1586
Augmented training set size: 2486


We have almost doubled our dataset using TFs!
Note that despite `n_per_original` being set to 2, our dataset may not exactly triple in size,
because sometimes TFs return `None` instead of a new data point
(e.g. `change_person` when applied to a sentence with no persons).
If you prefer to have exact proportions for your dataset, you can have TFs that can't perform a
valid transformation return the original data point rather than `None` (as they do here).

## 4. Training A Model

Our final step is to use the augmented data to train a model. We train an LSTM (Long Short Term Memory) model, which is a very standard architecture for text processing tasks.

The next cell makes Keras results reproducible. You can ignore it.

In [15]:
import tensorflow as tf

session_conf = tf.compat.v1.ConfigProto(
    intra_op_parallelism_threads=1, inter_op_parallelism_threads=1
)

tf.compat.v1.set_random_seed(0)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)

Now we'll train our LSTM on both the original and augmented datasets to compare performance.

In [16]:
from utils import featurize_df_tokens, get_keras_lstm, get_keras_early_stopping

X_train = featurize_df_tokens(df_train)
X_train_augmented = featurize_df_tokens(df_train_augmented)
X_valid = featurize_df_tokens(df_valid)
X_test = featurize_df_tokens(df_test)


def train_and_test(
    X_train,
    Y_train,
    X_valid=X_valid,
    Y_valid=Y_valid,
    X_test=X_test,
    Y_test=Y_test,
    num_buckets=30000,
):
    # Define a vanilla LSTM model with Keras
    lstm_model = get_keras_lstm(num_buckets)
    lstm_model.fit(
        X_train,
        Y_train,
        epochs=25,
        validation_data=(X_valid, Y_valid),
        callbacks=[get_keras_early_stopping(5)],
        verbose=0,
    )
    preds_test = lstm_model.predict(X_test)[:, 0] > 0.5
    return (preds_test == Y_test).mean()


acc_augmented = train_and_test(X_train_augmented, Y_train_augmented)
acc_original = train_and_test(X_train, Y_train)

W0815 18:29:40.858147 140038123681600 deprecation.py:506] From /home/ubuntu/snorkel-tutorials/.tox/spam/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


W0815 18:29:40.875402 140038123681600 deprecation.py:506] From /home/ubuntu/snorkel-tutorials/.tox/spam/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


W0815 18:29:41.066008 140038123681600 deprecation.py:323] From /home/ubuntu/snorkel-tutorials/.tox/spam/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


W0815 18:29:41.532244 140038123681600 deprecation.py:506] From /home/ubuntu/snorkel-tutorials/.tox/spam/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/adagrad.py:105: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Restoring model weights from the end of the best epoch.
Epoch 00017: early stopping


Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping


In [17]:
print(f"Test Accuracy (original training data): {100 * acc_original:.1f}%")
print(f"Test Accuracy (augmented training data): {100 * acc_augmented:.1f}%")

Test Accuracy (original training data): 91.2%
Test Accuracy (augmented training data): 92.8%


So using the augmented dataset indeed improved our model!
There is a lot more you can do with data augmentation, so try a few ideas
our on your own!