# Snorkel Intro Tutorial: Data Augmentation
Data augmentation is a popular technique for increasing the size of labeled training sets by applying **class-preserving** transformations to create copies of labeled data points. 

In the image domain, it is crucial factor in almost every state-of-art result today and is quickly gaining popularity in text-based applications. 

Snorkel models the data augmentation process by applying user-defined **transformation functions (TFs)** in sequence. 

## Preparation

In [1]:
import os
import random

import numpy as np
import pandas as pd
import tensorflow as tf

In [2]:
if os.path.basename(os.getcwd()) == "snorkel-tutorials":
  os.chdir("./spam-classification")

print(os.getcwd())

/Users/scottchu/Projects/learning/snorkel-tutorials/spam-classification


In [3]:
!pip install -r requirements.txt -q

In [4]:
# Turn off TensorFlow logging messages
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [5]:
# For reproducibility
seed = 0
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(0)
random.seed(0)

In [6]:
DISPLAY_ALL_TEXT = False

pd.set_option("display.max_colwidth", 0 if DISPLAY_ALL_TEXT else 50)

In [7]:
# Download the spaCy english model
! python -m spacy download en_core_web_sm -q

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## 1. Loading Data

In [8]:
from utils import load_spam_dataset

df_train, df_test = load_spam_dataset(load_train_labels=True)

Y_train = df_train["label"].values
Y_test = df_train["label"].values

In [9]:
df_train.head()

Unnamed: 0,author,date,text,label,video
0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx...,1,1
1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Ta...",1,1
2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,0,1
3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",0,1
4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",0,1


## 2. Writing Transformation Functions

Transformation functions are functions that can be applied to a training data point to create another valid training data point of the same class.

For example, for image classification problems, it is common to **rotate** or **crop** images in the training data to create new training inputs. 

Transformation functions should be **atomic** (a small rotation of an image, or changing a single word in a sentence). We then compose multiple transformation functions when applying them to training data points.

In [10]:
from snorkel.preprocess.nlp import SpacyPreprocessor

spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

In [11]:
import names
from snorkel.augmentation import transformation_function

# Pregenerate some random person names to replace existing ones with
# for the transformation strategies below
replacement_names = [names.get_full_name() for _ in range(50)]

In [12]:
from utils import preview_tf

In [13]:
@transformation_function(pre=[spacy])
def change_person(x):
  person_names = [ent.text for ent in x.doc.ents if ent.label_ == "PERSON"]
  # If there is at least one person name, replace a random one. Else return None.
  if person_names:
    name_to_replace = np.random.choice(person_names)
    replacement_name = np.random.choice(replacement_names)
    x.text = x.text.replace(name_to_replace, replacement_name)
    return x

preview_tf(df_train, change_person, 20)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,change_person,"""eye of the tiger"" ""i am the champion"" seems l...","""eye of the tiger"" ""i am the champion"" seems l..."
1,change_person,Οh my god ... Roar is the most liked video at ...,Οh my god ... Roar is the most liked video at ...
2,change_person,I started hating Katy Perry after finding out ...,I started hating Douglas Schmidt after finding...
3,change_person,i rekt ur mum last nite. cuz da haterz were 2 ...,i rekt ur mum last nite. cuz da haterz were 2 ...
4,change_person,share and like this page to win a hand signed ...,share and like this page to win a hand signed ...
5,change_person,WHATS UP EVERYONE!? :-) I Trying To Showcase M...,WHATS UP EVERYONE!? :-) I Trying To Showcase M...
6,change_person,••••►►My name is George and let me tell u EMIN...,••••►►My name is George and let me tell u EMIN...
7,change_person,The first billion viewed this because they tho...,The first billion viewed this because they tho...
8,change_person,You guys should check out this EXTRAORDINARY w...,You guys should check out this EXTRAORDINARY w...
9,change_person,i love Rihanna 😍😍😍😍[♧from Thailand♧]﻿,i love Malcolm Lafoe 😍😍😍😍[♧from Thailand♧]﻿


In [14]:
@transformation_function(pre=[spacy])
def swap_adjectives(x):
  adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
  # Check that there are at least two adjectives to swap.
  if len(adjective_idxs) >= 2:
    idx1, idx2 = sorted(np.random.choice(adjective_idxs, 2, replace=False))
    # Swap tokens in positions idx1 and idx2.
    x.text = " ".join(
      [
        x.doc[:idx1].text,
        x.doc[idx2].text,
        x.doc[1 + idx1 : idx2].text,
        x.doc[idx1].text,
        x.doc[1 + idx2 :].text,
      ]
    )
    return x

preview_tf(df_train, swap_adjectives, 20)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,swap_adjectives,hey guys look im aware im spamming and it piss...,hey guys look im more im spamming and it pisse...
1,swap_adjectives,I started hating Katy Perry after finding out ...,I started hating Katy Perry after finding out ...
2,swap_adjectives,i rekt ur mum last nite. cuz da haterz were 2 ...,i rekt ur mum sponswer nite. cuz da haterz wer...
3,swap_adjectives,You gotta say its funny. well not 2 billion wo...,You gotta say its funny. well not 2 billion wo...
4,swap_adjectives,"Hello! Do you like gaming, art videos, scienti...","Hello! Do you like gaming, art videos, scienti..."
5,swap_adjectives,Haha its so funny to see the salt of westerner...,Haha its so top to see the salt of westerners ...
6,swap_adjectives,Why does this video have so many views? Becaus...,Why does this video have so many views? Becaus...
7,swap_adjectives,Please help me give my son a grave. http://ww...,Please help me give my son a grave. anymore....
8,swap_adjectives,••••►►My name is George and let me tell u EMIN...,••••►►My name is George and let me tell u EMIN...
9,swap_adjectives,Hey I think I know what where dealing with her...,Hey I think I know what where dealing with her...


In [15]:
import nltk
from nltk.corpus import wordnet as wn

nltk.download("wordnet")

def get_synonym(word, pos=None):
  """Get synonym for word given its part-of-speech (pos)."""
  synsets = wn.synsets(word, pos=pos)
  # Return None if wordnet has no synsets (synonym sets) for this word and pos.
  if synsets:
    words = [lemma.name() for lemma in synsets[0].lemmas()]
    if words[0].lower() != word.lower():  # Skip if synonym is same as word.
      # Multi word synonyms in wordnet use '_' as a separator e.g. reckon_with. Replace it with space.
      return words[0].replace("_", " ")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/scottchu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [16]:
def replace_token(spacy_doc, idx, replacement):
  """Replace token in position idx with replacement."""
  return " ".join([
    spacy_doc[:idx].text, 
    replacement, 
    spacy_doc[1 + idx :].text
  ])

In [17]:
@transformation_function(pre=[spacy])
def replace_verb_with_synonym(x):
  # Get indices of verb tokens in sentence.
  verb_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "VERB"]
  if verb_idxs:
    # Pick random verb idx to replace.
    idx = np.random.choice(verb_idxs)
    synonym = get_synonym(x.doc[idx].text, pos="v")
    # If there's a valid verb synonym, replace it. Otherwise, return None.
    if synonym:
      x.text = replace_token(x.doc, idx, synonym)
      return x

preview_tf(df_train, replace_verb_with_synonym, 20)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,replace_verb_with_synonym,"""eye of the tiger"" ""i am the champion"" seems l...","""eye of the tiger"" ""i am the champion"" seems l..."
1,replace_verb_with_synonym,Οh my god ... Roar is the most liked video at ...,Οh my god ... Roar is the most wish video at V...
2,replace_verb_with_synonym,I started hating Katy Perry after finding out ...,I started hating Katy Perry after finding out ...
3,replace_verb_with_synonym,This video is so racist!!! There are only anim...,This video is so racist!!! There be only anima...
4,replace_verb_with_synonym,You gotta say its funny. well not 2 billion wo...,You gotta say its funny. well not 2 billion wo...
5,replace_verb_with_synonym,reminds me of this song https://soundcloud.com...,remind me of this song https://soundcloud.com...
6,replace_verb_with_synonym,2 billion....Coming soon﻿,2 billion.... come soon﻿
7,replace_verb_with_synonym,••••►►My name is George and let me tell u EMIN...,••••►►My name is George and let me tell u EMIN...
8,replace_verb_with_synonym,Hey I think I know what where dealing with her...,Hey I think I know what where dealing with her...
9,replace_verb_with_synonym,"Almost 1 Bil. What? Wow, GS sucks, in my opini...","Almost 1 Bil. What? Wow, GS suck , in my opini..."


In [18]:
@transformation_function(pre=[spacy])
def replace_noun_with_synonym(x):
    # Get indices of noun tokens in sentence.
  noun_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "NOUN"]
  if noun_idxs:
      # Pick random noun idx to replace.
    idx = np.random.choice(noun_idxs)
    synonym = get_synonym(x.doc[idx].text, pos="n")
    # If there's a valid noun synonym, replace it. Otherwise, return None.
    if synonym:
      x.text = replace_token(x.doc, idx, synonym)
      return x

preview_tf(df_train, replace_noun_with_synonym, 20)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,replace_noun_with_synonym,"""eye of the tiger"" ""i am the champion"" seems l...","""eye of the tiger"" ""i am the champion"" seems l..."
1,replace_noun_with_synonym,You gotta say its funny. well not 2 billion wo...,You gotta say its funny. well not 2 billion wo...
2,replace_noun_with_synonym,"Hello! Do you like gaming, art videos, scienti...","Hello! Do you like gaming, art video , scienti..."
3,replace_noun_with_synonym,Check out my Music Videos! and PLEASE SUBSCRIB...,Check out my Music Videos! and PLEASE SUBSCRIB...
4,replace_noun_with_synonym,Love the way you lie - Driveshaft﻿,Love the manner you lie - Driveshaft﻿
5,replace_noun_with_synonym,Watch my videos xx﻿,Watch my video xx﻿
6,replace_noun_with_synonym,Why does this video have so many views? Becaus...,Why does this video have so many views? Becaus...
7,replace_noun_with_synonym,WHATS UP EVERYONE!? :-) I Trying To Showcase M...,WHATS UP EVERYONE!? :-) I Trying To Showcase M...
8,replace_noun_with_synonym,Get free gift cards and pay pal money!﻿,Get free gift card game and pay pal money!﻿
9,replace_noun_with_synonym,The first billion viewed this because they tho...,The first billion viewed this because they tho...


In [19]:
@transformation_function(pre=[spacy])
def replace_adjective_with_synonym(x):
    # Get indices of adjective tokens in sentence.
  adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
  if adjective_idxs:
    # Pick random adjective idx to replace.
    idx = np.random.choice(adjective_idxs)
    synonym = get_synonym(x.doc[idx].text, pos="a")
    # If there's a valid adjective synonym, replace it. Otherwise, return None.
    if synonym:
      x.text = replace_token(x.doc, idx, synonym)
      return x

preview_tf(df_train, replace_adjective_with_synonym, 20)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,replace_adjective_with_synonym,You gotta say its funny. well not 2 billion wo...,You gotta say its funny. well not 2 billion wo...
1,replace_adjective_with_synonym,Haha its so funny to see the salt of westerner...,Haha its so amusing to see the salt of western...
2,replace_adjective_with_synonym,••••►►My name is George and let me tell u EMIN...,••••►►My name is George and let me tell u EMIN...
3,replace_adjective_with_synonym,Shuffling all the way with LMFAO! I like this ...,Shuffling all the way with LMFAO! I like this ...
4,replace_adjective_with_synonym,Hey guys plz check out my youtube channel to c...,Hey guys plz check out my youtube channel to c...
5,replace_adjective_with_synonym,I cried this song bringing back some hard memo...,I cried this song bringing back some difficult...
6,replace_adjective_with_synonym,You guys should check out this EXTRAORDINARY w...,You guys should check out this EXTRAORDINARY w...
7,replace_adjective_with_synonym,You guys should check out this EXTRAORDINARY w...,You guys should check out this EXTRAORDINARY w...
8,replace_adjective_with_synonym,Super awesome video<br />﻿,ace awesome video<br />﻿
9,replace_adjective_with_synonym,"Check out our vids, our songs are awesome! And...","Check out our vids, our songs are amazing ! An..."


In [20]:
tfs = [
  change_person,
  swap_adjectives,
  replace_verb_with_synonym,
  replace_noun_with_synonym,
  replace_adjective_with_synonym,
]

In [21]:
from utils import preview_tfs

preview_tfs(df_train, tfs)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,change_person,"""eye of the tiger"" ""i am the champion"" seems l...","""eye of the tiger"" ""i am the champion"" seems l..."
1,swap_adjectives,hey guys look im aware im spamming and it piss...,hey guys look im aware im spamming and it piss...
2,replace_verb_with_synonym,"""eye of the tiger"" ""i am the champion"" seems l...","""eye of the tiger"" ""i am the champion"" look li..."
3,replace_noun_with_synonym,"""eye of the tiger"" ""i am the champion"" seems l...","""eye of the tiger"" ""i am the champion"" seems l..."
4,replace_adjective_with_synonym,You gotta say its funny. well not 2 billion wo...,You gotta say its amusing . well not 2 billion...


The TFs are expected to be heuristic strategies that indeed preserve the class most of the time, but don't need to be perfect. It is especially true when using automated data augmentation techniques which can learn to avoid particularly corrupted data points.

## 3. Applying Transformation Functions to Augment Our Dataset

In [22]:
from snorkel.augmentation import RandomPolicy

random_policy = RandomPolicy(
  len(tfs), 
  sequence_length=2, 
  n_per_original=2, 
  keep_original=True
)

In [23]:
from snorkel.augmentation import MeanFieldPolicy

mean_field_policy = MeanFieldPolicy(
  len(tfs),
  sequence_length=2,
  n_per_original=2,
  keep_original=True,
  p=[0.05, 0.05, 0.3, 0.3, 0.3]
)

In [24]:
from snorkel.augmentation import PandasTFApplier

tf_applier = PandasTFApplier(tfs, mean_field_policy)

df_train_augmented = tf_applier.apply(df_train)

Y_train_augmented = df_train_augmented["label"].values

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1586/1586 [00:29<00:00, 53.77it/s]


In [25]:
print(f"Original training set size: {len(df_train)}")
print(f"Augmented training set size: {len(df_train_augmented)}")

Original training set size: 1586
Augmented training set size: 2429


In [26]:
df_train_augmented[["label", "text"]]

Unnamed: 0,label,text
0,1,pls http://www10.vakinha.com.br/VaquinhaE.aspx...
0,1,pls http://www10.vakinha.com.br/VaquinhaE.aspx...
0,1,pls http://www10.vakinha.com.br/VaquinhaE.aspx...
1,1,"if your like drones, plz subscribe to Kamal Ta..."
1,1,"if your like drones, plz subscribe to Kamal Ta..."
...,...,...
445,1,I hope everyone is in good spirits I&#39;m a h...
446,1,Lil m !!!!! Check hi out!!!!! Does live the wa...
446,1,Lil m !!!!! Check hi out!!!!! Does live the ma...
446,1,Lil m !!!!! Check hi out!!!!! Does live the ma...


## 4. Training a Model

In [27]:
import tensorflow as tf

session_conf = tf.compat.v1.ConfigProto(
  intra_op_parallelism_threads=1, 
  inter_op_parallelism_threads=1
)

tf.compat.v1.set_random_seed(0)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)


Metal device set to: Apple M1 Max

systemMemory: 64.00 GB
maxCacheSize: 21.33 GB




In [28]:
from utils import featurize_df_tokens

X_train = featurize_df_tokens(df_train)

X_train_augmented = featurize_df_tokens(df_train_augmented)

X_test = featurize_df_tokens(df_test)


In [41]:
def get_keras_lstm(num_buckets, embed_dim=16, rnn_state_size=64):
    lstm_model = tf.keras.Sequential()
    lstm_model.add(tf.keras.layers.Embedding(num_buckets, embed_dim))
    lstm_model.add(tf.keras.layers.LSTM(rnn_state_size, activation=tf.nn.relu))
    lstm_model.add(tf.keras.layers.Dense(1, activation=tf.nn.sigmoid))
    lstm_model.compile("Adagrad", "binary_crossentropy", metrics=["accuracy"])
    return lstm_model

In [48]:
def train_and_test(X_train, Y_train, X_test=X_test, Y_test=Y_test, num_buckets=30000):
  lstm_model = get_keras_lstm(num_buckets)
  lstm_model.fit(X_train, Y_train, epochs=5, verbose=0)
  preds_test = lstm_model.predict(X_test)[:, 0] > 0.5
  return (preds_test == Y_test).mean()

In [None]:
acc_augmented = train_and_test(X_train_augmented, Y_train_augmented)
acc_original = train_and_test(X_train, Y_train)

In [None]:
print(f"Test Accuracy (original training data): {100 * acc_original:.1f}%")
print(f"Test Accuracy (augmented training data): {100 * acc_augmented:.1f}%")

## Further Reading
- [Data Augmentation with Snorkel](https://www.snorkel.org/blog/tanda)
- Keras LSTM (Long Short Term Memory)
- TANDA Mean Field Model