# Data Augmentation

In this notebook, we will augment the Training Dataset with new examples of the "Insincere" label, to reduce label bias.

Imports

I have found that Keras's implementation of NLP modules to be really useful, which are pulled from their repository here.

In [50]:
import pandas as pd;
import keras_text_preprocessing as text_preprocessor;
import numpy as np;
import random;
from sklearn.neighbors import NearestNeighbors;

Data Loading. We are using the Cleaned and Preprocessed Data here.

In [2]:
data = pd.read_csv('data/cleaned_train.txt')
data = data.dropna()

In [3]:
data_text = data['cleaned_text']
data_text = data_text.dropna()

To augment data, we use Word Embeddings. In this case, our embeddings size is 300.

### Word Tokenization

We will tokenize the sentences by using Keras's module. Words will be mapped to individual numbers in a sequence, so they can be used to index for corresponding word embeddings.

In [4]:
embed_size = 300
MAX_WORDS = 200000

In [8]:
tokenizer = text_preprocessor.Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(data_text)

In [9]:
seq = tokenizer.texts_to_sequences(data_text)

In [11]:
ones = data[data['target'] == 1]
ones_text = ones['cleaned_text'];
ones_seq = tokenizer.texts_to_sequences(ones_text);

In this augmentation task, we are using FastText Word Embeddings trained on Wikipedia Data.

In [12]:
embeddings_dict = {};
with open('../Embeddings/crawl-%dd-2M.vec'%(embed_size), 'rb') as f:
    for line in f:
        splits = line.split();
        word = splits[0];
        vec = np.asarray(splits[1:], dtype='float32')
        
        embeddings_dict[word.decode()] = vec;

We will now use the Word Embeddings to build an Index mapping the words in our vocabulary to Vectors.

In [19]:
word_index = tokenizer.word_index;

In [20]:
index_word = {};
for word, item in word_index.items():
    index_word[item] = word;

In [21]:
vocab_size = min(MAX_WORDS, len(word_index))+1;
embeddings_matrix = np.zeros((vocab_size, embed_size));

for word, posit in word_index.items():
    if posit >= vocab_size:
        break;
        
    vec = embeddings_dict.get(word);
    if vec is None:
        vec = np.random.sample(embed_size);
        embeddings_dict[word] = vec;
    
    embeddings_matrix[posit] = vec;

We will use K Nearest Neighbors algorithm to get the Nearest Neighbors of each word in our vocabulary, by using Vector Comparisons between the Word Embeddings.

In [23]:
total_syns = 5;
top_k = 20000;

In [24]:
nearest_syns = NearestNeighbors(n_neighbors=total_syns+1).fit(embeddings_matrix) 

In [25]:
neighbours_mat = nearest_syns.kneighbors(embeddings_matrix[1:top_k])[1]

In [26]:
synonyms = {x[0]: x[1:] for x in neighbours_mat}

Let's view a few synonyms of words in our vocabulary

In [47]:
for posit in np.random.choice(top_k, 10):
    print(index_word[posit] + " : " + str([index_word[synonyms[posit][i]] for i in range(total_syns-1)]))

infps : ['lithromantics', 'staffhunting', 'slughorn', 'occs']
corruption : ['corrupt', 'corruptions', 'corrupted', 'malfeasance']
fifty : ['twenty', 'thirty', 'forty', 'sixty']
patton : ['davis', 'sherman', 'george', 'james']
gm : ['gms', 'g', 'ml', 'sr']
annoying : ['irritating', 'obnoxious', 'irksome', 'annoyingly']
coupon : ['coupons', 'discount', 'discounts', 'promo']
badge : ['badges', 'emblem', 'insignia', 'emblems']
curiosity : ['curiousity', 'curosity', 'curious', 'fascination']
could : ['would', 'might', 'can', 'should']


### Augmentation

To augment the actual sentences, we will use the following strategy:

Iterating over each sentence with a Target of 1, each word in a sentence will be replaced with a random synonym with a probability of 50%.

In [48]:
def augment_sentence(encoded_sentence, prob = 0.5):
    for posit in range(len(encoded_sentence)):
        if random.random() > prob:
            try:
                syns = synonyms[encoded_sentence[posit]];
                rand_syn = np.random.choice(syns);
                encoded_sentence[posit] = rand_syn;
            except KeyError:
                pass;
    return encoded_sentence;

An example sentence before replacement:

In [111]:
' '.join([index_word[idx] for idx in ones_seq[6]])

'so like it marriage the american woman for with green it that more could they fee'

In [113]:
new_sequences = [];
for ite in range(len(ones_seq)):
    new_sequences.append(augment_sentence(ones_seq[ite]));

The sentence after replacement:

In [114]:
' '.join([index_word[idx] for idx in new_sequences[6]])

'it like it marriage the canadian lady the in greeen it but more could themselves pay'

We will add these new sentences to the original DataFrame and save to file

In [170]:
new_df = pd.DataFrame(new_sequences)

In [172]:
new_df['target'] = 1;

In [174]:
new_df.columns = ['cleaned_text', 'target']

In [177]:
len(data_text_label)

1303765

In [178]:
combined_df = pd.concat([data_text_label, new_df])

In [184]:
(combined_df == 1).sum()

cleaned_text         0
target          161450
dtype: int64

In [185]:
combined_df.to_csv('data/augmented_quora_text.txt')