 # Augmentations in NLP

Data Augmentation techniques in NLP show substantial improvements on datasets with less than 500 observations, as illustrated by the original paper.

https://arxiv.org/abs/1901.11196

The Paper Considered here is EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks




#  ***Simple Data Augmentatons Techniques* are:**
1. SR : Synonym Replacement 
2. RD : Random Deletion
3. RS : Random Swap
4. RI : Random Insertion



In [1]:
# Used in all sections for managing data and files
import os
import numpy as np
from tqdm import tqdm
import pandas as pd
import pickle
import re

# NTLK is used for preprocessing text. You can find out more about each module using their documentation.
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import inaugural, stopwords
from wordcloud import WordCloud, STOPWORDS

# Scikit-Learn is used for feature extraction and training a logistic regression model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

In [2]:
# Reading train dataset in labelled training data folder and making a dataframe from them

labelled_training_data_path = '../input/iitgaihackathonmlrw2022/labelled_train_data.csv'
train_df = pd.read_csv(labelled_training_data_path)
unlabelled_train_df = pd.read_csv('../input/iitgaihackathonmlrw2022/unlabelled_train_data.csv')
test_df = pd.read_csv('../input/iitgaihackathonmlrw2022/data_only_test.csv')

print(train_df.shape, test_df.shape, unlabelled_train_df.shape)

In [3]:
cols = ['characteristics_ch1', 'data_processing', 'description', 'extract_protocol_ch1',
        'hyb_protocol', 'label_ch1', 'label_protocol_ch1', 
        'molecule_ch1', 'organism_ch1', 'scan_protocol', 'source_name_ch1', 'title',
        'type']
cols

# cols = ['molecule_ch1','organism_ch1','type' , 'contact_country', 'title' ,'characteristics_ch1' ]

In [4]:
train_df = train_df.replace(np.nan, '', regex=True)
unlabelled_train_df = unlabelled_train_df.replace(np.nan, '', regex=True)
test_df = test_df.replace(np.nan, '', regex=True)

In [5]:
def remove_urls(line):
    line = re.sub(r'http\S+', '', line)
    line = re.sub(r'www\S+', '', line)
    line = re.sub(r'\S+.txt', '', line)
    return line

In [6]:
def combine_text(x):
    all_text = ''
    for col in cols:
        all_text += x[col]
        all_text += ' '
    all_text = remove_urls(all_text)
    return all_text

In [7]:
train_df['features'] = train_df.apply(combine_text, axis=1)
unlabelled_train_df['features'] = unlabelled_train_df.apply(combine_text, axis=1)
test_df['features'] = test_df.apply(combine_text, axis=1)

In [8]:
# Preprocessing 'feature' column and storing the cleaned output in 'cleaned_feature'. 
# This will be done by the function given below.

def preprocess(data_df):
    data_df['cleaned_feature'] = ''
    
    # Initializing Stopwords and Lemmatization objects
    stop_words = set(stopwords.words('english'))
    wordnet_lemm = WordNetLemmatizer()
    
    # Pattern to detect characters which are not alphabets or numbers so they can removed
    alpha_or_numeric = "[^a-zA-Z0-9- ]"

    for index, row in tqdm(data_df.iterrows(), total=data_df.shape[0]):
    
        sample = row['features']
        
        # Replacing characters which are not alphabets or numbers with blank space and changing text to lowercase
        # These two steps are for cleaning text data, you can add more on top of this to make your data cleaner.
        pre_txt = re.sub(alpha_or_numeric, " ", sample)
        pre_txt = sample.lower()
            
        
        # Removing stop words and lemmatizing different words in preprocessed text and making the final processed text
        sample_words = [wordnet_lemm.lemmatize(w) for w in pre_txt.split() if w not in stop_words and len(w)>1]
        pre_proc_ver = ' '.join(sample_words)
        
        data_df.loc[index, 'cleaned_feature'] = pre_proc_ver
        
    return data_df
        
        
# Cleaned Training set
cleaned_train_df = preprocess(train_df.copy())
# cleaned_unlabelled_train_df = preprocess(unlabelled_train_df.copy())
cleaned_test_df = preprocess(test_df.copy())

# 1. Synonym Replacement :

Synonym replacement is a technique in which we replace a word by one of its synonyms

For identifying relevent Synonyms we use WordNet

In [9]:
from nltk.corpus import wordnet

def get_synonyms(word):
    
    synonyms = set()
    
    for syn in wordnet.synsets(word):
        for l in syn.lemmas():
            synonym = l.name().replace("_", " ").replace("-", " ").lower()
            synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
            synonyms.add(synonym) 
    if word in synonyms:
        synonyms.remove(word)
    
    return list(synonyms)

The get_synonyms funtion will return pre-processed list of synonyms of given word

Now we will replace the words with synonyms

In [10]:
from nltk.corpus import stopwords
stop_words = []
for w in stopwords.words('english'):
    stop_words.append(w)
print(stop_words)

In [11]:
import random

In [12]:

def synonym_replacement(words, n):
    
    words = words.split()
    
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        
        if num_replaced >= n: #only replace up to n words
            break

    sentence = ' '.join(new_words)

    return sentence


In [13]:
print(f" Example of Synonym Replacement: {synonym_replacement('hey man how are you doing',3)}")

To Get Larger Diversity of Sentences we could try replacing 1,2 3, .. Words in the given sentence.

Now lets get an example from out dataset and try augmenting it so that we could create 3 additional sentences per tweet 

In [14]:
trial_sent = cleaned_train_df['cleaned_feature'][4]
print(trial_sent)


In [15]:
# Create 3 Augmented Sentences per data 

for n in range(3):
    print(f" Example of Synonym Replacement: {synonym_replacement(trial_sent,n)}")

Now we are able to augment this Data :)

You can create New colums for the Same text-id  in our tweet - sentiment Dataset

In [16]:
cols

In [17]:
cleaned_train_df = cleaned_train_df[cols + ['cleaned_feature']]
cleaned_train_df.columns

In [43]:
SR_data_list = []
for idx, row in tqdm(cleaned_train_df.iterrows(), total=cleaned_train_df.shape[0]):
    for j in range(7):
        new_data = []
        for col in cleaned_train_df.columns:
            data = cleaned_train_df.loc[idx, col]
            if col == 'cleaned_feature':
                new_data.append(synonym_replacement(data, np.floor(0.05 * len(data.split())).astype(int)))
            else:
                new_data.append(data)
        SR_data_list.append(new_data)

# construct pandas dataframe from the list
SR_cleaned_train_df = pd.DataFrame(SR_data_list, columns=cleaned_train_df.columns)

In [44]:
print(SR_cleaned_train_df.shape)
SR_cleaned_train_df.head()

# 2.Random Deletion (RD)

In Random Deletion, we randomly delete a word if a uniformly generated number between 0 and 1 is smaller than a pre-defined threshold. This allows for a random deletion of some words of the sentence.



In [25]:
def random_deletion(words, p):

    words = words.split()
    
    #obviously, if there's only one word, don't delete it
    if len(words) == 1:
        return words

    #randomly delete words with probability p
    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    #if you end up deleting all words, just return a random word
    if len(new_words) == 0:
        rand_int = random.randint(0, len(words)-1)
        return [words[rand_int]]

    sentence = ' '.join(new_words)
    
    return sentence

Lets test out this Augmentation with our test_sample

In [26]:
print(random_deletion(trial_sent,0.2))
print(random_deletion(trial_sent,0.3))
print(random_deletion(trial_sent,0.4))

This Could help us in reducing Overfitting and may help to imporve our Model Accuracy 

In [45]:
RD_data_list = []
for idx, row in tqdm(cleaned_train_df.iterrows(), total=cleaned_train_df.shape[0]):
    for j in range(3):
        new_data = []
        for col in cleaned_train_df.columns:
            data = cleaned_train_df.loc[idx, col]
            if col == 'cleaned_feature':
                new_data.append(random_deletion(data, 0.2))
            else:
                new_data.append(data)
        RD_data_list.append(new_data)

# construct pandas dataframe from the list
RD_cleaned_train_df = pd.DataFrame(RD_data_list, columns=cleaned_train_df.columns)

In [28]:
print(RD_cleaned_train_df.shape)
RD_cleaned_train_df.head()


# 3. Random Swap (RS)

In Random Swap, we randomly swap the order of two words in a sentence.


In [29]:
def swap_word(new_words):
    
    random_idx_1 = random.randint(0, len(new_words)-1)
    random_idx_2 = random_idx_1
    counter = 0
    
    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words)-1)
        counter += 1
        
        if counter > 3:
            return new_words
    
    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
    return new_words

# This will Swap the words



In [30]:
def random_swap(words, n):
    
    words = words.split()
    new_words = words.copy()
    # n is the number of words to be swapped
    for _ in range(n):
        new_words = swap_word(new_words)
        
    sentence = ' '.join(new_words)
    
    return sentence

In [31]:
print(random_swap(trial_sent,1))
print(random_swap(trial_sent,2))
print(random_swap(trial_sent,3))

This Random Swapping will help to make our models robust and may inturn help in text classification. 

High order of swapping may downgrade the model

There is a high chance to loose semantics of language so be careful while using this augmentaion.



In [46]:
RS_data_list = []
for idx, row in tqdm(cleaned_train_df.iterrows(), total=cleaned_train_df.shape[0]):
    for j in range(3):
        new_data = []
        for col in cleaned_train_df.columns:
            data = cleaned_train_df.loc[idx, col]
            if col == 'cleaned_feature':
                new_data.append(random_swap(data, np.floor(0.05 * len(data.split())).astype(int)))
            else:
                new_data.append(data)
        RS_data_list.append(new_data)

# construct pandas dataframe from the list
RS_cleaned_train_df = pd.DataFrame(RS_data_list, columns=cleaned_train_df.columns)

In [35]:
print(RS_cleaned_train_df.shape)

# 4. Random Insertion (RI)
Finally, in Random Insertion, we randomly insert synonyms of a word at a random position.

Data augmentation
operations should not change the true label of
a sentence, as that would introduce unnecessary
noise into the data. Inserting a synonym of a word
in a sentence, opposed to a random word, is more
likely to be relevant to the context and retain the
original label of the sentence.

In [36]:
def random_insertion(words, n):
    
    words = words.split()
    new_words = words.copy()
    
    for _ in range(n):
        add_word(new_words)
        
    sentence = ' '.join(new_words)
    return sentence

def add_word(new_words):
    
    synonyms = []
    counter = 0
    
    while len(synonyms) < 1:
        random_word = new_words[random.randint(0, len(new_words)-1)]
        synonyms = get_synonyms(random_word)
        counter += 1
        if counter >= 10:
            return
        
    random_synonym = synonyms[0]
    random_idx = random.randint(0, len(new_words)-1)
    new_words.insert(random_idx, random_synonym)


In [37]:
print(random_insertion(trial_sent,1))
print(random_insertion(trial_sent,2))
print(random_insertion(trial_sent,3))

In [38]:
def aug(sent,n,p):
    print(f" Original Sentence : {sent}")
    print(f" SR Augmented Sentence : {synonym_replacement(sent,n)}")
    print(f" RD Augmented Sentence : {random_deletion(sent,p)}")
    print(f" RS Augmented Sentence : {random_swap(sent,n)}")
    print(f" RI Augmented Sentence : {random_insertion(sent,n)}")
    
    

In [39]:
aug("Hey everyone. This is a text classification problem",4,0.3)

In [47]:
RI_data_list = []
for idx, row in tqdm(cleaned_train_df.iterrows(), total=cleaned_train_df.shape[0]):
    for j in range(3):
        new_data = []
        for col in cleaned_train_df.columns:
            data = cleaned_train_df.loc[idx, col]
            if col == 'cleaned_feature':
                new_data.append(random_insertion(data, np.floor(0.05 * len(data.split())).astype(int)))
            else:
                new_data.append(data)
        RI_data_list.append(new_data)

# construct pandas dataframe from the list
RI_cleaned_train_df = pd.DataFrame(RI_data_list, columns=cleaned_train_df.columns)

In [42]:
print(RI_cleaned_train_df.shape)

The above Line Shows the Augmentations possible 

We covered the main Data Augmentation techniques in NLP. This is an active field of research, and the papers about this topic are quite recent.

Link to Paper : https://arxiv.org/abs/1901.11196

Repo Link : https://github.com/jasonwei20/eda_nlp


In [None]:
# Now append all the dataframes - cleaned_train_df, SR_cleaned_train_df, RS_cleaned_train_df, RI_cleaned_train_df, RD_cleaned_train_df

In [51]:
frames = [cleaned_train_df, SR_cleaned_train_df, RS_cleaned_train_df, RI_cleaned_train_df, RD_cleaned_train_df]
augmented_cleaned_train_df = pd.concat(frames, ignore_index=True)

In [52]:
import pickle

In [53]:
augmented_cleaned_train_df.to_pickle('augmented_cleaned_train_df.pkl', protocol=4)

In [54]:
augmented_cleaned_train_df.shape

![EDA_nlp.PNG](attachment:EDA_nlp.PNG)

# Final Thoughts Regarding EDA:




* EDA might not help much if you’re using a large enough dataset.

* Models that have been pre-trained on massive datasets probably don’t need EDA. So in most of the Cases Transformer Based Models wont require EDA

* generating augmented data similar to original data introduces some degree of noise that helps prevent overfitting. This may give a clearn understanding , why augmenation helps in making robust models.

* EDA can introduce new vocabulary through the synonym replacement and random insertion operations, allowing models to generalize to words in the test set that were not in the training set. 

* Finally this effects are more pronounced for smaller datasets.  


Open to new ideas and Suggestions and hope you were able to learn something Good. 

*Thank you so much for Reading the Kernal*   !!

