# NLP Albumentation

Have you ever wanted to build some cool NLP projects like Text summarization, Question Answering bots, finding semantic meaning in the text and got poor performance from the model simply because of the lack of training data. Of course, with the current growth in NLP, these projects can be implemented quickly with the help of models that are available in platforms like HuggingFace which are pretrained in large volumes of text data. Since these models are trained on generic datasets like Wikipedia, Stack Exchange, Yahoo answers etc, they would have average performance on the domain-specific tasks. For example, if we are dealing with the medical data, then it is recommended to take these pretrained models and then fine tune it for medical datasets. 

But the challenge here is to collect the data for training. These models would require huge volumes of text data for training. In this case, similar to the data augmentation techniques in Computer Vision, there is a technique called NLP Albumentation. These techniques are applied to the text data using which we can increase the size of the training set. This increase in size of the training data would improve the model performance on the NLP tasks. 

This notebook contains some of the techniques on NLP Albumentation in generating more training data from the existing data using some of the NLP Albumentation techniques.

In [2]:
import pandas as pd
import numpy as np

# Import NLTK libraries
import nltk
from nltk.corpus import stopwords
nltk.download('inaugural')
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('wordnet')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\rraja\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rraja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\rraja\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rraja\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Import NLP Augmentation related libraries

In [3]:
from nlpaug.util.file.download import DownloadUtil
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

### Read data

In [4]:
df = pd.read_excel("reviews.xlsx")

In [5]:
df.head()

Unnamed: 0,Reviews,Sentiment
0,Good restaurant with modern design great chil...,1
1,The bath and shower room was amazing Beautifu...,1
2,Backyard of the hotel is total mess shouldn t...,0
3,Funky rooms well designed hotel friendly and ...,1
4,Bed was extremely comfy and the staff where w...,1


In [6]:
df.shape

(227, 2)

There are 227 hotel reviews in the dataset. I know, this is very small for NLP tasks. But still, for experiment purpose, I have taken this smaller dataset, build a classifier and note down its performance. Then, I will be using NLP Augmentation techniques to increase this dataset and then run the classifier to compare its performance.

In [69]:
"""
def preprocess_speech(text):
    '''
    Function to lower the speech text, remove punctuations and stop words from the speech and perform lemmatization.
    Args: 
        text (str): raw text.
    Returns:
        preprocessed text.
    '''
    lem=WordNetLemmatizer()
    lower_text = text.lower()
    tokens = nltk.word_tokenize(lower_text)
    punct_removed = [token for token in tokens if token.isalnum()] # remove punctuations from speech
    
    stopwords_removed = [word for word in punct_removed if word not in stopwords.words("english")]
    
    final_words=[lem.lemmatize(w) for w in stopwords_removed if len(w) > 2]
    
    return " ".join(final_words)
"""

'\ndef preprocess_speech(text):\n    \'\'\'\n    Function to lower the speech text, remove punctuations and stop words from the speech and perform lemmatization.\n    Args: \n        text (str): raw text.\n    Returns:\n        preprocessed text.\n    \'\'\'\n    lem=WordNetLemmatizer()\n    lower_text = text.lower()\n    tokens = nltk.word_tokenize(lower_text)\n    punct_removed = [token for token in tokens if token.isalnum()] # remove punctuations from speech\n    \n    stopwords_removed = [word for word in punct_removed if word not in stopwords.words("english")]\n    \n    final_words=[lem.lemmatize(w) for w in stopwords_removed if len(w) > 2]\n    \n    return " ".join(final_words)\n'

Preprocessing of text data is not required here. Because, in order for the text augmentation to work, we need to preserve the semantic meaning of the text data.

In [9]:
#df['preprocessed_reviews'] = df['Reviews'].apply(preprocess_speech)

In [7]:
df.head()

Unnamed: 0,Reviews,Sentiment
0,Good restaurant with modern design great chil...,1
1,The bath and shower room was amazing Beautifu...,1
2,Backyard of the hotel is total mess shouldn t...,0
3,Funky rooms well designed hotel friendly and ...,1
4,Bed was extremely comfy and the staff where w...,1


In [8]:
# Split the model into Train and Test Data set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(df['Reviews'],df['Sentiment'],test_size=0.3)

In [9]:
Train_Y.value_counts()

0    82
1    76
Name: Sentiment, dtype: int64

The dataset has equal distribution of target classes

In [10]:
# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comparison to the corpus
Tfidf_vect = TfidfVectorizer(max_features=5000) # keeping the top 5000 important features
Tfidf_vect.fit(df['Reviews'])

TfidfVectorizer(max_features=5000)

In [11]:
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

### Build Support Vector Machine classifier

In [12]:
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)

SVC(gamma='auto', kernel='linear')

In [13]:
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)

In [14]:
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  91.30434782608695


Accuracy of 91 is not really bad given we have very small dataset.

### Build Naive Bayes Classifier

In [15]:
# Classifier - Algorithm - Naive Bayes
# fit the training dataset on the classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)

MultinomialNB()

In [16]:
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)

In [17]:
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  86.95652173913044


Accuracy of 86 is not really bad given we have very small dataset.

## NLPAUG
Data Augmentation library for text data

In [18]:
final_df = df[['Reviews', 'Sentiment']]

In [19]:
final_df.shape

(227, 2)

### OCR Augmenter
Simulate OCR engine error

In [20]:
temp_df = final_df
aug = nac.OcrAug()
for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_texts = aug.augment(text, n=3)
    for a in augmented_texts:
        new_row = {'Reviews': a, 'Sentiment': sentiment}
        
        final_df = final_df.append(new_row, ignore_index=True)

In [21]:
final_df.shape

(908, 2)

In [22]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
903,Ckeat Customer Service Staff very kespun8ive t...,1
904,Ckeat Customer Service Staff very responsive t...,1
905,Beautiful koum that was ready early with the m...,1
906,Beautiful room that was ready early with the m...,1
907,Beautiful k0om that was ready early with the m...,1


room = koum, responsive = kespun8ive  , these are some of the OCR error introduced.
I have generated 3 augmented text from 1 input text and thats why the dataframe has 908 rows

### Random Character Augmenter
insert character randomly

In [23]:
temp_df = final_df[:228]
aug = nac.RandomCharAug(action="insert")
for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_text = aug.augment(text)[0]
    new_row = {'Reviews': augmented_text, 'Sentiment': sentiment}
        
    final_df = final_df.append(new_row, ignore_index=True)

In [24]:
final_df.shape

(1136, 2)

In [25]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
1131,The VlUocati^on was pezruf8ect and so Abeauhti...,1
1132,No NNegaItrive,0
1133,Great Customer Service Staff very reks8po(nsiv...,1
1134,*BGeautFiful krocom that was preadoy early wit...,1
1135,Good resFtcaur0ant with m9oderVn design 9kbeXa...,1


### Random Character Augmenter
swap character randomly

In [26]:
temp_df = final_df[:228]
aug = nac.RandomCharAug(action="swap")
for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_text = aug.augment(text)[0]
    new_row = {'Reviews': augmented_text, 'Sentiment': sentiment}
        
    final_df = final_df.append(new_row, ignore_index=True)

In [27]:
final_df.shape

(1364, 2)

In [28]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
1359,The location was prfeetc and so beautiufl Only...,1
1360,No Engtaiev,0
1361,Great Customer Service Staff very rsepnosiev t...,1
1362,Beautiful room that was ready early with the m...,1
1363,Good restaurant whit modern design 9keat chill...,1


### Random Character Augmenter
substitute character randomly

In [29]:
temp_df = final_df[:228]
aug = nac.RandomCharAug(action="substitute")
for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_text = aug.augment(text)[0]
    new_row = {'Reviews': augmented_text, 'Sentiment': sentiment}
        
    final_df = final_df.append(new_row, ignore_index=True)

In [30]:
final_df.shape

(1592, 2)

In [31]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
1587,The lvcaWUon was perfect and so beautiful Only...,1
1588,No NV^acive,0
1589,Great Customer Skrvyc( Staff berb responsive t...,1
1590,Beautiful room that was ready early !iMh the m...,1
1591,Go%e restaurant with mFdexn dedig( 9k8aq chill...,1


### Random Character Augmenter
delete character randomly

In [32]:
temp_df = final_df[:228]
aug = nac.RandomCharAug(action="delete")
for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_text = aug.augment(text)[0]
    new_row = {'Reviews': augmented_text, 'Sentiment': sentiment}
        
    final_df = final_df.append(new_row, ignore_index=True)

In [33]:
final_df.shape

(1820, 2)

In [34]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
1815,The catio was perfect and so bautfu Oy a tram ...,1
1816,No egtve,0
1817,Great uster Service Staff very esponse to our ...,1
1818,Beautiful room that was eay early with the mos...,1
1819,Good restaurant th oden eign 9keat chill uot 1...,1


### spelling augmenter

In [35]:
temp_df = final_df[:228]
aug = naw.SpellingAug()
for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_texts = aug.augment(text, n=3)
    for a in augmented_texts:
        new_row = {'Reviews': a, 'Sentiment': sentiment}
        
        final_df = final_df.append(new_row, ignore_index=True)

In [36]:
final_df.shape

(2504, 2)

In [37]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
2499,Beautiful room that was ready ealy with the mo...,1
2500,Beautiful room that was ready early with Athe ...,1
2501,Bood reastaurant wiyh monderns design 9keat ch...,1
2502,Goog Resataurant with mudren design 9keat chil...,1
2503,God restaurant with mordern desingn 9keat chil...,1


### synonym augmenter

In [38]:
temp_df = final_df[:228]
aug = naw.SynonymAug(aug_src='wordnet')

for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_text = aug.augment(text)[0]
    new_row = {'Reviews': augmented_text, 'Sentiment': sentiment}
        
    final_df = final_df.append(new_row, ignore_index=True)

In [39]:
final_df.shape

(2732, 2)

In [40]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
2727,The location be gross and so beautiful Only a ...,1
2728,No Negative,0
2729,Great Customer Service Faculty very responsive...,1
2730,Beautiful room that be ready early with the mo...,1
2731,Honorable eatery with modern design 9keat chil...,1


### swap word randomly

In [41]:
temp_df = final_df[:228]
aug = naw.RandomWordAug(action="swap")

for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_text = aug.augment(text)[0]
    new_row = {'Reviews': augmented_text, 'Sentiment': sentiment}
        
    final_df = final_df.append(new_row, ignore_index=True)

In [42]:
final_df.shape

(2960, 2)

In [43]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
2955,The location was and perfect so Only beautiful...,1
2956,Negative no,0
2957,Great Customer Service Staff very responsive t...,1
2958,Beautiful room that ready was early with the m...,1
2959,With good restaurant design modern 9keat chill...,1


### delete word randomly

In [44]:
temp_df = final_df[:228]
aug = naw.RandomWordAug()

for i in range(0, len(temp_df)):
    
    text = temp_df.iloc[i]['Reviews']
    sentiment = temp_df.iloc[i]['Sentiment']
    augmented_text = aug.augment(text)[0]
    new_row = {'Reviews': augmented_text, 'Sentiment': sentiment}
        
    final_df = final_df.append(new_row, ignore_index=True)

In [45]:
final_df.shape

(3188, 2)

In [46]:
final_df.tail()

Unnamed: 0,Reviews,Sentiment
3183,The location perfect and so Only a tram ride a...,1
3184,Negative,0
3185,Great Service Staff very our individual also h...,1
3186,Beautiful room was ready early with the most c...,1
3187,Good with modern design uot p1ace pakr the hot...,1


In [47]:
# Split the model into Train and Test Data set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(final_df['Reviews'],final_df['Sentiment'],test_size=0.3)

In [48]:
Train_Y.value_counts()

0    1129
1    1102
Name: Sentiment, dtype: int64

In [49]:
# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comaprison to the corpus
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(df['Reviews'])

TfidfVectorizer(max_features=5000)

In [50]:
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

### Build SVM Classifier

In [51]:
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)

SVC(gamma='auto', kernel='linear')

In [52]:
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)

In [53]:
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  98.6415882967607


As you can see, after increasing the training dataset using NLP albumentation techniques, the performance of the SVM classifier is improved significantly.

### Naive Bayes Classifier

In [54]:
# Classifier - Algorithm - Naive Bayes
# fit the training dataset on the classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)

MultinomialNB()

In [55]:
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)

In [56]:
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  98.32810867293625


As you can see, after increasing the training dataset using NLP albumentation techniques, the performance of the SVM classifier is improved significantly.

Overall, if you see, I have taken a small dataset, implemented few NLP albumentation techniques, built basic classifiers. 
With this, we can see the impact of NLP augmentation in text data.

Further experimentation can be done in this area to get better results with the text data.