# Text Preprocessing and Augmentation

In this notebook, we will work with text data. Firstly, we will learn how to perform preprocessing on text data. Then, we will try to adopt data augmentation on text data. The Enron email datasets will be used here to demonstrate how text mining/NLP techniques could be used.

# 1. Data Background

In 2000, [Enron](https://en.wikipedia.org/wiki/Enron) was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.


The Enron fraud is the largest case of corporate fraud in American history. Founded in 1985, Enron Corporation went bankrupt by end of 2001 due to widespread corporate fraud and corruption. Before its fall, Fortune magazine had named Enron "America's most innovative company" for six consecutive years. So what happened? Who were the culprits?

In this notebook, we are going to work with emails corpus from Enron employees. We will learn how to analyze text data for fraud analysis.

In [1]:
### if we are using google colab, we need to run this cell to specify the path for data loading
import sys, os
if 'google.colab' in sys.modules:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    # specify the path of the folder containing "file_name" by changing the lecture index:
    lecture_index = '02'
    path_to_file = '/content/gdrive/My Drive/BT5153_2024/codes/lab_lecture{}/'.format(lecture_index)
    print(path_to_file)
    # change current path to the folder containing "file_name"
    os.chdir(path_to_file)
    !pwd

Mounted at /content/gdrive
/content/gdrive/My Drive/BT5153_2024/codes/lab_lecture02/
/content/gdrive/My Drive/BT5153_2024/codes/lab_lecture02


In [2]:
basefn = "..//data//"
import pandas as pd
df_corpus = pd.read_csv(basefn + "enron_emails_clean.csv")

In [3]:
df_corpus.head()

Unnamed: 0,Message-ID,Date,content
0,<8345058.1075840404046.JavaMail.evans@thyme>,2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...
1,<1512159.1075863666797.JavaMail.evans@thyme>,2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...
2,<26118676.1075862176383.JavaMail.evans@thyme>,2001-10-30 16:15:17,hey you are not wearing your target purple shi...
3,<10369289.1075860831062.JavaMail.evans@thyme>,2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...
4,<26728895.1075860815046.JavaMail.evans@thyme>,2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501..."


#### Exact Word Match
One simple approach to analyze text data is keyword based query. For example, look for any emails mentioning 'money'. Here, the query word could be any informative words.

In [4]:
# Select data that matches
df_corpus.loc[df_corpus['content'].str.contains('money', na=False)].head(3)


Unnamed: 0,Message-ID,Date,content
0,<8345058.1075840404046.JavaMail.evans@thyme>,2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...
2,<26118676.1075862176383.JavaMail.evans@thyme>,2001-10-30 16:15:17,hey you are not wearing your target purple shi...
3,<10369289.1075860831062.JavaMail.evans@thyme>,2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...


Usually you want to search more than one term. For example, in fraud analysis, you may prepare a full **fraud word lists** including terms that could potentially flag fraudulent clients and/or transactions.

Here, we create a list containing the following words/terms:

* 'enron stock'
* 'sell stock'
* 'stock bonus'
* 'sell enron stock'.

In [5]:
# Create a list of terms to search for
searchfor = ['enron stock', 'sell stock', 'stock bonus', 'sell enron stock']

filtered_emails  = df_corpus.loc[df_corpus['content'].str.contains('|'.join(searchfor), na=False)]
filtered_emails.head(2)

Unnamed: 0,Message-ID,Date,content
2,<26118676.1075862176383.JavaMail.evans@thyme>,2001-10-30 16:15:17,hey you are not wearing your target purple shi...
8,<19319259.1075862176360.JavaMail.evans@thyme>,2001-10-30 16:05:30,"2-5, so much for the nice machine\n\n\n -----O..."


In [6]:
print("Number of returned fraud emails is {}".format(filtered_emails.shape[0]))

Number of returned fraud emails is 13


The recall rate is quite low because the search keyword has to be exactly identical to the words in the emails to be found. For example, the email containing "SELL stock" will not be counted. In the following, we will use text preprocessing techniques from **texthero** to improve the recall rate.

# 2. Data Preprocessing

In any machine learning task, data preprocessing is quite important. In the following, we will check some of the common text preprocessing steps:

1. Lower casing
2. Removal of Punctuations
3. Removal of Stopwords
4. Chat word conversion
5. Spelling correction

So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role.

For example, in sentiment analysis use case, we need not remove the emojis or emoticons as it will convey some important information about the sentiment. Similarly we need to decide based on our use cases.

In [18]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string

### Lower Casing

Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.

This is more helpful for text featurization techniques like frequency, tfidf as it helps to combine the same words together thereby reducing the duplication and get correct counts / tfidf values.

This may not be helpful when we do tasks like Part of Speech tagging (where proper casing gives some information about Nouns and so on) and Sentiment Analysis (where upper casing refers to anger and so on)

By default, lower casing is done my most of the modern day vecotirzers and tokenizers like sklearn TfidfVectorizer and Keras Tokenizer. So we need to set them to false as needed depending on our use case.

In [19]:
df_corpus["content_lower"] = df_corpus["content"].str.lower()
df_corpus.head()

Unnamed: 0,Message-ID,Date,content,Tag,content_lower
0,<8345058.1075840404046.JavaMail.evans@thyme>,2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,0,investools advisory\na free digest of trusted ...
1,<1512159.1075863666797.JavaMail.evans@thyme>,2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,0,----- forwarded by richard b sanders/hou/ect o...
2,<26118676.1075862176383.JavaMail.evans@thyme>,2001-10-30 16:15:17,hey you are not wearing your target purple shi...,1,hey you are not wearing your target purple shi...
3,<10369289.1075860831062.JavaMail.evans@thyme>,2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,0,leslie milosevich\n1042 santa clara avenue\nal...
4,<26728895.1075860815046.JavaMail.evans@thyme>,2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",0,"rini twait\n1010 e 5th ave\nlongmont, co 80501..."


### Removal of Punctuations

One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the string.punctuation in python contains the following punctuation symbols

!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

We can add or remove more punctuations as per our need.

In [21]:
# drop the new column created in last cell
df_corpus.drop(["content_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df_corpus["content_wo_punct"] = df_corpus["content"].apply(lambda text: remove_punctuation(text))
df_corpus.head()

Unnamed: 0,Message-ID,Date,content,Tag,content_wo_punct
0,<8345058.1075840404046.JavaMail.evans@thyme>,2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,0,INVESTools Advisory\nA Free Digest of Trusted ...
1,<1512159.1075863666797.JavaMail.evans@thyme>,2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,0,Forwarded by Richard B SandersHOUECT on 09202...
2,<26118676.1075862176383.JavaMail.evans@thyme>,2001-10-30 16:15:17,hey you are not wearing your target purple shi...,1,hey you are not wearing your target purple shi...
3,<10369289.1075860831062.JavaMail.evans@thyme>,2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,0,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...
4,<26728895.1075860815046.JavaMail.evans@thyme>,2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",0,Rini Twait\n1010 E 5th Ave\nLongmont CO 80501\...


### Removal of Stopwords

Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.

In [25]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
", ".join(stopwords.words('english'))
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df_corpus["content_wo_stop"] = df_corpus["content_wo_punct"].apply(lambda text: remove_stopwords(text))
df_corpus.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Message-ID,Date,content,Tag,content_wo_punct,content_wo_stop
0,<8345058.1075840404046.JavaMail.evans@thyme>,2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,0,INVESTools Advisory\nA Free Digest of Trusted ...,INVESTools Advisory A Free Digest Trusted Inve...
1,<1512159.1075863666797.JavaMail.evans@thyme>,2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,0,Forwarded by Richard B SandersHOUECT on 09202...,Forwarded Richard B SandersHOUECT 09202000 070...
2,<26118676.1075862176383.JavaMail.evans@thyme>,2001-10-30 16:15:17,hey you are not wearing your target purple shi...,1,hey you are not wearing your target purple shi...,hey wearing target purple shirt today I mine I...
3,<10369289.1075860831062.JavaMail.evans@thyme>,2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,0,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,Leslie Milosevich 1042 Santa Clara Avenue Alam...
4,<26728895.1075860815046.JavaMail.evans@thyme>,2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",0,Rini Twait\n1010 E 5th Ave\nLongmont CO 80501\...,Rini Twait 1010 E 5th Ave Longmont CO 80501 rt...


### Chat Words Conversion

This is an important text preprocessing step if we are dealing with chat data. People do use a lot of abbreviated words in chat and so it might be helpful to expand those words for our analysis purposes.

Got a good list of chat slang words from this [repo](https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt). We can use this for our conversion here. We can add more words to this list.

In [26]:
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [27]:
chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

chat_words_conversion("one minute BRB")

'one minute Be Right Back'

### Spelling Correction¶

One another important text preprocessing step is spelling correction. Typos are common in text data and we might want to correct those spelling mistakes before we do our analysis.

In this notebook, let us use the python package pyspellchecker for our spelling correction.

In [29]:
!pip install pyspellchecker -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [30]:
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

text = "speling correctin"
correct_spellings(text)

'spelling correcting'

# 3. NLPAUG

![https://github.com/makcedward/nlpaug/blob/master/res/logo_small.png?raw=true](https://github.com/makcedward/nlpaug/blob/master/res/logo_small.png?raw=true)

More data we have, better performance we can achieve. What is more, sample more data from minority class is one approach to address the imbalanced problem.   However, it is very costy to annotate large amount of training data. And in some applications includign fraud detection, it is impossible to obtain lots of data labeled as fraud one. Therefore, proper data augmentation is useful to boost up your model performance.

Due to high complexity of language, it is more challenging to augment text compared to images which can simply cropping out portion of images. Here, we will explore the library named nlpaug. This python library helps you with augmenting nlp for your machine learning projects.

Provided Features listed as:

1. Generate synthetic data for improving model performance without manual effort
2. Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
3. Plug and play to any neural network frameworks (e.g. PyTorch, TensorFlow)
4. Support textual and audio input


Install Package

In [10]:
!pip install nlpaug==1.1.10 -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/410.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/410.8 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m276.5/410.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.8/410.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
df_corpus['Tag'] = 0
df_corpus.loc[df_corpus['content'].str.contains('|'.join(searchfor), na=False), 'Tag'] = 1
df_corpus['Tag'].value_counts()

0    2077
1      13
Name: Tag, dtype: int64

This library nlpaug provides various textual augmenter functions including character augmenter, word augmenter and sentence augmenter.

In this section, we will only explore word-level augmentation based on [WordNet](https://wordnet.princeton.edu/): substitute word by WordNet's synonym.

You may find other frameworks [here](https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb)

In [14]:
texts = df_corpus.loc[df_corpus.Tag==1, 'content'].tolist()
short_email = min(texts, key=lambda word: len(word))
#for better visualization, find the shortest email
short_email

'Ken,\n\nI am not a smart man. However I saw enron stock reached $26, I worry one thing.\n\nUnder current situation it is difficult to make more money. How could we boost stock price?\n\nEnron hired enough people already. It is the time to cut cost, freeze hiring new people now.\n\nFreezing hire is much more better than lay off in a near future.\n\nThanks.\n\nRobert (Xiaowu) Huang\nDatabase Technologies\n713-345-3612'

In [15]:
import nlpaug.augmenter.word as naw

#### Install WordNet

In [16]:
# download nltk resources
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [17]:
aug = naw.SynonymAug(aug_src='wordnet')
augmented_texts = aug.augment(short_email, 5) # 5 is the number of generated text
print("Original:")
print(short_email)
print("Augmented Texts:")
for idx in range(len(augmented_texts)):
    print(augmented_texts[idx])

Original:
Ken,

I am not a smart man. However I saw enron stock reached $26, I worry one thing.

Under current situation it is difficult to make more money. How could we boost stock price?

Enron hired enough people already. It is the time to cut cost, freeze hiring new people now.

Freezing hire is much more better than lay off in a near future.

Thanks.

Robert (Xiaowu) Huang
Database Technologies
713-345-3612
Augmented Texts:
Ken, Atomic number 53 am not a smart man. However I saw enron line reached $ 26, I worry one thing. Under current situation it is difficult to make more than money. How could we boost stock certificate price? Enron hired enough hoi polloi already. It be the time to turn off cost, freeze hiring new people now. Freezing hire is much more better than lay off in a near future. Thanks. Robert (Xiaowu) Huang Database Technologies 713 - 345 - 3612
Ken, I am not a saucy man. However One saw enron stock certificate reached $ 26, I worry one thing. Under current situatio