<a href="https://colab.research.google.com/github/rahiakela/transfer-learning-for-natural-language-processing/blob/main/2-getting-started-with-baselines/1_linear_and_tree_based_models_for_email_sentiment_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear & Tree-based models for Email Sentiment Classification

Our goal is to establish a set of baselines for a pair of concrete NLP problems, which we will later be able to use to measure progressive improvements gained from leveraging increasingly sophisticated transfer learning
approaches. In the process of doing this, we aim to advance your general NLP instincts and refresh your understanding of typical procedures involved in setting up problem-solving pipelines for such problems. You will review techniques ranging from tokenization to data structure and model selection. We first train some traditional machine learning models from scratch to establish some preliminary baselines for these problems.

We will focus on a pair of important representative example NLP problems – spam
classification of email, and sentiment classification of movie reviews. This exercise will arm you with a number of important skills, including some tips for obtaining, visualizing and preprocessing data. 

Three major model classes will be covered, namely linear models such as logistic regression, decision-tree-based models such as random forests, and neural-network-based models such as ELMo. These classes are additionally represented by support vector machines (SVMs) with linear kernels, gradient-boosting machines (GBMs) and BERT respectively. 

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/content-classification-supervised-models.png?raw=1' width='800'/>



## Setup

In [1]:
import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import email        # email package for processing email messages
import random
import re
import time


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC                              # Support Vector Classification model
from sklearn.ensemble import RandomForestClassifier      # random forest classifier library
from sklearn.model_selection import GridSearchCV         # for tune parameters systematically
from sklearn.ensemble import GradientBoostingClassifier  # GBM algorithm
from sklearn import metrics                              #Additional scklearn functions
from sklearn.model_selection import cross_val_score

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
from google.colab import files
files.upload() # upload kaggle.json file

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"rahiakela","key":"484f91b2ebc194b0bff8ab8777c1ebff"}'}

In [3]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle
kaggle datasets download -d wcukierski/enron-email-dataset
unzip -qq enron-email-dataset.zip

kaggle datasets download -d rtatman/fraudulent-email-corpus
unzip -qq fraudulent-email-corpus.zip

rm -rf enron-email-dataset.zip fraudulent-email-corpus.zip

kaggle.json
Downloading enron-email-dataset.zip to /content
 96% 345M/358M [00:06<00:00, 36.2MB/s]
100% 358M/358M [00:06<00:00, 54.1MB/s]
Downloading fraudulent-email-corpus.zip to /content
 91% 5.00M/5.52M [00:00<00:00, 14.9MB/s]
100% 5.52M/5.52M [00:00<00:00, 15.9MB/s]




## Preprocessing Email Spam Classification Example Data

Here, we are interested in developing an algorithm that can detect whether any given email is spam or not, at scale. To do this, we will build a dataset from two separate sources – the popular Enron email corpus as a proxy for email that is not spam, and a collection of “419” fraudulent emails as a proxy for email that is spam.

We will view this as a supervised classification task, where we will first train a classifier on a collection of emails labeled as either spam or not spam. 

In particular, we will sample the Enron Corpus – the largest public email collection, related to the notorious Enron financial scandal – as a proxy for email that are not spam, and sample “419” fraudulent emails, representing the best known type of spam, as a proxy for email that is spam. Both of these types of emails are openly available on [Kaggle](https://www.kaggle.com/wcukierski/enron-email-dataset).

The Enron corpus contains about half a million emails written by employees of the Enron Corporation, as collected by the Federal Energy Commission for the purposes of investigating the collapse of the company. It has been used extensively in the literature to study machine learning methods for email applications and is often the first data source researchers working with emails look to for initial experimentation with algorithm prototypes. On Kaggle, it is
available as a single-column .csv file with one email per row. Note that this data is still cleaner than one can expect to typically find in many practical applications in the wild.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/spam-email-preprocessing.png?raw=1' width='800'/>

The body of the email will first be separated from the headers of the email, some statistics about the dataset will be teased out to get a sense for the data, stopwords will be removed from the email, and it will then be classified as either spam or not spam.

### Loading and Visualizing the Fraudulent Email Corpus

Having loaded the Enron emails, let’s do the same for the “419” fraudulent email corpus, so that we can have some example data in our training set representing the “spam” class.

> Since this dataset comes as a .txt file, versus a .csv, the preprocessing steps are slightly different. First
of all, we have to specify the encoding when reading the file as latin1, otherwise the default encoding option of
utf-8 will fail. It is often the case in practice that one needs to experiment with a number of different encodings,
with the aforementioned two being the most popular ones, to get some datasets to read correctly. Additionally,
note that because this .txt file is one big column of emails (with headers) separated by line breaks and white
space, and is not separated nicely into rows with one email per row – as was the case for the Enron corpus – we
can’t use Pandas to neatly load it as was done before. We will read all the emails into a single string, and split
the string on a code word that appears close to the beginning of each email’s header, i.e, “From r”.

In [None]:
filepath = "./fradulent_emails.txt"
with open(filepath, "r", encoding="latin1") as file:
  data = file.read()

Print the first 20000 characters of read file string (this gives only a few emails), and notice the keyword From r close to the beginning of each email header

In [None]:
print(data[:2000])

Split on the code word From r appearing close to the beginning of each email

In [None]:
fraud_emails = data.split("From r")
print("Successfully loaded {} spam emails!".format(len(fraud_emails)))

Now that the fraudulent data is loaded as a list, we can convert it into a Pandas DataFrame.

In [None]:
fraud_bodies = extract_messages(pd.DataFrame(fraud_emails, columns=["message"], dtype=str))
fraud_bodies_df = pd.DataFrame(fraud_bodies[1:])

fraud_bodies_df.head()

### Loading and Visualizing the Enron Corpus

The first thing we need to do is load the data with the popular Pandas library, and to take a peek at a slice of the data to make sure we have a good sense of what it looks like.

In [11]:
filepath = "./emails.csv"

# Read the enron data into a pandas.DataFrame called emails
emails = pd.read_csv(filepath)
print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
print(emails.head())

Successfully loaded 517401 rows and 2 columns!
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


In [None]:
# take a closer look at the first email
print(emails.loc[0]["message"])

We see that the messages are contained within the message column of the resulting DataFrame, with the extra fields at the beginning of each message – including Message ID, To, From, etc.,– being referred to as the message’s header information or simply header.

Traditional spam classification methods derive features from the header information for classifying the message as spam or not. Here, we would like to perform the same task based on the content of the message only. One possible motivation for this approach is the fact that email training data may often be de-identified in practice due to privacy concerns and regulations, thereby making header info unavailable. Thus, we need to separate the headers from the messages in our dataset.

In [2]:
def extract_messages(df):
  messages = []
  for item in df["message"]:
    # Return a message object structure from a string
    e = email.message_from_string(item)
    # get message body
    message_body = e.get_payload()
    messages.append(message_body)
  print("Successfully retrieved message body from e-mails!")
  return messages

In [12]:
bodies = extract_messages(emails)

Successfully retrieved message body from e-mails!


In [13]:
# We then can display some processed emails
bodies_df = pd.DataFrame(bodies)
print(bodies_df.head())

                                                   0
0                          Here is our forecast\n\n 
1  Traveling to have a business meeting takes the...
2                     test successful.  way to go!!!
3  Randy,\n\n Can you send me a schedule of the s...
4                Let's shoot for Tuesday at 11:45.  


In [14]:
# extract random 10000 enron email bodies for building dataset
bodies_df = pd.DataFrame(random.sample(bodies, 10000))

# expand default pandas display options to make emails more clearly visible when printed
pd.set_option("display.max_colwidth", 300)
# you could do print(bodies_df.head()), but Jupyter displays this nicer for pandas DataFrames
bodies_df.head()

Unnamed: 0,0
0,"Thanks for your suggestions. I forgot about you leaving at 4:30. By girls\nhave both those movies. At age 2, my girls could not sit down long enough\nto watch a movie. They would watch a few minutes at a time of Barney.\n\n-----Original Message-----\nFrom: Michelle.Lokay@enron.com [mailto:Mi..."
1,"PLEASE NOTE THAT THE DATES WE WERE CONSIDERING, JANUARY 10 & JANUARY 17, ARE NOT GOING TO WORK.\n\nPlease let me know your availability for the following date:\n\nJanuary 18 at 8:00 a.m. to approximately 1:00 p.m.\n\nThanks,\n\nRosie\n\n"
2,some broker dinner. it is just a freebie.
3,"A copy of Dr. Oren's paper, Efficient Intrazonal Transmission Pricing, is\nattached. In addition, Judge Cooper issued Order No. 7 today. A copy is\nattached for your information.\n\nSee you in the Commission Workshop on February 15.\n\nParviz Adib, Ph. D.\nDirector of Market Oversight Division..."
4,"Hi Harry,\n?\nJust thought of saying hi since it has been an long time.? How's everything \nin Houston?? Things in banking are slowing down.? I'm just hoping that this \nproject I'm staffed on will go live.? There is a 90% chance.? Most activity \nnow in Asia is M&A related.? You see cross-b..."


The following (commented out) code is arguably the more "pythonic" way of achieving the extraction of bodies from messages. It is only 2 lines long and achieves the same result.

In [None]:
#messages = emails["message"].apply(email.message_from_string)
#bodies_df = messages.apply(lambda x: x.get_payload()).sample(10000)

### Email text preprocessing

Having loaded both datasets, we are now ready to sample emails from each one into a single DataFrame that will represent the overall dataset covering both classes of emails. Before doing this, we must decide how many samples to draw from each class. Ideally, the number of samples in each class will represent the natural distribution of emails in the wild, i.e, if we expect our classifier to encounter 60% spam emails and 40% nonspam emails when deployed, then a ratio such as 600 to 400 respectively might make sense.

**Note that a severe imbalance in the data, such as 99% for nonspam and 1% for spam may overfit to predict nonspam most of the time, an issue than needs to be considered when building datasets.** Since this is an idealized experiment, and we do not have any information on natural distributions of classes, we will
assume a 50/50 split. 

We also need to give some thought to how we are going to tokenize the emails, i.e., split emails into subunits of text - words, sentences, etc. To start off, we will tokenize into words, as this is the most common approach. 

We must also decide the maximum number of tokens per email, and the maximum length of each token, to ensure that the occasional “extremely long” email does not bog down the performance of our classifier. 

We do all these by specifying the following general hyperparameters, which will later be tuned experimentally to enhance performance as needed:

In [7]:
n_sample = 1000   # number of samples to generate in each class - 'spam', 'not spam'
maxtokens = 50    # the maximum number of tokens per document
maxtokenlen = 20  # the maximum length of each token

With these hyperparameters specified, we can now create a single DataFrame for the overarching training dataset. Let’s take the opportunity to also perform remaining preprocessing tasks, namely removing stop words, punctuations and tokenizing.

#### Tokenization

Let’s proceed by defining a function to tokenize emails by splitting them into words.

In [8]:
def tokenize(row):
  if row is None or row is "":
    tokens = ""
  else:
    tokens = str(row).split(" ")[:maxtokens]
  return tokens

#### Remove punctuation and unnecessary characters

Taking another look at the emails on the previous pair of pages, we see that they contain a lot of punctuation characters, and the spam emails tend to be capitalized. 

**In order to ensure that classification is done based on language content only, we have to remove punctuation marks and other non-word characters from the emails.** We do this by employing regular expressions with the Python regex library. We also normalize words by turning them into lower case.

In [9]:
def reg_expressions(row):
  tokens = []
  try:
    for token in row:
      token = token.lower()          # make all characters lower case
      token = re.sub(r"[\W\d]", "", token)
      token = token[:maxtokenlen]    # truncate all tokens to hyperparameter maxtokenlen
      tokens.append(token)
  except:
    token = ""
    tokens.append(token)
  return tokens

#### Stop-word removal

Finally, let’s define a function to remove stopwords - words that occur so frequently in language that they offer no useful information for classification. This includes words such as “the” and “are”, and the popular library NLTK provides a heavily used list that we will employ.

In [10]:
stop_words = stopwords.words("english")

def stop_word_removal(row):
  token = [token for token in row if token not in stop_words]
  token = filter(None, token)

  return token

### Assemble both Datasets

We are now going to put all these functions together to build the single dataset representing both classes. Most methods expect this dataset to be a Numpy array in order to process it, so we convert it to that form after combining the emails.

Now, putting all the preprocessing steps together we assemble our dataset...

In [15]:
# Convert everything to lower-case, truncate to maxtokens and truncate each token to maxtokenlen

# Apply predefined processing functions
enron_emails = bodies_df.iloc[:, 0].apply(tokenize)
enron_emails = enron_emails.apply(stop_word_removal)
enron_emails = enron_emails.apply(reg_expressions)
# sample the right number of emails from each class.
enron_emails = enron_emails.sample(n_sample)

# Apply predefined processing functions
spam_emails = fraud_bodies_df.iloc[:, 0].apply(tokenize)
spam_emails = spam_emails.apply(stop_word_removal)
spam_emails = spam_emails.apply(reg_expressions)
# sample the right number of emails from each class.
spam_emails = spam_emails.sample(n_sample)

# convert to Numpy array
raw_data = pd.concat([enron_emails, spam_emails], axis=0).values

Now, let’s take a peek at the result to make sure things are proceeding as expected:

In [16]:
print("Shape of combined data is:", raw_data.shape)
print("Data is:")
print(raw_data)

Shape of combined data is: (2000,)
Data is:
[list(['start', 'date', '', 'hourahead', 'hour', '', 'hourahead', 'schedule', 'download', 'failed', 'manual', 'intervention', 'required'])
 list(['might', 'get', 'together', 'little', 'earlier', 'depending', 'time', 'texaslsu', 'gameand', 'margaritas', 'babeshanna', 'husserenron', '', 'amto', 'eric', 'basshouectect', 'timothy', 'blanchardhoueesees', 'matthew', 'lenharthouectect', 'chad', 'landryhouectect', 'mmmarcantelequivacom', 'valgeneresaccomcc', 'subject', 'hey', 'guys', 'christen', 'i', 'going', 'cookoutpool', 'gathering', 'saturday', 'around', ''])
 list(['the', 'main', 'list', 'bemichael', 'farmer', '', 'ceo', 'merchanting', 'michaelfarmermgmccco', 'boettcher', 'thomasboettchermgmcc', 'hutchinson', '', 'chairman', 'mg', 'ltd', 'michaelhutchinsonmgl', 'jones', '', 'md', 'mg', 'ltd', '', 'head', 'trading', 'timjonesmgltdcoukrus', 'plackett', '', 'head', 'options', 'trading', 'russellplackettmgltd', 'schirmeister', '', 'director', 'marke

We see that the resulting array has divided the text into word units, as we intended to.

Let’s create the headers corresponding to these emails, consisting of n_sample=1000 of spam emails followed by n_sample=1000 of non-spam emails:

In [17]:
categories = ["spam", "notspam"]
header = ([1] * n_sample)
header.extend(([0] * n_sample)) 

We are now ready to convert this Numpy array into numerical features that can actually be fed to the algorithms for classification.

### Converting the Email Text Into Numbers

We start by employing what is often considered the simplest method for vectorizing words, i.e., converting them into numerical vectors – the bag-of-words model. This model simply counts the frequency of word tokens contained in each email and thereby represents it as a vector of such frequency counts.

Please observe that in doing this, we only retain tokens that appear more than once, as captured by the variable “used_tokens”. This enables us to keep the vector dimensions significantly lower than they would be otherwise. Please also
note that one can achieve this using various in-built vectorizers in the popular library scikitlearn.

We also note the scikit-learn vectorization methods include counting occurrences of sequences of any n words, or n-grams, as well as the tf-idf approach – important fundamental concepts you should brush on if rusty. For the problems looked at here, we did not notice an improvement when using these vectorization methods over the bag-of words approach.

The assemble_bag() function assembles a new dataframe containing all the unique words found in the text documents. It counts the word frequency and then returns the new dataframe.

In [20]:
def assemble_bag(data):
  used_tokens = []
  all_tokens = []

  for item in data:
    for token in item:
      if token in all_tokens:
        # If token has been seen before, append it to output list used_tokens
        if token not in used_tokens:
          used_tokens.append(token)
      else:
        all_tokens.append(token)

  df = pd.DataFrame(0, index=np.arange(len(data)), columns=used_tokens)

  # Create a Pandas DataFrame counting frequencies of vocabulary words – corresponding to columns, in each email – corresponding to rows
  for i, item in enumerate(data):
    for token in item:
      if token in used_tokens:
        df.iloc[i][token] += 1

  return df

We are now ready to convert these into numerical vectors!!

Having defined the assemble_bag function, let’s use it to actually carry out the vectorization and visualize it as follows:

In [21]:
# create bag-of-words model
enron_spam_bag = assemble_bag(raw_data) 

# this is the list of words in our bag-of-words model
predictors = [column for column in enron_spam_bag.columns]
print(enron_spam_bag)

         hourahead  mg  ltd  ...  wire  europeit  commenced  inspired
0     2          2   0    0  ...     0         0          0         0
1     2          0   0    0  ...     0         0          0         0
2     8          0   2    2  ...     0         0          0         0
3     0          0   0    0  ...     0         0          0         0
4     2          0   0    0  ...     0         0          0         0
...  ..        ...  ..  ...  ...   ...       ...        ...       ...
1995  5          0   0    0  ...     0         0          0         0
1996  0          0   0    0  ...     0         0          0         0
1997  0          0   0    0  ...     1         1          1         0
1998  0          0   0    0  ...     0         0          0         1
1999  0          0   0    0  ...     0         0          0         0

[2000 rows x 4788 columns]


The column labels indicate words in the vocabulary of the bag-of-words model, and the numerical entries in each row correspond to the frequency counts of each such word for each of the 2000 emails in our dataset. Notice that it is an extremely sparse DataFrame, i.e., it consists mostly of values of 0.


Having fully vectorized the dataset, we must remember that it is not shuffled with respect to classes, i.e., it contains Nsamp = 1000 spam emails followed by an equal number of nonspam emails. Depending on how this dataset is split, in our case by picking the first 70% for training and the remainder for testing, this could lead to a training set composed of spam only, which would obviously lead to failure. In order to create a randomized ordering of class samples in the dataset, we will need to shuffle the data in unison with the header/list of labels.


In [22]:
# shuffle raw data first
def unison_shuffle_data(data, header):
  p = np.random.permutation(len(header))
  data = data[p, :]
  header = np.asarray(header)[p]

  return data, list(header)

As the very last step of preparing the email dataset for training by our baseline classifiers, we split it into independent training and testing or validation sets. This will allow us to evaluate the performance of the classifier on a set of data that was not used for training, an important thing
to ensure in machine learning practice. We elect to use 70% of the data for training, and 30% for testing/validation afterwards.

In [23]:
data, header = unison_shuffle_data(enron_spam_bag.values, header)

# split into independent 70% training and 30% testing sets
idx = int(0.7 * data.shape[0])  # get 70% index value

# 70% of data for training
train_x = data[:idx, :]
train_y = header[:idx]

# remaining 30% for testing
test_x = data[idx:, :]
test_y = header[idx:]

print("train_x/train_y list details, to make sure they are of the right form:")
print(len(train_x))
print(train_x)
print(len(train_y))
print(train_y[:5])

train_x/train_y list details, to make sure they are of the right form:
1400
[[ 8  0  0 ...  0  0  0]
 [11  0  0 ...  0  0  0]
 [ 6  1  0 ...  0  0  0]
 ...
 [ 3  0  0 ...  0  0  0]
 [ 1  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]]
1400
[1, 1, 1, 0, 0]


Since 70% of 2000 is 1400, looks good! (for n_sample=1000)

## Generalized Linear Models