# Lab 5: Spam Detection

In this assignment, we will build a recurrent neural network to classify a SMS text message
as "spam" or "not spam". In the process, you will
    
1. Clean and process text data for machine learning.
2. Understand and implement a character-level recurrent neural network.
3. Use torchtext to build recurrent neural network models.
4. Understand batching for a recurrent neural network, and use torchtext to implement RNN batching.

Colab Link:https://colab.research.google.com/drive/1M0TAcacWcxSti2co7sAfEg16wzAHFnlI?usp=sharing

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

If you are interested to use the most recent version if torchtext, you can look at the following document to see how to convert the legacy version to the new version:
https://colab.research.google.com/github/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

In [None]:
!pip install torchtext==0.6 torch==1.11

## Part 1. Data Cleaning [15 pt]

We will be using the "SMS Spam Collection Data Set" available at http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

There is a link to download the "Data Folder" at the very top of the webpage. Download the zip file, unzip it, and upload the file `SMSSpamCollection` to Colab.    

In [None]:
SMSSpamCollection = '/content/gdrive/MyDrive/Colab Notebooks/SMS_Spam_Collection/SMSSpamCollection'

### Part (a) [2 pt]

Open up the file in Python, and print out one example of a spam SMS, and one example of a non-spam SMS.

What is the label value for a spam message, and what is the label value for a non-spam message?

In [None]:
for line in open(SMSSpamCollection):
    if line.split()[0] == 'spam':
        print('Spam message example:', line[5:])
        print('Spam label:', line.split()[0])
        print('')
        break

for line in open(SMSSpamCollection):
    if line.split()[0] == 'ham':
        print('Non-spam message example:', line[4:])
        print('Non-spam label:', line.split()[0])
        break

Spam message example: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

Spam label: spam

Non-spam message example: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

Non-spam label: ham


### Part (b) [1 pt]

How many spam messages are there in the data set?
How many non-spam messages are there in the data set?


In [None]:
spam_count = 0
ham_count = 0
for msg in open(SMSSpamCollection):
  if msg.split()[0] == 'spam':
        spam_count += 1
  elif msg.split()[0] == 'ham':
      ham_count += 1
print('There are', spam_count, 'spam messages')
print('There are', ham_count, 'non-spam messages')

There are 747 spam messages
There are 4827 non-spam messages


### Part (c) [4 pt]

We will be using the package `torchtext` to load, process, and batch the data.
A tutorial to torchtext is available below. This tutorial uses the same
Sentiment140 data set that we explored during lecture.

https://medium.com/@sonicboom8/sentiment-analysis-torchtext-55fb57b1fab8

Unlike what we did during lecture, we will be building a **character level RNN**.
That is, we will treat each **character** as a token in our sequence,
rather than each **word**.

Identify two advantage and two disadvantage of modelling SMS text
messages as a sequence of characters rather than a sequence of words.

In [None]:
# Advantages

  # 1) treating each character as a token requires less memory since there are
  # much less characters than there are words
  # 2) modellings SMS texts as a sequence of characters instead of words means
  # that the potential of misspelling words is reduced
  # the model is better at identifying patterns within words

# Disadvantages

  # 1) higher computational cost since modelling SMS texts as a sequence of
  # characters would require more hidden layers
  # 2) lower accuracy compared to word level RNN

### Part (d) [1 pt]

We will be loading our data set using `torchtext.data.TabularDataset`. The
constructor will read directly from the `SMSSpamCollection` file.

For the data file to be read successfuly, we
need to specify the **fields** (columns) in the file.
In our case, the dataset has two fields:

- a text field containing the sms messages,
- a label field which will be converted into a binary label.

Split the dataset into `train`, `valid`, and `test`. Use a 60-20-20 split.
You may find this torchtext API page helpful:
https://torchtext.readthedocs.io/en/latest/data.html#dataset

Hint: There is a `Dataset` method that can perform the random split for you.

In [None]:
import torchtext

text_field = torchtext.data.Field(sequential=True,
                       tokenize=lambda x: x,
                       include_lengths=True,
                       batch_first = True,
                       use_vocab=True)

label_field = torchtext.data.Field(sequential=False,
                                   use_vocab=False,
                                   is_target=True,
                                   batch_first=True,
                                   preprocessing=lambda x: int(x == 'spam'))

fields = [('label', label_field), ('sms', text_field)]

dataset = torchtext.data.TabularDataset(SMSSpamCollection,
                                        "tsv",
                                        fields)

training_data, validation_data, testing_data = dataset.split([0.60, 0.20, 0.20], True)

### Part (e) [2 pt]

You saw in part (b) that there are many more non-spam messages than spam messages.
This **imbalance** in our training data will be problematic for training.
We can fix this disparity by duplicating spam messages in the training set,
so that the training set is roughly **balanced**.

Explain why having a balanced training set is helpful for training our neural network.

Note: if you are not sure, try removing the below code and train your mode.

In [None]:
# save the original training examples
old_train_examples = training_data.examples
# get all the spam messages in `train`
train_spam = []
for item in training_data.examples:
    if item.label == 1:
        train_spam.append(item)
# duplicate each spam message 6 more times
training_data.examples = old_train_examples + train_spam * 6

In [None]:
# having a balanced training set is helpful for training our neural network
# because we don't want the model to be biased towards the class which has
# more training data. if the dataset is imbalanced, the model will learn more
# features of one class and less of the others, result in a model which might
# not generalize well, thus overfitting and also low accuracy

### Part (f) [1 pt]

We need to build the vocabulary on the training data by running the below code.
This finds all the possible character tokens in the training set.

Explain what the variables `text_field.vocab.stoi` and `text_field.vocab.itos` represent.

In [None]:
text_field.build_vocab(training_data)
print(text_field.vocab.stoi)
print(text_field.vocab.itos)

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7db044bb70d0>>, {'<unk>': 0, '<pad>': 1, ' ': 2, 'e': 3, 'o': 4, 't': 5, 'a': 6, 'n': 7, 'r': 8, 'i': 9, 's': 10, 'l': 11, 'u': 12, 'h': 13, '0': 14, 'd': 15, 'c': 16, '.': 17, 'm': 18, 'y': 19, 'w': 20, 'p': 21, 'g': 22, '1': 23, 'f': 24, 'b': 25, '2': 26, 'T': 27, '8': 28, 'k': 29, 'E': 30, 'v': 31, '5': 32, 'S': 33, 'C': 34, 'O': 35, 'I': 36, '4': 37, 'N': 38, '7': 39, 'A': 40, 'x': 41, '3': 42, '6': 43, 'R': 44, '!': 45, '9': 46, ',': 47, 'P': 48, 'M': 49, 'W': 50, 'L': 51, 'U': 52, 'H': 53, 'D': 54, 'B': 55, 'F': 56, 'G': 57, 'Y': 58, "'": 59, '/': 60, '?': 61, '£': 62, '&': 63, '-': 64, 'X': 65, ':': 66, 'z': 67, 'V': 68, 'j': 69, 'K': 70, '*': 71, 'J': 72, ')': 73, '+': 74, ';': 75, '(': 76, '"': 77, 'q': 78, 'Q': 79, '>': 80, '#': 81, '@': 82, '=': 83, 'Z': 84, 'ü': 85, 'Ü': 86, '$': 87, '‘': 88, '\x92': 89, '[': 90, ']': 91, '<': 92, '%': 93, '_': 94, '|': 95, '¡': 96, '’': 97, '…': 98, '\

In [None]:
#text_field.vocab.stoi represents a dictionary mappign of characters which
# correspond to their respective numerical identifies
# stoi means string to index, maps string token to integer index

#text_field.vocab.itos is a list of character tokens which have been
# indexed by their corresponding numerical identifiers
# itos stands for index to string, maps index to string token

### Part (g) [2 pt]

The tokens `<unk>` and `<pad>` were not in our SMS text messages.
What do these two values represent?

In [None]:
# <unk> are unknown text tokens in the SMS text message
# unknown vocabulary

# <pad> are padding token are used to increase the length of the character
# sequence to ensure all sequences have the same length

### Part (h) [2 pt]

Since text sequences are of variable length, `torchtext` provides a `BucketIterator` data loader,
which batches similar length sequences together. The iterator also provides functionalities to
pad sequences automatically.

Take a look at 10 batches in `train_iter`. What is the maximum length of the
input sequence in each batch? How many `<pad>` tokens are used in each of the 10
batches?

In [None]:
train_iter = torchtext.data.BucketIterator(training_data,
                                           batch_size=32,
                                           sort_key=lambda x: len(x.sms), # to minimize padding
                                           sort_within_batch=True,        # sort within each batch
                                           repeat=False)                  # repeat the iterator for many epochs

In [None]:
batch_num = 1
for batch in train_iter:
  if batch_num <= 10:
    print('batch number', batch_num, 'has max length:', int(batch.sms[1][0]))
    pad = 0
    for msg in range(0, len(batch.sms[1])):
      pad = pad + (batch.sms[1][0] - batch.sms[1][msg])
    print('batch number', batch_num, 'has', int(pad), '<pad> tokens \n')
    batch_num += 1
  else:
    break

## Part 2. Model Building [8 pt]

Build a recurrent neural network model, using an architecture of your choosing.
Use the one-hot embedding of each character as input to your recurrent network.
Use one or more fully-connected layers to make the prediction based on your
recurrent network output.

Instead of using the RNN output value for the final token, another often used
strategy is to max-pool over the entire output array. That is, instead of calling
something like:

```
out, _ = self.rnn(x)
self.fc(out[:, -1, :])
```

where `self.rnn` is an `nn.RNN`, `nn.GRU`, or `nn.LSTM` module, and `self.fc` is a
fully-connected
layer, we use:

```
out, _ = self.rnn(x)
self.fc(torch.max(out, dim=1)[0])
```

This works reasonably in practice. An even better alternative is to concatenate the
max-pooling and average-pooling of the RNN outputs:

```
out, _ = self.rnn(x)
out = torch.cat([torch.max(out, dim=1)[0],
                 torch.mean(out, dim=1)], dim=1)
self.fc(out)
```

We encourage you to try out all these options. The way you pool the RNN outputs
is one of the "hyperparameters" that you can choose to tune later on.

In [None]:
# You might find this code helpful for obtaining
# PyTorch one-hot vectors.

ident = torch.eye(10)
print(ident[0]) # one-hot vector
print(ident[1]) # one-hot vector
x = torch.tensor([[1, 2], [3, 4]])
print(ident[x]) # one-hot vectors

tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]]])


In [None]:
onehot_size = len(text_field.vocab.itos)

In [None]:
class RNN(nn.Module):
    def __init__(self, onehot_size, hidden_size, num_classes):
        self.name = "Spam_Detection_RNN"
        super(RNN, self).__init__()
        self.emb = torch.eye(onehot_size)
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(onehot_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.emb[x]
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        out, _ = self.rnn(x, h0)
        out = self.fc(torch.max(out, dim=1)[0])
        return out

## Part 3. Training [16 pt]

### Part (a) [4 pt]

Complete the `get_accuracy` function, which will compute the
accuracy (rate) of your model across a dataset (e.g. validation set).
You may modify `torchtext.data.BucketIterator` to make your computation
faster.

In [None]:
def get_accuracy(model, data):
    """ Compute the accuracy of the `model` across a dataset `data`

    Example usage:

    >>> model = MyRNN() # to be defined
    >>> get_accuracy(model, valid) # the variable `valid` is from above
    """
    correct, total = 0, 0

    data_loader = torchtext.data.BucketIterator(data, batch_size=64,
                                               sort_key=lambda x: len(x.sms),
                                               sort_within_batch=True,
                                               repeat=False)

    for message, label in data_loader:
        output = model(message[0])
        pred = output.max(1, keepdim=True)[1]
        correct += pred.eq(label.view_as(pred)).sum().item()
        total += message[0].shape[0]

    return correct / total

### Part (b) [4 pt]

Train your model. Plot the training curve of your final model.
Your training curve should have the training/validation loss and
accuracy plotted periodically.

Note: Not all of your batches will have the same batch size.
In particular, if your training set does not divide evenly by
your batch size, there will be a batch that is smaller than
the rest.

In [None]:
def train_RNN(model, training_data, validation_data, batch_size, num_epochs=10, learning_rate=0.0005):
  criterion = nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
  training_losses, validation_losses, training_accuracy, validation_accuracy = [], [], [], []
  epochs = []
  train_loader = torchtext.data.BucketIterator(training_data, batch_size=batch_size,
                                               sort_key=lambda x: len(x.sms),
                                               sort_within_batch=True,
                                               repeat=False)
  validation_loader = torchtext.data.BucketIterator(validation_data,
                                               batch_size=batch_size,
                                               sort_key=lambda x: len(x.sms),
                                               sort_within_batch=True,
                                               repeat=False)

  for epoch in range(num_epochs):
    for msg, labels in train_loader:
      optimizer.zero_grad()
      prediction = model(msg[0])
      loss = criterion(prediction, labels)
      loss.backward()
      optimizer.step()
    training_losses.append(float(loss))

    for msg, labels in validation_loader:
      prediction = model(msg[0])
      loss = criterion(prediction, labels)
    validation_losses.append(float(loss))

    epochs.append(epoch)
    training_accuracy.append(get_accuracy(model, training_data))
    validation_accuracy.append(get_accuracy(model, validation_data))
    print("Epoch %d; Loss %f; Training Accuracy %f; Validation Accuracy %f" % (
          epoch+1, loss, training_accuracy[-1], validation_accuracy[-1]))
    model_path = "model_{0}_bs{1}_lr{2}_epoch{3}".format(model.name, batch_size, learning_rate, epoch)
    torch.save(model.state_dict(), model_path)

  # plotting
  plt.title("Training Curve")
  plt.plot(training_losses, label="Train")
  plt.plot(validation_losses, label="Validation")
  plt.xlabel("Epoch")
  plt.ylabel("Loss")
  plt.show()

  plt.title("Training Curve")
  plt.plot(epochs, training_accuracy, label="Train")
  plt.plot(epochs, validation_accuracy, label="Validation")
  plt.xlabel("Epoch")
  plt.ylabel("Accuracy")
  plt.legend(loc='best')
  plt.show()

  print("Highest Training Accuracy: {}".format(max(training_accuracy)))
  print("Highest Training Accuracy: {}".format(max(validation_accuracy)))

In [None]:
model_1 = RNN(onehot_size, 100, 2)
train_RNN(model_1, training_data, validation_data, batch_size=64, num_epochs=25, learning_rate=0.0005)

### Part (c) [4 pt]

Choose at least 4 hyperparameters to tune. Explain how you tuned the hyperparameters.
You don't need to include your training curve for every model you trained.
Instead, explain what hyperparemters you tuned, what the best validation accuracy was,
and the reasoning behind the hyperparameter decisions you made.

For this assignment, you should tune more than just your learning rate and epoch.
Choose at least 2 hyperparameters that are unrelated to the optimizer.

In [None]:
# I will first change the number of hidden layers between the first fully connected
# layer and the second fully connected layer. Since the model classifies whether
# the message is spam or not spam, changing the number of hidden layers helps to
# tune the classifiers performance. I doubled it from 100->200

model_2 = RNN(onehot_size, 200, 2)
train_RNN(model_2, training_data, validation_data, batch_size=64, num_epochs=25, learning_rate=0.0005)

In [None]:
# I will now change the batch size to try to reduce noise
# Also reduce the number of epochs since the model seems to be overfitting a bit
# increased batch size 64->128, decreased epochs 25->20

model_3 = RNN(onehot_size, 200, 2)
train_RNN(model_3, training_data, validation_data, batch_size=128, num_epochs=20, learning_rate=0.0005)

In [None]:
# Too much overfitting, try decreasing batch size to 32

model_4 = RNN(onehot_size, 200, 2)
train_RNN(model_4, training_data, validation_data, batch_size=32, num_epochs=20, learning_rate=0.0005)

In [None]:
# I will keep batch size at 64, previous is overfitting too.
# Now I will decrease learning rate to try to improve model performance. Decreased from 0.0005->0.0001

model_5 = RNN(onehot_size, 200, 2)
train_RNN(model_5, training_data, validation_data, batch_size=64, num_epochs=25, learning_rate=0.0001)

In [None]:
# Now I will try increasing learning rate to try to improve model performance.
# Decreased from 0.0001->0.001

model_6 = RNN(onehot_size, 200, 2)
train_RNN(model_6, training_data, validation_data, batch_size=64, num_epochs=20, learning_rate=0.001)

In [None]:
# I will keep learning rate at 0.0005.
# Lastly I will use a new RNN model with output = self.fc(out[:, -1, :])

In [None]:
class RNN(nn.Module):
    def __init__(self, onehot_size, hidden_size, num_classes):
        self.name = "Spam_Detection_RNN"
        super(RNN, self).__init__()
        self.emb = torch.eye(onehot_size)
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(onehot_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.emb[x]
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        output, _ = self.rnn(x, h0)
        output = self.fc(output[:, -1, :])
        return output

In [None]:
model_7 = RNN(onehot_size, 200, 2)
train_RNN(model_7, training_data, validation_data, batch_size=64, num_epochs=20, learning_rate=0.0005)

In [None]:
# I will retrain model_2 with 20 epochs instead and that will be the best model

class RNN(nn.Module):
    def __init__(self, onehot_size, hidden_size, num_classes):
        self.name = "Spam_Detection_RNN"
        super(RNN, self).__init__()
        self.emb = torch.eye(onehot_size)
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(onehot_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.emb[x]
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        output, _ = self.rnn(x, h0)
        output = self.fc(torch.max(output, dim=1)[0])
        return output

In [None]:
model_8 = RNN(onehot_size, 200, 2)
train_RNN(model_8, training_data, validation_data, batch_size=64, num_epochs=20, learning_rate=0.0005)

### Part (d) [2 pt]

Before we deploy a machine learning model, we usually want to have a better understanding
of how our model performs beyond its validation accuracy. An important metric to track is
*how well our model performs in certain subsets of the data*.

In particular, what is the model's error rate amongst data with negative labels?
This is called the **false positive rate**.

What about the model's error rate amongst data with positive labels?
This is called the **false negative rate**.

Report your final model's false positive and false negative rate across the
validation set.

In [None]:
# Create a Dataset of only spam validation examples
valid_spam = torchtext.data.Dataset(
    [e for e in validation_data.examples if e.label == 1],
    validation_data.fields)

# Create a Dataset of only non-spam validation examples
valid_nospam = torchtext.data.Dataset(
    [e for e in validation_data.examples if e.label == 0],
    validation_data.fields)


valid_spam_accuracy = get_accuracy(model_8, valid_spam)
valid_nospam_accuracy = get_accuracy(model_8, valid_nospam)

false_positive_rate = (1 - valid_nospam_accuracy)*100
false_negative_rate = (1 - valid_spam_accuracy)*100

print("The false positive rate is:", false_positive_rate, "%")
print("The false negative rate is:", false_negative_rate, "%")

The false positive rate is: 0.7253886010362698 %
The false negative rate is: 7.9999999999999964 %


### Part (e) [2 pt]

The impact of a false positive vs a false negative can be drastically different.
If our spam detection algorithm was deployed on your phone, what is the impact
of a false positive on the phone's user? What is the impact of a false negative?

In [None]:
# A false positive means that a non-spam message was classified as a spam message.
# If our spam detection algorithm was deployed on a phone, it could mean that
# the user might miss important messages since they were classified and sorted
# into the spam folder instead of into their messages.

# A false negative means that a spam message was classified as a non-spam message.
# For the phone's user, this could mean that the user is receiving a lot of spam
# messages or even malicious/phishing emails which could harm the user or their
# phone. This can be annoying to the user since they are constantly getting
# messages which they have to spend time deciding if they're valid or not.

## Part 4. Evaluation [11 pt]

### Part (a) [1 pt]

Report the final test accuracy of your model.

In [None]:
testing_accuracy = get_accuracy(model_8, testing_data)
print("The best performing model's test accuracy is: ", testing_accuracy)

The best performing model's test accuracy is:  0.9757630161579892


### Part (b) [3 pt]

Report the false positive rate and false negative rate of your model across the test set.

In [None]:
testing_spam = torchtext.data.Dataset(
    [e for e in testing_data.examples if e.label == 1],
    testing_data.fields)

test_nospam = torchtext.data.Dataset(
    [e for e in testing_data.examples if e.label == 0],
    testing_data.fields)

test_spam_accuracy = get_accuracy(model_8, testing_spam)
test_nospam_accuracy = get_accuracy(model_8, test_nospam)

false_positive_rate = (1 - test_nospam_accuracy)*100
false_negative_rate = (1 - test_spam_accuracy)*100

print("The false positive rate is:", false_positive_rate, "%")
print("The false negative rate is:", false_negative_rate, "%")

The false positive rate is: 1.2435233160621784 %
The false negative rate is: 10.738255033557042 %


### Part (c) [3 pt]

What is your model's prediction of the **probability** that
the SMS message "machine learning is sooo cool!" is spam?

Hint: To begin, use `text_field.vocab.stoi` to look up the index
of each character in the vocabulary.

In [None]:
msg = "machine learning is sooo cool!"

msg_index = []
for char in msg:
  msg_index.append(text_field.vocab.stoi[char])

test_msg = torch.LongTensor([msg_index])
prediction = model_8(test_msg)
probability = F.softmax(prediction, dim=1)
print("The probability the message is spam is", float(probability[0][1]*100), "%")

The probability the message is spam is 2.230325222015381 %


### Part (d) [4 pt]

Do you think detecting spam is an easy or difficult task?

Since machine learning models are expensive to train and deploy, it is very
important to compare our models against baseline models: a simple
model that is easy to build and inexpensive to run that we can compare our
recurrent neural network model against.

Explain how you might build a simple baseline model. This baseline model
can be a simple neural network (with very few weights), a hand-written algorithm,
or any other strategy that is easy to build and test.

**Do not actually build a baseline model. Instead, provide instructions on
how to build it.**

In [None]:
# Compared to computer vision taks, one might think detecting spam is a relatively
# easy task, however detecting spam is actually a very diffcult task since spam
# messages use manipulative language and try to target vulnerable users'
# emotions to try to make it seem like a human sent the message.
# As spam messages continue to get more convincing as more language is
# introduced and understood online, it gets harder to detect spam.

# This is how I would build a simple baseline model.
# I would build it using a lot of words/symbols/emoticons that commonly
# occur in spam message to use as an indicator for whether the message might
# be spam or not. If the content of the message passes a certain threshold of
# potential spam phrases, it has a higher probability of being spam. We can
# calculate the spam probability for each message ans use that probability for
# the baseline model. While this would not be as accurate as an RNN, it
# would be a lot simpler to build and test, hence could be used as a good
# baseline to ensure the RNN's performance exceeds the baseline model's accuracy.