<a href="https://colab.research.google.com/github/shicong621/Colab/blob/main/PA5_Shicong_Wang.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programming Assignment Five: Spam detection with neural network.

In this assignment, you are asked to build a neural network that can detect spam from a given SMS message.

The provided files are:
1. `spam_train.csv`: a csv file containing the training data, where the 'text' column provides the sms messages and the 'label' column indicates whether the sms message is a 'ham' (0) or a 'spam' (1).
2. `spam_test.csv`: a csv file containing the testing data, following the same format as `spam_train.csv`.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
import torch

torch.cuda.is_available()

True

**Step 1: Compute the SMS message vector based on the average value of the word vectors that belong to the words in it.** 

Just like the last assignment, we compute the 'representation' of each message, i.e., the vector, by averaging word vectors with Word2Vec. But this time, we are using pre-trained [Glove word embeddings](https://nlp.stanford.edu/projects/glove/) instead. Specifically, we are using word embedding `glove.6B.100d` to obtain word vectors of each message, as long as the word is in the 'glove.6B.100d' embedding space.

In other words, you need to:
1. Have a [basic idea](https://nlp.stanford.edu/pubs/glove.pdf) of how Glove provides pre-trained word embeddings (vectors).
2. Download and extract word vectors from `glove.6B.100d`, contained in `glove.6B.zip`.
3. Compute the message vectors by averaging the vectors of words in the message.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
import spacy
nlp = spacy.load('en_core_web_sm')

import re
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk import PorterStemmer
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords


In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [50]:
SPAM_TRAIN = os.path.join("spam_train.csv")
SPAM_TEST  = os.path.join("spam_test.csv")

train = pd.read_csv(SPAM_TRAIN)
test  = pd.read_csv(SPAM_TEST)

train_X_list = list(train['text'])
train_Y = np.array(train['label'])
test_X_list = list(test['text'])
test_Y = np.array(test['label'])

In [51]:
embeddings_dict = {}
with open("glove.6B.100d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

def preprocess(text):
    i = 1
    return_list = []
    print("the number of sentences to be processed is: ", len(text))
    
    #load the language model for English
    nlp = spacy.load('en_core_web_sm')
    for content in text:
        # remove all the references texts "[...]"
#        content_1 = re.sub(r"[\[0-9*\]]", "", content)
        
        doc = nlp(content)
        assert doc.has_annotation("SENT_START")
        
        # segment all the sentences in the wiki texts. 
        # lower-case the tokens.
        content_2 = []
        for sent in doc.sents:
            content_2.append(sent.text.lower())
            
        # tokenize the sentences.
        content_3 = []
        # lemmatizer = nlp.get_pipe("lemmatizer")
        for sent in content_2:
            doc = nlp(sent)
            words = []
            for token in doc:
                words.append(token.lemma_)
            content_3.append(words)
        return_list.append(content_3)
        # print("tweet No.", i, "is done")
        i += 1
            
    return(return_list)

In [52]:
train_X_list = preprocess(train_X_list)
test_X_list  = preprocess(test_X_list)


vocab = list(embeddings_dict.keys())
train_X = []
i = 1
for message in train_X_list: 
    rep_vec_list = []
    for sentence in message:
        for word in sentence:
            if word in vocab:
                rep_vec_list.append(embeddings_dict[word])
    train_X.append(np.array(rep_vec_list).mean(axis=0))
    i += 1
train_X = np.array(train_X)


test_X = []
i = 1
for message in test_X_list: 
    rep_vec_list = []
    for sentence in message:
        for word in sentence: 
            if word in vocab: 
                rep_vec_list.append(embeddings_dict[word])
    array = np.array(rep_vec_list)
    avy_array = array.mean(axis = 0)
    test_X.append(avy_array)
    # print('sentence No.', i, 'is done')
    i += 1

the number of sentences to be processed is:  1000
the number of sentences to be processed is:  494


  ret = ret.dtype.type(ret / rcount)


**Step 2: Build 'dataset + data loader' that can feed data to train your model with Pytorch.**

Our goal is to train a spam detection model (classification). Here's an [example](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) of how a classfier is trained. Although it is for image classification, the idea is very similar:

1. Prepare/build a dataset and load it with data loader;
2. Prepare/build a model that takes the data input and predicts; and 
3. Prepare/build the optimizer and loss functions to train the model with the dataset.

Naturally, the next thing to do is to prepare the data. We do it by building the 'Dataset' and 'Dataloader' with Pytorch.

You may refer to [this page](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to get an idea of how to make 'Dataset' and 'Dataloader'. 

Hints:
1. Make sure `__init__` , `__len__` and `__getitem__` of your defined dataset is implemented properly. In particular, the `__getitem__` function should return the specified message vector and its label.
2. Don't compute the message vector when calling the `__getitem__` function, otherwise the training process will slow down A LOT.
3. Make sure the shuffle is on for your data loader setup, as the data in the csv file is not. 



In [8]:
from torch.utils.data import DataLoader, Dataset

In [53]:
del test_X[424]
test_Y = np.delete(test_Y, 424)

test_X = np.array(test_X)

class Data(Dataset):
    def __init__(self, X, y):
        self.X = torch.from_numpy(X.astype(np.float32))
        self.y = torch.from_numpy(y.astype(np.long))
        self.len = self.X.shape[0]
       
    def __getitem__(self, index):
        return self.X[index], self.y[index]
   
    def __len__(self):
        return self.len


batch_size = 64
train_data = Data(train_X, train_Y)
test_data  = Data(test_X, test_Y)
train_dataloader = DataLoader(dataset = train_data,batch_size = batch_size, shuffle = True)
test_dataloader  = DataLoader(dataset = test_data, batch_size = batch_size, shuffle = True)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if __name__ == '__main__':


**Step 3: Build the neural net model.** 

Once the data is ready, we need to design and implement our neural network model.

You should look [here](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html) to see how a model can be defined.

The model does not need to be complicated. An example structure could be:

1. linear layer 100 x 15
2. ReLU activation layer
3. linear layer 15 x 2 (think about why here is 2 instead of 1?)
4. Softmax activation layer

But feel free to test out other possible combinations of linear layers & activation functions and whether they make significant difference to the model performance later.

In [10]:
import torch.nn as nn
import torch.nn.functional as F

In [11]:
class ToyNN(nn.Module):
    def __init__(self):
        super(ToyNN, self).__init__()
        #self.embedding = create_emb_layer(embedding_matrix, True)
        self.linear1 = nn.Linear(100, 15)
        self.activation = nn.ReLU()
        self.linear2 = nn.Linear(15, 2)
        self.softmax = nn.Softmax()
        #self.gru = nn.GRU(embedding_dim, hidden_size, num_layers, batch_first=True)
        
    def forward(self, x): 
        #x = self.embedding(embedding_matrix)
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        #x = self.gru(self.embedding(inp), hidden)
        return x

toyNN = ToyNN()

**Step 4: Train the model with optimizer and loss function.**

Lastly, we need to set up the [optimizer](https://pytorch.org/docs/stable/optim.html) and [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions) to train the model. You may refer to the links for more details. Specifically, we need Stochastic Gradient Descent (SGD) for optimizer and CrossEntropyLoss for loss function.

The last thing to do is to train the model for several epochs and evaluate its performance from time to time. For example,  train the model 5000 epochs, evaluating the model every 100 epochs. If you are not sure how the training works, you may refer to the [classification model tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) to see how it is typically done. Don't forget to print the average loss of the epoch to see if the model is being optimized properly.

The evaluation metric should be the [**accuracy**](https://en.wikipedia.org/wiki/Confusion_matrix) of predicting ham/spam on the testing data (TP+TN/(TP+TN+FP+FN)). The highest accuracy should be above at least **90%**. Try different settings of model structure, learning rate, and the number of training epochs  to achieve that level of accuracy.

In [12]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(toyNN.parameters(), lr=0.001, momentum=0.9)

In [64]:
for epoch in range(100):  # loop over the dataset multiple times

    loss_values = []
    running_loss = 0.0
    for i, data in enumerate(train_dataloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = toyNN (inputs)
        loss = criterion(outputs, labels)
        loss_values.append(loss.item())
        pred = torch.max(outputs, 1)[1].eq(labels).sum()
        acc = pred * 100.0 / len(inputs)
        print('Epoch: {}, Loss: {}, Accuracy: {}%'.format(epoch+1, loss.item(), acc.numpy()))
        loss.backward()
        optimizer.step()

        # print statistics
        #running_loss += loss.item()
        #if i % 15 == 14:    # print every 100 mini-batches
            #print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 100:.3f}')
            #running_loss = 0.0

#print('Finished Training')

  app.launch_new_instance()


Epoch: 1, Loss: 0.3633579611778259, Accuracy: 96.875%
Epoch: 1, Loss: 0.44047826528549194, Accuracy: 87.5%
Epoch: 1, Loss: 0.4076085388660431, Accuracy: 90.625%
Epoch: 1, Loss: 0.4160230755805969, Accuracy: 92.1875%
Epoch: 1, Loss: 0.371744841337204, Accuracy: 96.875%
Epoch: 1, Loss: 0.4353334605693817, Accuracy: 89.0625%
Epoch: 1, Loss: 0.4045264720916748, Accuracy: 90.625%
Epoch: 1, Loss: 0.3613974153995514, Accuracy: 98.4375%
Epoch: 1, Loss: 0.41133442521095276, Accuracy: 90.625%
Epoch: 1, Loss: 0.3656587600708008, Accuracy: 96.875%
Epoch: 1, Loss: 0.39611202478408813, Accuracy: 93.75%
Epoch: 1, Loss: 0.409018874168396, Accuracy: 90.625%
Epoch: 1, Loss: 0.42006319761276245, Accuracy: 90.625%
Epoch: 1, Loss: 0.37587684392929077, Accuracy: 95.3125%
Epoch: 1, Loss: 0.37861883640289307, Accuracy: 92.1875%
Epoch: 1, Loss: 0.3641502261161804, Accuracy: 100.0%
Epoch: 2, Loss: 0.41342002153396606, Accuracy: 90.625%
Epoch: 2, Loss: 0.3829618990421295, Accuracy: 95.3125%
Epoch: 2, Loss: 0.386

In [62]:
with torch.no_grad(): 
  for i, data in enumerate(test_dataloader, 0):
    inputs, labels = data 
            #outputs = outputs.to(torch.float32) 
    outputs = toyNN(inputs) 
    pred = torch.max(outputs, 1)[1].eq(labels).sum()
    acc = pred * 100.0 / len(inputs)
    print('Loss: {}, Accuracy: {}%'.format(loss.item(), acc.numpy())) 

Loss: 0.3595186769962311, Accuracy: 93.75%
Loss: 0.3595186769962311, Accuracy: 89.0625%
Loss: 0.3595186769962311, Accuracy: 85.9375%
Loss: 0.3595186769962311, Accuracy: 93.75%
Loss: 0.3595186769962311, Accuracy: 84.375%
Loss: 0.3595186769962311, Accuracy: 93.75%
Loss: 0.3595186769962311, Accuracy: 95.3125%
Loss: 0.3595186769962311, Accuracy: 93.33333587646484%


  app.launch_new_instance()
