# SMS Spam Binary Classification

Data is downloaded from Kaggle's [SMS Spam Collection Dataset](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset?resource=download)

In [15]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import torch
import torchtext

from torchtext.legacy.data import Field, BucketIterator, TabularDataset

In [2]:
import nltk
nltk.download('punkt')

from nltk import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SereneWizard\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
data = pd.read_csv('.\dataset\spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


From the glimpse of the dataframe above we identify that we only neeed columns `v1` and `v2`. 

In [4]:
to_not_drop = ['v1', 'v2']
data = data.filter(items=to_not_drop, axis=1)
# Another way: 
# data = data.drop(data.columns.difference(to_not_drop), axis=1)
new_names = {'v1':'labels', 'v2':'text'}
data = data.rename(index=str, columns=new_names)
data.head()

Unnamed: 0,labels,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Splitting the data into training and test dataset

In [21]:
train, test = train_test_split(data, test_size=0.2, random_state=29)
train.shape, test.shape

((4457, 2), (1115, 2))

Since the index of the `train` and `test` dataset will not be in contiguous order, let's address that: 

In [6]:
train.reset_index(drop=True), test.reset_index(drop=True)

(     labels                                               text
 0       ham  Joy's father is John. Then John is the ____ of...
 1       ham  Good Morning my Dear........... Have a great &...
 2       ham                             Prakesh is there know.
 3       ham  IåÕm cool ta luv but v.tired 2 cause i have be...
 4       ham            Call me da, i am waiting for your call.
 ...     ...                                                ...
 4452    ham  Aww that's the first time u said u missed me w...
 4453    ham      Dude ive been seeing a lotta corvettes lately
 4454    ham       I am taking half day leave bec i am not well
 4455   spam  8007 25p 4 Alfie Moon's Children in Need song ...
 4456   spam  This message is brought to you by GMW Ltd. and...
 
 [4457 rows x 2 columns],
      labels                                               text
 0       ham  I've not called you in a while. This is hoping...
 1       ham                  Sorry, I'll call later in meeting
 2      spam

In [24]:
train.to_csv('dataset/train.csv', index=False)
test.to_csv('dataset/test.csv', index=False)

Next, the text has to be tokenized into the words, and for that, `nltk` library will be used. In this case, `punkt` word tokenizer will be used. This is a standard tokenizer used in `nltk`.   
The `Field()` object in `torchtext` allows us to specify how we want the individual text fields to be preprocessed and treated. In this case, `Field()` will allow text to be tokenized into words. 

In [13]:
TEXT = torchtext.legacy.data.Field(tokenize = word_tokenize)

Similarly, another Field object is defined which will correspond to the labels corresponding to these text messages. Specific Field class called `LabelField()` is used. And it converts the `spam` and `ham` labels to their floating point representation. 

In [16]:
LABEL = torchtext.legacy.data.LabelField(dtype = torch.float)

In [17]:
datafields = [('labels', LABEL), ('text', TEXT)]

These two field objects defined and mapped within `datafields` will know what to do with the data columns when imported.  
So, next the raw data that these Field objects will apply to will be imported. For this, `TabularDataset` object is created, that can read from the csv and various other file formats. 
`split()` function allows splitting the TabularDataset into training and test dataset. 

In [19]:
trn, tst = torchtext.legacy.data.TabularDataset.splits(path = './dataset', 
                                                       train = 'train.csv', 
                                                       test = 'test.csv', 
                                                       format = 'csv', 
                                                       skip_header = True,
                                                       fields = datafields)

Let's look at a subset of training dataset. 

In [22]:
trn[:5]

[<torchtext.legacy.data.example.Example at 0x1a3625b5ac0>,
 <torchtext.legacy.data.example.Example at 0x1a362d8d430>,
 <torchtext.legacy.data.example.Example at 0x1a362d8d700>,
 <torchtext.legacy.data.example.Example at 0x1a362d8dc70>,
 <torchtext.legacy.data.example.Example at 0x1a3627fe0d0>]

We can see that every record in the TabularDataset is an Example object. Further diving deeper, 

In [23]:
trn[5].__dict__.keys()

dict_keys(['labels', 'text'])

In [28]:
print(trn[5].text)

['Sorry', 'i', "'ve", 'not', 'gone', 'to', 'that', 'place', '.', 'I.ll', 'do', 'so', 'tomorrow', '.', 'Really', 'sorry', '.']


In [29]:
print(trn[5].labels)

ham


To look at all the attributes in an Example, `vars()` built-in functino can be used. 

In [30]:
print(vars(trn.examples[5]))

{'labels': 'ham', 'text': ['Sorry', 'i', "'ve", 'not', 'gone', 'to', 'that', 'place', '.', 'I.ll', 'do', 'so', 'tomorrow', '.', 'Really', 'sorry', '.']}


Next step is to numericalize the representation of the words. The initial numerical representation of the individual words will be done using one-hot encoding.   
But when using one-hot encoding, a feature vector to represent a single word will be very large, equaling the size of the vocabulary.   
In order to limit the size of the feature vector, we will build a vocabulary on the training data with a maximum size of 105000. That means, we will only consider the top 10500 words of the entire vocabulary. The words outside of this top 10500 will be considered as "unknown" words. 

In [35]:
TEXT.build_vocab(trn, max_size = 10000)

Training data will be used once again to build a vocabulary for the labels. 

In [36]:
LABEL.build_vocab(trn)

In [37]:
print(f'Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}')
print(f'Unique tokens in LaBEL vocabulary: {len(LABEL.vocab)}')

Unique tokens in TEXT vocabulary: 10002
Unique tokens in LaBEL vocabulary: 2


Looking at these two vocabularies, the length of the LABEL vocabulary makes sense, but that of the TEXT vocabulary does not. The two extra tokens in the TEXT vocab are as follows: 

1. \<unk\>: Unknown words outside of the top 10000 words. 
2. \<pad\>: Padding to make sentences the same length. 

The sentences in our text messages can be of different length. But RNN requires that the sentences be of the same length, so we pad the shorter sentences with the \<pad\> token. 

Next, looking at the most common 50 words in the vocab:

In [39]:
print(TEXT.vocab.freqs.most_common(50))

[('.', 3896), ('to', 1695), ('I', 1572), (',', 1484), ('you', 1476), ('?', 1231), ('!', 1128), ('a', 1063), ('the', 948), ('...', 890), ('i', 766), ('&', 745), ('and', 689), ('in', 642), ('is', 632), ('u', 621), (';', 621), ('me', 589), (':', 575), ('for', 531), ('..', 521), ('my', 491), ('it', 481), ('your', 463), ('of', 457), ('have', 415), ('on', 404), ('that', 399), ('2', 394), (')', 389), ("'s", 375), ('now', 322), ("'m", 317), ('call', 315), ('do', 313), ('are', 309), ('at', 303), ('be', 303), ("n't", 301), ('not', 296), ('or', 288), ('U', 286), ('with', 284), ('will', 274), ('can', 270), ('*', 264), ('gt', 261), ('get', 261), ('lt', 259), ('so', 251)]


Even though we had encoded these words using one-hot representation, all we see here is the unique integer IDs assigned to the individual words. This is because pytorch displayed the compact representation of the one-hot encoding, which is the index position of the particular word in the feature vector. 

We can identify what words belong to these numeric indices by using `itos` mapping. And numeric representation of each string can be identified using `stoi` mapping. 

In [40]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', '.', 'to', 'I', ',', 'you', '?', '!', 'a']


In [41]:
print(LABEL.vocab.stoi)

defaultdict(None, {'ham': 0, 'spam': 1})


One last preprocessing step to carry out before feeding the data to NN model is to build the batch iterator. Taking the batch size of 64, and using `BucketIterator()`, 

In [46]:
batch_size = 64

train_iter, test_iter = torchtext.legacy.data.BucketIterator.splits(
    (trn, tst), 
    batch_size = batch_size, 
    sort_key = lambda x: len(x.text), 
    sort_within_batch = False)

## Design of Neural Network

In [51]:
import torch.nn as nn

In [52]:
class RNN(nn.Module):
    
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    
    def forward(self, text):
        
        embedded = self.embedding(text)
        
        output, hidden = self.rnn (embedded)
        
        hidden_1D = hidden.squeeze(0)
        
        assert torch.equal (output[-1, :, :], hidden_1D)
        
        return self.fc(hidden_1D)
             

Variables into the model:

* `input_dim`: gives the size of the feature vectors, which is the one-hot representation of the words. 

* `embedding_dim`: represents the dimension of the word embeddings - the dense vector representation of the words - that will be trained while training the RNN model. It is a hyperparameter.

* `hidden_dim`: is the dimension of the hidden state of the RNN. This is also defined by us. It is also a hyperparameter. 

* `output_dim`: is the dimension of the output vector (1 for binary classification, since we just need one dimension to represent 0 & 1).

And the dimensions of the feed-forward variables: 

* `text`: \[sentence_length, batch_size \]: Every input sentence is a list of indexes of the one-hot encoded words. 

* `embedded`: \[sentence_length, batch_size, embedded_dim \]: The words in each sentence is now represented by its dense embedding. 

* `ouput`: \[sentence_length, batch_size, hidden_dim \]: hidden_dim comes from the concatenation of the hidden state for every time step, i.e. every word. 

* `hidden`: \[1, batch_size, hidden_dim \]: There is one hidden output for each sentence. This final hidden state of the RNN will be fed into the Linear Layer. 

The unnecessary dimension of the `hidden` is got rid of by using `squeeze()` function. The last hidden state of the output tensor should be equal to the hidden state. 

## Training the model

In [55]:
input_dim = len(TEXT.vocab)

embedding_dim = 100

hidden_dim = 256

output_dim = 1

Instantiating the RNN model: 

In [56]:
model = RNN(input_dim,
            embedding_dim, 
            hidden_dim, 
            output_dim)

Model will be trained by using the Adam optimizer, with a learning rate of $10^{-6}$.   
The Loss function is Binary Cross-Entropy (BCE) with Logits, which is Cross-Entropy calculation for binary classification + sigmoid function to get predictions in the range of 0 or 1. 

In [65]:
lr = 1e-6
optimizer = torch.optim.Adam(model.parameters(), lr = lr)
loss_criterion = nn.BCEWithLogitsLoss()

Next, building a helper function to run through the training process. 

In [62]:
def trainNN(model, iterator, optimizer, loss_criterion):
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1) # Squeeze because output is [batch_size, 1]
        
        loss = loss_criterion(predictions, batch.labels)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        
        correct = (rounded_preds == batch.labels).float()
        acc = correct.sum() / len(correct)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
        
        

In [67]:
num_epochs = 5

for epoch in range(num_epochs):
    train_loss, train_acc = trainNN(model, train_iter, optimizer, loss_criterion)
    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% |')

| Epoch: 01 | Train Loss: 0.678 | Train Acc: 84.40%
| Epoch: 02 | Train Loss: 0.649 | Train Acc: 85.54%
| Epoch: 03 | Train Loss: 0.623 | Train Acc: 85.63%
| Epoch: 04 | Train Loss: 0.598 | Train Acc: 85.56%
| Epoch: 05 | Train Loss: 0.576 | Train Acc: 85.64%


In [68]:
epoch_loss = 0
epoch_acc = 0

In [69]:
model.eval()

RNN(
  (embedding): Embedding(10002, 100)
  (rnn): RNN(100, 256)
  (fc): Linear(in_features=256, out_features=1, bias=True)
)

In [71]:
with torch.no_grad():
    for batch in test_iter:
        predictions = model(batch.text).squeeze(1)
        
        loss = loss_criterion(predictions, batch.labels)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        
        correct = (rounded_preds == batch.labels).float()
        acc = correct.sum() / len(correct)
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
test_loss = epoch_loss / len(test_iter)
test_acc = epoch_acc / len(test_iter)

print(f'| Test Loss: {train_loss:.3f} | Test Acc: {train_acc*100:.2f}% |')

| Test Loss: 0.576 | Test Acc: 85.64% |
