**Name:Sina Mansourdehghan**

# 0. Introduction

In this notebook, we aim to make a classifier to identify spam messages. We will use a dataset that is consisted of 5000 SMS texts. Some of theses texts are labeled as `spam` while the rest are considered `ham`.

For this aim, we will use **BERT** word-embeddings from the `transformers` library. We will not train a transformer, as it requires a lot of GPU power, but we will fine-tune a pre-trained transformer encoder (**BERT**) for our classification problem.


In [1]:
!pip install --quiet transformers torch

[K     |████████████████████████████████| 5.8 MB 31.6 MB/s 
[K     |████████████████████████████████| 7.6 MB 52.5 MB/s 
[K     |████████████████████████████████| 182 kB 61.9 MB/s 
[?25h

In [2]:
# IMPORTS
from math import ceil
import numpy as np
import pandas as pd

import torch
import torch.nn as nn

from transformers import BertTokenizer, BertModel

In [3]:
# Check if CUDA is available
cuda_available = torch.cuda.is_available()

# Set the device
device = torch.device('cuda' if cuda_available else 'cpu')

# 1. Data

In [4]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

In [5]:
df.head(5)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
# Replace "spam" labels with 1 and "ham" labels with 0
df['label'] = df['label'].replace({"spam": 1, "ham": 0})

In [7]:
df.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
# Split the dataframe into train and val sets
from sklearn.model_selection import train_test_split
train, val = train_test_split(df, test_size=0.1)

In [9]:
class CustomDataset:
    def __init__(self, df):
        # Store the dataframe as an instance variable
        self.df = df

    def __getitem__(self, index):
        # Get the row at the given index from the dataframe
        row = self.df.iloc[index]

        # Return the features and label as a tuple
        return row['text'], row['label']


class CustomDataloader:
    def __init__(self, dataset, batch_size, shuffle=False):
        # Store the dataset and batch size as instance variables
        self.dataset = dataset.df
        self.batch_size = batch_size

        # Calculate the number of batches
        self.num_batches = len(self.dataset) // self.batch_size

        # Set the shuffle flag
        self.shuffle = shuffle

    def __len__(self):
        # Return the number of batches
        return self.num_batches

    def __iter__(self, calm=True):
        # Get the indices of the rows in the dataset
        indices = list(range(len(self.dataset)))

        # Shuffle the indices if required
        if self.shuffle:
            np.random.shuffle(indices)

        # Iterate over the batches
        for i in range(self.num_batches):
            # Get the indices for the current batch
            batch_indices = indices[i*self.batch_size : (i+1)*self.batch_size]

            # Get the rows for the current batch
            rows = self.dataset.iloc[batch_indices]

            # Convert the rows to tensors
            features = list(rows['text'].values)
            labels = rows['label'].values
            # features = torch.from_numpy(rows['text'].values.astype(np.float32))
            labels = torch.from_numpy(rows['label'].values)

            # Yield the batch as a tuple
            yield features, labels

In [10]:
# Example usage
# df = pd.read_csv('data.csv')
dataset = CustomDataset(df)
dataloader = CustomDataloader(dataset, batch_size=32, shuffle=True)
i=0
for text, label in dataloader:
    if i<5:
        print('*******************\nbatch: %d'% i) 
        print('number of labels=\n',len(label))
        print('a sample text=\n',text[2])
        print('Type',type(text))
        print('Type label',type(label[0].item()))
    i+=1


###################### (5 points) ##########################

*******************
batch: 0
number of labels=
 32
a sample text=
 Hmm...my uncle just informed me that he's paying the school directly. So pls buy food.
Type <class 'list'>
Type label <class 'int'>
*******************
batch: 1
number of labels=
 32
a sample text=
 WIN: We have a winner! Mr. T. Foley won an iPod! More exciting prizes soon, so keep an eye on ur mobile or visit www.win-82050.co.uk
Type <class 'list'>
Type label <class 'int'>
*******************
batch: 2
number of labels=
 32
a sample text=
 i want to grasp your pretty booty :)
Type <class 'list'>
Type label <class 'int'>
*******************
batch: 3
number of labels=
 32
a sample text=
 Yo theres no class tmrw right?
Type <class 'list'>
Type label <class 'int'>
*******************
batch: 4
number of labels=
 32
a sample text=
 Hi the way I was with u 2day, is the normal way&this is the real me. UR unique&I hope I know u 4 the rest of mylife. Hope u find wot was lost.
Type <class 'list'>
Type label <class 'int'>


# 2. Pretrained Language Model

In this section we will use the pretrained **BERT** model from the `transformers` library with its respective `tokenizer`. **BERT** is a transformer encoder which is suited for various downstream NLP tasks namely *Sequence classification*.

In [14]:
# Defining the tokenizer and model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained("bert-base-uncased")


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
# Define the input text
text = "This is a sample text."

# Tokenize the text
tokens_ = bert_tokenizer.tokenize(text)
print(tokens_)

['this', 'is', 'a', 'sample', 'text', '.']


In [16]:
# text = "What is your name?"
text = ["What is your name?",'the']
tokenized = bert_tokenizer(text, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
print(len(tokenized))
encoding = bert_model(**tokenized)
print(len(encoding))

3
2


In [17]:
tokenized

{'input_ids': tensor([[ 101, 2054, 2003, 2115, 2171, 1029,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 1996,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0

In [18]:
encoding.pooler_output.size() # batch, hidden_state (embeding)

torch.Size([2, 768])

In [19]:
# Maqximum length: 128
tokenized.input_ids

tensor([[ 101, 2054, 2003, 2115, 2171, 1029,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 1996,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0, 

The 'tokenized' variable contains:

{
    
    'input_ids': tensor([[  101,  1037,  2023,  2003,  2062,  8362,  1012,   102]]),

    'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]),

    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])

}

What `bert_tokenizer` gets as input. (text, max_length, padding, truncation, return_tensors):

<font color=red>The bert_tokenizer() function is a convenience function provided by the transformers library that combines the steps of tokenization and tensor conversion into a single function call. The function takes several arguments, including:

* max_length: The maximum number of tokens to include in the output. If the input text has more tokens than the specified maximum length, the extra tokens will be truncated.

* padding: Specifies whether to pad the output tensor with zeros to ensure that it has the specified maximum length.

* truncation: Specifies whether to truncate the input text if it has more tokens than the specified maximum length.

* return_tensors: Specifies the format in which to return the tensor. In this case, the tensor will be returned as a PyTorch tensor.

The bert_tokenizer() function returns a dictionary containing the tokenized input text, along with several other pieces of information such as the input mask and the attention mask</font>


# 3. Model

If you inspect the `encoding` of the `BERT`, you will realize that `BERT` gives a vector for each of the tokens included in the input sentence. However, all of these word tokens are not needed for a simple classification task.

Instead, we can use the first token representation, as it captures the whole tokens meanings. `BERT` provides this token for us in a special variable called `pooler_output`. We will use this `pooler_output` as the input of our classification head inside our classifier model.
![BERT pooler output](https://miro.medium.com/max/1100/1*Or3YV9sGX7W8QGF83es3gg.webp)

In [20]:
# pooler_output is the embedding of the [CLS] special token. In many cases,
# it is considered a valid representation of the complete sentence. last_hidden_state contains the final embeddings of 
#all tokens in the sentence from the last hidden state.

In [21]:
class SpamClassifier(nn.Module):
    def __init__(self, embedding_tokenizer, embedding_model):
        super().__init__()
        # Set the embedding size and instantiate the tokenizer and embedding model
        self.embedding_size = embedding_model.config.hidden_size
        self.tokenizer = embedding_tokenizer
        self.embedding = embedding_model

        # Define the classifier and sigmoid layers
        self.classifier = nn.Linear(self.embedding_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # tokenize the input text and convert the tokens to a tensor
        tokens = self.tokenizer(x, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
        tokens=tokens.to(device)
        # generate embeddings for the input text
        pooler_output = self.embedding(**tokens).pooler_output 
        pooler_output=pooler_output.to(device)
        # pass the pooler_output to the classifier layer
        output = self.classifier(pooler_output)
        # apply the sigmoid activation function to the output
        out=self.sigmoid(output)
        out=out.squeeze()
        return out


    def predict(self, x):
        # Get the predicted probability for the input text
        prob = self.forward(x)
        # Convert the probability to a binary class (0 or 1)
        pred = (prob > 0.5)
        pred = pred.int()
        return pred


In [22]:
model = SpamClassifier(bert_tokenizer, bert_model)
model=model.to(device)

In [23]:
model.predict(['salam','salamw2'])

tensor([0, 1], device='cuda:0', dtype=torch.int32)

In [24]:
device

device(type='cuda')

# 4. Training and Evaluation

In [None]:
# # tokenize the input text and convert the tokens to a tensor
# tokens = bert_tokenizer(A, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
# # generate embeddings for the input text
# pooler_output = bert_model(**tokens).pooler_output

In [27]:
# define the learning parameters here (lr and epochs.)
# then initilizer your model, an appropriate optimizer
# and loss function.
import torch.optim as optim

lr = 0.01
epochs = 10
model = SpamClassifier(bert_tokenizer, bert_model)
optimizer = optim.SGD(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [None]:
type(criterion)

torch.nn.modules.loss.CrossEntropyLoss

In [None]:
from tqdm import tqdm
# define the loss function and the optimizer


# train the model for a specified number of epochs
num_epochs = 2
model=model.to(device)
for epoch in range(num_epochs):
    # set the model to training mode
    model.train()

    # use tqdm to display a progress bar during training
    with tqdm(train_dataloader) as pbar:
        for text, label in pbar:
            label=label.to(device)
            # make predictions on the batch of data
            prediction = model(text)

            # calculate the loss
            loss = criterion(prediction, label.float())

            # reset the gradients
            optimizer.zero_grad()

            # backpropagate the loss
            loss.backward()

            # update the model's parameters
            optimizer.step()

            # update the progress bar
            pbar.set_description(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}')

    # set the model to evaluation mode
    model.eval()

    # use the model to make predictions on the validation data
    with torch.no_grad():
        for text, label in val_dataloader:
            label=label.to(device)
            predictions = model.predict(text)
            # calculate the accuracy of the model on the validation data
            accuracy = (predictions == label).float().mean().item()
            print(f'Accuracy: {accuracy:.4f}')

Epoch 1/2, Loss: 20.7225: 100%|██████████| 156/156 [01:27<00:00,  1.78it/s]


Accuracy: 0.1875
Accuracy: 0.2188
Accuracy: 0.0938
Accuracy: 0.2188
Accuracy: 0.0938
Accuracy: 0.1562
Accuracy: 0.2188
Accuracy: 0.1250
Accuracy: 0.1875
Accuracy: 0.0000
Accuracy: 0.1875
Accuracy: 0.0625
Accuracy: 0.0625
Accuracy: 0.1250
Accuracy: 0.1562
Accuracy: 0.1250
Accuracy: 0.0625


Epoch 2/2, Loss: 3.4273: 100%|██████████| 156/156 [01:29<00:00,  1.74it/s]


Accuracy: 0.9062
Accuracy: 0.9375
Accuracy: 0.8438
Accuracy: 0.8750
Accuracy: 0.8438
Accuracy: 0.9062
Accuracy: 0.8750
Accuracy: 0.9062
Accuracy: 0.8438
Accuracy: 0.8438
Accuracy: 0.7812
Accuracy: 0.8750
Accuracy: 0.8438
Accuracy: 0.8125
Accuracy: 0.9375
Accuracy: 0.8125
Accuracy: 0.8125


# 5. Using HuggingFace

[HuggingFace library](http://huggingface.co/) has built a nice API for NLP tasks around the transformers. To get familiar with this comrehensive library, In this section you are asked to use the huggingface `Trainer`, `Dataset`, and `BertForSequenceClassification` to do what we did above again.

Feel free to refer to the library documentation to learn about these modules.

In [None]:
# use huggingface Trainer and Dataset API and train the 
# `SpamClassifier`. You should not use the `SpamClassifier`
# we implemented previously. Instead you should use 
# `BertForSequenceClassification` here.

In [28]:
import transformers
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

# Load the BERT model (which is going to be fine-tuned on our data) and tokenizer
tokenizer2 = transformers.BertTokenizer.from_pretrained('bert-base-cased')
model2 = transformers.BertForSequenceClassification.from_pretrained('bert-base-cased')
model2=model2.to(device)

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [36]:
# Define the Dataset class for the spam email dataset
class SpamDataset(torch.utils.data.Dataset):
  def __init__(self, emails, labels):
    self.emails = emails
    self.labels = torch.tensor(labels)

  def __len__(self):
    return len(self.emails)

  def __getitem__(self, index):
    email = self.emails[index]
    label = self.labels[index]
    input_ids = tokenizer2.encode(email, padding="max_length", return_tensors='pt')
    # tokenized = bert_tokenizer(email, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
    # input_ids=tokenized.input_ids
    input_ids = input_ids.squeeze(0)
    return  {'input_ids': input_ids, 'labels': label}


In [43]:
# Create a DataLoader for the spam email dataset
spam_dataset = SpamDataset(list(train.text.values), list(train.label.values))
val_dataset = SpamDataset(list(val.text.values), list(val.label.values))
data_loader = torch.utils.data.DataLoader(spam_dataset, batch_size=32, shuffle=True)

# Define the loss function and optimization algorithm
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model2.parameters(), lr=1e-3)

In [38]:
spam_dataset.__len__()

5572

In [44]:
# Initialize the Trainer class
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    eval_steps=500,
    num_train_epochs=1,
    seed=0,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model2,
    args=args,
    train_dataset=spam_dataset,
    eval_dataset=val_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

# Fine-tune the model for a number of epochs
trainer.train()


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 5014
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 627
  Number of trainable parameters = 108311810


Step,Training Loss,Validation Loss
500,0.0992,0.051854


***** Running Evaluation *****
  Num examples = 558
  Batch size = 8
Saving model checkpoint to output/checkpoint-500
Configuration saved in output/checkpoint-500/config.json
Model weights saved in output/checkpoint-500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from output/checkpoint-500 (score: 0.05185406655073166).


TrainOutput(global_step=627, training_loss=0.09037785020551423, metrics={'train_runtime': 536.0955, 'train_samples_per_second': 9.353, 'train_steps_per_second': 1.17, 'total_flos': 1319238831575040.0, 'train_loss': 0.09037785020551423, 'epoch': 1.0})