**Name: Mohammad Bagher Soltani**

**Std. No.: 98105813**

# 0. Introduction

In this notebook, we aim to make a classifier to identify spam messages. We will use a dataset that is consisted of 5000 SMS texts. Some of theses texts are labeled as `spam` while the rest are considered `ham`.

For this aim, we will use **BERT** word-embeddings from the `transformers` library. We will not train a transformer, as it requires a lot of GPU power, but we will fine-tune a pre-trained transformer encoder (**BERT**) for our classification problem.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd '/content/drive/MyDrive/Deep Learning HW4'

/content/drive/MyDrive/Deep Learning HW4


In [None]:
!pip install --quiet transformers torch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m81.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m96.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# IMPORTS
from math import ceil

import pandas as pd
import numpy as np

from tqdm import tqdm
from copy import deepcopy

import torch
import torch.nn as nn

from transformers import BertTokenizer, BertModel
from sklearn.model_selection import train_test_split

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# 1. Data

In [None]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

In [None]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
######################   TODO 1.1   ########################
# change the label column so that `spam` labels get `1` 
# and `ham` gets `0`
###################### (2 points) ##########################
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
######################   TODO 1.2   ########################
# split the dataframe into two sections of train and val. 
# keep the train size 10 times of val.
###################### (3 points) ##########################
data_split = train_test_split(df, test_size=0.1, random_state=42)

In [None]:
######################   TODO 1.3   ########################
# based on what you did in homework 1, create a dataset and 
# a dataloader. Your dataset should return a text with its 
# respective label when iterated.
###################### (10 points) ##########################

class CustomDataset:
    def __init__(self, df):
        self.data = df['text'].values
        self.labels = df['label'].values

    def __getitem__(self, index):
        return self.data.iloc[index], self.labels.iloc[index]
    
    def __len__(self):
        return len(self.data)


class CustomDataloader:
    def __init__(self, dataset, batch_size, shuffle=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.indexes = np.arange(len(dataset))
        if shuffle:
            np.random.shuffle(self.indexes)

    def __len__(self):
        return len(self.dataset) // self.batch_size

    def __iter__(self, calm=True):
        for i in range(len(self)):
            start = i * self.batch_size
            end = start + self.batch_size
            idx = self.indexes[start:end]
            batch = self.dataset[idx]
            
            yield batch

In [None]:
######################   TODO 1.4   ########################
# initialize a dataloader for each of your train and val
# splits.
###################### (5 points) ##########################
train_ds = CustomDataset(data_split[0])
val_ds = CustomDataset(data_split[1])

batch_size = 100
train_dl = CustomDataloader(train_ds, batch_size, shuffle=True)
val_dl = CustomDataloader(val_ds, batch_size, shuffle=True)

# 2. Pretrained Language Model

In this section we will use the pretrained **BERT** model from the `transformers` library with its respective `tokenizer`. **BERT** is a transformer encoder which is suited for various downstream NLP tasks namely *Sequence classification*.

In [None]:
# Defining the tokenizer and model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
text = "What is your name?"
tokenized = bert_tokenizer(text, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
encoding = bert_model(**tokenized)
print(encoding.pooler_output.shape)

torch.Size([1, 768])


**TODO 2.1.** In section bellow, try to explain the arguments that `bert_tokenizer` gets as input. (text, max_length, padding, truncation, return_tensors) *(10 points)*

<font color=red>
text: sequence of words to be tokenized.
<br>
<br>
max_length: controls the maximum length of the input sequence. Here, length
<br>
means the number of tokens that can be derived from the sequence.
<br>
<br>
padding: specifies how padding should be implemented. setting to "max_length"
<br>
will add [PAD] token to the end of the sentence so that its length becomes
<br>
max_length - 2. (two other tokens are [CLS] and [SEP] which are put at
<br>
the start and end of the tokens). Setting to "longest" will add padding
<br>
tokens so that the length of the sequence becomes equal to that of
<br>
the longest sequence.
<br>
<br>
truncation: if the length of the sequence exceeds max_length-2, remove
<br>
the rest of the tokens so that it becomes max_length-2. 
<br>
<br>
return_tensors: if not set, returns ordinary python list of intergers.
<br>
Otherwise, returns tensors. 'tf' is for tensorflow.constant, 'pt' is 
<br>
for torch.Tensor, 'np' is for numpy.ndarray.
</font>


# 3. Model

If you inspect the `encoding` of the `BERT`, you will realize that `BERT` gives a vector for each of the tokens included in the input sentence. However, all of these word tokens are not needed for a simple classification task.

Instead, we can use the first token representation, as it captures the whole tokens meanings. `BERT` provides this token for us in a special variable called `pooler_output`. We will use this `pooler_output` as the input of our classification head inside our classifier model.
![BERT pooler output](https://miro.medium.com/max/1100/1*Or3YV9sGX7W8QGF83es3gg.webp)

In [None]:
class SpamClassifier(nn.Module):
    def __init__(self, embedding_tokenizer, embedding_model):
        super().__init__()
        ######################   TODO 3.1   ########################
        # construct layers and structure of the network
        self.embedding_size = 768

        self.tokenizer = embedding_tokenizer
        self.embedding = embedding_model
        self.classifier = nn.Linear(self.embedding_size, 1)
        self.sigmoid = nn.Sigmoid()
        ###################### (10 points) #########################

    def forward(self, x):
        ######################   TODO 3.2   ########################
        # implement the forward pass of your model. first tokenizer
        # the sentence, the get the embeddings from your language
        # model, then use the `pooler_output` for your classifier
        # layer.
        tokenized = self.tokenizer(x, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
        tokenized = tokenized.to(device)
        encoding = self.embedding(**tokenized)
        pooler_output = encoding.pooler_output
        output = self.sigmoid(self.classifier(pooler_output))
        output = output.squeeze(1)
        return output
        ###################### (10 points) #########################

    def predict(self, x):
        ######################   TODO 3.3   ########################
        # get the predicted class of x.
        output = self(x)
        prediction = output.round()
        return prediction
        ###################### (5 points) #########################

# 4. Training and Evaluation

In [None]:
######################   TODO 4.1   ########################
# define the learning parameters here (lr and epochs.)
# then initilizer your model, an appropriate optimizer
# and loss function.
lr = 1e-4
epochs = 2

embedding_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
embedding_model = BertModel.from_pretrained("bert-base-uncased")

model = SpamClassifier(embedding_tokenizer, embedding_model).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr)
criterion = nn.BCELoss()
###################### (10 points) ##########################

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
######################   TODO 4.2   ########################
# implement your training loop and train your model.
# return to homework 1 if needed.
train_losses = []
train_accs = []
val_losses = []
val_accs = []

best_model = None
best_val_loss = - np.inf

for epoch in range(epochs):
  
  train_loss = 0
  total = 0
  correct = 0
  model.train()
  with tqdm(enumerate(train_dl), total=len(train_dl)) as pbar:
    for _, (data, labels) in pbar:
      pred = model(data)
      labels = torch.Tensor(labels).to(device)
      loss = criterion(pred, labels)
      train_loss += loss.detach()
      loss.backward()
      optimizer.step()

      total += len(data)
      correct += (pred.round() == labels).sum()

      pbar.set_description(f'Epoch {epoch + 1}: train_loss={(train_loss / total):.3f}, train_accuracy={(correct / total):.4f}')
  
  train_losses.append(train_loss)
  train_accs.append(correct / total)

  val_loss = 0
  total = 0
  correct = 0
  model.eval()
  with torch.no_grad():
    with tqdm(enumerate(val_dl), total=len(val_dl)) as pbar:
      for _, (data, labels) in pbar:
        pred = model(data)
        labels = torch.Tensor(labels).to(device)
        loss = criterion(pred, labels)
        val_loss += loss.detach()

        total += len(data)
        correct += (pred.round() == labels).sum()

        pbar.set_description(f'Epoch {epoch + 1}: val_loss={(val_loss / total):.3f}, val_accuracy={(correct / total):.4f}')
  
  val_losses.append(val_loss)
  val_accs.append(correct / total)
  if val_loss < best_val_loss:
    best_val_loss = val_loss
    best_model = deepcopy(model)

###################### (10 points) ##########################

Epoch 1: train_loss=0.003, train_accuracy=0.9264: 100%|██████████| 50/50 [01:32<00:00,  1.85s/it]
Epoch 1: val_loss=0.001, val_accuracy=0.9680: 100%|██████████| 5/5 [00:03<00:00,  1.49it/s]
Epoch 2: train_loss=0.009, train_accuracy=0.6674: 100%|██████████| 50/50 [01:33<00:00,  1.86s/it]
Epoch 2: val_loss=0.004, val_accuracy=0.8760: 100%|██████████| 5/5 [00:03<00:00,  1.49it/s]
Epoch 3: train_loss=0.007, train_accuracy=0.7360: 100%|██████████| 50/50 [01:33<00:00,  1.87s/it]
Epoch 3: val_loss=0.004, val_accuracy=0.8760: 100%|██████████| 5/5 [00:03<00:00,  1.48it/s]
Epoch 4: train_loss=0.005, train_accuracy=0.8642: 100%|██████████| 50/50 [01:34<00:00,  1.88s/it]
Epoch 4: val_loss=0.006, val_accuracy=0.8760: 100%|██████████| 5/5 [00:03<00:00,  1.48it/s]
Epoch 5: train_loss=0.004, train_accuracy=0.8642: 100%|██████████| 50/50 [01:34<00:00,  1.90s/it]
Epoch 5: val_loss=0.004, val_accuracy=0.8760: 100%|██████████| 5/5 [00:03<00:00,  1.49it/s]
Epoch 6: train_loss=0.005, train_accuracy=0.8642: 

# 5. Using HuggingFace

[HuggingFace library](http://huggingface.co/) has built a nice API for NLP tasks around the transformers. To get familiar with this comrehensive library, In this section you are asked to use the huggingface `Trainer`, `Dataset`, and `BertForSequenceClassification` to do what we did above again.

Feel free to refer to the library documentation to learn about these modules.

In [None]:
!pip install --quiet datasets

In [None]:
######################   TODO 5.1   ########################
# use huggingface Trainer and Dataset API and train the 
# `SpamClassifier`. You should not use the `SpamClassifier`
# we implemented previously. Instead you should use 
# `BertForSequenceClassification` here.
###################### (25 points) #########################
from datasets import Dataset, load_metric
from transformers import Trainer, BertForSequenceClassification, BertTokenizer
from torch.utils.data import DataLoader

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
ds = Dataset.from_pandas(df)

def tokenize_fn(example):
  return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_ds = ds.map(tokenize_fn).remove_columns(column_names=["text"])
split_ds = tokenized_ds.train_test_split(test_size=0.1)

metric = load_metric('accuracy')

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.round(predictions, axis=-1)
  results = metric.compute(predictions=predictions, references=labels)
  return results


model = BertForSequenceClassification.from_pretrained('bert-base-cased')
train_dataset = split_ds["train"]
eval_dataset = split_ds["test"]

trainer = Trainer(
    model=model,
    train_dataset=train_dataset, 
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
    )
trainer.train()

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embedd

  0%|          | 0/5572 [00:00<?, ?ex/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/pytorch_model.bin
Some weights of the model checkpoin

Step,Training Loss
500,0.081


Saving model checkpoint to tmp_trainer/checkpoint-500
Configuration saved in tmp_trainer/checkpoint-500/config.json
Model weights saved in tmp_trainer/checkpoint-500/pytorch_model.bin


Step,Training Loss
500,0.081
1000,0.0365
1500,0.0177


Saving model checkpoint to tmp_trainer/checkpoint-1000
Configuration saved in tmp_trainer/checkpoint-1000/config.json
Model weights saved in tmp_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to tmp_trainer/checkpoint-1500
Configuration saved in tmp_trainer/checkpoint-1500/config.json
Model weights saved in tmp_trainer/checkpoint-1500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1881, training_loss=0.037321191085012334, metrics={'train_runtime': 1387.3032, 'train_samples_per_second': 10.843, 'train_steps_per_second': 1.356, 'total_flos': 3957716494725120.0, 'train_loss': 0.037321191085012334, 'epoch': 3.0})