# Assignment 3: Fine-tuning BERT for Classification Tasks (15 Marks)

## Due: March 24, 2022

Welcome to Assignment 3 of our course on Natural Language Processing. As the name suggests in this assignment you will learn how to fine-tune a pretrained model like BERT on a downstream task to improve much more superior performance compared to the methods discussed so far. Like previous assignments we will continue to work on the SST-2 sentiment dataset as well ask introduce a new task to work on i.e. [Microsfot Research Paraphrase Corpus](https://www.microsoft.com/en-us/download/details.aspx?id=52398). This assignment will also make heavy use of the [Hugging Face's Transformers Library](https://huggingface.co/docs/transformers/index). Don't worry if you are not familiar with the library, we will discuss its usage in detail.

Note: Access to a GPU will be crucial for working on this assignment. So do select a GPU runtime in Colab before you start working.

Suggested Reading: [Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](https://arxiv.org/pdf/1810.04805.pdf)


In [2]:
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    sst_data_dir = "gdrive/MyDrive/PlakshaNLP/Assignment3/data/SST-2"
    mrpc_data_dir = "gdrive/MyDrive/PlakshaNLP/Assignment3/data/MRPC"
except:
    sst_data_dir = "/datadrive/t-kabir/work/repos/PlakshaNLP/source/Assignment3/data/SST-2"
    mrpc_data_dir = "/datadrive/t-kabir/work/repos/PlakshaNLP/source/Assignment3/data/MRPC"

Mounted at /content/gdrive


In [3]:
# Install required libraries
!pip install numpy
!pip install pandas
!pip install torch
!pip install tqdm
!pip install matplotlib
!pip install transformers
!pip install sklearn
!pip install tqdm

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.6 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 33.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 38.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

In [4]:
# We start by importing libraries that we will be making use of in the assignment.
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.optim import Adam
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import copy
import tqdm

Similar to last time we will again be working on the Stanford Sentiment Dataset. This time we will also create a validation set by splitting the training data, which we will use for model selection

In [4]:
# We can use pandas to load the datasets
train_df = pd.read_csv(f"{sst_data_dir}/train.tsv", sep = "\t")
test_df = pd.read_csv(f"{sst_data_dir}/dev.tsv", sep = "\t")

# We reserve 2% of the training data for validation
train_df, val_df = train_test_split(train_df, test_size=0.02, random_state = 42)

print(f"Number of Training Examples: {len(train_df)}")
print(f"Number of Validation Examples: {len(val_df)}")
print(f"Number of Test Examples: {len(test_df)}")

Number of Training Examples: 66002
Number of Validation Examples: 1347
Number of Test Examples: 872


In [5]:
# View a sample of the dataset
train_df.head()

Unnamed: 0,sentence,label
30842,the earnestness of its execution and skill of ...,1
11789,while cherish does n't completely survive its ...,0
21981,"this sad , compulsive life",0
13804,if i stay positive,1
44994,filmmakers david weissman and bill weber benef...,1


## Task 1: Tokenization and Data Preperation

As discussed in the lectures, BERT and other pretrained language models use sub-word tokenization i.e. individual words can also be split into constituent subwords to reduce the vocabulary size. The Transformer library provides tokenizer for all the popular language models. Below we demonstrate how to create and use these tokenizers.

In [16]:
# Import the BertTokenizer from the library
from transformers import BertTokenizer

# Load a pre-trained BERT Tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

`BertTokenizer.from_pretrained` is used to load a pre-trained tokenizer. Notice that we provide the argument `"bert-base-uncased"` to the method. This refers to the variant of BERT that we want to use. The term "base" means we want to use the smaller BERT variant i.e. the one with 12 layers, and "uncased" refers to the fact that it treats upper-case and lower-case characters identically. There are 4 variants available for BERT which are:
    - `bert-base-uncased`
    - `bert-base-cased`
    - `bert-large-uncased`
    - `bert-large-cased`
Now that we have loaded the tokenizer, let's see how to use it.

`tokenize` method can be used to split the text into sequence of tokens

In [7]:
bert_tokenizer.tokenize("a high-spirited musical that exquisitely blends music , and high drama .")

['a',
 'high',
 '-',
 'spirited',
 'musical',
 'that',
 'exquisite',
 '##ly',
 'blend',
 '##s',
 'music',
 ',',
 'and',
 'high',
 'drama',
 '.']

Notice how the tokenizer not only splits the text into words but also subwords like "exquisitely" is split into "exquisite" and "ly". 

Another use case of the tokenizer is to convert the tokens into indices. This is important because BERT and almost all language models takes as the inputs a sequence of token ids, which they use to map into embeddings. `convert_tokens_to_ids` method can be used to do this

In [8]:
sentence = "a high-spirited musical that exquisitely blends music , and high drama ."
tokens = bert_tokenizer.tokenize(sentence)
token_ids = bert_tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[1037, 2152, 1011, 24462, 3315, 2008, 19401, 2135, 12586, 2015, 2189, 1010, 1998, 2152, 3689, 1012]


The two steps can also be combined by simply calling the tokenizer object

In [9]:
bert_tokenizer(sentence)

{'input_ids': [101, 1037, 2152, 1011, 24462, 3315, 2008, 19401, 2135, 12586, 2015, 2189, 1010, 1998, 2152, 3689, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Notice that it returns a bunch of things in addition to the ids. The `"input_ids"` are just the token ids that we obtained in the previous cell. However you will notice that it has a few additional ids, it starts with 101 and ends with 102. These are what we call special tokens and correspond the \[CLS\] and \[SEP\] tokens used by BERT. 

`"token_type_ids"` contains which sequence does a particular token belongs to. This is mainly used for sentence pair tasks and can be ignored for now.

`"attention_mask`" is a mask vector that indicates if a particular token corresponds to padding. Padding is extremely important when we are dealing with variable length sequences, which is almost always the case. Through padding we can ensure that all the sequences in a batch are of same size. However, while processing the sequence we need ignore these padding tokens, hence a mask is required to identify such tokens.

Padding can be enabled by providing a value for `max_length` argument and setting `padding="max_length"`, as shown below

In [10]:
tokenizer_output = bert_tokenizer(sentence, max_length=32, padding="max_length", truncation = True, return_tensors="pt")
input_ids = tokenizer_output["input_ids"]
attn_mask = tokenizer_output["attention_mask"]
print(f"Input Ids:\n {input_ids}\n")

print(f"Attention Mask:\n {attn_mask}\n")

Input Ids:
 tensor([[  101,  1037,  2152,  1011, 24462,  3315,  2008, 19401,  2135, 12586,
          2015,  2189,  1010,  1998,  2152,  3689,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])

Attention Mask:
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])



Notice how 0s get appended to the input ids sequence, and the same is also reflected in the output of `attn_mask` where `0` indicates that the particular token was padded and `1` means otherwise.  `truncation = True` ensures that if a sequence has a length greater than `max_length` it gets truncated. Setting `return_tensors="pt"` results in the outputs as torch tensors

## Task 1.1: Custom Dataset Class (2 Marks)

Now that we know how to use the hugging face tokenizers we can define the custom `torch.utils.Dataset` class like we did in the previous assignments to process and store the data as well as provides a way to iterate through the dataset. Implement the `SST2BertDataset` class below. Recall to create a custom class you need to implement 3 methods `__init__`, `__len__` and `__getitem__`.

In [9]:
from torch.utils.data import Dataset, DataLoader

class SST2BertDataset(Dataset):
    
    def __init__(self, sentences, labels, seq_len, bert_variant = "bert-base-uncased"):
        """
        Constructor for the `SST2BertDataset` class. Stores the `sentences` and `labels` which can then be used by
        other methods. Also initializes the tokenizer
        
        Inputs:
            - sentences (list) : A list of movie reviews
            - labels (list): A list of sentiment labels corresponding to each review
            - seq_len (int): Length of the sequence to use.
                             If number of tokens are lower than `seq_len` add padding otherwise truncate
        """
    
        self.tokenizer = BertTokenizer.from_pretrained(bert_variant)
        self.seq_len = seq_len
        self.sentences = sentences
        self.labels = labels
        self.tokenized_tensors = [self.tokenizer(sentence,max_length = self.seq_len, padding = 'max_length', truncation = True, return_tensors = 'pt') for sentence in sentences]
 

        
    def __len__(self):
        """
        Returns the length of the dataset i.e. the number of reviews present in the dataset
        """
        length = len(list(self.labels))

        return length
    
    def __getitem__(self, idx):
        """
        Returns the training example corresponding to review present at the `idx` position in the dataset
        
        Inputs:
            - idx (int): Index corresponding to the review,label to be returned
            
        Returns:
            - input_ids (torch.tensor): Indices of the tokens in the sentence at `idx` position.
                                        Shape of the tensor should be (`seq_len`,)
            - mask (torch.tensor): Attention mask indicating which tokens are padded.
                                   Shape of the tensor should be (`seq_len`,)
            - label (int): Sentiment label for the corresponding sentence
        
        Hint: To get the output from the tokenizer in the form of torch tensors set return_tensors="pt" when calling self.tokenizer 
        """
        
        input_ids = self.tokenized_tensors[idx]['input_ids']
        mask = self.tokenized_tensors[idx]['attention_mask']
        label = self.labels[idx]
        
        return input_ids.squeeze(0), mask.squeeze(0), label

In [12]:
print("Running Sample Test Cases")

sample_sentences = ["unflinchingly bleak and desperate",
                    "it 's slow -- very , very slow .",
                    "it 's a charming and often affecting journey ."]
sample_labels = [0, 0, 1]
sample_seq_len = 12
sample_dataset = SST2BertDataset(sample_sentences, sample_labels, sample_seq_len)

print(f"Sample Test Case 1: Checking if `__len__` is implemented correctly")
dataset_len= len(sample_dataset)
expected_len = len(sample_labels)
print(f"Dataset Length: {dataset_len}")
print(f"Expected Length: {expected_len}")
assert len(sample_dataset) == len(sample_sentences)
print("Sample Test Case Passed!")
print("****************************************\n")

print(f"Sample Test Case 2: Checking if `__getitem__` is implemented correctly for `idx= 0`")
sample_idx = 0
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids = torch.tensor([101, 4895, 10258, 2378, 8450, 2135, 21657, 1998, 7143, 102, 0, 0])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
expected_label = 0
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")

print(f"Sample Test Case 3: Checking if `__getitem__` is implemented correctly for `idx= 1`")
sample_idx = 1
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids = torch.tensor([101, 2009, 1005, 1055, 4030, 1011, 1011, 2200, 1010, 2200, 4030, 102])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
expected_label = 0
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")

print(f"Sample Test Case 4: Checking if `__getitem__` is implemented correctly for `idx= 2`")
sample_idx = 2
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids = torch.tensor([101, 2009, 1005, 1055, 1037, 11951, 1998, 2411, 12473, 4990, 1012, 102])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
expected_label = 1
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")



Running Sample Test Cases
Sample Test Case 1: Checking if `__len__` is implemented correctly
Dataset Length: 3
Expected Length: 3
Sample Test Case Passed!
****************************************

Sample Test Case 2: Checking if `__getitem__` is implemented correctly for `idx= 0`
input_ids:
 tensor([  101,  4895, 10258,  2378,  8450,  2135, 21657,  1998,  7143,   102,
            0,     0])
Expected input_ids:
 tensor([  101,  4895, 10258,  2378,  8450,  2135, 21657,  1998,  7143,   102,
            0,     0])
mask:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
Expected mask:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
label:
 0
Expected label:
 0
Sample Test Case Passed!
****************************************

Sample Test Case 3: Checking if `__getitem__` is implemented correctly for `idx= 1`
input_ids:
 tensor([ 101, 2009, 1005, 1055, 4030, 1011, 1011, 2200, 1010, 2200, 4030,  102])
Expected input_ids:
 tensor([ 101, 2009, 1005, 1055, 4030, 1011, 1011, 2200, 1010, 2200, 4030,  10

Creating Datasets and Dataloaders for train, validation and test data. Since pretrained models like BERT have millions of parameters, it is common to use a smaller batch size to reduce the memory footprint.

In [13]:
seq_len = 128
batch_size = 16

train_sentences, train_labels = train_df["sentence"].values, train_df["label"].values
val_sentences, val_labels = val_df["sentence"].values, val_df["label"].values
test_sentences, test_labels = test_df["sentence"].values, test_df["label"].values

train_dataset = SST2BertDataset(train_sentences, train_labels, seq_len=seq_len)
val_dataset = SST2BertDataset(val_sentences, val_labels, seq_len=seq_len)
test_dataset = SST2BertDataset(test_sentences, test_labels, seq_len=seq_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

## Task 2: Implementing and Training BERT-based Classifier

Similar to pretrained tokenizers, the transformers library also provide numerous pre-trained language models that can be fine-tuned on a wide variety of downstream tasks. We demonstrate usage of these models below.

In [19]:
# Import BertModel from the library
from transformers import BertModel

# Create an instance of pretrained BERT
bert_model = BertModel.from_pretrained("bert-base-uncased")
bert_model

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

As you can see very similar to how we created pre-trained tokenizer, we can load a pretrained BERT model by calling `BertModel.from_pretrained(bert-base-uncased)`. This can actually be considered just a Pytorch `nn.Module` like `nn.Linear` and can be similarly plugged into a network architecture. Also, notice the model contains 12 BERT layers, where each layer consists of a Self Attention layer followed by a sequence of linear layers and activation functions (MLP), as we discussed when talking about Transformer architecture in the lecture.

In [15]:
sentence = "a high-spirited musical that exquisitely blends music , and high drama ."
tokenizer_output = bert_tokenizer(sentence, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]

output = bert_model(input_ids, attention_mask = attn_mask)
output

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[-0.5230, -0.4870, -0.0666,  ..., -0.2906,  0.5917,  0.5204],
                                                        [-0.5298,  0.0526, -0.0479,  ..., -0.5181,  0.6624,  0.5428],
                                                        [-0.5554, -0.4201,  0.1434,  ..., -0.0929,  0.5689, -0.3167],
                                                        ...,
                                                        [-0.0330, -0.1500,  0.3019,  ..., -0.3996,  0.4959, -0.2757],
                                                        [-0.5161, -1.0316,  0.1923,  ...,  0.6252,  0.9381, -0.6699],
                                                        [ 0.5373,  0.1544, -0.3276,  ...,  0.3932, -0.5570, -0.0681]]],
                                                      grad_fn=<NativeLayerNormBackward0>)),
                                              ('pooler_output',
     

As you can calling `bert_model` returns a bunch of different things. Let's go through them one by one and understand

In [16]:
last_hidden_state = output.last_hidden_state
print(f"input_ids shape: {input_ids.shape}")
print(f"last_hidden_state shape: {last_hidden_state.shape}")

input_ids shape: torch.Size([1, 18])
last_hidden_state shape: torch.Size([1, 18, 768])


For an input of shape `[1,18]` which just means a single sequence of 18 tokens, last_hidden_state is a tensor of shape `[1, 18, 768]` denoting the contextual embedding of each of the 18 tokens in the sequence. These representations can be then used for solving a downstream task, by adding a linear layer or MLP layer on top. These can be useful for sequence labelling type of tasks.

In [17]:
pooler_output = output.pooler_output
print(f"input_ids shape: {input_ids.shape}")
print(f"pooler_output shape: {pooler_output.shape}")

input_ids shape: torch.Size([1, 18])
pooler_output shape: torch.Size([1, 768])


`pooler_output` is an aggregate representation of the entire sentence and can be thought of as a sentence embedding. It is obtained by passing the representation of the \[CLS\] token through a linear layer. This can be useful for sentence-level tasks like sentiment analysis etc.

Apart from these two we can also obtain other values by providing additional arguments. Like if we want to obtain attention maps which can be useful for interpretating the model's behavior, we can just specify `output_attentions=True` while calling the model

In [18]:
output = bert_model(input_ids, attention_mask = attn_mask, output_attentions=True)
attentions = output.attentions
print(f"Data type of attentions output: {type(attentions)}")
print(f"Number of elements: {len(attentions)}")
print(f"Shape of individual element: {attentions[0].shape}")
print(f"Example attention map: {attentions[0][0,0]}")

Data type of attentions output: <class 'tuple'>
Number of elements: 12
Shape of individual element: torch.Size([1, 12, 18, 18])
Example attention map: tensor([[0.0394, 0.1036, 0.0293, 0.0427, 0.0286, 0.0309, 0.0662, 0.0268, 0.0501,
         0.0200, 0.0675, 0.0327, 0.0375, 0.0721, 0.0347, 0.0549, 0.0806, 0.1825],
        [0.0649, 0.0452, 0.0417, 0.0654, 0.0789, 0.0430, 0.0496, 0.0633, 0.0430,
         0.0539, 0.0473, 0.0497, 0.0574, 0.0621, 0.0497, 0.0575, 0.0777, 0.0497],
        [0.0283, 0.0353, 0.0166, 0.0490, 0.0736, 0.0902, 0.0363, 0.1293, 0.0442,
         0.0875, 0.0423, 0.0785, 0.0449, 0.0196, 0.0160, 0.1193, 0.0377, 0.0513],
        [0.0383, 0.0532, 0.0579, 0.0280, 0.0793, 0.0480, 0.0412, 0.0809, 0.0484,
         0.0647, 0.0456, 0.0459, 0.0744, 0.0305, 0.0602, 0.0957, 0.0593, 0.0482],
        [0.0096, 0.0322, 0.0157, 0.1455, 0.0169, 0.0352, 0.0554, 0.0405, 0.0580,
         0.0260, 0.0883, 0.0265, 0.0851, 0.0867, 0.0172, 0.0415, 0.1230, 0.0967],
        [0.0380, 0.0107, 0.0485, 0

As you can see `attentions` is a tuple containing 12 elements which corresponds to the attention maps of each of the 12 layers in the network. Further each layer's attention maps also contains 12 attention maps corresponding to 12 heads in each layer. A single attention map as you can see is a 18x18 matrix representing the attention pattern for all the tokens in the sequence

### Task 2.1: Implementing BERT-based Classifier (2 Marks)

In this task you will implement a bert-based classifier in Pytorch very similar to how we created bag of word classifiers in the previous assignments. Instead of using `nn.Linear` here we will simply use `BertModel` as a component. Implement the `BertClassiferModel` module below with the architecture BertModel->Linear->Sigmoid

In [10]:
class BertClassifierModel(nn.Module):
    
    def __init__(self, d_hidden = 768, bert_variant = "bert-base-uncased"):
        """
        Define the architecture of Bert-Based classifier.
        You will mainly need to define 3 components, first a BERT layer
        using `BertModel` from transformers library,
        a linear layer to map the representation from Bert to the output,
        and a sigmoid layer to map the score to a proability
        
        Inputs:
            - d_hidden (int): Size of the hidden representations of bert
            - bert_variant (str): BERT variant to use
        """
        super(BertClassifierModel, self).__init__()
        self.bert_layer = BertModel.from_pretrained(bert_variant)
        self.output_layer = nn.Linear(d_hidden,1)
        self.sigmoid_layer = nn.Sigmoid()
        
      
        
    def forward(self, input_ids, attn_mask):
        """
        Forward Passes the inputs through the network and obtains the prediction
        
        Inputs:
            - input_ids (torch.tensor): A torch tensor of shape [batch_size, seq_len]
                                        representing the sequence of token ids
            - attn_mask (torch.tensor): A torch tensor of shape [batch_size, seq_len]
                                        representing the attention mask such that padded tokens are 0 and rest 1
                                        
        Returns:
          - output (torch.tensor): A torch tensor of shape [batch_size,] obtained after passing the input to the network
                                        
        
        Hint: Recall which of the outputs from BertModel is appropriate for the sentence classification task.
        """
        output = None
        
        output = self.bert_layer(input_ids, attention_mask = attn_mask, output_attentions=True)
        output = output.pooler_output
        output = self.output_layer(output)
        output = self.sigmoid_layer(output)
        
        return output.squeeze(-1) # Question: Why do squeeze() here? 

In [20]:
print(f"Running Sample Test Cases!")
torch.manual_seed(42)
model = BertClassifierModel()

print("Sample Test Case 1")
sentence = "a high-spirited musical that exquisitely blends music , and high drama ."
tokenizer_output = bert_tokenizer(sentence, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]
bert_cls_out = model(input_ids, attn_mask).detach().numpy()
expected_bert_cls_out = np.array([0.43614867])
print(f"Input Sentence: {sentence}")
print(f"Model Output: {bert_cls_out}")
print(f"Expected Output: {expected_bert_cls_out}")

assert bert_cls_out.shape == expected_bert_cls_out.shape
assert np.allclose(bert_cls_out, expected_bert_cls_out, 1e-4)
print("Test Case Passed! :)")
print("******************************\n")

print("Sample Test Case 2 (Checking how padding effects the output. It shouldn't!)")
sentence = "a high-spirited musical that exquisitely blends music , and high drama ."
tokenizer_output = bert_tokenizer(sentence,max_length = 30, padding = "max_length", return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]
bert_cls_out = model(input_ids, attn_mask).detach().numpy()
expected_bert_cls_out = np.array([0.43614867])
print(f"Input Sentence: {sentence}")
print(f"Model Output: {bert_cls_out}")
print(f"Expected Output: {expected_bert_cls_out}")

assert bert_cls_out.shape == expected_bert_cls_out.shape
assert np.allclose(bert_cls_out, expected_bert_cls_out, 1e-4)
print("Test Case Passed! :)")
print("******************************\n")

print("Sample Test Case 3. Checking if the model works for batched inputs")
sentences = [
    "a high-spirited musical that exquisitely blends music , and high drama .",
    "unflinchingly bleak and desperate"
]
tokenizer_output = bert_tokenizer(sentences,max_length = 30, padding = "max_length", return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]
bert_cls_out = model(input_ids, attn_mask).detach().numpy()
expected_bert_cls_out = np.array([0.43614867, 0.46988717])
print(f"Input Sentences: {sentences}")
print(f"Model Output: {bert_cls_out}")
print(f"Expected Output: {expected_bert_cls_out}")

assert bert_cls_out.shape == expected_bert_cls_out.shape
assert np.allclose(bert_cls_out, expected_bert_cls_out, 1e-4)
print("Test Case Passed! :)")
print("******************************\n")


Running Sample Test Cases!


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Sample Test Case 1
Input Sentence: a high-spirited musical that exquisitely blends music , and high drama .
Model Output: [0.43614867]
Expected Output: [0.43614867]
Test Case Passed! :)
******************************

Sample Test Case 2 (Checking how padding effects the output. It shouldn't!)
Input Sentence: a high-spirited musical that exquisitely blends music , and high drama .
Model Output: [0.43614867]
Expected Output: [0.43614867]
Test Case Passed! :)
******************************

Sample Test Case 3. Checking if the model works for batched inputs
Input Sentences: ['a high-spirited musical that exquisitely blends music , and high drama .', 'unflinchingly bleak and desperate']
Model Output: [0.43614867 0.46988717]
Expected Output: [0.43614867 0.46988717]
Test Case Passed! :)
******************************



### Task 2.2: Training and Evaluating the Model (5 Marks)

Now that we have implemented the custom Dataset and a BERT based classifier model, we can start training and evaluating the model. This time we will modify the training loop slightly. At the end of each training epoch we will now evaluate on the validation data and check the accuracy. Based on this we will select the best model across the epochs that obtains highest validation accuracy. You will need to implement the `train` and `evaluate` functions below.

In [11]:
def get_accuracy(pred_labels, act_labels):
  """
  Calculates the accuracy value by comparing predicted labels with actual labels

  Inputs:
    - pred_labels (numpy.ndarray) : A numpy 1d array containing predicted labels. 
    - act_labels (numpy.ndarray): A numpy 1d array containing actual labels (of same size as pred_labels). 

  Returns:
    - accuracy (float): Number of correct predictions / Total number of predictions

  """
  accuracy = None
  correct_predictions=0
  for prediction,actual in zip(pred_labels,act_labels):
    if prediction==actual:
      correct_predictions+=1
  accuracy=correct_predictions/len(pred_labels)

  return accuracy



In [12]:
def convert_probs_to_labels(probs, threshold = 0.5):
  """
  Convert the probabilities to labels by using the specified threshold

  Inputs:
    - probs (numpy.ndarray): A numpy 1d array containing the probabilities predicted by the classifier model
    - threshold (float): A threshold value beyond which we assign a positive label i.e 1 and 0 below it

  Returns:
    - labels (numpy.ndarray): Labels obtained after thresholding
    
  """
    
  labels =[]
  for value in probs:
    if value>threshold:
      labels.append(1)
    else:
      labels.append(0)

  return labels

In [13]:
def evaluate(model, test_dataloader, threshold = 0.5, device = "cpu"):
    """
    Evaluates `model` on test dataset

    Inputs:
        - model (BertClassifierModel): Logistic Regression model to be evaluated
        - test_dataloader (torch.utils.DataLoader): A dataloader defined over the test dataset
        - threshold (float): Probability Threshold above which we consider label as 1 and 0 below

    Returns:
        - accuracy (float): Average accuracy over the test dataset 
    """
    
    model.eval()
    model = model.to(device)
    accuracy = 0
    
    with torch.no_grad():
      for test_batch in test_dataloader:
        features, masks, labels = test_batch

        # Most nn modules and loss functions assume the inputs are of type Float, so convert both features and labels to floats
        features = features.float()
        labels = labels.float()

        # Transfer the features and labels to device
        features  = torch.tensor(features).to(torch.int64)
        masks = torch.tensor(masks).to(torch.int64)
        
        features = features.to(device)
        masks = masks.to(device)
        labels = labels.to(device)

        # Probability predictions from the model
        pred_probs = model(features,masks)

        # Convert predictions and labels to numpy arrays from torch tensors 
        pred_probs = pred_probs.detach().cpu().numpy()
        labels = labels.detach().cpu().numpy()

        # Get accuracy of predictions 
        predictions =[]
        for value in pred_probs:
          if value>threshold:
            predictions.append(1)
          else:
            predictions.append(0)


        batch_accuracy = None
        correct_predictions=0
        for prediction,actual in zip(predictions,labels):
          if prediction==actual:
            correct_predictions+=1
        batch_accuracy=correct_predictions/len(predictions)


        accuracy += batch_accuracy

      # Divide by number of batches to get average accuracy
      accuracy = accuracy / len(test_dataloader)

    return accuracy

    

def train(model, train_dataloader, val_dataloader,
          lr = 1e-5, num_epochs = 3,
          device = "cpu"):
    """
    Runs the training loop. Define the loss function as BCELoss like the last tine
    and optimizer as Adam and traine for `num_epochs` epochs.

    Inputs:
        - model (BertClassifierModel): BERT based classifer model to be trained
        - train_dataloader (torch.utils.DataLoader): A dataloader defined over the training dataset
        - val_dataloader (torch.utils.DataLoader): A dataloader defined over the validation dataset
        - lr (float): The learning rate for the optimizer
        - num_epochs (int): Number of epochs to train the model for.
        - device (str): Device to train the model on. Can be either 'cuda' (for using gpu) or 'cpu'

    Returns:
        - best_model (BertClassifierModel): model corresponding to the highest validation accuracy (checked at the end of each epoch)
        - best_val_accuracy (float): Validation accuracy corresponding to the best epoch
    """
    epoch_loss = 0
    for param in model.parameters():
        param.requires_grad = True
    best_val_accuracy = float("-inf")
    best_model = None

    # 1. Define Loss function and optimizer
    loss_fn = nn.BCELoss()
    optimizer = Adam(model.parameters(),lr=lr)

    for epoch in range(num_epochs):
        model = model.to(device)
        model.train() # Since we are evaluating model at the end of every epoch, it is important to bring it back to train mode
        epoch_loss = 0
        
        # 2. Write Training Loop (store the loss for each batch in epoch_loss like done in previous assignments)
        for train_batch in tqdm.tqdm(train_dataloader):
            # Zero out any gradients stored in the previous steps
            optimizer.zero_grad()

            # Unwrap the batch to get features and labels
            features, masks, labels = train_batch

            # Most nn modules and loss functions assume the inputs are of type Float, so convert both features and labels to floats
            features = features.float()
            labels = labels.float()

            # Transfer the features and labels to device
            features  = torch.tensor(features).to(torch.int64)
            masks = torch.tensor(masks).to(torch.int64)
            
            features = features.to(device)
            masks = masks.to(device)
            labels = labels.to(device)


            # Step 3: Feed the input features to the model to get predictions
            preds = model(features, masks)

            # Step 4: Compute the loss and perform backward pass
            loss = loss_fn(preds,labels)
            loss.backward()

            # Step 5: Take optimizer step
            optimizer.step()
            # Store loss value for tracking
            epoch_loss += loss.item()

        
        epoch_loss = epoch_loss / len(train_dataloader)
        
        # 3. Evaluate on validation data by calling `evaluate` and store the validation accuracy in `val_accurracy`
        print('Evaluating')
        val_accuracy = evaluate(model,val_dataloader)
        # Model selection
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_model = copy.deepcopy(model) # Create a copy of model
        
        print(f"Epoch {epoch} completed | Average Training Loss: {epoch_loss} | Validation Accuracy: {val_accuracy}")

 
    return best_model, best_val_accuracy

In [28]:
import warnings
warnings.filterwarnings("ignore")

In [29]:
torch.manual_seed(42)
print("Training on 100 data points for sanity check")
sample_sentences = train_df["sentence"].values.tolist()[:100]
sample_labels = train_df["label"].values.tolist()[:100]
sample_dataset = SST2BertDataset(sample_sentences, sample_labels, seq_len=32)
sample_dataloader = DataLoader(sample_dataset, batch_size=4)

model = BertClassifierModel()
best_model, best_val_acc = train(model, sample_dataloader, sample_dataloader, num_epochs = 5, device = "cuda")
print(f"Best Validation Accuracy: {best_val_acc}")
print(f"Expected Best Validation Accuracy: {0.99}")

Training on 100 data points for sanity check


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 25/25 [00:02<00:00, 10.50it/s]


Evaluating
Epoch 0 completed | Average Training Loss: 0.6932587218284607 | Validation Accuracy: 0.75


100%|██████████| 25/25 [00:01<00:00, 13.62it/s]


Evaluating
Epoch 1 completed | Average Training Loss: 0.6349137878417969 | Validation Accuracy: 0.9


100%|██████████| 25/25 [00:01<00:00, 13.73it/s]


Evaluating
Epoch 2 completed | Average Training Loss: 0.46960262894630433 | Validation Accuracy: 0.99


100%|██████████| 25/25 [00:01<00:00, 13.54it/s]


Evaluating
Epoch 3 completed | Average Training Loss: 0.34585016429424287 | Validation Accuracy: 1.0


100%|██████████| 25/25 [00:01<00:00, 13.69it/s]


Evaluating
Epoch 4 completed | Average Training Loss: 0.19754688173532486 | Validation Accuracy: 1.0
Best Validation Accuracy: 1.0
Expected Best Validation Accuracy: 0.99


 You can expect the validation accuracy of 0.99 by the end of training. This is so high because we trained on just 100 examples and just use those for validation for a sanity check. This is often done to debug the model and training loop. Let's now train on the entire dataset. This can take some time approximately 50 minutes per epoch, since we are fine-tuning all the 12 layers of BERT.

In [30]:
model = BertClassifierModel()
best_model, best_val_acc = train(model, train_loader, val_loader, num_epochs = 3, device = "cuda")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 4126/4126 [24:34<00:00,  2.80it/s]


Evaluating
Epoch 0 completed | Average Training Loss: 0.21276546122846648 | Validation Accuracy: 0.9522058823529411


100%|██████████| 4126/4126 [24:34<00:00,  2.80it/s]


Evaluating
Epoch 1 completed | Average Training Loss: 0.11131476718110542 | Validation Accuracy: 0.9595588235294118


100%|██████████| 4126/4126 [24:35<00:00,  2.80it/s]


Evaluating
Epoch 2 completed | Average Training Loss: 0.07546952022857757 | Validation Accuracy: 0.9522058823529411


You should expect about ~95% validation accuracy. Let's now check how does this model performs on the test data

In [31]:
# torch.save(best_model,'gdrive/MyDrive/PlakshaNLP/Assignment3/best_model.pt')

In [None]:
best_model_ = torch.load('gdrive/MyDrive/PlakshaNLP/Assignment3/best_model.pt')

In [32]:
test_accuracy = evaluate(best_model, test_loader, threshold = 0.5, device = "cuda")
print(test_accuracy)

0.9238636363636363


As you can see we get around ~93% accuracy on the test data! Compare it with ~80% accuracy that we had been getting with the Bag of Words models in previous assignments. This shows how powerful these pre-trained contextual representations can be in solving such NLP tasks.

### Task 2.3: Making Predictions from scratch (1 Mark)

Similar to assignment 1, implement the function `predict_text` that takes as input the sentence/document to be classified and runs it through the BERT classifier model to obtain the prediction.

In [14]:
def predict_text(text, model, tokenizer, threshold = 0.5,device = "cpu"):
    """
    Predicts the sentiment label for a piece of text using the BERT classifier model
    
    Inputs:
        - text (str): The sentence/document whose sentiment is to be predicted
        - model (BertClassifierModel): Fine-tuned BERT based classifer model
        - tokenizer (BertTokenizer): Pre-trained BERT tokenizer
        - threshold (float): Probability Threshold above which we consider label as 1 and 0 below
    Returns:
        - pred_label (float): Predicted sentiment of the document
    """
    
    model = model.to(device)
    model.eval()

    
    tensors = tokenizer([text],max_length = 128, padding = 'max_length', truncation = True, return_tensors = 'pt')
    input_id = tensors['input_ids'].to(device)
    mask = tensors['attention_mask'].to(device)
    best_model = model.to(device)
    pred_label = 1 if best_model(input_id,mask).tolist()[0]>threshold else 0
    
    
    return pred_label

In [46]:
print("Running Sample Test Cases")

print("Sample Test Case 1")
sample_document = "this movie was great"
predicted_label = predict_text(sample_document, best_model, bert_tokenizer)
expected_label = 1
print(f"Sample Text: {sample_document}")
print(f"Predicted Label: {predicted_label}")
print(f"Expected Label: {expected_label}")

assert predicted_label == expected_label

print("**********************************\n")

print("Sample Test Case 2")
sample_document = "A terrible film, 2 hours of my life that I will never get back"
predicted_label = predict_text(sample_document, best_model, bert_tokenizer)
expected_label = 0
print(f"Sample Text: {sample_document}")
print(f"Predicted Label: {predicted_label}")
print(f"Expected Label: {expected_label}")

assert predicted_label == expected_label

print("**********************************\n")



Running Sample Test Cases
Sample Test Case 1
Sample Text: this movie was great
Predicted Label: 1
Expected Label: 1
**********************************

Sample Test Case 2
Sample Text: A terrible film, 2 hours of my life that I will never get back
Predicted Label: 0
Expected Label: 0
**********************************



## Task 3: Fine-tuning BERT on Micorsoft Research Paraphrase Corpus (5 Marks)

Micorsoft Research Paraphrase Corpus (MRPC) consists of sentence pairs extracted from online news sources and the task is to identify whether the two sentences are paraphrases of each other i.e. if they have the same meaning. Unlike SST-2 this task operates on a pair of sentences instead of a single sentence. However, the way BERT is trained it makes it very easy to handle pair of sentences by just seperating them via a \[SEP\] token

<img src="https://i.ibb.co/Nx8mK1P/bert-sentence-pair.jpg" alt="bert-sentence-pair" border="0">

Hence we just need to modify the custom dataset to do this concatenation operation and rest of the code for models, training and evaluation can essentially stay the same! We load the dataset below:

In [5]:
def load_mrpc_dataset(split = "train"):
    filename = os.path.join(mrpc_data_dir, f"msr_paraphrase_{split}.txt")
    sentence1s = []
    sentence2s = []
    labels = []
    with open(filename) as f:
        for i,line in enumerate(f):
            if i == 0:
                continue
            row = line.split("\t")
            sentence1 = row[3]
            sentence2 = row[4]
            label = row[0]
            sentence1s.append(sentence1)
            sentence2s.append(sentence2)
            labels.append(int(label))
    
    return pd.DataFrame({
        "sentence1": sentence1s,
        "sentence2" : sentence2s,
        "label" : labels
    })


mrpc_train_df = load_mrpc_dataset("train")
mrpc_train_df, mrpc_val_df = train_test_split(mrpc_train_df, test_size=0.1, random_state=42)
mrpc_test_df = load_mrpc_dataset("test")

print(f"Number of Training Examples: {len(mrpc_train_df)}")
print(f"Number of Validation Examples: {len(mrpc_val_df)}")
print(f"Number of Test Examples: {len(mrpc_test_df)}")

Number of Training Examples: 3668
Number of Validation Examples: 408
Number of Test Examples: 1725


In [6]:
mrpc_train_df.head()

Unnamed: 0,sentence1,sentence2,label
1789,No Americans were reported among the casualtie...,"None of the casualties was Americans, said Cap...",1
393,Microsoft is preparing to alter its Internet E...,Microsoft Corp. is preparing changes to its In...,1
2390,"""This fire is going to have a great potential ...","""The fire is going to have great potential to ...",1
1940,Federal offices were to remain closed for a se...,"The Government shut down in Washington, and fe...",1
170,"In Canada, the booming dollar will be in focus...","In Canada, the surging dollar was in focus aga...",1


The `"sentence1"` and `"sentence2"` contain the two sentences respectively, and the `"label"` column contains the label where 1 indicates the two sentences are paraphrases and 0 otherwise.

From here we remove the training wheels and ask you to implement the fine-tuning pipeline for this task yourself. As mentioned before there will be very few changes needed over the functions/classes we have already defined for fine-tuning on SST-2 dataset. We will evaluate based on whether you could fine-tune the model on the MRPC dataset and evaluate it on its test set. You should expect an accuracy of about ~83% on the test set.

In [17]:
seq_len = 128
batch_size = 16

train_sentences, train_labels = mrpc_train_df["sentence1"].values+"[SEP]"+mrpc_train_df["sentence2"].values, mrpc_train_df["label"].values
val_sentences, val_labels = mrpc_val_df["sentence1"].values+"[SEP]"+mrpc_val_df["sentence2"].values, mrpc_val_df["label"].values
test_sentences, test_labels = mrpc_test_df["sentence1"].values+"[SEP]"+mrpc_test_df["sentence2"].values, mrpc_test_df["label"].values

train_dataset = SST2BertDataset(train_sentences, train_labels, seq_len=seq_len)
val_dataset = SST2BertDataset(val_sentences, val_labels, seq_len=seq_len)
test_dataset = SST2BertDataset(test_sentences, test_labels, seq_len=seq_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [20]:
model = BertClassifierModel()
best_model, best_val_acc = train(model, train_loader, val_loader, num_epochs = 3, device = "cuda")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 230/230 [02:49<00:00,  1.36it/s]


Evaluating




Epoch 0 completed | Average Training Loss: 0.5677496584861175 | Validation Accuracy: 0.7980769230769231


100%|██████████| 230/230 [02:48<00:00,  1.37it/s]


Evaluating
Epoch 1 completed | Average Training Loss: 0.3825737151438775 | Validation Accuracy: 0.8125


100%|██████████| 230/230 [02:48<00:00,  1.37it/s]


Evaluating
Epoch 2 completed | Average Training Loss: 0.24170980938588796 | Validation Accuracy: 0.8100961538461539


In [21]:
print('Validation accuracy :',best_val_acc)

Validation accuracy : 0.8125
