<a href="https://colab.research.google.com/github/sinaabbasi1/NLP-MSc/blob/main/Assignments/Assignment%2003/NLP_Assignment_03_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Efficient Fine-tuning

Efficient fine-tuning is an emerging technique in NLP that focuses on reducing the computational costs and time required for fine-tuning pre-trained models. While fine-tuning a pre-trained language model for a downstream task become a prevalent paradigm and has shown great success in improving the performance of pre-trained models, it can be associated with high computational costs and long training times. Efficient fine-tuning approaches aim to address these limitations by leveraging techniques such as knowledge distillation, parameter sharing, and pruning to make fine-tuning more efficient. These approaches have shown promising results in reducing training times and memory requirements while maintaining or even surpassing the effectiveness of traditional fine-tuning methods.

## Adapter-tuning

[Adapter-tuning](https://proceedings.mlr.press/v97/houlsby19a/houlsby19a.pdf) is a parameter-efficient technique that is used to fine-tune a pre-trained language model for a downstream task. In this method, rather than training the entire parameters of the pre-trained model, only additional adapter modules that are injected into each layer are trained, while the remaining parameters are frozen.

For this part, your task is to fine-tune the pre-trained [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) model using adapter technique for the IMDB Movie Review dataset, which contains labeled reviews for sentiment analysis.

Finally, the performance of the fine-tuned model should be evaluated on the test set by computing metrics like accuracy, precision, recall, and F1-score.

### Some notes:

* You can load the pre-trained model from Hugging Face.
* You are not allowed to use the existing adapter library.
* Feel free to experiment with different hyperparameters such as learning rate, batch size, and the number of epochs to find values that produce satisfactory results.



# Prerequisites

First, we install and import libraries we'll need later.

In [None]:
!pip install datasets
!pip install transformers
# Package `portalocker` is required to be installed to use this datapipe
!pip install 'portalocker>=2.0.0'

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

from torch.utils.data import DataLoader, Dataset
from transformers import RobertaModel, RobertaTokenizer
from transformers import AutoTokenizer, GPT2Config, PreTrainedModel
from transformers import AutoModelForCausalLM, PretrainedConfig, AutoConfig, AutoModel, AutoImageProcessor

import torchtext.datasets as datasets

import os, math
import numpy as np
from tqdm.notebook import tqdm
import random
import copy

import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import clear_output

clear_output()

Next, we'll set the random seeds for reproducability.

In [None]:
SEED = 43

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Setting the device option:

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# The Data

In this exercise we are going to use the IMDB Movie Review Dataset. It contains 50k reviews with their respective labels as positive or negative. We are randomly using 20k of it (10k for train and 10k for test).

In [None]:
ROOT = './data'
train_data, test_data = datasets.IMDB(root=ROOT)

In [None]:
print(f'Number of train samples are {len(list(train_data))}')
print(f'Number of test samples are {len(list(test_data))}')

Number of train samples are 25000
Number of test samples are 25000


Defining some key variables that will be used later on in the training.

In [None]:
MAX_LEN = 300
BATCH_SIZE = 32
LR = 1e-05
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', truncation=True, do_lower_case=True)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In order to use the IMDB dataset for RoBERTa model we need to define an specific Dataset class. In this class we tokenize our data and change the labels from 1/2 to 0/1.

In [None]:
class IMDBDataset(Dataset):
    def __init__(self, data, tokenizer, max_len):
        # data is a list that contains tuples in the form of: (1/2: int (as sentiments), 'text': str)
        self.tokenizer = tokenizer
        self.data = list(data)[5000:15000]
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    # The labels are 1 for negative and 2 for positive
    # Here we want to change it to 0 and 1
    @staticmethod
    def label_change(label):
        if label == 1:
            return 0
        else:
            return 1

    def __getitem__(self, index):
        text = str(self.data[index][1])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.label_change(self.data[index][0]), dtype=torch.float)
        }

In [None]:
train_dataset = IMDBDataset(train_data, tokenizer, MAX_LEN)
test_dataset = IMDBDataset(test_data, tokenizer, MAX_LEN)

In [None]:
len(train_dataset)

10000

Here we define our dataloaders:

In [None]:
train_dataloader = data.DataLoader(train_dataset
                                   , shuffle=True, batch_size=BATCH_SIZE, drop_last=True)
test_dataloader = data.DataLoader(test_dataset, shuffle=True, batch_size=BATCH_SIZE, drop_last=True)

# The Model

The model we are going to use is RoBERTa. We want to modify it using adapter layers.

In [None]:
# Adapter contains a bottleneck which project the original input into smaller dimension and then back it to original dimension
class Adapter(nn.Module):

    def __init__(self, size = 6, model_dim = 768):
        super().__init__()
        self.adapter_block = nn.Sequential(
            nn.Linear(model_dim, size),
            nn.ReLU(),
            nn.Linear(size, model_dim)
        )

    def forward(self, x):

        ff_out = self.adapter_block(x)
        # Skip connection
        adapter_out = ff_out + x

        return adapter_out

In [None]:
# This class take the specific layer and add adapter after it
class Adaptered(nn.Module):
    def __init__(self, orig_layer):
        super().__init__()
        self.orig_layer = orig_layer
        self.adapter = Adapter()

    def forward(self, *x):
        orig_out = self.orig_layer(*x)
        output = self.adapter.forward(orig_out)

        return output

In [None]:
# Adding adapters into RoBERTa architecture
class Roberta_with_adapter(nn.Module):

    def __init__(self, model):
        super().__init__()
        self.model = model

        # Freeze the original model parameters, in adapter-tuning we need to calculate gradient for adapter module,
        # classification head, and layer norm
        for name, params in model.named_parameters():
            if not(params.requires_grad and (('LayerNorm' in name) or ('bare_roberta' not in name))):
                params.requires_grad = False

        # Embed adapter layers into the transformer blocks
        for i in range(12):
            self.model.bare_roberta.encoder.layer[i].attention.output.LayerNorm = Adaptered(self.model.bare_roberta.encoder.layer[i].attention.output.LayerNorm)
            self.model.bare_roberta.encoder.layer[i].output.LayerNorm = Adaptered(self.model.bare_roberta.encoder.layer[i].output.LayerNorm)

    def get_model(self):
        return self.model


In [None]:
# Defining RoBERTa model and adding classfication head on top of it
class RobertaClass(nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.bare_roberta = RobertaModel.from_pretrained("roberta-base")
        self.dense = nn.Linear(768, 768)
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.bare_roberta(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0] # last_hidden_state (batch_size, sequence_length, hidden_size)
        pooler = hidden_state[:, 0] # take only the CLS token from last_hidden_state
        pooler = self.dense(pooler)
        pooler = nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [None]:
model = Roberta_with_adapter(RobertaClass()).get_model()
model.to(device)

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaClass(
  (bare_roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): A

# Fine-tuning

Here we fine-tune the generated model from previous section.

In [None]:
# Defining the loss function and optimizer
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=LR)

criterion = nn.CrossEntropyLoss()
criterion = criterion.to(device)

In [None]:
# Accuracy function
def calculate_accuracy(preds, targets):
    n_correct = (preds==targets).sum().item()
    return n_correct

In [None]:
def train_model(model, optimizer, data_loader, criterion):
    # Set model to train mode
    model.train()

    # loss per epoch and number of correct predictions in order to calculate the accuracy
    epoch_loss = 0
    n_correct = 0

    for data in tqdm(data_loader, desc='Training', leave=False):

        ## Step 1: Move input data to device (only strictly necessary if we use GPU)
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        ## Step 2: Run the model on the input data
        preds = model(ids, mask, token_type_ids)

        ## Step 3: Calculate the loss and accuracy
        loss = criterion(preds, targets)

        max_val, max_idx = torch.max(preds.data, dim=1)
        n_correct += calculate_accuracy(max_idx, targets)

        ## Step 4: Perform backpropagation
        # Before calculating the gradients, we need to ensure that they are all zero.
        # The gradients would not be overwritten, but actually added to the existing ones.
        optimizer.zero_grad()
        # Perform backpropagation
        loss.backward()

        ## Step 5: Update the parameters
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(data_loader), (n_correct / (len(data_loader) * 32))

In [None]:
EPOCHS = 6

# Training loop
for epoch in tqdm(range(EPOCHS), desc='Epochs'):

    train_loss, train_acc = train_model(model, optimizer, train_dataloader, criterion)

    print(f'Epoch: {epoch + 1:02}')
    print(f'\tTrain loss: {train_loss:.3f} | Train acc: {train_acc * 100:.2f}%')

Epochs:   0%|          | 0/6 [00:00<?, ?it/s]

Training:   0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 01
	Train loss: 0.570 | Train acc: 74.54%


Training:   0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 02
	Train loss: 0.522 | Train acc: 76.18%


Training:   0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 03
	Train loss: 0.427 | Train acc: 81.00%


Training:   0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 04
	Train loss: 0.393 | Train acc: 82.79%


Training:   0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 05
	Train loss: 0.376 | Train acc: 83.77%


Training:   0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 06
	Train loss: 0.365 | Train acc: 84.01%


# Evaluation

Finally, we evaluate the performance of the fine-tuned model on the test set by computing metrics like accuracy, precision, recall, and F1-score.

In [None]:
def eval_model(model, data_loader, criterion):
    # Set model to eval mode
    model.eval()

    # loss per epoch and number of correct predictions in order to calculate the accuracy
    epoch_loss = 0
    n_correct = 0
    preds_list = list()
    targets_list = list()

    with torch.no_grad(): # Deactivate gradients for the following code
        for data in tqdm(data_loader, desc='Evaluation', leave=False):

            ## Step 1: Move input data to device (only strictly necessary if we use GPU)
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            # print(targets.tolist())
            targets_list.extend(targets.tolist())

            ## Step 2: Run the model on the input data
            preds = model(ids, mask, token_type_ids).squeeze()
            # print(preds.tolist())
            preds_list.extend(preds.tolist())

            ## Step 3: Calculate the loss and accuracy
            loss = criterion(preds, targets)

            max_val, max_idx = torch.max(preds.data, dim=1)
            n_correct += calculate_accuracy(max_idx, targets)

            epoch_loss += loss.item()

    return epoch_loss / len(data_loader), (n_correct / (len(data_loader) * 32)), preds_list, targets_list

## Accuracy

In [None]:
test_loss, test_acc, preds_list, targets_list = eval_model(model, test_dataloader, criterion)
print(f'Test loss: {test_loss:.3f} | Test acc: {test_acc * 100:.2f}%')

Evaluation:   0%|          | 0/312 [00:00<?, ?it/s]

Test loss: 0.388 | Test acc: 84.71%


Using torchmetrics to calculate remaining metrics.

In [None]:
!pip install torchmetrics
clear_output()

In [None]:
from torchmetrics.classification import BinaryPrecision, BinaryRecall, BinaryF1Score

In [None]:
# generated prediction and target lists are correct in size
print(len(preds_list))
print(len(targets_list))

9984
9984


In [None]:
# torchmetrics uses tensors
preds_tensor = torch.tensor(preds_list)
targets_tensor = torch.tensor(targets_list)

In [None]:
# max for each prediction
preds_tensor_max = torch.max(preds_tensor, 1)

## Precision

In [None]:
precision_metric = BinaryPrecision()
precision = precision_metric(preds_tensor_max.indices, targets_tensor).item()
print(f'Precision: {precision * 100:.2f}%')

Precision: 73.19%


## Recall

In [None]:
recall_metric = BinaryRecall()
recall = recall_metric(preds_tensor_max.indices, targets_tensor).item()
print(f'Recall: {recall * 100:.2f}%')

Recall: 61.19%


## F-1 Score

In [None]:
f1score_metric = BinaryF1Score()
f1score = f1score_metric(preds_tensor_max.indices, targets_tensor)
print(f'F-1 Score: {f1score * 100:.2f}%')

F-1 Score: 66.65%
