# Fine-tuning BERT on Sentiment Analysis task

### Project Overview

In this notebook, we will demonstrate the process of fine-tuning a pre-trained language model, specifically `bert-base-uncased`, for sentiment analysis on the Stanford IMDB reviews dataset. Fine-tuning is a critical step in transferring the knowledge learned by a large pre-trained model, such as BERT, to a specific downstream task like sentiment classification. By applying fine-tuning techniques, we can adapt the model to our dataset and achieve higher performance with fewer labeled examples than training from scratch.

The BERT model (`bert-base-uncased`) we are using is a Transformer-based model that has been pre-trained on a large corpus of text in an unsupervised manner. This model has proven to be highly effective in various NLP tasks. Our dataset, the Stanford IMDB reviews, is a well-known benchmark for sentiment analysis, consisting of movie reviews labeled as positive or negative.

We will work on a smaller version of the dataset (1000 examples) to speed up training.

In [1]:
# Importing necessary libraries
from datasets import load_dataset  # Huggingface datasets
from transformers import (  # Huggingface transformers
    AutoTokenizer,
    DataCollatorWithPadding,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer)
import torch  # Cuda access
import evaluate  # Evaluate the model
import numpy as np


In [2]:
"""Main settings"""
# Check if cuda is present
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Loading the IMDB movie reviews dataset
# This dataset is used for binary sentiment classification (positive/negative).
dataset = load_dataset("stanfordnlp/imdb")

# Set model
model_checkpoint = 'bert-base-uncased'

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, clean_up_tokenization_spaces=True)

# Import accuracy evaluation metric
accuracy = evaluate.load("accuracy")


### Tokenization

Before feeding text data into BERT, it must be tokenized into a format that the model can understand. Tokenization is the process of converting raw text into tokens, which are then mapped to integers that correspond to the model's vocabulary. BERT uses WordPiece tokenization, which splits words into subwords to handle the vast variety of words in natural language more efficiently. This allows BERT to handle out-of-vocabulary words by breaking them down into known subword tokens.



In [3]:
def tokenize_function(examples):
    """Helper function to tokenize text from the Dataset"""
    text = examples["text"]

    # Tokenize and truncate
    tokenizer.truncation_side = 'left'
    tokenized_inputs = tokenizer(
        text,
        max_length=512,
        truncation=True,
        return_tensors="np",
        )
    
    return tokenized_inputs


In [4]:
# Tokenize the dataset and store it as torch tensor
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [5]:
# Create a smaller version of the dataset, to speed up the training
small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(1000))


In [6]:
# Check the Positive to total ratio for the shuffled smaller datasets
a = sum(small_train_dataset['label'])/len(small_train_dataset['label'])
b = sum(small_eval_dataset['label'])/len(small_eval_dataset['label'])

print(f"Positive to total ratio in train: {a*100}%")
print(f"Positive to total ratio in eval:  {b*100}%")

Positive to total ratio in train: 48.8%
Positive to total ratio in eval:  48.8%


In [7]:
# Define label maps, to be able to pass from labels to id and vice versa
id2label = {0: "NEG", 1: "POS"}
label2id = {"NEG":0, "POS":1}


### Data Collation with Padding

When working with batches of data, each input in a batch must have the same length. `DataCollatorWithPadding` is used to ensure that all sequences in a batch are padded to the same length, which is necessary because different reviews have varying lengths. This padding is dynamically applied based on the longest sequence in each batch, making the process more memory-efficient than static padding to a fixed length.

In [8]:
# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


## Standard Fine-Tuning

In standard fine-tuning, all the parameters of the pre-trained model are updated during the training process. This means that every layer in BERT will be adjusted based on the gradients calculated from the loss function. This is obviously computationally expensive, and overfitting is around the corner, especially when working with small datasets.

In [None]:
# Define model
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=2,
    id2label=id2label,
    label2id=label2id
    )


In [9]:
# Initialize model to device
model.to(device)


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [41]:
def test_model(model, data_slice):
    """Helper function to test the model on small set of sentences from data"""

    print("Model predictions:")
    print(f"|Text{' '*50}|Gold |Pred |")
    print(f"{'-'*50}")

    correct_count = 0

    for e in data_slice:
        text = e['text']
        label = id2label[e['label']]

        # Tokenize text
        inputs = tokenizer.encode(
            text,
            max_length=512,
            truncation=True,
            return_tensors="pt"
            ).to(device)
        
        # Compute logits
        with torch.no_grad(): logits = model(inputs).logits
        # Convert logits to label
        predictions = torch.argmax(logits)

        l_pred = id2label[predictions.tolist()]
        print(f"|{text[:50]}... |{label}  |{l_pred}  |")
        
        # Keep track of correct prediction for accuracy
        if label == l_pred: correct_count += 1

    print(f"{'-'*50}")
    print(f"Accuracy: {(correct_count/10)*100}%")


In [40]:
# Test model on ten random sentences
example_data = small_train_dataset.shuffle().select(range(10))
test_model(model, example_data)

Model predictions:
|Text                                                  |GOLD |PRED |
--------------------------------------------------
|This is a quirky little movie, and I have to agree... |POS  |NEG  |
|When thinking about Captivity many words come to m... |NEG  |NEG  |
|As other reviewers have noted the film dies in the... |NEG  |NEG  |
|What happens when an army of wetbacks, towelheads,... |NEG  |NEG  |
|if you like gangster type of movies, then this is ... |POS  |NEG  |
|This is part one of a short animation clip showing... |POS  |NEG  |
|Melissa Joan Hart shines! This show is amazing!! T... |POS  |NEG  |
|Well the reason for seeing it in the cinema was th... |NEG  |NEG  |
|Who me? No, I'm not kidding. That's what it really... |NEG  |NEG  |
|Having heard so much about the 1990s Cracker serie... |NEG  |NEG  |
--------------------------------------------------
Accuracy: 60.0%


In [12]:
def compute_metrics(p):
    """Helper function to evaluate the trainer"""
    #print(p)
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    
    return {"accuracy": accuracy.compute(predictions=pred, references=labels)}


### Training Hyperparameters

The performance of the fine-tuning process is influenced by several hyperparameters:

- **Learning Rate (`lr`)**: controls the step size of the optimizer, i.e., how much to update the model's weights during each iteration.
- **Batch Size (`batch_size`)**: refers to the number of samples processed before the model’s weights are updated. A smaller batch size can lead to more stable training, especially on smaller datasets.
- **Number of Epochs (`num_epochs`)**: the number of complete passes through the training dataset.
- **Optimizer (`optimizer`)**: we use `adamw_torch`, a variant of the Adam optimizer that implements weight decay (a form of regularization). AdamW helps prevent overfitting by penalizing large weights in the model, thus improving generalization.

In [25]:
# Training hyperparameters
lr = 1e-3
batch_size = 8
num_epochs = 3
optimizer = 'adamw_torch'

In [42]:
# Define training arguments
training_args = TrainingArguments(
    output_dir= "bert_finetuned",
    learning_rate=lr,
    optim=optimizer,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=100
)


In [43]:
# Define trainer and train the model
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [44]:
# Empirical test after fine-tuning
test_model(model, example_data)

Model predictions:
|Text                                                  |Gold |Pred |
--------------------------------------------------
|This is a quirky little movie, and I have to agree... |POS  |NEG  |
|When thinking about Captivity many words come to m... |NEG  |NEG  |
|As other reviewers have noted the film dies in the... |NEG  |NEG  |
|What happens when an army of wetbacks, towelheads,... |NEG  |NEG  |
|if you like gangster type of movies, then this is ... |POS  |NEG  |
|This is part one of a short animation clip showing... |POS  |NEG  |
|Melissa Joan Hart shines! This show is amazing!! T... |POS  |NEG  |
|Well the reason for seeing it in the cinema was th... |NEG  |NEG  |
|Who me? No, I'm not kidding. That's what it really... |NEG  |NEG  |
|Having heard so much about the 1990s Cracker serie... |NEG  |NEG  |
--------------------------------------------------
Accuracy: 60.0%


### Optionally save model
The model can also be saved to local in order to be able to call it back and reuse it once trained.

In [None]:
# Optionally save model
#trainer.save_model('custom_bert')


In [None]:
# How to call the model back once saved
#model_2 = AutoModelForSequenceClassification.from_pretrained('custom_bert')
#model_2.to('cuda')

## PEFT and LoRA

PEFT (Parameter-Efficient Fine-Tuning) refers to methods that aim to fine-tune models with a smaller number of trainable parameters. This is particularly useful for large models like BERT, where full fine-tuning can be impractical due to the high computational cost.

LoRA (Low-Rank Adaptation) is an efficient fine-tuning technique designed to reduce the computational and memory costs associated with standard fine-tuning. Instead of updating all the parameters of the pre-trained model, LoRA introduces **trainable low-rank matrices** into the layers of the model.

The reason why LoRA works so well is because the new trainable parameters (matrices) are inserted in a way that allows the model to capture task-specific information without modifying the original pre-trained weights, and their dimension is drastically reduced in relation to the rank we assign when calling LoRA.

In [45]:
# Free up space
del model
del trainer
if device == 'cuda': torch.cuda.empty_cache()

In [46]:
from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig


In [47]:
# Generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#model.to(device)

### LoRA Configuration

- `task_type`: "SEQ_CLS" specifies that the task is sequence classification, which is appropriate for sentiment analysis.
- `r`: the rank of the low-rank matrices added by LoRA.
- `lora_alpha`: a scaling factor for the low-rank matrices, controlling the impact of the LoRA weights on the model.
- `lora_dropout`: introduces dropout to the LoRA layers, helping to prevent overfitting.
- `target_modules`: specifies to which modules within the Transformers layers LoRA will be applied. In our case, the `query` and `key` modules, which are key components in the attention mechanism.

In [48]:
# PEFT configuration parameters with LoRA
peft_config = LoraConfig(task_type="SEQ_CLS",
                        r=4,
                        lora_alpha=32,
                        lora_dropout=0.01,
                        target_modules=['query', 'key']
                        )

#### Choice of `target_modules`

In the Transformer architecture, which underlies BERT, attention mechanisms are central to the model's ability to understand and process language. The attention mechanism is driven by three main components: `query`, `key`, and `value` vectors. These components are derived from the input embeddings and are used to calculate the attention scores that determine how much focus each token should pay to other tokens in a sequence.

- **Query**: Represents the input's intent or question—essentially, "what am I looking for?"
- **Key**: Represents the properties of the other tokens—"what do I have that might be relevant?"
- **Value**: Represents the actual information or content that will be attended to, based on the similarity between the query and key.

The attention score for each pair of tokens is computed as the dot product between the `query` vector of one token and the `key` vector of another. This score determines the weight or importance of the corresponding `value` vectors in the final output. Thus, the `query` and `key` components are critical in shaping how information flows through the model and how relationships between tokens are modeled.

By focusing on the `query` and `key` modules in LoRA, we are effectively modifying the way the model calculates attention scores. For sentiment analysis, understanding the relationships between words (e.g., negations like "not" and sentiment-laden words like "bad") is crucial. By fine-tuning the `query` and `key` matrices, LoRA allows the model to learn how to better focus on these relationships in the context of sentiment classification.

Moreover, by restricting modifications to the `query` and `key` modules, LoRA preserves much of this general knowledge encoded in the `value` vectors, while still allowing for the task-specific adjustments necessary for sentiment analysis. This targeted approach allows for efficient and effective fine-tuning, making it a preferred strategy when computational resources are limited or when working with smaller datasets.


In [49]:
peft_config


LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='SEQ_CLS', inference_mode=False, r=4, target_modules={'key', 'query'}, lora_alpha=32, lora_dropout=0.01, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False))

In [50]:
# Load the model with the PEFT configuration
model = get_peft_model(model, peft_config)

# We can see how much of the total parameters are trainable for the given model
model.print_trainable_parameters()

trainable params: 148,994 || all params: 109,632,772 || trainable%: 0.1359


In [51]:
model.to(device)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.01, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (defaul

In [60]:
# Training hyperparameters (same as before)
lr = 1e-3
batch_size = 8
num_epochs = 3
optimizer = 'adamw_torch'

In [62]:
# Define training arguments
training_args = TrainingArguments(
    output_dir= "bert_finetuned_lora",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=100
)


In [59]:
# Create trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train model
trainer.train()


Step,Training Loss
25,0.4485


KeyboardInterrupt: 

In [63]:
test_model(model, example_data)

Model predictions:
|Text                                                  |Gold |Pred |
--------------------------------------------------
|This is a quirky little movie, and I have to agree... |POS  |POS  |
|When thinking about Captivity many words come to m... |NEG  |NEG  |
|As other reviewers have noted the film dies in the... |NEG  |NEG  |
|What happens when an army of wetbacks, towelheads,... |NEG  |NEG  |
|if you like gangster type of movies, then this is ... |POS  |POS  |
|This is part one of a short animation clip showing... |POS  |POS  |
|Melissa Joan Hart shines! This show is amazing!! T... |POS  |POS  |
|Well the reason for seeing it in the cinema was th... |NEG  |NEG  |
|Who me? No, I'm not kidding. That's what it really... |NEG  |NEG  |
|Having heard so much about the 1990s Cracker serie... |NEG  |NEG  |
--------------------------------------------------
Accuracy: 100.0%


### Why LoRA Outperforms Standard Fine-Tuning
LoRA often performs better than standard fine-tuning because it allows for more efficient parameter updates with a lower risk of overfitting. By only modifying a subset of the model parameters, LoRA maintains the general knowledge encoded in the pre-trained model while learning task-specific patterns more effectively. This is particularly advantageous when working with limited labeled data, as it reduces the chances of the model memorizing the training data. Additionally, LoRA's reduced memory footprint and computational efficiency make it a more scalable solution for fine-tuning large models like BERT.