[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vicvenet/GenAI_for_Innovative_Communications/blob/main/2025_S1/Week_7/sentiment_analysis_distilbert_finetune.ipynb)

# Objectives of today’s workshop

Understanding how to:
1. Identify a suitable pre-trained GenAI model
2. Adjust that model with a light-weight and inexpensive approach to achieve this objective by fine-tuning with Low-Rank Adaptation
3. Apply the fine-tuned model to a simple example

# 1. Identify a suitable pre-trained GenAI model

a. Consider the actual use case and the data available

In the current hypthetical context of a sentiment analysis task for a company that wants to monitor the sentiment of their customers, we need a model that can run fast at a low cost and probably need to identify basic emotions.

b. Do not reinvent the wheel: Aim to use data that you already have or use an open-source dataset and start with a pre-trained model that fits the actual use case and can be fine-tuned with the available data

While models with multiple billion parameters are available, companies are more likely to use smaller models that are more suitable for the task and can be fine-tuned with the available data:
- DistilBERT Base uncased which is a distilled version of the popular BERT model that is known to be good at sentiment analysis is a good candidate for this task
- It is very small (66M parameters) so it can run fast
- Even though it is designed for sequence to sequence generation, it can be adjusted to a classification task by replacing the final layer with a new one that is trained for the specific classification task. This is what we will do in this workshop.

Regarding the dataset to fine-tune the model on, there are many available datasets for sentiment analysis,  and we will use a subset of the Go Emotions dataset (https://huggingface.co/datasets/google-research-datasets/go_emotions) which is a collection of emotion-based Reddit comments with 28 different emotions. In real life, the company will have their own dataset of customer feedback and it will use to fine-tune the model

c. Size matters: Aim to use a model that is not too big, so that it can be used at scale at a reasonable cost

# 2. Adjust that model with a light-weight and inexpensive approach to achieve this objective by fine-tuning with Low-Rank Adaptation

a. Use a light-weight and inexpensive approach to fine-tune the model by using Low-Rank Adaptation (LoRA)

b. LoRA is a technique that allows us to fine-tune the model with a smaller number of trainable parameters, which is more efficient and easier to manage.

Install the libraries with specific versions for reproducibility (Google Colab already has Pytorch installed)

In [None]:
!pip install -q \
    transformers>=4.30.0 \
    datasets>=2.12.0 \
    peft>=0.4.0 \
    tqdm>=4.65.0 \
    scikit-learn>=1.2.2

Import required libraries

In [None]:
from datasets import load_dataset
from transformers import (
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    get_linear_schedule_with_warmup,
    AdamW
)
from peft import get_peft_model, LoraConfig, TaskType
from tqdm.auto import tqdm
import torch
from torch.nn.utils import clip_grad_norm_
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader
import os
from pathlib import Path

## Directory Setup
Create necessary directories for saving model, dataset, and tokenized data

In [None]:
SAVE_DIR = Path("saved_data")
MODEL_DIR = SAVE_DIR / "model"
DATASET_DIR = SAVE_DIR / "dataset"
TOKENIZED_DIR = SAVE_DIR / "tokenized_dataset"
LORA_DIR = MODEL_DIR / "trained_LoRA"  # New directory for LoRA adapters

os.makedirs(SAVE_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(DATASET_DIR, exist_ok=True)
os.makedirs(TOKENIZED_DIR, exist_ok=True)
os.makedirs(LORA_DIR, exist_ok=True)  # Create LoRA directory

## Dataset Loading
Load the Go Emotions dataset with the Hugging Face dataset library, using cached version if available

Check if dataset is already downloaded

In [None]:
dataset_path = DATASET_DIR / "go_emotions_simplified"
if os.path.exists(dataset_path):
    print("Loading cached dataset...")
    dataset = load_dataset("go_emotions", "simplified",
                         cache_dir=str(dataset_path))
else:
    print("Downloading dataset...")
    dataset = load_dataset("go_emotions", "simplified")
    dataset.save_to_disk(str(dataset_path))

# Get number of unique labels from the dataset
num_labels = len(set(
    label
    for example in dataset['train']
    for label in example['labels']
))

## Model Initialization
Initialize the DistilBERT model and tokenizer using the Hugging Face library transformers

Initialize the tokenizer

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

Initialize or load the model

In [None]:
model_path = LORA_DIR / "distilbert_lora_go_emotions"
if os.path.exists(model_path):
    print("Loading saved LoRA adapter...")
    model = DistilBertForSequenceClassification.from_pretrained(
        str(model_path),
        num_labels=num_labels  # Ensure consistent number of labels
    )
else:
    print("Initializing new model...")
    model = DistilBertForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",
        num_labels=num_labels  # Use number of labels from dataset
    )

## LoRA Configuration
Configure Low-Rank Adaptation parameters

Configure LoRA with target modules

In [None]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin"]
)

Apply LoRA to the model

In [None]:
model = get_peft_model(model, lora_config)

## Data Processing
Define tokenization function and prepare the dataset

In [None]:
def tokenize_function(example):
    tokenized = tokenizer(
        example['text'],
        padding='max_length',
        truncation=True,
        max_length=64,  # Set maximum sequence length
        # 64 is safe as the longest sentence in the dataset is 33 words
        # which is most likely less than 64 tokens
        return_tensors=None  # Don't return tensors yet
    )
    # For batched processing, labels will be a list of lists
    if isinstance(example['labels'], list) and isinstance(example['labels'][0], list):
        # Take first label for each example in batch
        tokenized['labels'] = [labels[0] if labels else 0 for labels in example['labels']]
    else:
        # Single example case
        tokenized['labels'] = example['labels'][0] if example['labels'] else 0
    return tokenized

Set the batch size

In [None]:
batch_size = 64

## Dataset Tokenization
Tokenize the dataset and prepare it for training

Check if tokenized dataset exists

In [None]:
tokenized_path = TOKENIZED_DIR / "tokenized_dataset"
if os.path.exists(tokenized_path):
    print("Loading cached tokenized dataset...")
    tokenized_dataset = dataset.load_from_disk(str(tokenized_path))
else:
    print("Tokenizing dataset...")
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        batch_size=batch_size * 4,  # Process 4 training batches at once for efficiency
        remove_columns=dataset["train"].column_names
    )
    # Set format for PyTorch
    tokenized_dataset = tokenized_dataset.with_format(
        "torch",
        columns=[
            "input_ids",
            "attention_mask",
            "labels"
        ]
    )
    tokenized_dataset.save_to_disk(str(tokenized_path))

## Data Loaders
Create DataLoaders for training and evaluation

Create DataLoader for training

In [None]:
train_loader = DataLoader(
    tokenized_dataset["train"],
    batch_size=batch_size,
    shuffle=True
)

Create DataLoader for evaluation

In [None]:
eval_loader = DataLoader(
    tokenized_dataset["validation"],
    batch_size=batch_size
)

## Training Setup
Configure device and optimizer

Some setup

In [None]:
# Set device to GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=4e-5)

# Training constants
max_grad_norm = 1.0
num_epochs = 3

# Calculate number of training steps
num_training_steps = len(train_loader) * num_epochs
num_warmup_steps = num_training_steps // 10  # 10% of total steps for warmup

# Create scheduler
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

## Training Loop
Train the model for the specified number of epochs

In [None]:
# Training loop
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}")

    for batch in progress_bar:
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Clip gradients
        clip_grad_norm_(model.parameters(), max_grad_norm)

        # Update parameters
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        # Track loss
        epoch_loss += loss.item()
        current_loss = epoch_loss / (progress_bar.n + 1)

        # Update progress bar
        progress_bar.set_postfix({
            "loss": f"{current_loss:.4f}",
            "lr": f"{scheduler.get_last_lr()[0]:.2e}"
        })

    # Print epoch summary
    avg_epoch_loss = epoch_loss / len(train_loader)
    print(f"\nEpoch {epoch+1} - Average Loss: {avg_epoch_loss:.4f}")

## Model Evaluation
Evaluate the model on validation data

In [None]:
model.eval()
all_predictions = []
all_labels = []

with torch.no_grad():
    for batch in eval_loader:
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        logits = outputs.logits

        # Get predictions
        predictions = torch.argmax(logits, dim=-1)
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(batch['labels'].cpu().numpy())

Calculate accuracy

In [None]:
accuracy = accuracy_score(all_labels, all_predictions)
print(f"Validation Accuracy: {accuracy:.4f}")

# 3. Apply the fine-tuned model to a simple example
Try the model on a sample sentence to verify it works

In [None]:
# Define emotion mapping (before predict_sentiment function)
EMOTIONS = [
    "admiration", "amusement", "anger",
    "annoyance", "approval", "caring",
    "confusion", "curiosity", "desire",
    "disappointment", "disapproval",
    "disgust", "embarrassment",
    "excitement", "fear", "gratitude",
    "grief", "joy", "love", "nervousness",
    "neutral", "optimism", "pride",
    "realization", "relief", "remorse",
    "sadness", "surprise"
]

def predict_sentiment(text):
    # Tokenize the input text
    inputs = tokenizer(
        text,
        padding='max_length',
        truncation=True,
        max_length=64,
        return_tensors="pt"
    )

    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1)

    # Map numerical prediction to emotion name
    emotion = EMOTIONS[prediction.item()]
    return emotion

# Test with an example sentence
example_text = "I feel excited to learn how to use GenAI!"
prediction = predict_sentiment(example_text)
print("\nExample Prediction:")
print(f"Text: {example_text}")
print(f"Predicted emotion: {prediction}")

## Save Model
Save the trained LoRA adapter and tokenizer

In [None]:
# Save the LoRA adapter and tokenizer
print("Saving LoRA adapter and tokenizer...")
model.save_pretrained(str(LORA_DIR / "distilbert_lora_go_emotions"))
tokenizer.save_pretrained(str(LORA_DIR / "distilbert_lora_go_emotions"))
print("Training complete! LoRA adapter and tokenizer saved.")