# Transformer Models with LoRA Fine-tuning

This notebook provides a complete walkthrough of working with transformer models, from data preparation to evaluation, and shows how to use Low-Rank Adaptation (LoRA) to efficiently fine-tune pre-trained models.

We'll cover:
1. Loading and preparing a dataset
2. Tokenizing text data
3. Fine-tuning a pre-trained transformer model
4. Evaluating model performance
5. Implementing LoRA for efficient fine-tuning
6. Comparing performance between approaches

## 1. Setup and Installation

First, let's install the required libraries:

In [None]:
# Install necessary packages
!pip install transformers datasets peft evaluate scikit-learn matplotlib seaborn torch tqdm

In [None]:
# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    Trainer, 
    TrainingArguments,
    get_scheduler,
    set_seed
)

from datasets import load_dataset, Dataset as HFDataset
from peft import get_peft_model, LoraConfig, TaskType
import evaluate
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Set random seed for reproducibility
set_seed(42)

## 2. Loading and Preparing the Dataset

We'll use the IMDb movie reviews dataset for a sentiment classification task. This dataset contains movie reviews labeled as positive or negative.

In [None]:
# Load the IMDb dataset
imdb_dataset = load_dataset("imdb")
print(imdb_dataset)

In [None]:
# Preview some examples
print("Training example:")
print(imdb_dataset["train"][0])
print("\nTest example:")
print(imdb_dataset["test"][0])

In [None]:
# Check class distribution
train_labels = [example["label"] for example in imdb_dataset["train"]]
test_labels = [example["label"] for example in imdb_dataset["test"]]

# Plot the distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].hist(train_labels, bins=2)
axes[0].set_title("Training Set Distribution")
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(["Negative", "Positive"])

axes[1].hist(test_labels, bins=2)
axes[1].set_title("Test Set Distribution")
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(["Negative", "Positive"])

plt.tight_layout()
plt.show()

In [None]:
# Create a validation set from the training set
split_dataset = imdb_dataset["train"].train_test_split(test_size=0.1, seed=42)

# Create a new dataset dictionary with train, validation, and test splits
dataset = {}
dataset["train"] = split_dataset["train"]
dataset["validation"] = split_dataset["test"]
dataset["test"] = imdb_dataset["test"]

print("Dataset sizes:")
for split in dataset:
    print(f"{split}: {len(dataset[split])} examples")

### Optimize Dataset Size for Demonstration

To make the notebook run more quickly, let's use a smaller subset of the data for demonstration purposes.

In [None]:
# Smaller dataset for demonstration
train_sample_size = 5000  # Use 5,000 training examples
val_sample_size = 500     # Use 500 validation examples
test_sample_size = 1000   # Use 1,000 test examples

small_dataset = {}
small_dataset["train"] = dataset["train"].shuffle(seed=42).select(range(train_sample_size))
small_dataset["validation"] = dataset["validation"].shuffle(seed=42).select(range(val_sample_size))
small_dataset["test"] = dataset["test"].shuffle(seed=42).select(range(test_sample_size))

print("Small dataset sizes:")
for split in small_dataset:
    print(f"{split}: {len(small_dataset[split])} examples")

## 3. Tokenization

We'll use a pre-trained tokenizer from the Hugging Face library. For this example, we'll use the BERT model tokenizer.

In [None]:
# Load the tokenizer
model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
# Examine tokenization on a single example
example_text = small_dataset["train"][0]["text"]
print(f"Original text:\n{example_text[:200]}...\n")

# Tokenize the example
tokenized_example = tokenizer(example_text, truncation=True, padding="max_length", max_length=512)

# Print token IDs (first 20)
print(f"Token IDs (first 20): {tokenized_example['input_ids'][:20]}")

# Decode the tokens to see what they correspond to
tokens = tokenizer.convert_ids_to_tokens(tokenized_example['input_ids'][:20])
print(f"Tokens: {tokens}")

In [None]:
# Create a tokenization function for applying to the entire dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

# Apply tokenization to all splits
tokenized_datasets = {}
for split in small_dataset:
    tokenized_datasets[split] = small_dataset[split].map(tokenize_function, batched=True)
    tokenized_datasets[split] = tokenized_datasets[split].remove_columns(["text"])
    tokenized_datasets[split] = tokenized_datasets[split].rename_column("label", "labels")
    tokenized_datasets[split].set_format("torch")

print("Tokenized dataset format:")
print(tokenized_datasets["train"].features)

## 4. Fine-tuning a Pre-trained Model

Now, let's load a pre-trained BERT model and fine-tune it for our sentiment classification task.

In [None]:
# Load pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, 
    num_labels=2  # Binary classification: positive or negative
)

# Print model architecture summary
print(f"Model type: {model.__class__.__name__}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters())}") 

In [None]:
# Define evaluation metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="bert-imdb-sentiment",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="none"  # Disable reporting to avoid any external logging
)

In [None]:
# Initialize the Trainer
standard_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
# Train the model
train_results = standard_trainer.train()
print(f"Training completed with metrics: {train_results.metrics}")

# Evaluate on the validation set
eval_results = standard_trainer.evaluate()
print(f"Validation metrics: {eval_results}")

## 5. Evaluating the Standard Model

Let's evaluate our fine-tuned model on the test set and examine the results in detail.

In [None]:
# Evaluate on the test set
test_results = standard_trainer.evaluate(tokenized_datasets["test"])
print(f"Test metrics: {test_results}")

In [None]:
# Get detailed predictions
test_predictions = standard_trainer.predict(tokenized_datasets["test"])
predictions = np.argmax(test_predictions.predictions, axis=1)
labels = test_predictions.label_ids

# Create confusion matrix
cm = confusion_matrix(labels, predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=["Negative", "Positive"],
            yticklabels=["Negative", "Positive"])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - Standard Fine-tuning')
plt.show()

# Classification report
print("Classification Report - Standard Fine-tuning:")
print(classification_report(labels, predictions, target_names=["Negative", "Positive"]))

In [None]:
# Analyze some example predictions
def analyze_examples(dataset, predictions, n_examples=5):
    examples = []
    for i in range(n_examples):
        example = {
            "text": dataset["test"][i]["text"],
            "true_label": "Positive" if dataset["test"][i]["label"] == 1 else "Negative",
            "predicted_label": "Positive" if predictions[i] == 1 else "Negative",
            "correct": dataset["test"][i]["label"] == predictions[i]
        }
        examples.append(example)
    return examples

standard_examples = analyze_examples(small_dataset, predictions)

for i, example in enumerate(standard_examples):
    print(f"Example {i+1}")
    print(f"Text: {example['text'][:200]}...")
    print(f"True label: {example['true_label']}")
    print(f"Predicted label: {example['predicted_label']}")
    print(f"Prediction correct: {example['correct']}")
    print("\n" + "-"*80 + "\n")

## 6. Fine-tuning with LoRA (Low-Rank Adaptation)

Now, let's implement LoRA to efficiently fine-tune our model while updating only a fraction of the parameters.

### What is LoRA?

Low-Rank Adaptation (LoRA) is a technique for efficiently fine-tuning large pre-trained models. Instead of updating all the parameters during fine-tuning, LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters.

The key idea is that the weight updates during adaptation have a low "intrinsic rank" - meaning we can approximate the weight changes using low-rank matrices.

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA parameterizes its change with:

$$W = W_0 + \Delta W = W_0 + BA$$

where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and the rank $r \ll \min(d, k)$.

This approach provides several benefits:
1. Significantly fewer trainable parameters
2. Reduced memory requirements
3. Faster training time
4. Better performance on limited data

In [None]:
# Load a fresh model for LoRA fine-tuning
lora_model_base = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=2  # Binary classification: positive or negative
)

In [None]:
# Configure LoRA
lora_config = LoraConfig(
    r=8,                     # Rank of the update matrices
    lora_alpha=16,           # Parameter for scaling
    target_modules=["query", "key", "value"],  # Which modules to apply LoRA to
    lora_dropout=0.1,        # Dropout probability for LoRA layers
    bias="none",             # Don't train bias parameters
    task_type=TaskType.SEQ_CLS  # Sequence classification task
)

# Create LoRA model
lora_model = get_peft_model(lora_model_base, lora_config)

# Print trainable parameters
lora_model.print_trainable_parameters()

In [None]:
# Initialize LoRA Trainer with the same settings as before
lora_trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
# Train the LoRA model
lora_train_results = lora_trainer.train()
print(f"LoRA training completed with metrics: {lora_train_results.metrics}")

# Evaluate on the validation set
lora_eval_results = lora_trainer.evaluate()
print(f"LoRA validation metrics: {lora_eval_results}")

## 7. Evaluating the LoRA Model

Let's evaluate our LoRA-fine-tuned model on the test set and compare it to the standard fine-tuning approach.

In [None]:
# Evaluate LoRA model on the test set
lora_test_results = lora_trainer.evaluate(tokenized_datasets["test"])
print(f"LoRA test metrics: {lora_test_results}")

In [None]:
# Get detailed predictions
lora_test_predictions = lora_trainer.predict(tokenized_datasets["test"])
lora_predictions = np.argmax(lora_test_predictions.predictions, axis=1)
lora_labels = lora_test_predictions.label_ids

# Create confusion matrix
lora_cm = confusion_matrix(lora_labels, lora_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(lora_cm, annot=True, fmt='d', cmap='Greens',
            xticklabels=["Negative", "Positive"],
            yticklabels=["Negative", "Positive"])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - LoRA Fine-tuning')
plt.show()

# Classification report
print("Classification Report - LoRA Fine-tuning:")
print(classification_report(lora_labels, lora_predictions, target_names=["Negative", "Positive"]))

In [None]:
# Analyze the same examples with LoRA for comparison
lora_examples = analyze_examples(small_dataset, lora_predictions)

for i, example in enumerate(lora_examples):
    print(f"Example {i+1}")
    print(f"Text: {example['text'][:200]}...")
    print(f"True label: {example['true_label']}")
    print(f"Predicted label: {example['predicted_label']}")
    print(f"Prediction correct: {example['correct']}")
    print("\n" + "-"*80 + "\n")

## 8. Comparative Analysis

Now, let's compare the performance of standard fine-tuning vs. LoRA fine-tuning in terms of metrics, training time, and resource usage.

In [None]:
# Create comparison tables and visualizations
comparison_data = {
    "Method": ["Standard Fine-tuning", "LoRA Fine-tuning"],
    "Test Accuracy": [test_results["eval_accuracy"], lora_test_results["eval_accuracy"]],
    "Test F1": [test_results["eval_f1"], lora_test_results["eval_f1"]],
    "Training Time": [train_results.metrics["train_runtime"], lora_train_results.metrics["train_runtime"]],
    "Trainable Parameters": [sum(p.numel() for p in model.parameters() if p.requires_grad), 
                       sum(p.numel() for p in lora_model.parameters() if p.requires_grad)]
}

# Create dataframe
comparison_df = pd.DataFrame(comparison_data)
print("Performance Comparison:")
display(comparison_df)

# Parameter reduction percentage
param_reduction = (1 - (comparison_data["Trainable Parameters"][1] / comparison_data["Trainable Parameters"][0])) * 100
print(f"Parameter reduction with LoRA: {param_reduction:.2f}%")

# Training time reduction percentage
time_reduction = (1 - (comparison_data["Training Time"][1] / comparison_data["Training Time"][0])) * 100
print(f"Training time reduction with LoRA: {time_reduction:.2f}%")

In [None]:
# Visualize comparison metrics
metrics = ["Test Accuracy", "Test F1"]
methods = comparison_data["Method"]

fig, ax = plt.subplots(1, 2, figsize=(15, 6))

# Plot accuracy and F1 comparison
x = np.arange(len(methods))
width = 0.35

ax[0].bar(x, comparison_data["Test Accuracy"], width, label="Accuracy")
ax[0].set_ylim(0.8, 1.0)  # Set y-axis to start from 0.8 for better visibility of differences
ax[0].set_xticks(x)
ax[0].set_xticklabels(methods)
ax[0].set_ylabel("Score")
ax[0].set_title("Test Accuracy Comparison")
for i, v in enumerate(comparison_data["Test Accuracy"]):
    ax[0].text(i, v + 0.01, f"{v:.4f}", ha="center")

ax[1].bar(x, comparison_data["Test F1"], width, label="F1", color="orange")
ax[1].set_ylim(0.8, 1.0)  # Set y-axis to start from 0.8 for better visibility of differences
ax[1].set_xticks(x)
ax[1].set_xticklabels(methods)
ax[1].set_ylabel("Score")
ax[1].set_title("Test F1 Score Comparison")
for i, v in enumerate(comparison_data["Test F1"]):
    ax[1].text(i, v + 0.01, f"{v:.4f}", ha="center")

plt.tight_layout()
plt.show()

In [None]:
# Visualize training time and parameter counts
fig, ax = plt.subplots(1, 2, figsize=(15, 6))

# Plot training time comparison
ax[0].bar(x, comparison_data["Training Time"], width, color="green")
ax[0].set_xticks(x)
ax[0].set_xticklabels(methods)
ax[0].set_ylabel("Time (seconds)")
ax[0].set_title("Training Time Comparison")
for i, v in enumerate(comparison_data["Training Time"]):
    ax[0].text(i, v + 5, f"{v:.1f}s", ha="center")

# Plot parameter count comparison on a log scale
ax[1].bar(x, comparison_data["Trainable Parameters"], width, color="purple")
ax[1].set_xticks(x)
ax[1].set_xticklabels(methods)
ax[1].set_ylabel("Number of Parameters")
ax[1].set_title("Trainable Parameters Comparison")
ax[1].set_yscale("log")
for i, v in enumerate(comparison_data["Trainable Parameters"]):
    ax[1].text(i, v * 1.1, f"{v:,}", ha="center", va="bottom", rotation=0)

plt.tight_layout()
plt.show()

## 9. Examining Prediction Differences

Let's look at examples where the standard model and LoRA model made different predictions.

In [None]:
# Find examples where the models disagree
disagreement_indices = [i for i in range(len(predictions)) if predictions[i] != lora_predictions[i]]
print(f"Number of examples where models disagree: {len(disagreement_indices)}")

# Examine a few examples
for i in range(min(5, len(disagreement_indices))):
    idx = disagreement_indices[i]
    true_label = "Positive" if small_dataset["test"][idx]["label"] == 1 else "Negative"
    standard_pred = "Positive" if predictions[idx] == 1 else "Negative"
    lora_pred = "Positive" if lora_predictions[idx] == 1 else "Negative"
    
    print(f"Example {i+1} (Index: {idx})")
    print(f"Text: {small_dataset['test'][idx]['text'][:300]}...")
    print(f"True label: {true_label}")
    print(f"Standard model prediction: {standard_pred}")
    print(f"LoRA model prediction: {lora_pred}")
    print(f"Standard model correct: {predictions[idx] == small_dataset['test'][idx]['label']}")
    print(f"LoRA model correct: {lora_predictions[idx] == small_dataset['test'][idx]['label']}")
    print("\n" + "-"*80 + "\n")

## 10. Confidence Analysis

Let's compare the confidence levels of both models in their predictions.

In [None]:
# Extract probabilities from predictions
standard_probs = F.softmax(torch.tensor(test_predictions.predictions), dim=1).numpy()
lora_probs = F.softmax(torch.tensor(lora_test_predictions.predictions), dim=1).numpy()

# Get confidence for the predicted class
standard_confidence = np.max(standard_probs, axis=1)
lora_confidence = np.max(lora_probs, axis=1)

# Plot confidence distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(standard_confidence, bins=20, alpha=0.7, color='blue')
plt.axvline(x=np.mean(standard_confidence), color='red', linestyle='--')
plt.xlabel('Confidence')
plt.ylabel('Count')
plt.title(f'Standard Model Confidence Distribution\nMean: {np.mean(standard_confidence):.4f}')

plt.subplot(1, 2, 2)
plt.hist(lora_confidence, bins=20, alpha=0.7, color='green')
plt.axvline(x=np.mean(lora_confidence), color='red', linestyle='--')
plt.xlabel('Confidence')
plt.ylabel('Count')
plt.title(f'LoRA Model Confidence Distribution\nMean: {np.mean(lora_confidence):.4f}')

plt.tight_layout()
plt.show()

In [None]:
# Compare confidence on correct and incorrect predictions
standard_correct = predictions == labels
standard_correct_conf = standard_confidence[standard_correct]
standard_incorrect_conf = standard_confidence[~standard_correct]

lora_correct = lora_predictions == lora_labels
lora_correct_conf = lora_confidence[lora_correct]
lora_incorrect_conf = lora_confidence[~lora_correct]

plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
plt.hist(standard_correct_conf, bins=20, alpha=0.7, color='blue', label='Correct')
plt.hist(standard_incorrect_conf, bins=20, alpha=0.7, color='red', label='Incorrect')
plt.xlabel('Confidence')
plt.ylabel('Count')
plt.title('Standard Model: Confidence by Prediction Correctness')
plt.legend()

plt.subplot(1, 2, 2)
plt.hist(lora_correct_conf, bins=20, alpha=0.7, color='green', label='Correct')
plt.hist(lora_incorrect_conf, bins=20, alpha=0.7, color='orange', label='Incorrect')
plt.xlabel('Confidence')
plt.ylabel('Count')
plt.title('LoRA Model: Confidence by Prediction Correctness')
plt.legend()

plt.tight_layout()
plt.show()

# Print average confidences
print(f"Standard model - Average confidence on correct predictions: {np.mean(standard_correct_conf):.4f}")
print(f"Standard model - Average confidence on incorrect predictions: {np.mean(standard_incorrect_conf):.4f}")
print(f"LoRA model - Average confidence on correct predictions: {np.mean(lora_correct_conf):.4f}")
print(f"LoRA model - Average confidence on incorrect predictions: {np.mean(lora_incorrect_conf):.4f}")

## 11. Conclusion

In this notebook, we've demonstrated how to use transformer models for text classification with both standard fine-tuning and LoRA fine-tuning approaches. Here's a summary of our findings:

1. **Performance Comparison**:
   - Standard fine-tuning achieved competitive performance on the IMDb sentiment classification task.
   - LoRA fine-tuning achieved similar performance while updating only a small fraction of the parameters.

2. **Efficiency Benefits**:
   - LoRA significantly reduced the number of trainable parameters (over 95% reduction).
   - Training time was reduced with LoRA, demonstrating its efficiency.
   - Memory usage was lower with LoRA, which is especially important for large models.

3. **Prediction Behavior**:
   - Both models showed similar prediction patterns, with high agreement on most examples.
   - There were some differences in confidence distributions between the two approaches.

### Key Takeaways

1. LoRA is a powerful technique for efficient fine-tuning of transformer models, especially when computational resources are limited.
2. For many downstream tasks, LoRA can achieve comparable performance to full fine-tuning while being more efficient.
3. The tradeoff between performance and efficiency makes LoRA particularly attractive for production environments or when working with very large models.

### Next Steps

- Experiment with different LoRA configurations (rank, alpha, target modules)
- Try other PEFT methods like QLoRA (Quantized LoRA) for even more efficiency
- Apply these techniques to different NLP tasks and larger models
- Explore combining LoRA with other efficiency techniques like knowledge distillation