# Focused Learning: Low-Rank Adaptation (LoRA) Implementation

## Learning Objectives
- Understand the mathematical foundations of Low-Rank Adaptation (LoRA)
- Implement LoRA from scratch to gain deep insights into its mechanics
- Explore the trade-offs between rank, scaling factors, and performance
- Compare LoRA with other PEFT methods through hands-on implementation

## Paper Reference
This notebook explores concepts from the paper "Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications" (arXiv:2404.13506v2).

Specifically, we focus on Section 3 which highlights that LoRA is one of the most effective PEFT methods across various applications:

> "Our analysis reveals that Low-Rank Adaptation (LoRA) fine-tunes a minimal number of parameters, thus enabling the recalibration of training weights on a single GPU." (Section 5, Page 6)

The paper also notes that LoRA has been successfully applied in multiple domains:

> "In medical imaging, a combination of methods (e.g., BitFit + LoRA) may be effective" (Section 3.3, Page 3)

> "For PPI prediction, PEFT models even outperform traditional methods... Low-Rank Adaptation at 0.81 percent compared to the full model's 100 percent..." (Section 3.4, Page 4)

## 1. Introduction to LoRA

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method introduced by Hu et al. (2021) that updates pre-trained model weights by adding low-rank decomposition matrices. The key insight behind LoRA is that the weight updates during fine-tuning have a low "intrinsic rank".

### Mathematical Formulation

In LoRA, instead of directly updating a weight matrix $W \in \mathbb{R}^{d \times k}$, we freeze $W$ and introduce a low-rank decomposition:

$$W' = W + \Delta W = W + BA$$

where:
- $B \in \mathbb{R}^{d \times r}$
- $A \in \mathbb{R}^{r \times k}$
- $r \ll \min(d,k)$ is the rank

This results in significantly fewer trainable parameters, as we only train the matrices $A$ and $B$ while keeping the original weights $W$ frozen.

During the forward pass, the LoRA-augmented weight computation is:

$$h = Wx + \Delta W x = Wx + BAx$$

To further control the contribution of the adaptation, a scaling factor $\alpha$ is introduced:

$$h = Wx + \frac{\alpha}{r}BAx$$

where $\alpha$ is a hyperparameter that scales the contribution of the low-rank update.

In [None]:
# Install necessary libraries
!pip install torch transformers datasets peft matplotlib numpy

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import AdamW

# Set the seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Implementing LoRA from Scratch

Let's implement LoRA from scratch to understand its inner workings. We'll start by creating a basic linear layer with LoRA.

In [None]:
class LoRALinear(nn.Module):
    """Linear layer with LoRA adaptation"""
    def __init__(self, in_features, out_features, rank=8, alpha=16, dropout=0.0):
        super().__init__()
        
        # Original linear layer (frozen)
        self.linear = nn.Linear(in_features, out_features)
        # Freeze the weights of the original linear layer
        self.linear.weight.requires_grad = False
        if self.linear.bias is not None:
            self.linear.bias.requires_grad = False
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Scaling factor
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)
        
        # Initialize LoRA matrices
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)
    
    def forward(self, x):
        # Original forward pass
        orig_output = self.linear(x)
        
        # LoRA forward pass
        # Apply A matrix, then dropout, then B matrix
        lora_output = self.dropout(F.linear(x, self.lora_A))  # [batch_size, rank]
        lora_output = F.linear(lora_output, self.lora_B.t())  # [batch_size, out_features]
        
        # Scale LoRA output
        lora_output = lora_output * self.scaling
        
        # Combine outputs
        return orig_output + lora_output
    
    def get_trainable_parameters(self):
        """Count the number of trainable parameters"""
        return self.lora_A.numel() + self.lora_B.numel()
    
    def get_total_parameters(self):
        """Count the total number of parameters (trainable + frozen)"""
        return self.linear.weight.numel() + self.lora_A.numel() + self.lora_B.numel()

Now, let's create a simple neural network with and without LoRA to compare their parameter count and performance.

In [None]:
import math

class SimpleNN(nn.Module):
    """Simple neural network with two linear layers"""
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

class SimpleNN_LoRA(nn.Module):
    """Simple neural network with LoRA-adapted linear layers"""
    def __init__(self, input_dim, hidden_dim, output_dim, rank=8, alpha=16, dropout=0.0):
        super().__init__()
        self.fc1 = LoRALinear(input_dim, hidden_dim, rank, alpha, dropout)
        self.fc2 = LoRALinear(hidden_dim, output_dim, rank, alpha, dropout)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
    
    def get_trainable_parameters(self):
        """Count the number of trainable parameters"""
        return self.fc1.get_trainable_parameters() + self.fc2.get_trainable_parameters()
    
    def get_total_parameters(self):
        """Count the total number of parameters (trainable + frozen)"""
        return self.fc1.get_total_parameters() + self.fc2.get_total_parameters()

Let's create a function to count the number of trainable parameters in a model:

In [None]:
def count_parameters(model):
    """Count the number of trainable parameters in a model"""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def count_total_parameters(model):
    """Count the total number of parameters in a model"""
    return sum(p.numel() for p in model.parameters())

Now, let's create a synthetic classification dataset and compare the standard model with the LoRA model.

In [None]:
# Create a synthetic dataset
def create_synthetic_data(n_samples=1000, n_features=20, n_classes=2, random_state=42):
    np.random.seed(random_state)
    X = np.random.randn(n_samples, n_features)
    
    # Create a random weight matrix for classification
    W = np.random.randn(n_features, n_classes)
    
    # Compute logits and apply softmax
    logits = X @ W
    probs = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
    
    # Get the predicted class
    y = np.argmax(probs, axis=1)
    
    # Add some noise to make it more challenging
    noise_idx = np.random.choice(n_samples, size=int(0.1 * n_samples), replace=False)
    y[noise_idx] = 1 - y[noise_idx]  # Flip labels for these samples
    
    return X, y

# Create datasets
X_train, y_train = create_synthetic_data(n_samples=800, n_features=100, n_classes=2)
X_test, y_test = create_synthetic_data(n_samples=200, n_features=100, n_classes=2)

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.LongTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.LongTensor(y_test)

# Create datasets and dataloaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32)

## 3. Comparing Parameter Counts and Performance

In [None]:
# Define model parameters
input_dim = 100
hidden_dim = 256
output_dim = 2

# Create models
standard_model = SimpleNN(input_dim, hidden_dim, output_dim).to(device)
lora_model = SimpleNN_LoRA(input_dim, hidden_dim, output_dim, rank=8, alpha=16, dropout=0.1).to(device)

# Count parameters
standard_params = count_parameters(standard_model)
standard_total_params = count_total_parameters(standard_model)
lora_params = count_parameters(lora_model)
lora_total_params = count_total_parameters(lora_model)

# Print parameter counts
print(f"Standard model - Trainable parameters: {standard_params:,} (100.00%)")
print(f"LoRA model - Trainable parameters: {lora_params:,} ({lora_params / standard_params * 100:.2f}%)")
print(f"LoRA model - Total parameters (including frozen): {lora_total_params:,}")
print(f"Parameter reduction: {(1 - lora_params / standard_params) * 100:.2f}%")

Now, let's train both models and compare their performance.

In [None]:
def train_model(model, train_dataloader, test_dataloader, optimizer, num_epochs=10, model_name="Model"):
    """Train and evaluate a model"""
    # Loss function
    criterion = nn.CrossEntropyLoss()
    
    # Training stats
    train_losses = []
    test_accuracies = []
    
    # Training loop
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for batch_idx, (data, target) in enumerate(train_dataloader):
            data, target = data.to(device), target.to(device)
            
            # Forward pass
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(train_dataloader)
        train_losses.append(avg_loss)
        
        # Evaluate the model
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for data, target in test_dataloader:
                data, target = data.to(device), target.to(device)
                output = model(data)
                _, predicted = torch.max(output.data, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()
        
        accuracy = 100 * correct / total
        test_accuracies.append(accuracy)
        
        print(f"{model_name} - Epoch {epoch+1}/{num_epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy:.2f}%")
    
    return train_losses, test_accuracies

# Define optimizers
standard_optimizer = AdamW(standard_model.parameters(), lr=0.001)
lora_optimizer = AdamW(lora_model.parameters(), lr=0.001)

# Train the standard model
print("Training the standard model...")
standard_losses, standard_accuracies = train_model(
    standard_model, 
    train_dataloader, 
    test_dataloader, 
    standard_optimizer, 
    num_epochs=10, 
    model_name="Standard Model"
)

# Train the LoRA model
print("\nTraining the LoRA model...")
lora_losses, lora_accuracies = train_model(
    lora_model, 
    train_dataloader, 
    test_dataloader, 
    lora_optimizer, 
    num_epochs=10, 
    model_name="LoRA Model"
)

Let's visualize the training progress and compare the two models.

In [None]:
# Plot training loss
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(standard_losses, label='Standard Model')
plt.plot(lora_losses, label='LoRA Model')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.legend()
plt.grid(True)

# Plot test accuracy
plt.subplot(1, 2, 2)
plt.plot(standard_accuracies, label='Standard Model')
plt.plot(lora_accuracies, label='LoRA Model')
plt.xlabel('Epoch')
plt.ylabel('Accuracy (%)')
plt.title('Test Accuracy')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

## 4. Exploring LoRA Hyperparameters

Let's explore how different LoRA hyperparameters (rank and alpha) affect the model's performance.

In [None]:
def train_lora_with_hyperparams(input_dim, hidden_dim, output_dim, rank, alpha, train_dataloader, test_dataloader):
    """Train a LoRA model with specified hyperparameters"""
    model = SimpleNN_LoRA(input_dim, hidden_dim, output_dim, rank=rank, alpha=alpha, dropout=0.1).to(device)
    optimizer = AdamW(model.parameters(), lr=0.001)
    
    # Get parameter counts
    trainable_params = count_parameters(model)
    total_params = count_total_parameters(model)
    param_percentage = trainable_params / count_parameters(standard_model) * 100
    
    print(f"LoRA(rank={rank}, alpha={alpha}) - Trainable parameters: {trainable_params:,} ({param_percentage:.2f}%)")
    
    # Train for fewer epochs to speed up experimentation
    _, accuracies = train_model(
        model, 
        train_dataloader, 
        test_dataloader, 
        optimizer, 
        num_epochs=5, 
        model_name=f"LoRA(r={rank}, α={alpha})"
    )
    
    # Return the final accuracy and parameter percentage
    return accuracies[-1], param_percentage

# Define a range of ranks and alphas to explore
ranks = [1, 2, 4, 8, 16, 32]
alphas = [1, 4, 16, 32, 64]

# Store results
results = []

# Train models with different hyperparameters
for rank in ranks:
    for alpha in alphas:
        print(f"\nTraining LoRA model with rank={rank}, alpha={alpha}...")
        accuracy, param_percentage = train_lora_with_hyperparams(
            input_dim, hidden_dim, output_dim, rank, alpha, train_dataloader, test_dataloader
        )
        
        results.append({
            'rank': rank,
            'alpha': alpha,
            'accuracy': accuracy,
            'param_percentage': param_percentage
        })

Let's visualize the results of our hyperparameter exploration.

In [None]:
import pandas as pd

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Create a pivot table for the heatmap
pivot_acc = results_df.pivot(index='rank', columns='alpha', values='accuracy')

# Plot heatmap of accuracy
plt.figure(figsize=(10, 8))
sns.heatmap(pivot_acc, annot=True, fmt='.1f', cmap='viridis')
plt.title('Test Accuracy (%) for Different LoRA Hyperparameters')
plt.xlabel('Alpha (α)')
plt.ylabel('Rank (r)')
plt.show()

# Plot parameter percentage vs. accuracy for different ranks
plt.figure(figsize=(12, 6))
for rank in ranks:
    rank_data = results_df[results_df['rank'] == rank]
    plt.plot(rank_data['param_percentage'], rank_data['accuracy'], marker='o', label=f'rank={rank}')

plt.axhline(y=standard_accuracies[-1], color='r', linestyle='--', label='Standard Model')
plt.xlabel('Parameter Percentage (%)')
plt.ylabel('Test Accuracy (%)')
plt.title('Parameter Efficiency vs. Accuracy for Different LoRA Ranks')
plt.grid(True)
plt.legend()
plt.show()

## 5. LoRA for Transformer Models

In practice, LoRA is most commonly applied to transformer models, particularly for fine-tuning large language models. Let's see how to apply LoRA to a transformer model using the `peft` library.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

# Load a pre-trained BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

# Count parameters in the full model
full_params = count_parameters(model)
print(f"Full BERT model - Trainable parameters: {full_params:,}")

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # rank of the LoRA matrices
    lora_alpha=16,  # scaling factor
    target_modules=["query", "key", "value"],  # which modules to apply LoRA to
    lora_dropout=0.1,  # dropout probability
    bias="none",  # whether to train bias parameters
    task_type=TaskType.SEQ_CLS  # task type (sequence classification)
)

# Create the LoRA model
lora_model = get_peft_model(model, lora_config)

# Count trainable parameters
lora_params = count_parameters(lora_model)
print(f"LoRA BERT model - Trainable parameters: {lora_params:,} ({lora_params/full_params*100:.2f}%)")
print(f"Parameter reduction: {(1 - lora_params/full_params)*100:.2f}%")

# Print the model structure to see the LoRA layers
lora_model.print_trainable_parameters()

## 6. Analysis of LoRA across Different Applications

Based on the paper, let's analyze how LoRA performs across different application domains.

In [None]:
# Application domains and parameter percentages from the paper
applications = [
    'Commonsense Reasoning', 
    'Arithmetic Reasoning',
    'Video Text Generation',
    'Medical Imaging',
    'Protein Models',
    'Code Review',
    'Speech Synthesis'
]

# Parameter percentages for LoRA (from the paper)
lora_percentages = [0.83, 0.83, 0.81, 0.81, 0.81, 1.0, 0.8]

# Create a bar chart
plt.figure(figsize=(12, 6))
bars = plt.bar(applications, lora_percentages, color='skyblue')
plt.axhline(y=1.0, color='r', linestyle='--', label='1% threshold')
plt.title('LoRA Parameter Percentages Across Different Applications')
plt.xlabel('Application Domain')
plt.ylabel('Parameters (% of full model)')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 2.0)  # Set a reasonable y-axis limit
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

# Add data labels on the bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.05,
             f'{height:.2f}%', ha='center', va='bottom')

plt.legend()
plt.show()

## 7. Comparison with Other PEFT Methods

Let's visualize the comparison of LoRA with other PEFT methods mentioned in the paper.

In [None]:
# PEFT methods and their parameter percentages (from Table 1 and 2 in the paper)
methods = ['PrefT', 'AdapterS', 'AdapterP', 'LoRA', 'DoRA (half)', 'DoRA', 'LoReFT']
param_percentages_7b = [0.110, 0.990, 3.540, 0.830, 0.430, 0.840, 0.031]  # For LLaMA-7B
param_percentages_13b = [0.030, 0.800, 2.890, 0.670, 0.350, 0.680, 0.025]  # For LLaMA-13B

# Create a bar chart for parameter percentages
plt.figure(figsize=(14, 6))
x = np.arange(len(methods))
width = 0.35

plt.bar(x - width/2, param_percentages_7b, width, label='LLaMA-7B')
plt.bar(x + width/2, param_percentages_13b, width, label='LLaMA-13B')

plt.axhline(y=1.0, color='r', linestyle='--', label='1% threshold')
plt.title('Parameter Percentages of Different PEFT Methods')
plt.xlabel('PEFT Method')
plt.ylabel('Parameters (% of full model)')
plt.xticks(x, methods)
plt.yscale('log')  # Log scale to better visualize the differences
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

# Accuracies from Table 1 in the paper (Commonsense Reasoning)
accuracies_7b = [64.6, 70.8, 72.3, 74.7, 77.5, 78.1, 80.2]  # For LLaMA-7B
accuracies_13b = [68.4, 79.5, 81.5, 80.5, 80.8, 81.5, 83.3]  # For LLaMA-13B

# Create a bar chart for accuracies
plt.figure(figsize=(14, 6))
plt.bar(x - width/2, accuracies_7b, width, label='LLaMA-7B')
plt.bar(x + width/2, accuracies_13b, width, label='LLaMA-13B')

plt.axhline(y=77.0, color='g', linestyle='--', label='ChatGPT')
plt.title('Average Accuracy of Different PEFT Methods on Commonsense Reasoning')
plt.xlabel('PEFT Method')
plt.ylabel('Average Accuracy (%)')
plt.xticks(x, methods)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

# Create a scatter plot of parameter percentage vs. accuracy
plt.figure(figsize=(12, 8))
plt.scatter(param_percentages_7b, accuracies_7b, s=100, marker='o', label='LLaMA-7B')
plt.scatter(param_percentages_13b, accuracies_13b, s=100, marker='s', label='LLaMA-13B')

# Annotate each point with the method name
for i, method in enumerate(methods):
    plt.annotate(method, (param_percentages_7b[i], accuracies_7b[i]), 
                 textcoords="offset points", xytext=(0,10), ha='center')
    plt.annotate(method, (param_percentages_13b[i], accuracies_13b[i]), 
                 textcoords="offset points", xytext=(0,-15), ha='center')

plt.axhline(y=77.0, color='g', linestyle='--', label='ChatGPT')
plt.title('Parameter Efficiency vs. Accuracy for Different PEFT Methods')
plt.xlabel('Parameters (% of full model)')
plt.ylabel('Average Accuracy (%)')
plt.xscale('log')  # Log scale for better visualization
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

## 8. Conclusion and Key Takeaways

From our experiments and analysis, we can draw several key insights about LoRA and its applications:

1. **Parameter Efficiency**: LoRA significantly reduces the number of trainable parameters (typically to less than 1% of the full model), which leads to lower memory usage and faster training.

2. **Performance Trade-off**: Despite the drastic reduction in trainable parameters, LoRA can achieve performance comparable to full fine-tuning in many cases, especially when the rank and scaling factor are appropriately chosen.

3. **Hyperparameter Sensitivity**: The performance of LoRA depends on the choice of hyperparameters, particularly the rank (r) and scaling factor (alpha). Higher ranks generally lead to better performance but at the cost of more parameters.

4. **Application Versatility**: As shown in the paper, LoRA has been successfully applied across various domains, including commonsense reasoning, medical imaging, protein modeling, and speech synthesis, with consistently low parameter overhead (around 0.8-1%).

5. **Comparative Advantage**: While newer methods like LoReFT may achieve better parameter efficiency in some tasks, LoRA offers a good balance between performance and parameter efficiency across a wide range of applications.

6. **Integration with Transformers**: LoRA is particularly well-suited for transformer-based models, as it can be selectively applied to specific attention modules (query, key, value), further optimizing the parameter-performance trade-off.

These insights align with the paper's conclusion that LoRA is one of the most effective PEFT methods, offering a good balance between computational efficiency and performance across diverse applications.

## References

1. Balne, C. C. S., Bhaduri, S., Roy, T., Jain, V., & Chadha, A. (2024). Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications. arXiv:2404.13506v2.

2. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

3. Wu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C. D., & Potts, C. (2024). ReFT: Representation finetuning for language models. arXiv preprint arXiv:2401.13622.