# Parameter-Efficient Fine-Tuning (PEFT) Methods

Parameter Efficient Fine Tuning (PEFT) refers to a suite of techniques used to fine tune models in more efficient, "scrappier" ways. The core insight behind these methods is that models normally can have their behaviors ajdusted by adjusting a **much, much smaller** subset of parameters than tradtional fine tuning.

In this notebook, we'll dive deep into the most important parameter-efficient fine-tuning techniques for large transformer-based models. We'll explore LoRA, DoRA, and Adapter methods, understanding both their theoretical foundations and practical implementations.

------

#### Why do we need PEFT?
In a typical fine-tuning paradigm, we would update all parameters of a pre-trained model. For a 7B parameter model like Llama-2 for instance, this would mean storing gradients and optimizer states for *all 7 billion parameters* - requiring massive GPU memory and computational resources that most practitioners simply don't have access to.

PEFT methods solve this by updating only a small subset of parameters (often <1% of the total!) - yet they still achieve performance comparable to full fine-tuning. These methods take advantage of the fact that most of a large model's knowledge and abilities **doesn't** need to be changed or adjusted to adapt to new tasks.

PEFT is an umbrella term, and lots of different specific methods are different ways of doing parameter efficient fine tuning. We can't cover nearly all of them in this notebook, but we'll go in depth on the most used, including:

- **Low-Rank Adaptation (LoRA)** - The foundation of modern PEFT that uses matrix decomposition
- **Weight-Decomposed Low-Rank Adaptation (DoRA)** - A more sophisticated approach that separates magnitude and direction
- **Adapter Methods** - The original PEFT approach using small bottleneck layers

We'll also touch on some others, giving more breif descriptions.

---

To the best of my ability, this will include practical implementations with real code examples, and guidance on when to use each method and choosing the right approach for a given use case.

## 0. Basic Setup

Let's start by preparing everything we'll need to run the code in this notebook.

### Installs + Imports

We'll set this up to run on Google Colab, which will need some additional installs. Below, we also import all packages needed for this code.

In [1]:
# Core ML libraries
!pip install -q transformers>=4.35.0
!pip install -q accelerate>=0.24.0
!pip install -q datasets>=2.14.0

# PEFT libraries - we'll use the official PEFT library and also implement from scratch
!pip install -q peft>=0.7.0
!pip install -q bitsandbytes>=0.41.0  # For efficient quantization if needed

# Visualization and analysis
!pip install -q matplotlib seaborn
!pip install -q plotly  # For interactive plots

# Optional: for more advanced examples
!pip install -q wandb  # For experiment tracking (optional)

In [4]:
# Now let's import everything we'll need
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel
)
from datasets import Dataset
import math
from typing import Optional, Dict, Any
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')

In [5]:
# Check if we have CUDA available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"🔧 Using device: {device}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")


🔧 Using device: cpu


## 1. **Method 1 -** LoRA (Low-Rank Adaptation)

The first method that we'll cover is LoRA, which relies on doing low-rank updates to the weight matrices that comprise a model.

### **The Background**
LoRA is based on a simple but powerful insight: when we fine-tune a pre-trained model, the weight updates often have low "intrinsic rank".

That is to say, how many independent directions of change are actually meaningful in any given update matrix $\Delta W = W_{fine-tuned} - W_{original}$ is small. Even when $\Delta W$ might be a huge matrix (say 4096×4096), most of the meaningful changes during fine tuning are captured by much fewer dimensions.

----

So, where did this idea come from?



#### 1. **Prior Theoretical Work on Intrinsic Dimensionality**
Before LoRA, several papers established that neural networks often have much lower "intrinsic dimensionality" than their parameter count suggests. In particular, **Li et al. (2018)** and **Aghajanyan et al. (2020)** both invesigated training smaller subspaces of these models and found success adapting models therein, showing that the *effective* number of parameters needed for adaptation might be much smaller than the total parameter count.

#### 2. **The Key Empirical Discovery**
The LoRA authors **(Hu et al., 2021)** took this insight and attempted to measure the intrinsic rank of the weight updates during fine-tuning.

To do this, they:
- Fine-tuned GPT-3 on various tasks using full fine-tuning
- Computed the weight update matrices $\Delta W = W_{fine-tuned} - W_{original}$
- Performed Singular Value Decomposition (SVD) on these update matrices and analyzed how the values were distributed

#### 3. **The Findings**
What they found was that the weight update matrices had **very low effective ranks.** In fact,

- Most singular values were tiny (close to zero)
- Only a small number of singular values (often < 100) contained most of the "signal"
- And it held across different model sizes, tasks, and layers

This result wasn't mathematically guaranteed nor obvious. Fine-tuning could theoretically require complex, high-dimensional change. And while over-parameterization suggested *some* redundancy, it was not known that that fine-tuning updates would be *universally* low-rank in a task, model, and optimization-independent way.

#### **The Mathematical Intuition**

Of course, there are several fundamental reasons why this low-rank structure might emerge and make good mathematical sense.

##### **1. _The Over-Parameterization Hypothesis_**
It's been hypothesized that large language models are massively over-parameterized for any *single* task. This over-parameterization means that: many parameters are redundant for the specific adaptation task, and changes to different parameters become highly correlated.

##### **2. _Feature Reuse and Composition_**
Pre-trained models already contain rich, hierarchical feature representations. So fine-tuning typically can just rely on:

- **Reweighting existing features** rather than learning new ones from scratch
- **Combining existing patterns** in new ways rather than creating entirely new patterns
- **Adjusting decision boundaries** rather than learning new feature detectors

This also means adaptations can be expressed as linear combinations of *existing* feature directions.

##### **3. _Task Similarity_**
Most fine-tuning tasks share underlying structure with the pre-training objective. For instance, any new language task will involve:

- language understanding and generation
- use of common syntactic and semantic patterns

The main difference is often in *style* or *domain* rather than fundamental capabilities, so the adaptation primarily involves adjusting the "mixing weights" of existing capabilities rather than learning entirely new ones.

##### **4. _Gradient Flow and Optimization Dynamics_**
During fine-tuning, gradients tend to flow along the directions that were already "activated" during pre-training, meaning that:

1. Parameters that were important for pre-training are more likely to be updated
2. Updates tend to be correlated across layers (if one layer needs to change, related layers need complementary changes)

This type of optimization will naturally find low-dimensional paths through the parameter space.

### **The Core Mathematical Idea**

Now that we understand *why* updates might need to only be low-rank, ;et's talk about *how* we do low-rank updated in practice.

Instead of updating the full weight matrix $W \in \mathbb{R}^{d \times k}$ during fine-tuning, LoRA represents the update as:

$$h = W_0 x + \Delta W x = W_0 x + BA x$$

Here:
- $W_0$ is the original **frozen** pre-trained weight matrix
- $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are trainable **low-rank matrices**
- $r \ll \min(d,k)$ is the rank of these matrices (typically 1-256)
- $\Delta W = BA$ represents the low-rank update

> Remember, if B is a {4096 x 4} matrix, and A is a {4 x 4096} matrix (rank 4), the *size of the update* to W is still {4096 x 4096} !

---

With this update, the number of trainable parameters drops from $d \times k$ to $(d + k) \times r$

For a typical transformer layer with $d=k=4096$ and $r=64$, that means:
- **Full fine-tuning**: $4096 \times 4096 = 16.8M$ parameters
- **LoRA**: $(4096 + 4096) \times 64 = 524K$ parameters  

- -> 97% fewer parameters!


### Initialization Strategy (Critical!)

LoRA uses a specific initialization strategy that's crucial for training stability:
- Matrix $A$ is initialized with small random Gaussian values: $A \sim \mathcal{N}(0, \sigma^2)$
- Matrix $B$ is initialized to **zero**: $B = 0$
- This ensures $\Delta W = BA = 0$ at initialization

This means training starts with the exact original pre-trained model behavior, then gradually learns the adaptation. A scaling factor $\alpha/r$ is often applied to control the magnitude of updates:

$$h = W_0 x + \frac{\alpha}{r} BA x$$

Where $\alpha$ is a hyperparameter (often 16 or 32) that controls how much the LoRA adaptation affects the original model.

### The Mathematical Intuition

In linear algebra terms, if most of the "action" in $\Delta W$ happens along just $r$ directions (where $r$ is small), then we can write:

$$\Delta W = \sum_{i=1}^{r} \sigma_i u_i v_i^T$$

This is exactly the low-rank decomposition! The $u_i$ and $v_i$ are the key directions of change, and $\sigma_i$ are their strengths. LoRA approximates this by learning $B$ and $A$ such that $\Delta W \approx BA$, where $B$ captures the output directions and $A$ captures the input directions of these changes.

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import Optional

class LoRALayer(nn.Module):
    """
    A Low-Rank Adaptation layer that can wrap any linear layer.

    This implementation shows the core LoRA concept:
    - Keep original weights frozen
    - Add low-rank adaptation via two smaller matrices
    - Apply scaling to control adaptation strength
    """

    def __init__(self,
                 original_layer: nn.Linear,
                 rank: int = 4,
                 alpha: float = 32.0,
                 dropout: float = 0.1):
        super().__init__()

        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank  # This is the α/r scaling factor

        # Get dimensions from the original layer
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # Freeze the original layer
        self.original_layer = original_layer
        for param in self.original_layer.parameters():
            param.requires_grad = False

        # Create the low-rank matrices A and B
        # A: (rank, in_features) - initialized with small random values
        # B: (out_features, rank) - initialized to zero
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        # Optional dropout for regularization
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()

    def forward(self, x):
        # Original computation: W₀x
        original_output = self.original_layer(x)

        # LoRA computation: (α/r) * B * A * x
        # We compute this as: (α/r) * B * (A * x) for efficiency
        lora_output = self.dropout(x) @ self.lora_A.T  # (batch, rank)
        lora_output = lora_output @ self.lora_B.T      # (batch, out_features)
        lora_output = lora_output * self.scaling       # Apply α/r scaling

        # Combine: W₀x + (α/r)BAx
        return original_output + lora_output

    def get_delta_weights(self):
        """
        Returns the actual ΔW = (α/r)BA matrix for analysis
        """
        return self.scaling * (self.lora_B @ self.lora_A)

# Let's test this with a simple example
print("🔧 Creating a test linear layer and its LoRA version...")

# Original linear layer
original = nn.Linear(512, 256)
print(f"Original layer parameters: {sum(p.numel() for p in original.parameters()):,}")

# LoRA version
lora_layer = LoRALayer(original, rank=16, alpha=32)
trainable_params = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in lora_layer.parameters())

print(f"LoRA trainable parameters: {trainable_params:,}")
print(f"LoRA total parameters: {total_params:,}")
print(f"Parameter reduction: {(1 - trainable_params/total_params)*100:.1f}%")

# Test forward pass
x = torch.randn(32, 512)  # batch_size=32, input_dim=512
output = lora_layer(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

# Show that ΔW starts at zero (due to B initialization)
delta_w = lora_layer.get_delta_weights()
print(f"Initial ΔW norm: {delta_w.norm().item():.6f} (should be ~0)")

🔧 Creating a test linear layer and its LoRA version...
Original layer parameters: 131,328
LoRA trainable parameters: 12,288
LoRA total parameters: 143,616
Parameter reduction: 91.4%
Input shape: torch.Size([32, 512])
Output shape: torch.Size([32, 256])
Initial ΔW norm: 0.000000 (should be ~0)
