# LoRA: Efficient Fine-Tuning

**Low-Rank Adaptation for parameter-efficient training**

## The Problem with Full Fine-Tuning

Full fine-tuning updates ALL model parameters:

| Model | Parameters | GPU Memory (FP32) |
|-------|------------|-------------------|
| GPT-2 | 124M | ~500 MB |
| GPT-2 Large | 774M | ~3 GB |
| LLaMA 7B | 7B | ~28 GB |
| LLaMA 70B | 70B | ~280 GB |

For large models, this is impractical:
- Requires multiple expensive GPUs
- Need to store full model copies for each task
- Risk of catastrophic forgetting

## LoRA: The Key Insight

**LoRA (Low-Rank Adaptation)** freezes the pre-trained model and adds small trainable matrices.

Instead of updating a weight matrix $W \in \mathbb{R}^{d \times k}$:

$$W_{\text{new}} = W + \Delta W$$

LoRA decomposes the update as a low-rank product:

$$W_{\text{new}} = W + BA$$

where:
- $B \in \mathbb{R}^{d \times r}$
- $A \in \mathbb{R}^{r \times k}$
- $r \ll \min(d, k)$ (rank is much smaller)

In [None]:
import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    """Low-Rank Adaptation layer."""
    
    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.1
    ):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # Low-rank matrices
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Initialize A with Kaiming, B with zeros
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Compute LoRA output: (B @ A) @ x * scaling"""
        # x: (batch, seq, in_features)
        # lora_A: (rank, in_features)
        # lora_B: (out_features, rank)
        
        lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
        return lora_out * self.scaling

# Example: Compare parameter counts
in_features, out_features = 4096, 4096
rank = 8

full_params = in_features * out_features
lora_params = (in_features * rank) + (out_features * rank)

print(f"Full fine-tuning: {full_params:,} parameters")
print(f"LoRA (rank={rank}): {lora_params:,} parameters")
print(f"Reduction: {full_params / lora_params:.1f}x fewer parameters")

## LoRA Linear Layer

In [None]:
class LoRALinear(nn.Module):
    """Linear layer with LoRA adaptation."""
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.1
    ):
        super().__init__()
        
        # Freeze original layer
        self.original = original_layer
        for param in self.original.parameters():
            param.requires_grad = False
        
        # Add LoRA
        self.lora = LoRALayer(
            in_features=original_layer.in_features,
            out_features=original_layer.out_features,
            rank=rank,
            alpha=alpha,
            dropout=dropout
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward: original output + LoRA output."""
        return self.original(x) + self.lora(x)

# Example
original = nn.Linear(768, 768)
lora_layer = LoRALinear(original, rank=8)

# Count trainable parameters
trainable = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in lora_layer.parameters())

print(f"Trainable parameters: {trainable:,}")
print(f"Total parameters: {total:,}")
print(f"Trainable: {100 * trainable / total:.2f}%")

## Using PEFT Library

In practice, use HuggingFace's PEFT library for LoRA:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Base model parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Note: Requires `pip install peft`
try:
    from peft import LoraConfig, get_peft_model, TaskType
    
    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=8,                    # Rank
        lora_alpha=16,          # Scaling factor
        lora_dropout=0.1,       # Dropout
        target_modules=["c_attn", "c_proj"],  # Which layers to adapt
        bias="none"             # Don't train biases
    )
    
    # Apply LoRA to model
    peft_model = get_peft_model(model, lora_config)
    
    # Print trainable parameters
    peft_model.print_trainable_parameters()
    
except ImportError:
    print("PEFT not installed. Install with: pip install peft")
    print("\nFor now, here's what the output would look like:")
    print("trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.24%")

## LoRA Hyperparameters

| Parameter | Description | Typical Values |
|-----------|-------------|----------------|
| `r` (rank) | Dimension of low-rank matrices | 4, 8, 16, 32 |
| `alpha` | Scaling factor | 16, 32 (often 2×r) |
| `dropout` | Dropout on LoRA layers | 0.05-0.1 |
| `target_modules` | Which layers to adapt | Query, Key, Value projections |

**Guidelines:**
- Higher rank = more capacity but more parameters
- Start with r=8, increase if underfitting
- Target attention projections (Q, K, V) for best results

## Merging LoRA Weights

After training, you can merge LoRA weights into the base model for inference:

In [None]:
def merge_lora_weights(original_weight, lora_A, lora_B, scaling):
    """
    Merge LoRA weights into original weight matrix.
    
    W_merged = W + (B @ A) * scaling
    """
    delta_W = (lora_B @ lora_A) * scaling
    return original_weight + delta_W

# Example
d, k, r = 768, 768, 8
alpha = 16
scaling = alpha / r

W = torch.randn(d, k)  # Original weights
A = torch.randn(r, k)  # LoRA A
B = torch.randn(d, r)  # LoRA B

W_merged = merge_lora_weights(W, A, B, scaling)

print(f"Original W shape: {W.shape}")
print(f"Merged W shape: {W_merged.shape}")
print(f"\nAfter merging, inference is same speed as original model!")

## Benefits of LoRA

1. **Memory Efficient** — Only store/update small matrices
2. **Fast Training** — Fewer gradients to compute
3. **Modular** — Swap LoRA adapters for different tasks
4. **Preserves Base Model** — Less catastrophic forgetting
5. **Easy Deployment** — Merge weights for zero overhead inference

## Next Steps

Now that we've covered SFT (including LoRA), let's move on to Reward Modeling — training models to predict human preferences.