# LoRA (Low-Rank Adaptation) Implementation from Scratch

## Educational Notebook - Minimal Resource Requirements

This notebook implements LoRA from first principles using the smallest possible model (GPT-2 small - 124M parameters).

**Resource Requirements:**
- RAM: 4GB minimum
- GPU: Optional (works on CPU)
- Storage: ~500MB

**What is LoRA?**

LoRA freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each layer. Instead of fine-tuning all parameters, we only train small matrices that are added to the original weights.

**Key Idea:**
```
Original: W ∈ R^(d×k)
LoRA: W' = W + BA where B ∈ R^(d×r), A ∈ R^(r×k), r << min(d,k)
```

This reduces trainable parameters from d×k to (d+k)×r.

In [None]:
# Install required packages
!pip install -q transformers torch datasets accelerate

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import math
import numpy as np
from typing import Optional, List
import warnings
warnings.filterwarnings('ignore')

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

## Part 1: LoRA Layer Implementation

We'll implement LoRA as a wrapper around existing Linear layers.

In [None]:
class LoRALayer(nn.Module):
    """
    LoRA implementation for a linear layer.
    
    Args:
        in_features: Input dimension
        out_features: Output dimension  
        rank: Rank of the low-rank decomposition (r)
        alpha: Scaling factor (typically set to rank)
        dropout: Dropout probability
    """
    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 8,
        alpha: float = 16,
        dropout: float = 0.1
    ):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # LoRA matrices: W' = W + BA
        # B: (out_features, rank) - initialized to zeros
        # A: (rank, in_features) - initialized with Kaiming uniform
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Dropout for regularization
        self.dropout = nn.Dropout(p=dropout) if dropout > 0 else nn.Identity()
        
        # Initialize A with Kaiming uniform (same as nn.Linear)
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        # B is initialized to zero, so initially W' = W (no change)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: x @ (W + scaling * B @ A)^T
        
        Args:
            x: Input tensor of shape (..., in_features)
            
        Returns:
            Output tensor of shape (..., out_features)
        """
        # Compute LoRA adaptation: (B @ A) with scaling
        lora_output = (self.dropout(x) @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return lora_output
    
    def extra_repr(self) -> str:
        return f'in_features={self.in_features}, out_features={self.out_features}, rank={self.rank}, alpha={self.alpha}'


class LinearWithLoRA(nn.Module):
    """
    Wrapper that combines a frozen Linear layer with LoRA adaptation.
    """
    def __init__(
        self,
        linear: nn.Linear,
        rank: int = 8,
        alpha: float = 16,
        dropout: float = 0.1
    ):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features,
            linear.out_features,
            rank=rank,
            alpha=alpha,
            dropout=dropout
        )
        
        # Freeze the original linear layer
        self.linear.weight.requires_grad = False
        if self.linear.bias is not None:
            self.linear.bias.requires_grad = False
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original output + LoRA adaptation
        return self.linear(x) + self.lora(x)


# Test the LoRA layer
print("Testing LoRA Layer:")
print("=" * 50)

# Create a simple linear layer
linear = nn.Linear(768, 768)
print(f"Original Linear layer parameters: {sum(p.numel() for p in linear.parameters()):,}")

# Wrap with LoRA
lora_linear = LinearWithLoRA(linear, rank=8, alpha=16)
trainable_params = sum(p.numel() for p in lora_linear.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in lora_linear.parameters())

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters (LoRA): {trainable_params:,}")
print(f"Percentage trainable: {100 * trainable_params / total_params:.2f}%")
print(f"Parameter reduction: {total_params / trainable_params:.1f}x fewer trainable params")

# Test forward pass
x = torch.randn(2, 10, 768)  # (batch, seq_len, hidden_dim)
output = lora_linear(x)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {output.shape}")

## Part 2: Apply LoRA to Pre-trained Model

Now we'll apply LoRA to specific layers in a GPT-2 model. Typically, we target:
- Query and Value projection matrices in attention
- Sometimes Key projections
- Optionally, feed-forward layers

In [None]:
def apply_lora_to_model(
    model: nn.Module,
    target_modules: List[str] = ['c_attn'],  # GPT-2 uses 'c_attn' for QKV projection
    rank: int = 8,
    alpha: float = 16,
    dropout: float = 0.1
) -> nn.Module:
    """
    Apply LoRA to specified modules in the model.
    
    Args:
        model: Pre-trained model
        target_modules: Names of modules to apply LoRA to
        rank: LoRA rank
        alpha: LoRA alpha
        dropout: Dropout probability
    """
    for name, module in model.named_modules():
        # Check if this module should have LoRA applied
        if any(target in name for target in target_modules):
            if isinstance(module, nn.Linear):
                # Get parent module and attribute name
                parent_name = '.'.join(name.split('.')[:-1])
                attr_name = name.split('.')[-1]
                parent = model.get_submodule(parent_name) if parent_name else model
                
                # Replace with LoRA version
                lora_layer = LinearWithLoRA(module, rank=rank, alpha=alpha, dropout=dropout)
                setattr(parent, attr_name, lora_layer)
                print(f"Applied LoRA to: {name}")
    
    return model


def count_parameters(model: nn.Module) -> dict:
    """
    Count total and trainable parameters in the model.
    """
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    return {
        'total': total_params,
        'trainable': trainable_params,
        'frozen': total_params - trainable_params,
        'trainable_pct': 100 * trainable_params / total_params
    }


# Load GPT-2 small model (smallest possible)
print("Loading GPT-2 small model...")
model_name = "gpt2"  # 124M parameters
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

# Print original model stats
print("\nOriginal Model:")
print("=" * 50)
stats = count_parameters(model)
print(f"Total parameters: {stats['total']:,}")

# Apply LoRA to attention layers
print("\nApplying LoRA to model...")
print("=" * 50)
model = apply_lora_to_model(
    model,
    target_modules=['c_attn'],  # Apply to QKV projections in GPT-2
    rank=8,
    alpha=16,
    dropout=0.1
)

# Print LoRA model stats
print("\nLoRA Model:")
print("=" * 50)
stats = count_parameters(model)
print(f"Total parameters: {stats['total']:,}")
print(f"Trainable parameters: {stats['trainable']:,}")
print(f"Frozen parameters: {stats['frozen']:,}")
print(f"Trainable percentage: {stats['trainable_pct']:.2f}%")
print(f"Parameter reduction: {stats['total'] / stats['trainable']:.1f}x")

model = model.to(device)
print(f"\nModel moved to: {device}")

## Part 3: Prepare Training Data

We'll create a simple dataset for demonstration. For real applications, use larger datasets like WikiText or your custom data.

In [None]:
class SimpleTextDataset(Dataset):
    """
    Simple text dataset for causal language modeling.
    """
    def __init__(self, texts: List[str], tokenizer, max_length: int = 128):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.encodings = []
        
        for text in texts:
            # Tokenize and truncate
            encoded = tokenizer(
                text,
                max_length=max_length,
                truncation=True,
                padding='max_length',
                return_tensors='pt'
            )
            self.encodings.append(encoded)
    
    def __len__(self):
        return len(self.encodings)
    
    def __getitem__(self, idx):
        item = {key: val[0] for key, val in self.encodings[idx].items()}
        # For causal LM, labels are the same as input_ids
        item['labels'] = item['input_ids'].clone()
        return item


# Create a small synthetic dataset for demonstration
# In practice, use a real dataset like WikiText, C4, or your domain-specific data
train_texts = [
    "Machine learning is a subset of artificial intelligence that enables systems to learn from data.",
    "Deep learning uses neural networks with multiple layers to learn complex patterns.",
    "Natural language processing helps computers understand and generate human language.",
    "Transformers revolutionized NLP with their attention mechanism and parallel processing.",
    "Transfer learning allows models to leverage knowledge from pre-training on large datasets.",
    "Fine-tuning adapts pre-trained models to specific downstream tasks efficiently.",
    "LoRA reduces the number of trainable parameters during fine-tuning significantly.",
    "Parameter-efficient fine-tuning methods enable training on limited computational resources.",
    "Attention mechanisms allow models to focus on relevant parts of the input sequence.",
    "Self-supervised learning has become crucial for training large language models.",
    "The GPT architecture uses decoder-only transformers for autoregressive text generation.",
    "Embedding layers convert discrete tokens into continuous vector representations.",
    "Positional encodings help transformers understand the order of tokens in sequences.",
    "Layer normalization stabilizes training in deep neural networks.",
    "Dropout prevents overfitting by randomly disabling neurons during training.",
    "The Adam optimizer adapts learning rates for each parameter independently.",
    "Gradient clipping prevents exploding gradients in deep networks.",
    "Batch normalization normalizes activations across mini-batches.",
    "Residual connections allow gradients to flow through very deep networks.",
    "Multi-head attention enables models to attend to different representation subspaces.",
] * 5  # Repeat to get more training examples

# Create dataset and dataloader
train_dataset = SimpleTextDataset(train_texts, tokenizer, max_length=64)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

print(f"Training samples: {len(train_dataset)}")
print(f"Batches per epoch: {len(train_loader)}")

# Show a sample
sample = train_dataset[0]
print(f"\nSample input shape: {sample['input_ids'].shape}")
print(f"Decoded text: {tokenizer.decode(sample['input_ids'], skip_special_tokens=True)[:100]}...")

## Part 4: Training Loop

Implement a simple training loop to fine-tune the model with LoRA.

In [None]:
def train_lora(
    model: nn.Module,
    train_loader: DataLoader,
    num_epochs: int = 3,
    learning_rate: float = 3e-4,
    weight_decay: float = 0.01
):
    """
    Train the model with LoRA.
    """
    # Only optimize LoRA parameters
    optimizer = torch.optim.AdamW(
        [p for p in model.parameters() if p.requires_grad],
        lr=learning_rate,
        weight_decay=weight_decay
    )
    
    model.train()
    
    for epoch in range(num_epochs):
        total_loss = 0
        num_batches = 0
        
        for batch_idx, batch in enumerate(train_loader):
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Forward pass
            outputs = model(**batch)
            loss = outputs.loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            # Update weights
            optimizer.step()
            
            total_loss += loss.item()
            num_batches += 1
            
            # Print progress
            if (batch_idx + 1) % 5 == 0:
                avg_loss = total_loss / num_batches
                print(f"Epoch [{epoch+1}/{num_epochs}], "
                      f"Batch [{batch_idx+1}/{len(train_loader)}], "
                      f"Loss: {loss.item():.4f}, "
                      f"Avg Loss: {avg_loss:.4f}")
        
        epoch_loss = total_loss / num_batches
        print(f"\nEpoch {epoch+1} completed. Average Loss: {epoch_loss:.4f}\n")
        print("=" * 50)


# Train the model
print("Starting LoRA training...")
print("=" * 50)
train_lora(model, train_loader, num_epochs=3, learning_rate=3e-4)

## Part 5: Text Generation and Evaluation

Test the fine-tuned model by generating text.

In [None]:
def generate_text(
    model: nn.Module,
    tokenizer,
    prompt: str,
    max_length: int = 100,
    temperature: float = 0.8,
    top_p: float = 0.9
) -> str:
    """
    Generate text from a prompt.
    """
    model.eval()
    
    # Tokenize prompt
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text


# Test generation with different prompts
prompts = [
    "Machine learning is",
    "Deep learning uses",
    "Transfer learning allows",
    "LoRA reduces",
    "Attention mechanisms"
]

print("\nGenerating text with fine-tuned model:")
print("=" * 50)

for prompt in prompts:
    generated = generate_text(model, tokenizer, prompt, max_length=80)
    print(f"\nPrompt: {prompt}")
    print(f"Generated: {generated}")
    print("-" * 50)

## Part 6: Save and Load LoRA Weights

Save only the LoRA weights (not the entire model) for efficient storage.

In [None]:
def save_lora_weights(model: nn.Module, save_path: str):
    """
    Save only the LoRA weights.
    """
    lora_state_dict = {}
    
    for name, param in model.named_parameters():
        if 'lora' in name and param.requires_grad:
            lora_state_dict[name] = param.cpu()
    
    torch.save(lora_state_dict, save_path)
    print(f"LoRA weights saved to: {save_path}")
    print(f"Number of LoRA parameters: {len(lora_state_dict)}")
    
    # Print size
    import os
    size_mb = os.path.getsize(save_path) / (1024 * 1024)
    print(f"File size: {size_mb:.2f} MB")


def load_lora_weights(model: nn.Module, load_path: str):
    """
    Load LoRA weights into the model.
    """
    lora_state_dict = torch.load(load_path, map_location=device)
    
    # Load weights
    model.load_state_dict(lora_state_dict, strict=False)
    print(f"LoRA weights loaded from: {load_path}")
    print(f"Number of LoRA parameters loaded: {len(lora_state_dict)}")


# Save LoRA weights
print("\nSaving LoRA weights...")
print("=" * 50)
save_lora_weights(model, '/home/claude/lora_weights.pt')

# Compare with full model size
print("\nFor comparison:")
print(f"Full GPT-2 model would be ~500 MB")
print(f"LoRA weights are only a small fraction!")

## Part 7: Merge LoRA Weights (Optional)

For inference efficiency, we can merge LoRA weights back into the original weights: W' = W + BA

In [None]:
def merge_lora_weights(model: nn.Module) -> nn.Module:
    """
    Merge LoRA weights into the base model for inference.
    After merging, LoRA layers can be removed.
    """
    for name, module in model.named_modules():
        if isinstance(module, LinearWithLoRA):
            # Get the parent module
            parent_name = '.'.join(name.split('.')[:-1])
            attr_name = name.split('.')[-1]
            parent = model.get_submodule(parent_name) if parent_name else model
            
            # Merge: W' = W + scaling * B @ A
            with torch.no_grad():
                merged_weight = module.linear.weight.clone()
                lora_weight = (module.lora.lora_B @ module.lora.lora_A) * module.lora.scaling
                merged_weight += lora_weight
                
                # Create new linear layer with merged weights
                new_linear = nn.Linear(
                    module.linear.in_features,
                    module.linear.out_features,
                    bias=module.linear.bias is not None
                )
                new_linear.weight.data = merged_weight
                if module.linear.bias is not None:
                    new_linear.bias.data = module.linear.bias.clone()
                
                # Replace module
                setattr(parent, attr_name, new_linear)
                print(f"Merged LoRA weights for: {name}")
    
    return model


# Create a copy for merging (optional)
print("\nMerging LoRA weights into base model...")
print("=" * 50)
print("Note: This is optional and only for inference optimization.")
print("After merging, the model has no separate LoRA parameters.")

## Summary and Key Insights

### What We Implemented:

1. **LoRA Layer**: Low-rank decomposition matrices (A and B) that adapt pre-trained weights
2. **Model Integration**: Applied LoRA to attention layers in GPT-2
3. **Training**: Fine-tuned only LoRA parameters (~0.3% of total parameters)
4. **Inference**: Generated text with the adapted model
5. **Weight Management**: Saved/loaded only LoRA weights (much smaller files)

### Key Advantages of LoRA:

1. **Memory Efficient**: Only train ~0.1-1% of parameters
2. **Storage Efficient**: LoRA weights are tiny (few MBs vs GBs)
3. **No Catastrophic Forgetting**: Base model remains frozen
4. **Modular**: Easy to swap different LoRA adapters
5. **Fast Training**: Fewer parameters = faster convergence

### Hyperparameters to Tune:

- **Rank (r)**: 8-64 (lower = fewer params, higher = more capacity)
- **Alpha**: Usually 2×rank or 1×rank
- **Target Modules**: Which layers to apply LoRA to
- **Dropout**: Regularization (0.05-0.1)
- **Learning Rate**: 1e-4 to 5e-4 for LoRA

### Extensions:

1. Apply LoRA to more layers (feed-forward networks)
2. Use QLoRA (quantized LoRA) for even lower memory
3. Implement multi-task LoRA (multiple adapters)
4. Add evaluation metrics (perplexity, downstream tasks)
5. Use larger models (Llama, Mistral) with same approach

## Mathematical Deep Dive

### LoRA Formulation:

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$:

$$h = W_0 x + \Delta W x = W_0 x + BAx$$

Where:
- $B \in \mathbb{R}^{d \times r}$: Down-projection matrix
- $A \in \mathbb{R}^{r \times k}$: Up-projection matrix  
- $r \ll \min(d, k)$: Rank (bottleneck dimension)

### Parameter Count:

- Original: $d \times k$ parameters
- LoRA: $(d + k) \times r$ parameters
- Reduction factor: $\frac{d \times k}{(d+k) \times r}$

Example: For $d=k=768$, $r=8$:
- Original: 589,824 parameters
- LoRA: 12,288 parameters (48× reduction!)

### Scaling Factor:

$$\Delta W = \frac{\alpha}{r} BA$$

The scaling factor $\frac{\alpha}{r}$ helps:
- Stabilize training across different ranks
- Control the magnitude of updates
- Typically set $\alpha = r$ or $\alpha = 2r$