# Memory Optimization

**Advanced techniques for training large models on limited hardware**

## The Memory Challenge

Training large language models requires substantial GPU memory. Understanding and optimizing memory usage is crucial for:

- **Fitting larger models** on your hardware
- **Using larger batch sizes** for more stable training
- **Faster training** through better GPU utilization
- **Cost reduction** by using smaller/cheaper GPUs

### Memory Breakdown

```
Total GPU Memory Usage:
+-- Model Weights        (~25-30%)
+-- Optimizer State      (~50-60%)  <-- Largest component!
+-- Gradients            (~25-30%)
+-- Activations          (~10-20%)  <-- Depends on batch size
+-- Framework Overhead   (~5%)
```

## Memory Examples

**GPT-2 (124M parameters) full fine-tuning:**

```
Model weights (fp32):     124M x 4 bytes = 496 MB
Optimizer (AdamW):        124M x 8 bytes = 992 MB  (momentum + variance)
Gradients (fp32):         124M x 4 bytes = 496 MB
Activations (batch=8):                    ~500 MB
Framework overhead:                       ~100 MB
----------------------------------------------------
Total:                                    ~2.6 GB
```

**Llama 7B full fine-tuning:**

```
Model weights (fp32):     7B x 4 bytes = 28 GB
Optimizer (AdamW):        7B x 8 bytes = 56 GB
Gradients (fp32):         7B x 4 bytes = 28 GB
Activations (batch=8):                   ~20 GB
Framework overhead:                      ~2 GB
----------------------------------------------------
Total:                                   ~134 GB  <-- Won't fit on consumer GPUs!
```

The optimizer state is typically the largest memory consumer, often requiring 2x the model size for AdamW!

## Technique 1: Mixed Precision Training

**Most impactful technique** - Reduces memory by 50% with minimal code changes.

### FP16 vs BF16 vs FP32

| Format | Bits | Range | Precision | Memory |
|--------|------|-------|-----------|--------|
| FP32 | 32 | +/-3.4e38 | ~7 decimal digits | 4 bytes |
| FP16 | 16 | +/-65,504 | ~3 decimal digits | 2 bytes |
| BF16 | 16 | +/-3.4e38 | ~2 decimal digits | 2 bytes |

**BF16** is preferred for modern GPUs (Ampere/Ada) - same range as FP32, no loss scaling needed.

In [None]:
import torch
from torch.cuda.amp import autocast, GradScaler

# Mixed precision training example
scaler = GradScaler()  # For FP16 only, not needed for BF16

def train_step_mixed_precision(model, batch, optimizer):
    optimizer.zero_grad()
    
    # Forward pass in mixed precision
    with autocast(dtype=torch.bfloat16):  # or torch.float16
        outputs = model(batch["input_ids"])
        loss = outputs.loss
    
    # Backward pass
    scaler.scale(loss).backward()  # Scale loss to prevent underflow
    
    # Optimizer step with unscaling
    scaler.step(optimizer)
    scaler.update()
    
    return loss.item()

print("Memory Savings with Mixed Precision:")
print("  FP32 -> BF16: ~50% reduction")

## Technique 2: LoRA (Low-Rank Adaptation)

**Dramatic memory reduction** by training only a tiny fraction of parameters.

```
Full Fine-Tuning:
  Trainable params: 7,000,000,000
  Optimizer state:  56 GB

LoRA (r=16):
  Trainable params: 16,777,216  (0.24% of model!)
  Optimizer state:  134 MB      (418x reduction!)
```

In [None]:
# LoRA memory savings
print("LoRA Memory Savings (Llama 7B):")
print()
print("Full Fine-Tuning (BF16):")
print("  Model:      14 GB (trainable)")
print("  Optimizer:  56 GB")
print("  Gradients:  14 GB")
print("  Total:      84 GB + activations")
print()
print("LoRA (BF16, r=16):")
print("  Model:      14 GB (frozen, can be quantized)")
print("  LoRA:       67 MB (trainable)")
print("  Optimizer:  268 MB (only for LoRA)")
print("  Gradients:  67 MB (only for LoRA)")
print("  Total:      14.4 GB + activations (5.8x reduction!)")
print()
print("Rank selection for memory:")
print("  r=4:   ~33 MB (minimum, may underfit)")
print("  r=8:   ~67 MB (good for simple tasks)")
print("  r=16:  ~134 MB (default, recommended)")
print("  r=32:  ~268 MB (high capacity)")

## Technique 3: Gradient Accumulation

**Simulate larger batch sizes** without additional memory.

```
Effective batch size = batch_size x gradient_accumulation_steps
Memory usage = batch_size_per_step (not effective_batch_size!)
```

In [None]:
# Gradient accumulation implementation
def train_with_gradient_accumulation(model, dataloader, optimizer, accumulation_steps=4):
    optimizer.zero_grad()
    
    for i, batch in enumerate(dataloader):
        # Forward pass
        outputs = model(batch["input_ids"])
        loss = outputs.loss
        
        # Scale loss by accumulation steps
        loss = loss / accumulation_steps
        
        # Backward pass (accumulates gradients)
        loss.backward()
        
        # Update weights every accumulation_steps
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

print("Gradient Accumulation:")
print("  batch_size=4, accumulation_steps=8")
print("  Effective batch size: 32")
print("  Memory: Only 4 samples at a time")

## Technique 4: Gradient Checkpointing

**Trade computation for memory** by recomputing activations during backward pass.

```
Without Gradient Checkpointing:
  Forward:  Save all activations -> High memory
  Backward: Use saved activations -> Fast

With Gradient Checkpointing:
  Forward:  Save only checkpoint activations -> Low memory
  Backward: Recompute from checkpoints -> Slower, low memory
```

In [None]:
from transformers import AutoModelForCausalLM

# Enable gradient checkpointing
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.gradient_checkpointing_enable()

print("Gradient Checkpointing Memory Savings:")
print()
print("Llama 7B training (batch_size=8, seq_length=2048):")
print("  Without checkpointing: ~20 GB activations")
print("  With checkpointing:    ~5 GB activations")
print("  Savings: 75% reduction")
print()
print("Trade-off: 20-30% slower training")

## Technique 5: Model Quantization

**Load models in reduced precision** (4-bit or 8-bit) to dramatically reduce memory.

| Precision | Memory | Quality |
|-----------|--------|--------|
| FP32 | 28 GB (7B) | 100% |
| BF16 | 14 GB | 99.9% |
| 8-bit | 7 GB | ~99% |
| 4-bit | 3.5 GB | 95-98% |

In [None]:
# Quantization with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization (QLoRA)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 (better than standard)
    bnb_4bit_use_double_quant=True,   # Double quantization for more savings
    bnb_4bit_compute_dtype=torch.bfloat16,  # Computation dtype
)

# model = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Llama-3.2-7B",
#     quantization_config=quantization_config,
#     device_map="auto",
# )

print("QLoRA (4-bit + LoRA) Setup:")
print("  Model (4-bit):          3.5 GB")
print("  LoRA adapters (BF16):   67 MB")
print("  Optimizer state:        268 MB")
print("  Gradients:              67 MB")
print("  Activations (bs=8):     5 GB (with checkpointing)")
print("  ------------------------------------")
print("  Total:                  ~9 GB (fits on RTX 3080!)")

## Memory Profiling

In [None]:
import torch
import gc

def profile_memory(fn, label=""):
    """Profile memory usage of a function."""
    if not torch.cuda.is_available():
        print("CUDA not available for profiling")
        return
    
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()
    gc.collect()
    
    start_mem = torch.cuda.memory_allocated()
    
    result = fn()
    
    end_mem = torch.cuda.memory_allocated()
    peak_mem = torch.cuda.max_memory_allocated()
    
    print(f"\n{label}")
    print(f"  Start: {start_mem / 1e9:.2f} GB")
    print(f"  End:   {end_mem / 1e9:.2f} GB")
    print(f"  Delta: {(end_mem - start_mem) / 1e9:.2f} GB")
    print(f"  Peak:  {peak_mem / 1e9:.2f} GB")
    
    return result

# Memory monitoring
if torch.cuda.is_available():
    print(f"Current memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"Peak memory: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
else:
    print("CUDA not available")

## Debugging OOM Errors

**OOM Debugging Checklist:**

1. **Reduce batch size by 50%**
2. **Enable gradient checkpointing**
3. **Use gradient accumulation**
4. **Check for memory leaks** (storing tensors accidentally)
5. **Clear cache** with `torch.cuda.empty_cache()`

**Common OOM Causes:**

| Cause | Solution |
|-------|----------|
| Batch size too large | Reduce by 50%, use gradient accumulation |
| Sequence length too long | Truncate to 512 or 1024 tokens |
| Accumulating tensors | Use `.item()` or `.detach()` |
| Fragmented memory | `torch.cuda.empty_cache()` |
| Multiple models | Delete unused models |
| Full precision | Use BF16/FP16 |

## Memory Optimization Strategy

**Recommended approach:**

### Step 1: Essential Optimizations (Always Apply)
1. Mixed precision (BF16/FP16)
2. LoRA (if training large models)
3. Find max batch size

### Step 2: Add If Still OOM
4. Gradient checkpointing
5. Gradient accumulation

### Step 3: Extreme Constraints
6. 4-bit quantization (QLoRA)
7. CPU offloading (DeepSpeed)

In [None]:
# Complete optimization example
print("""Full Optimization Example:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# 4-bit quantization + LoRA
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-7B",
    quantization_config=quantization_config,
    device_map="auto"
)

# LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, ...)
model = get_peft_model(model, lora_config)

# Gradient checkpointing
model.gradient_checkpointing_enable()

# Training config
config = SFTConfig(
    batch_size=4,                    # Small batch
    gradient_accumulation_steps=8,   # Effective batch = 32
    learning_rate=3e-4,
)

# Result: 7B model on 12 GB GPU!
""")

## Summary

**Memory Optimization Techniques Ranked:**

| Technique | Memory Savings | Speed Impact | When to Use |
|-----------|----------------|--------------|-------------|
| Mixed Precision | 50% | +20% faster | Always |
| LoRA | 80-95% optimizer | None | Large models |
| Gradient Accumulation | 0% (enables larger batch) | -20-30% | Memory-limited |
| Gradient Checkpointing | 50-80% activations | -20-30% | Long sequences |
| Quantization (4-bit) | 75% model | -10-20% | Extreme constraints |
| CPU Offloading | 50-70% optimizer | -60-80% | Last resort |

## Next Steps

Now let's explore hyperparameter tuning for optimal training.