# Lab 6: Quantization & Optimization

**Module 6 - Advanced Optimization Techniques**

| Duration | Difficulty | Framework | Exercises |
|----------|------------|-----------|----------|
| 120 min | Advanced | BitsAndBytes | 4 |

## Learning Objectives

- Understand quantization theory and trade-offs
- Implement INT8 and INT4 quantization
- Apply QLoRA for memory-efficient fine-tuning
- Benchmark and compare model performance

## Setup

In [None]:
# !pip install transformers accelerate bitsandbytes scipy

In [None]:
import torch
import gc
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_NAME = "microsoft/phi-2"

def get_memory_usage():
    if torch.cuda.is_available():
        return {"allocated": torch.cuda.memory_allocated() / 1e9, "reserved": torch.cuda.memory_reserved() / 1e9}
    return {"allocated": 0, "reserved": 0}

def clear_memory():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

---

## Exercise 1: Understanding Quantization

Explore how quantization works at the tensor level.

**Your Task:** Implement manual quantization and analyze error.

In [None]:
import numpy as np

def manual_quantize_int8(tensor: torch.Tensor) -> tuple:
    """Manually quantize a tensor to INT8."""
    # TODO: Find scale factor and quantize
    # scale = abs_max / 127
    # quantized = round(tensor / scale).to(int8)
    pass

def analyze_quantization_error(original: torch.Tensor, bits: int = 8):
    """Analyze quantization error for different bit widths."""
    # TODO: Quantize, dequantize, and measure MSE
    pass

---

## Exercise 2: INT8 Quantization

Load and compare models with INT8 quantization.

**Your Task:** Create quantization configs and load models.

In [None]:
def load_model_fp16():
    """Load model in FP16."""
    # TODO: Load with torch_dtype=torch.float16
    pass

def load_model_int8():
    """Load model with INT8 quantization."""
    # TODO: Create BitsAndBytesConfig with load_in_8bit=True
    pass

---

## Exercise 3: INT4 and NF4 Quantization

Implement 4-bit quantization with NormalFloat4.

**Your Task:** Configure NF4 quantization for optimal quality.

In [None]:
def load_model_nf4():
    """Load model with NF4 quantization."""
    # TODO: Create config with:
    # - load_in_4bit=True
    # - bnb_4bit_quant_type="nf4"
    # - bnb_4bit_compute_dtype=torch.float16
    # - bnb_4bit_use_double_quant=True
    pass

def benchmark_generation(model, tokenizer, prompts: list):
    """Benchmark generation speed."""
    # TODO: Measure tokens/second for generation
    pass

---

## Exercise 4: QLoRA Implementation

Combine 4-bit quantization with LoRA for efficient fine-tuning.

**Your Task:** Set up QLoRA training configuration.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

def setup_qlora_model():
    """Set up a model for QLoRA fine-tuning."""
    # TODO:
    # 1. Create 4-bit quantization config
    # 2. Load quantized model
    # 3. Prepare for k-bit training
    # 4. Apply LoRA config
    pass

def analyze_qlora_efficiency(model):
    """Analyze QLoRA parameter efficiency."""
    # TODO: Count trainable vs total params
    pass

---

## Checkpoint

You've completed Lab 6! Key concepts:

- INT8 reduces memory ~2x with minimal quality loss
- NF4 reduces memory ~4x, optimized for neural network weights
- QLoRA enables fine-tuning 7B+ models on consumer GPUs

**Next:** Lab 7 - Ethical AI & Guardrails