# Module E: Quantization Lab - Memory, Speed, and Quality Tradeoffs

**Goal:** Understand how quantization affects model size, inference speed, and output quality.

**Persona:** ML Engineer optimizing for production deployment

## Workshop Overview

In this hands-on lab, you will:

1. **Learn the fundamentals** of model quantization
2. **Load the same model** at different precision levels (FP16, INT8, INT4)
3. **Benchmark performance** - memory, latency, and perplexity
4. **Compare quality** in an interactive side-by-side arena

## Why Quantization Matters

| Scenario | Challenge | Quantization Helps By |
|----------|-----------|----------------------|
| Edge deployment | Limited GPU memory | 4x smaller model fits |
| Cost optimization | Expensive GPU hours | Faster inference = lower cost |
| Latency requirements | Real-time responses | Smaller = faster |
| Multi-model serving | Many models, one GPU | Each model uses less VRAM |

## Model: Qwen2.5-1.5B-Instruct

We'll use **Qwen2.5-1.5B-Instruct** - a 1.5 billion parameter instruction-tuned model that:
- Is small enough to experiment with quickly
- Is large enough to show meaningful differences
- Has strong performance for its size


In [None]:
# Debug: Check bitsandbytes installation
import sys
import subprocess

print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Check if bitsandbytes is installed
try:
    import bitsandbytes as bnb
    print(f"✅ bitsandbytes {bnb.__version__} is installed")
    print(f"   Location: {bnb.__file__}")
except ImportError as e:
    print(f"❌ bitsandbytes NOT found: {e}")
    print("\n🔧 Installing bitsandbytes...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", "bitsandbytes"])
    print("✅ Installation complete. Please restart the kernel!")

# Check CUDA availability
import torch
print(f"\n🖥️ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   CUDA version: {torch.version.cuda}")
    print(f"   GPU: {torch.cuda.get_device_name(0)}")

## ✅ FIXED: bitsandbytes Installation Complete!

**Good news**: All quantization libraries are now installed in your `.venv`:
- ✅ bitsandbytes 0.49.0
- ✅ auto-gptq 0.7.1  
- ✅ optimum 2.0.0
- ✅ PyTorch 2.9.1+cu128 with CUDA support

### To use them:

**1. Restart your Jupyter kernel**: 
   - Go to **Kernel → Restart Kernel** (or press `00` in Jupyter)

**2. Run the debug cell above** to verify the installation

**3. Continue with the lab!**

---

### Still seeing import errors after restart?

**Check kernel matches venv**: In Jupyter, go to **Kernel → Change Kernel** and ensure it says `Python (fico)` or uses `.venv/bin/python`

**Manual install if needed**:
```bash
cd /home/shadeform/workshop-v1/fico
./scripts/install_bitsandbytes.sh
```

---

## Section 1: Quantization Fundamentals

### What is Quantization?

Quantization reduces the **precision** of model weights from higher bit-widths to lower ones.

```
Original Weight (FP32):  0.123456789012345678901234567890
FP16 Weight:            0.1234567890
INT8 Weight:            0.12345
INT4 Weight:            0.12
```

### The Precision Ladder

```
┌─────────────────────────────────────────────────────────────────────────┐
│ Precision    │ Bits │ Memory (1.5B) │ Use Case                         │
├──────────────┼──────┼───────────────┼───────────────────────────────────┤
│ FP32         │  32  │    ~6 GB      │ Training, highest accuracy       │
│ FP16/BF16    │  16  │    ~3 GB      │ Standard inference               │
│ INT8         │   8  │   ~1.5 GB     │ Good quality, 2x compression     │
│ INT4 (NF4)   │   4  │   ~0.8 GB     │ Acceptable quality, 4x compress  │
│ INT2/1.58b   │  1-2 │   ~0.4 GB     │ Experimental, research           │
└─────────────────────────────────────────────────────────────────────────┘
```

### Memory Formula

```
Model Size (GB) ≈ (Parameters × Bits per Weight) / (8 × 10^9)

For 1.5B parameters:
  FP16: 1.5B × 16 / 8B = 3.0 GB
  INT8: 1.5B ×  8 / 8B = 1.5 GB
  INT4: 1.5B ×  4 / 8B = 0.75 GB
```

### Quantization Methods

| Method | Description | Pros | Cons |
|--------|-------------|------|------|
| **Post-Training (PTQ)** | Quantize after training | Fast, no retraining | Some accuracy loss |
| **Quantization-Aware (QAT)** | Train with quantization | Better accuracy | Requires training |
| **GPTQ** | 4-bit, layer-by-layer | Good quality at 4-bit | Slower to quantize |
| **bitsandbytes** | Dynamic 8/4-bit | Easy to use, on-the-fly | Slightly slower inference |
| **AWQ** | Activation-aware | Best 4-bit quality | Slower quantization |

---

## Section 2: Environment Setup

In [None]:
# Environment Check
import sys
import os
from pathlib import Path

print("Python:", sys.executable)
print("\n=== Package Versions ===")

packages = [
    ("torch", "PyTorch"),
    ("transformers", "Transformers"),
    ("accelerate", "Accelerate"),
    ("bitsandbytes", "BitsAndBytes"),
    ("plotly", "Plotly"),
    ("ipywidgets", "Widgets"),
]

missing = []
for pkg, name in packages:
    try:
        mod = __import__(pkg)
        ver = getattr(mod, "__version__", "?")
        print(f"✅ {name}: {ver}")
    except ImportError:
        print(f"❌ {name}: NOT INSTALLED")
        missing.append(pkg)

if missing:
    print(f"\n⚠️ Install missing: pip install {' '.join(missing)}")

print("\n=== GPU Check ===")
import torch
if torch.cuda.is_available():
    print(f"✅ CUDA available: {torch.cuda.get_device_name(0)}")
    print(f"   Total VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️ CUDA not available - quantization benchmarks require GPU!")

In [None]:
# Core Imports
import time
import gc
import re
from pathlib import Path

import torch
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output, Markdown

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Our utility functions
from quant_utils import (
    clear_gpu_memory,
    get_gpu_memory_mb,
    get_model_size_mb,
    count_parameters,
    calculate_perplexity_simple,
    benchmark_generation,
    load_model_fp16,
    load_model_int8,
    load_model_int4,
    compare_outputs,
    QuantizationBenchmark,
    create_memory_chart,
    create_speed_chart,
    create_perplexity_chart,
    create_tradeoff_chart,
    create_summary_dashboard,
)

print("✅ All imports successful!")

In [None]:
# Configuration

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
CALIBRATION_FILE = Path("calibration_texts/fico_calibration.txt")

# Load calibration text
if CALIBRATION_FILE.exists():
    CALIBRATION_TEXT = CALIBRATION_FILE.read_text()
    print(f"✅ Loaded calibration text: {len(CALIBRATION_TEXT)} chars")
else:
    CALIBRATION_TEXT = """
    Credit scoring is a statistical analysis performed by lenders to determine creditworthiness.
    The FICO Score ranges from 300 to 850 and is based on payment history, amounts owed,
    length of credit history, credit mix, and new credit inquiries.
    """
    print("⚠️ Using default calibration text")

# HARDER benchmark prompts that reveal quality differences
BENCHMARK_PROMPTS = [
    # Math reasoning (requires precision)
    "If a credit card has a $5,000 balance at 18.5% APR and you pay $200/month, how much interest will you pay in the first month? Show your calculation.",
    
    # Multi-step reasoning
    "Alice has a credit score of 680. She pays off a $3,000 credit card balance and her utilization drops from 80% to 15%. Her payment history is perfect. Will her score likely go up or down, and by approximately how much?",
    
    # Factual precision
    "List the exact 5 factors that make up a FICO Score and their precise percentage weights. Format: Factor (XX%).",
    
    # Complex financial calculation
    "Calculate the effective annual rate for a loan with 12% nominal rate compounded monthly. Show the formula and final percentage.",
    
    # Code generation (requires syntax precision)
    "Write a Python function called calculate_fico_factors that takes payment_history, credit_utilization, credit_age, credit_mix, and new_credit as inputs (all 0-100) and returns a weighted score using the standard FICO percentages (35%, 30%, 15%, 10%, 10%). Include input validation.",
    
    # Long context reasoning
    "A borrower has: 1) 10 years of credit history, 2) never missed a payment, 3) 5 credit cards with total limit $50k and balance $5k, 4) 1 mortgage, 5) no new inquiries in 2 years. Analyze each FICO factor and predict if this profile would likely score above 750, between 670-750, or below 670. Explain your reasoning for each factor.",
]

# Quality arena prompts with difficulty levels
ARENA_PROMPTS = {
    "Math_Easy": "What is 25% of 800?",
    
    "Math_Medium": "A loan of $10,000 at 5% annual interest compounded monthly for 3 years. What's the total amount paid?",
    
    "Math_Hard": "Calculate the internal rate of return (IRR) for an investment: pay $1000 today, receive $300/year for 4 years. Show calculation steps.",
    
    "Reasoning_Easy": "Which is better for your credit: paying off a credit card or opening a new one?",
    
    "Reasoning_Medium": "You have two credit cards: Card A ($5k limit, $4k balance) and Card B ($10k limit, $1k balance). You have $3k to pay down debt. Should you pay Card A or split between both to minimize utilization impact? Explain numerically.",
    
    "Reasoning_Hard": "Three scenarios: A) Pay minimum on all cards, B) Pay off smallest balance first (snowball), C) Pay highest interest first (avalanche). For someone with $20k debt across 4 cards (rates: 24%, 18%, 15%, 12%), calculate total interest paid over 2 years for each strategy. Which is optimal?",
    
    "Code_Easy": "Write a function to check if a credit score is 'excellent' (740+), 'good' (670-739), or 'poor' (<670).",
    
    "Code_Medium": "Write a Python function that calculates credit utilization ratio given a list of tuples (balance, limit) for multiple cards. Handle edge cases like zero limits.",
    
    "Code_Hard": "Write a complete Python class CreditScoreSimulator with methods to: 1) add accounts, 2) record payments, 3) calculate utilization, 4) estimate score impact. Include proper error handling and docstrings.",
    
    "Factual_Precision": "What are the exact numeric ranges for FICO score categories: Exceptional, Very Good, Good, Fair, and Poor?",
}

print(f"\n📋 Configuration:")
print(f"   Model: {MODEL_NAME}")
print(f"   Benchmark prompts: {len(BENCHMARK_PROMPTS)} (HARDER TESTS)")
print(f"   Arena prompts: {len(ARENA_PROMPTS)} (DIFFICULTY LEVELS)")
print(f"\n💡 These prompts test: math precision, multi-step reasoning, code generation, factual accuracy")

---

## Section 3: Loading Models at Different Precisions

We'll load the same model three times:
1. **FP16** (16-bit floating point) - Baseline
2. **INT8** (8-bit integer) - 2x smaller
3. **INT4** (4-bit integer) - 4x smaller

For each, we'll measure:
- Load time
- GPU memory usage
- Peak memory during loading

In [None]:
# Storage for models and results
models = {}  # {precision: (model, tokenizer)}
load_results = {}  # {precision: ModelLoadResult}
benchmarks = []  # List of QuantizationBenchmark

print("🔄 Will load models one at a time to accurately measure memory.")
print("   This takes a few minutes per model.")

In [None]:
# Load FP16 Model (Baseline)
print("=" * 60)
print("Loading FP16 Model (Baseline)")
print("=" * 60)

clear_gpu_memory()
print(f"GPU Memory before: {get_gpu_memory_mb():.1f} MB")

result_fp16 = load_model_fp16(MODEL_NAME)

print(f"\n✅ FP16 Model Loaded!")
print(f"   Load time: {result_fp16.load_time_seconds:.2f}s")
print(f"   Memory used: {result_fp16.memory_mb:.1f} MB")
print(f"   Peak memory: {result_fp16.peak_memory_mb:.1f} MB")

# Store
models["FP16"] = (result_fp16.model, result_fp16.tokenizer)
load_results["FP16"] = result_fp16

# Quick test
test_input = result_fp16.tokenizer("Hello", return_tensors="pt").to(result_fp16.model.device)
with torch.no_grad():
    test_output = result_fp16.model.generate(**test_input, max_new_tokens=5)
print(f"   Test output: {result_fp16.tokenizer.decode(test_output[0], skip_special_tokens=True)}")

In [None]:
# Free FP16 model before loading INT8 (for accurate memory measurement)
print("\n🧹 Clearing FP16 model to load INT8...")
del models["FP16"]  # Delete the entire entry
clear_gpu_memory()
print(f"GPU Memory after clearing: {get_gpu_memory_mb():.1f} MB")

In [None]:
# Load INT8 Model
print("=" * 60)
print("Loading INT8 Model (8-bit quantized)")
print("=" * 60)

clear_gpu_memory()
print(f"GPU Memory before: {get_gpu_memory_mb():.1f} MB")

result_int8 = load_model_int8(MODEL_NAME)

print(f"\n✅ INT8 Model Loaded!")
print(f"   Load time: {result_int8.load_time_seconds:.2f}s")
print(f"   Memory used: {result_int8.memory_mb:.1f} MB")
print(f"   Peak memory: {result_int8.peak_memory_mb:.1f} MB")
print(f"   Compression: {load_results['FP16'].memory_mb / result_int8.memory_mb:.1f}x smaller")

# Store
models["INT8"] = (result_int8.model, result_int8.tokenizer)
load_results["INT8"] = result_int8

# Quick test
test_input = result_int8.tokenizer("Hello", return_tensors="pt").to(result_int8.model.device)
with torch.no_grad():
    test_output = result_int8.model.generate(**test_input, max_new_tokens=5)
print(f"   Test output: {result_int8.tokenizer.decode(test_output[0], skip_special_tokens=True)}")

In [None]:
# Free INT8 model before loading INT4
print("\n🧹 Clearing INT8 model to load INT4...")
del models["INT8"]  # Delete the entire entry
clear_gpu_memory()
print(f"GPU Memory after clearing: {get_gpu_memory_mb():.1f} MB")

In [None]:
# Load INT4 Model (NF4 quantization)
print("=" * 60)
print("Loading INT4-NF4 Model (4-bit quantized)")
print("=" * 60)

clear_gpu_memory()
print(f"GPU Memory before: {get_gpu_memory_mb():.1f} MB")

result_int4 = load_model_int4(MODEL_NAME, quant_type="nf4")

print(f"\n✅ INT4-NF4 Model Loaded!")
print(f"   Load time: {result_int4.load_time_seconds:.2f}s")
print(f"   Memory used: {result_int4.memory_mb:.1f} MB")
print(f"   Peak memory: {result_int4.peak_memory_mb:.1f} MB")
print(f"   Compression: {load_results['FP16'].memory_mb / result_int4.memory_mb:.1f}x smaller")

# Store
models["INT4-NF4"] = (result_int4.model, result_int4.tokenizer)
load_results["INT4-NF4"] = result_int4

# Quick test
test_input = result_int4.tokenizer("Hello", return_tensors="pt").to(result_int4.model.device)
with torch.no_grad():
    test_output = result_int4.model.generate(**test_input, max_new_tokens=5)
print(f"   Test output: {result_int4.tokenizer.decode(test_output[0], skip_special_tokens=True)}")

In [None]:
# Summary: Memory Comparison
print("=" * 60)
print("Memory Comparison Summary")
print("=" * 60)

fp16_mem = load_results["FP16"].memory_mb

summary_data = []
for name, result in load_results.items():
    bits = {"FP16": 16, "INT8": 8, "INT4-NF4": 4}.get(name, 16)
    summary_data.append({
        "Precision": name,
        "Bits": bits,
        "Memory (MB)": result.memory_mb,
        "Load Time (s)": result.load_time_seconds,
        "Compression": f"{fp16_mem / result.memory_mb:.1f}x",
    })

df_memory = pd.DataFrame(summary_data)
display(df_memory)

# Bar chart
fig = px.bar(
    df_memory,
    x="Precision",
    y="Memory (MB)",
    title="GPU Memory Usage by Quantization Level",
    text="Memory (MB)",
    color="Precision",
)
fig.update_traces(texttemplate="%{text:.0f} MB", textposition="outside")
fig.update_layout(showlegend=False)
fig.show()

---

## Section 4: Benchmarking - Latency and Perplexity

Now we'll measure:
1. **Generation Speed** - tokens per second
2. **Perplexity** - objective quality metric (lower = better)

We need to reload all models for fair comparison.

In [None]:
# Reload all models for benchmarking
# (We deleted them earlier to measure memory accurately)

print("🔄 Reloading all models for benchmarking...")
print("   This may take a few minutes.\n")

# Clear first
models = {}
clear_gpu_memory()

# We'll keep only INT4 loaded (smallest) and load others as needed
# Or reload all if you have enough VRAM

# For machines with limited VRAM, we benchmark one at a time
BENCHMARK_ONE_AT_A_TIME = True

if not BENCHMARK_ONE_AT_A_TIME:
    # Load all (needs ~5GB VRAM)
    print("Loading FP16...")
    result = load_model_fp16(MODEL_NAME)
    models["FP16"] = (result.model, result.tokenizer)
    
    print("Loading INT8...")
    result = load_model_int8(MODEL_NAME)
    models["INT8"] = (result.model, result.tokenizer)
    
    print("Loading INT4-NF4...")
    result = load_model_int4(MODEL_NAME)
    models["INT4-NF4"] = (result.model, result.tokenizer)
    
    print("\n✅ All models loaded!")
else:
    print("Will benchmark models one at a time (memory-efficient mode)")

In [None]:
# Benchmark Function

def run_full_benchmark(model, tokenizer, precision: str, bits: float) -> QuantizationBenchmark:
    """
    Run complete benchmark for a model.
    
    Returns QuantizationBenchmark with all metrics.
    """
    print(f"\n{'='*50}")
    print(f"Benchmarking {precision}")
    print(f"{'='*50}")
    
    memory_mb = get_gpu_memory_mb()
    print(f"Current memory: {memory_mb:.1f} MB")
    
    # 1. Perplexity
    print("\n📊 Calculating perplexity...")
    ppl = calculate_perplexity_simple(model, tokenizer, CALIBRATION_TEXT)
    print(f"   Perplexity: {ppl:.2f}")
    
    # 2. Generation speed
    print("\n⚡ Measuring generation speed...")
    speeds = []
    sample_outputs = {}
    
    for prompt in BENCHMARK_PROMPTS[:3]:  # Use 3 prompts for speed
        result = benchmark_generation(
            model, tokenizer, prompt,
            max_new_tokens=50,
            num_runs=2,
            warmup_runs=1,
        )
        speeds.append(result.tokens_per_second)
        sample_outputs[prompt[:30]] = result.output[:100]
    
    avg_speed = sum(speeds) / len(speeds)
    print(f"   Avg speed: {avg_speed:.1f} tokens/sec")
    
    return QuantizationBenchmark(
        precision=precision,
        bits_per_weight=bits,
        memory_mb=memory_mb,
        load_time_s=load_results.get(precision, load_results.get("FP16")).load_time_seconds,
        perplexity=ppl,
        tokens_per_second=avg_speed,
        sample_outputs=sample_outputs,
    )

In [None]:
# Run benchmarks one at a time

benchmarks = []

# FP16
print("\n" + "="*60)
print("Loading and benchmarking FP16...")
print("="*60)
clear_gpu_memory()
result = load_model_fp16(MODEL_NAME)
benchmark = run_full_benchmark(result.model, result.tokenizer, "FP16", 16.0)
benchmark.memory_mb = result.memory_mb  # Use accurate load-time memory
benchmark.load_time_s = result.load_time_seconds
benchmarks.append(benchmark)

# Keep FP16 for comparison later
models["FP16"] = (result.model, result.tokenizer)

# Free and load INT8
print("\n" + "="*60)
print("Loading and benchmarking INT8...")
print("="*60)
if "FP16" in models:
    del models["FP16"]
clear_gpu_memory()
result = load_model_int8(MODEL_NAME)
benchmark = run_full_benchmark(result.model, result.tokenizer, "INT8", 8.0)
benchmark.memory_mb = result.memory_mb
benchmark.load_time_s = result.load_time_seconds
benchmarks.append(benchmark)

# Free and load INT4
print("\n" + "="*60)
print("Loading and benchmarking INT4-NF4...")
print("="*60)
# Free the previous model and its reference
del result
clear_gpu_memory()
result = load_model_int4(MODEL_NAME, quant_type="nf4")
benchmark = run_full_benchmark(result.model, result.tokenizer, "INT4-NF4", 4.0)
benchmark.memory_mb = result.memory_mb
benchmark.load_time_s = result.load_time_seconds
benchmarks.append(benchmark)

# Keep INT4 loaded (smallest)
models["INT4-NF4"] = (result.model, result.tokenizer)

print("\n" + "="*60)
print("✅ All benchmarks complete!")
print("="*60)

In [None]:
# Results Summary

print("\n" + "="*70)
print("                    BENCHMARK RESULTS SUMMARY")
print("="*70)

# Create DataFrame
results_data = []
for b in benchmarks:
    results_data.append({
        "Precision": b.precision,
        "Bits": b.bits_per_weight,
        "Memory (MB)": f"{b.memory_mb:.0f}",
        "Load Time (s)": f"{b.load_time_s:.1f}",
        "Perplexity": f"{b.perplexity:.2f}",
        "Speed (tok/s)": f"{b.tokens_per_second:.1f}",
    })

df_results = pd.DataFrame(results_data)
display(df_results)

# Calculate relative metrics
fp16_ppl = benchmarks[0].perplexity
fp16_speed = benchmarks[0].tokens_per_second
fp16_mem = benchmarks[0].memory_mb

print("\n📊 Relative to FP16 Baseline:")
for b in benchmarks:
    ppl_change = ((b.perplexity - fp16_ppl) / fp16_ppl) * 100
    speed_change = ((b.tokens_per_second - fp16_speed) / fp16_speed) * 100
    mem_ratio = fp16_mem / b.memory_mb
    
    print(f"\n{b.precision}:")
    print(f"   Memory: {mem_ratio:.1f}x smaller")
    print(f"   Speed: {speed_change:+.1f}% {'faster' if speed_change > 0 else 'slower'}")
    print(f"   Quality: {ppl_change:+.1f}% perplexity change")

In [None]:
# Visualization: Summary Dashboard

fig = create_summary_dashboard(benchmarks)
fig.show()

In [None]:
# Individual Charts

# Memory Chart
fig_mem = create_memory_chart(benchmarks)
fig_mem.show()

# Speed Chart
fig_speed = create_speed_chart(benchmarks)
fig_speed.show()

# Perplexity Chart
fig_ppl = create_perplexity_chart(benchmarks)
fig_ppl.show()

# Tradeoff Chart
fig_trade = create_tradeoff_chart(benchmarks)
fig_trade.show()

---

## Section 5: Quality Arena - Side-by-Side Comparison

Numbers are great, but can you actually *see* the quality difference?

In this section, we'll:
1. Generate outputs from all quantization levels for the same prompts
2. Display them side-by-side
3. Vote on which is best (blind voting available)

In [None]:
# Reload all models for comparison (if not already loaded)

print("🔄 Loading all models for quality arena...")

# Clear everything
models = {}
clear_gpu_memory()

# Load all three
print("Loading FP16...")
result = load_model_fp16(MODEL_NAME)
models["FP16"] = (result.model, result.tokenizer)

print("Loading INT8...")
result = load_model_int8(MODEL_NAME)
models["INT8"] = (result.model, result.tokenizer)

print("Loading INT4-NF4...")
result = load_model_int4(MODEL_NAME)
models["INT4-NF4"] = (result.model, result.tokenizer)

print(f"\n✅ All models loaded! Current GPU memory: {get_gpu_memory_mb():.0f} MB")

In [None]:
# Quality Arena Class

class QualityArena:
    """Interactive side-by-side quality comparison."""
    
    def __init__(self, models_dict: dict, prompts: dict):
        self.models = models_dict
        self.prompts = prompts
        self.responses = {}  # {prompt: {precision: response}}
        self.votes = {p: 0 for p in models_dict.keys()}  # Vote counts
        self.current_prompt_idx = 0
        self.prompt_names = list(prompts.keys())
        self.blind_mode = False
        
    def generate_all_responses(self, prompt: str, max_new_tokens: int = 150):
        """Generate responses from all models for a prompt."""
        responses = {}
        
        for precision, (model, tokenizer) in self.models.items():
            device = next(model.parameters()).device
            inputs = tokenizer(prompt, return_tensors="pt").to(device)
            
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    temperature=0.7,
                    do_sample=True,
                    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
                )
            
            response = tokenizer.decode(
                outputs[0][inputs.input_ids.shape[1]:],
                skip_special_tokens=True
            )
            responses[precision] = response.strip()
        
        return responses
    
    def display_comparison(self, prompt_name: str):
        """Display side-by-side comparison for a prompt."""
        prompt = self.prompts[prompt_name]
        
        if prompt not in self.responses:
            print(f"\n⏳ Generating responses for: {prompt_name}")
            self.responses[prompt] = self.generate_all_responses(prompt)
        
        responses = self.responses[prompt]
        
        # Display
        print("\n" + "="*70)
        print(f"Task: {prompt_name}")
        print(f"Prompt: {prompt}")
        print("="*70)
        
        if self.blind_mode:
            # Shuffle for blind comparison
            import random
            items = list(responses.items())
            random.shuffle(items)
            labels = ["Option A", "Option B", "Option C"]
            
            for label, (precision, response) in zip(labels, items):
                print(f"\n📝 {label}:")
                print("-" * 40)
                print(response[:500])
                print()
            
            print("\n🔒 Reveal: " + ", ".join([f"{l}={p}" for l, (p, _) in zip(labels, items)]))
        else:
            for precision, response in responses.items():
                print(f"\n📝 {precision}:")
                print("-" * 40)
                print(response[:500])
                print()
    
    def record_vote(self, precision: str):
        """Record a vote for a precision level."""
        if precision in self.votes:
            self.votes[precision] += 1
            print(f"✅ Vote recorded for {precision}")
    
    def show_results(self):
        """Display voting results."""
        total = sum(self.votes.values())
        
        print("\n" + "="*50)
        print("        QUALITY ARENA RESULTS")
        print("="*50)
        
        if total == 0:
            print("No votes recorded yet.")
            return
        
        sorted_votes = sorted(self.votes.items(), key=lambda x: x[1], reverse=True)
        
        for i, (precision, count) in enumerate(sorted_votes, 1):
            pct = (count / total) * 100
            bar = "█" * int(pct / 5)
            medal = ["🥇", "🥈", "🥉"][i-1] if i <= 3 else "  "
            print(f"{medal} {precision}: {bar} {count} votes ({pct:.1f}%)")
        
        print(f"\nTotal votes: {total}")


# Create arena
arena = QualityArena(models, ARENA_PROMPTS)
print("✅ Quality Arena ready!")
print(f"   Prompts: {list(ARENA_PROMPTS.keys())}")

In [None]:
# Interactive Arena Widget

# Prompt selector
w_prompt = widgets.Dropdown(
    options=list(ARENA_PROMPTS.keys()),
    description="Task:",
    style={"description_width": "initial"}
)

# Blind mode toggle
w_blind = widgets.Checkbox(
    value=False,
    description="Blind Mode (shuffle labels)",
)

# Generate button
w_generate = widgets.Button(
    description="Generate Comparisons",
    button_style="primary",
)

# Vote buttons
w_vote_fp16 = widgets.Button(description="Vote FP16", button_style="info")
w_vote_int8 = widgets.Button(description="Vote INT8", button_style="warning")
w_vote_int4 = widgets.Button(description="Vote INT4", button_style="success")

# Results button
w_results = widgets.Button(description="Show Results", button_style="")

# Output
w_output = widgets.Output()

def on_generate(b):
    arena.blind_mode = w_blind.value
    with w_output:
        clear_output()
        arena.display_comparison(w_prompt.value)

def on_vote_fp16(b):
    arena.record_vote("FP16")
    
def on_vote_int8(b):
    arena.record_vote("INT8")
    
def on_vote_int4(b):
    arena.record_vote("INT4-NF4")

def on_results(b):
    with w_output:
        arena.show_results()

w_generate.on_click(on_generate)
w_vote_fp16.on_click(on_vote_fp16)
w_vote_int8.on_click(on_vote_int8)
w_vote_int4.on_click(on_vote_int4)
w_results.on_click(on_results)

# Display
display(widgets.VBox([
    widgets.HTML("<h3>🏆 Quality Arena</h3>"),
    widgets.HBox([w_prompt, w_blind]),
    w_generate,
    widgets.HTML("<br><b>Vote for the best response:</b>"),
    widgets.HBox([w_vote_fp16, w_vote_int8, w_vote_int4, w_results]),
    w_output,
]))

In [None]:
# Manual comparison (non-widget fallback)

print("\n" + "="*70)
print("           SIDE-BY-SIDE QUALITY COMPARISON")
print("="*70)

for task_name, prompt in ARENA_PROMPTS.items():
    print(f"\n\n{'#'*70}")
    print(f"# Task: {task_name}")
    print(f"# Prompt: {prompt}")
    print(f"{'#'*70}")
    
    responses = arena.generate_all_responses(prompt, max_new_tokens=100)
    arena.responses[prompt] = responses
    
    for precision, response in responses.items():
        print(f"\n📝 [{precision}]:")
        print("-" * 50)
        print(response[:400])

print("\n" + "="*70)
print("✅ All comparisons generated!")
print("   Use the voting widget above or call arena.record_vote('FP16')")

---

## Section 6: Exercises

Test your understanding with these challenges!

### Exercise 1: Memory Budget Challenge

**Scenario:** You have a GPU with only **2GB** of available VRAM. 

**Question:** Which quantization level(s) can fit, and what's the tradeoff?

In [None]:
# Exercise 1: Check which models fit in 2GB

VRAM_BUDGET_MB = 2000  # 2GB

print(f"VRAM Budget: {VRAM_BUDGET_MB} MB\n")

for b in benchmarks:
    fits = "✅ FITS" if b.memory_mb <= VRAM_BUDGET_MB else "❌ TOO BIG"
    print(f"{b.precision}: {b.memory_mb:.0f} MB - {fits}")

# Your analysis:
# TODO: Which would you choose and why?

### Exercise 2: Quality Threshold

**Scenario:** Your application requires perplexity to stay within **10%** of the FP16 baseline.

**Question:** What's the most aggressive quantization you can use?

In [None]:
# Exercise 2: Find lowest precision meeting quality threshold

MAX_PPL_INCREASE = 0.10  # 10% max increase

fp16_ppl = benchmarks[0].perplexity
threshold = fp16_ppl * (1 + MAX_PPL_INCREASE)

print(f"FP16 Baseline Perplexity: {fp16_ppl:.2f}")
print(f"Max Allowed Perplexity: {threshold:.2f} (+{MAX_PPL_INCREASE*100:.0f}%)\n")

for b in benchmarks:
    ppl_change = ((b.perplexity - fp16_ppl) / fp16_ppl) * 100
    meets = "✅ MEETS" if b.perplexity <= threshold else "❌ EXCEEDS"
    print(f"{b.precision}: PPL={b.perplexity:.2f} ({ppl_change:+.1f}%) - {meets}")

# Your analysis:
# TODO: Which precision would you recommend?

### Exercise 3: Pareto Frontier

**Question:** Looking at the tradeoff chart, which quantization level is on the "Pareto frontier" (best tradeoff between speed and quality)?

In [None]:
# Exercise 3: Analyze the Pareto frontier

# Recreate tradeoff chart for analysis
fig = create_tradeoff_chart(benchmarks)
fig.show()

# Analysis helper
print("\nSpeed vs Quality Analysis:")
print("-" * 50)
for b in benchmarks:
    # Calculate "efficiency score" (higher speed, lower perplexity = better)
    efficiency = b.tokens_per_second / b.perplexity
    print(f"{b.precision}: Speed={b.tokens_per_second:.1f} tok/s, PPL={b.perplexity:.2f}, Efficiency={efficiency:.2f}")

# Your analysis:
# TODO: Which point(s) are on the Pareto frontier?

### Exercise 4: Blind Test

**Challenge:** Can you tell the difference between FP16 and INT4 outputs?

Enable blind mode in the Quality Arena and try to guess which is which!

In [None]:
# Exercise 4: Blind comparison

import random

print("🔒 BLIND TEST MODE")
print("="*50)
print("Can you tell which output is FP16 vs INT4?\n")

test_prompt = "Explain compound interest in simple terms."

# Generate responses
responses = {}
for precision, (model, tokenizer) in models.items():
    device = next(model.parameters()).device
    inputs = tokenizer(test_prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    responses[precision] = response.strip()

# Shuffle
items = list(responses.items())
random.shuffle(items)
labels = ["Option A", "Option B", "Option C"]
label_map = {l: p for l, (p, _) in zip(labels, items)}

print(f"Prompt: {test_prompt}\n")

for label, (precision, response) in zip(labels, items):
    print(f"\n📝 {label}:")
    print("-" * 40)
    print(response[:500])

print("\n" + "="*50)
print("Which is FP16? Which is INT4? Make your guess!")
print("\nRun the next cell to reveal the answer...")

In [None]:
# Reveal blind test answers
print("🔓 REVEAL")
print("="*50)
for label, precision in label_map.items():
    print(f"{label} = {precision}")

---

## Summary & Key Takeaways

### What We Learned

1. **Quantization trades precision for efficiency**
   - FP16 → INT8: ~2x memory reduction, minimal quality loss
   - FP16 → INT4: ~4x memory reduction, some quality degradation

2. **The right choice depends on your constraints**
   - Limited VRAM? → INT4 or INT8
   - Quality-critical? → FP16 or INT8
   - Cost-sensitive? → INT4 with careful validation

3. **Perplexity is a good objective metric**
   - But always validate with task-specific evaluation
   - Human preference matters for subjective tasks

4. **Modern quantization is surprisingly good**
   - NF4 (Normal Float 4-bit) preserves quality well
   - For many tasks, users can't tell the difference

### Quick Reference

| Precision | When to Use |
|-----------|-------------|
| FP16 | Development, quality-critical production |
| INT8 | Production with quality requirements |
| INT4-NF4 | Edge deployment, cost optimization |

### Next Steps

- Explore **GPTQ** and **AWQ** for even better 4-bit quality
- Try **quantization-aware training (QAT)** for critical applications
- Investigate **speculative decoding** for faster inference
- Consider **mixed precision** for optimal tradeoffs

In [None]:
# Final Summary

print("\n" + "="*70)
print("           QUANTIZATION LAB - FINAL SUMMARY")
print("="*70)

# Show final benchmark table
display(df_results)

# Show voting results if any
if sum(arena.votes.values()) > 0:
    print("\n📊 Quality Arena Voting Results:")
    arena.show_results()

print("\n✅ Lab complete! You now understand:")
print("   • How quantization affects model size")
print("   • The memory-quality-speed tradeoffs")
print("   • How to measure quality with perplexity")
print("   • How to choose the right precision for your use case")

In [None]:
# Cleanup

# Uncomment to free GPU memory when done:
# for name in list(models.keys()):
#     del models[name]
# clear_gpu_memory()
# print(f"GPU memory after cleanup: {get_gpu_memory_mb():.0f} MB")