# Model Quantization with HuggingFace Transformers

Quantization reduces memory usage and can improve throughput at a modest accuracy cost.

## Why Quantization?
- **Memory**: 4-bit models use ~1/4 the memory of fp16 models
- **Speed**: Lower precision can enable faster inference
- **Deployment**: Fit larger models on smaller GPUs

## Quantization Methods:
- **8-bit (INT8)**: Good balance, minimal accuracy loss
- **4-bit (NF4)**: Maximum compression, some quality tradeoff
- **GPTQ/AWQ**: Post-training quantization optimized for inference



In [None]:
# Set up torch for optimal performance
import gc
import torch

if torch.cuda.is_available():
    torch.backends.cuda.matmul.fp32_precision = "tf32"
    torch.backends.cudnn.conv.fp32_precision = "tf32"

torch.manual_seed(42)

# Model identifiers
MODEL_INSTRUCT = "Qwen/Qwen3-4B-Instruct-2507"
MODEL_THINKING = "Qwen/Qwen3-4B-Thinking-2507"

def report_memory(tag):
    if not torch.cuda.is_available():
        print("CUDA not available.")
        return
    torch.cuda.synchronize()
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"{tag}: allocated {allocated:.2f} GB, reserved {reserved:.2f} GB")

gc.collect()
torch.cuda.empty_cache()

report_memory("Initial")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def load_model(model_id, quantization_config=None):
    """Load a model and tokenizer with optional quantization."""
    tokenizer = AutoTokenizer.from_pretrained(
        model_id,
        use_fast=True,
        trust_remote_code=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        dtype=torch.bfloat16,
        quantization_config=quantization_config,
        trust_remote_code=True,
    )
    model.eval()
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id
    return tokenizer, model


In [None]:
def format_messages(user_prompt, system_prompt="You are a helpful assistant."):
    """Format messages for chat models."""
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]


def generate_chat(
    model,
    tokenizer,
    messages,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    **kwargs,
):
    """Generate a chat response."""
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    gen_kwargs = dict(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        pad_token_id=tokenizer.eos_token_id,
        **kwargs,
    )
    if do_sample:
        gen_kwargs.update({"temperature": temperature, "top_p": top_p})

    with torch.inference_mode():
        output_ids = model.generate(**gen_kwargs)

    gen_ids = output_ids[0, input_ids.shape[-1] :]
    return tokenizer.decode(gen_ids, skip_special_tokens=True)


comparison_prompts = {
    "math": "Solve: If a train travels 120 km in 1.5 hours, what is its average speed?",
    "reasoning": "You have 3 boxes: apples, oranges, and mixed. All labels are wrong. "
    "Pick one fruit to identify all boxes. Explain.",
    "analysis": "Compare pros and cons of deploying an LLM on-prem vs in the cloud.",
}


## Three-Way Comparison: Baseline vs INT8 vs NF4

Run the same test prompts across all three quantization levels to compare outputs and memory usage.

In [None]:
# Dictionary to store results from all three quantization methods
all_results = {}

# 1. BASELINE (Full Precision)
print("\n" + "="*60)
print("LOADING: Full Precision (BF16)")
print("="*60)

if 'model' in locals() and model is not None:
    del model
gc.collect()
torch.cuda.empty_cache()

tokenizer, model = load_model(MODEL_INSTRUCT)
report_memory("Baseline (FP16/BF16)")

baseline_results = {}
for name, prompt in comparison_prompts.items():
    messages = format_messages(prompt, system_prompt="You are a precise assistant.")
    baseline_results[name] = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=200,
        do_sample=False,
    )

all_results['Baseline (FP16)'] = baseline_results
print("✓ Baseline results collected.\n")

# 2. INT8 QUANTIZATION
print("="*60)
print("LOADING: 8-bit Quantization (INT8)")
print("="*60)

if 'model' in locals() and model is not None:
    del model
gc.collect()
torch.cuda.empty_cache()

bnb_8bit = BitsAndBytesConfig(load_in_8bit=True)
tokenizer, model = load_model(MODEL_INSTRUCT, quantization_config=bnb_8bit)
report_memory("INT8 (8-bit)")

int8_results = {}
for name, prompt in comparison_prompts.items():
    messages = format_messages(prompt, system_prompt="You are a precise assistant.")
    int8_results[name] = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=200,
        do_sample=False,
    )

all_results['INT8 (8-bit)'] = int8_results
print("✓ INT8 results collected.\n")

# 3. NF4 QUANTIZATION
print("="*60)
print("LOADING: 4-bit Quantization (NF4)")
print("="*60)

if 'model' in locals() and model is not None:
    del model
gc.collect()
torch.cuda.empty_cache()

bnb_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer, model = load_model(MODEL_INSTRUCT, quantization_config=bnb_4bit)
report_memory("NF4 (4-bit)")

nf4_results = {}
for name, prompt in comparison_prompts.items():
    messages = format_messages(prompt, system_prompt="You are a precise assistant.")
    nf4_results[name] = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=200,
        do_sample=False,
    )

all_results['NF4 (4-bit)'] = nf4_results
print("✓ NF4 results collected.\n")


### Display Results

Compare outputs across all three quantization levels side-by-side.

In [None]:
# Display comprehensive three-way comparison
for task_name in comparison_prompts:
    print(f"\n{'='*70}")
    print(f"TASK: {task_name.upper()}")
    print(f"{'='*70}")
    
    for quant_type, results in all_results.items():
        print(f"\n[{quant_type}]")
        print("-" * 70)
        print(results[task_name])
        print()


### Memory Usage Summary

Review the memory statistics printed during model loading:
- **Baseline (BF16)**: Highest accuracy, maximum memory usage
- **INT8**: ~50% memory reduction with minimal quality loss
- **NF4 (4-bit)**: ~75% memory reduction with acceptable quality tradeoff

## Other Quantization Formats

Beyond bitsandbytes, several other quantization methods are available:

#### GPTQ (GPU Post-Training Quantization)
- Pre-quantized models available on HuggingFace Hub
- Optimized for fast inference on GPU
- Loading: Use `AutoGPTQForCausalLM` class or models with `-GPTQ` suffix

#### AWQ (Activation-Aware Weight Quantization)
- Preserves important weights based on activation magnitudes
- Better accuracy than naive 4-bit quantization
- Loading: Use `AutoAWQForCausalLM` class or models with `-AWQ` suffix

#### GGUF (llama.cpp format)
- Optimized for CPU inference
- Various quantization levels (Q4_K_M, Q5_K_S, Q8_0, etc.)
- Use with `llama-cpp-python` or similar libraries

**Example Hub Search**:
- GPTQ models: Search for "gptq" in model names
- AWQ models: Search for "awq" in model names

Refer to model cards on HuggingFace Hub for supported quantization variants and detailed loading instructions.

## Key Takeaways

**When to Use Each Quantization Method:**

1. **Full Precision (BF16/FP16)**: 
   - When you have sufficient GPU memory
   - Maximum accuracy is critical
   - Baseline for comparison

2. **8-bit (INT8)**:
   - Good balance between memory and quality
   - Minimal accuracy degradation
   - ~50% memory savings

3. **4-bit (NF4)**:
   - Maximum memory efficiency
   - Acceptable quality for most tasks
   - ~75% memory savings
   - Enables larger models on smaller GPUs

4. **GPTQ/AWQ**:
   - Pre-quantized models for production deployment
   - Fast inference with optimized kernels
   - Available on HuggingFace Hub
