# Llama Instruct 4-bit GPTQ with vLLM Compressor

This notebook walks through 4-bit weight-only GPTQ quantization with the vLLM compressor (llm-compressor), serves the quantized model with vLLM, makes a sample call, and benchmarks accuracy on 100 samples. Run the cells in order; any cell with `!` executes a shell command.

## Prerequisites
- GPU with compute capability >= 8.0 (Ampere or newer) for fast W4A16 inference
- Enough VRAM for the chosen model (quantized model still needs ~40-50% of BF16 size)
- Set `HUGGINGFACE_HUB_TOKEN` in the environment if the model is gated

If you are rerunning, restart the kernel after installation to ensure the right CUDA/toolkit versions are picked up.

In [2]:
# Install vLLM + llm-compressor and friends (pin llmcompressor 0.6.0.1: py3.12-friendly and numpy<2/vLLM compatible)
!pip install -q --upgrade \
  'vllm>=0.5.5' \
  'llmcompressor==0.6.0.1' \
  'transformers>=4.43' \
  datasets accelerate

## Configure model + paths
Pick any Llama Instruct variant that you have access to. Calibration uses a small slice of UltraChat by default to keep the run quick.

In [3]:
from pathlib import Path
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"  # change if you prefer another model
CALIBRATION_DATASET = "HuggingFaceH4/ultrachat_200k"
CALIBRATION_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 128  # increase to 512+ for better quality
MAX_SEQUENCE_LENGTH = 2048
QUANTIZED_DIR = Path("llama-gptq-w4a16")

QUANTIZED_DIR.mkdir(exist_ok=True)
print(f"Saving quantized model to: {QUANTIZED_DIR.resolve()}")


  from .autonotebook import tqdm as notebook_tqdm


Saving quantized model to: /home/ubuntu/vllm/llama-gptq-w4a16


## Load model + tokenizer
Using `torch_dtype='auto'` keeps memory reasonable during quantization.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)


## Build calibration set
We keep preprocessing simple: apply the chat template, shuffle, then tokenize without adding extra special tokens (the template already includes BOS).

In [None]:
# Load and shuffle a small calibration slice
raw_ds = load_dataset(
    CALIBRATION_DATASET,
    split=f"{CALIBRATION_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
).shuffle(seed=42)

def format_example(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"], tokenize=False
        )
    }

# Apply chat template then tokenize
formatted = raw_ds.map(format_example)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

calibration_ds = formatted.map(tokenize, remove_columns=formatted.column_names)
print(calibration_ds)


## Quantize to 4-bit (W4A16) with GPTQ
This uses the vLLM compressor one-shot pipeline with the built-in GPTQ recipe for 4-bit weights (group size 128). Increase `num_calibration_samples` for better accuracy.

In [None]:
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

oneshot(
    model=model,
    dataset=calibration_ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

model.save_pretrained(QUANTIZED_DIR, save_compressed=True)
tokenizer.save_pretrained(QUANTIZED_DIR)
print(f"Quantized model saved to {QUANTIZED_DIR}")


## Serve with vLLM
Start the OpenAI-compatible server in a terminal (or a background cell). Update `--tensor-parallel-size` if using multiple GPUs.

In [None]:
# Example: launch the server
# !CUDA_VISIBLE_DEVICES=0 vllm serve "$QUANTIZED_DIR" \
#   --max-model-len 4096 \
#   --tensor-parallel-size 1 \
#   --port 8000 \
#   --api-key dummy


## Call the served model
Uses the OpenAI client pointed at the local vLLM server.

In [None]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

resp = client.chat.completions.create(
    model="gptq-w4a16-demo",  # the model id exposed by vLLM (use GET /v1/models to inspect)
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Give me two bullet points on why quantization helps inference."},
    ],
    max_tokens=64,
)
print(resp.choices[0].message.content)


## Local inference without the server (optional)
You can also load the quantized checkpoint directly with the Python API for quick experiments.

In [None]:
from vllm import LLM, SamplingParams

sampling = SamplingParams(temperature=0.0, max_tokens=64)
local_llm = LLM(model=str(QUANTIZED_DIR), tensor_parallel_size=1)
outputs = local_llm.generate([
    "Summarize quantization in one sentence.",
], sampling)
print(outputs[0].outputs[0].text.strip())


## Benchmark on a small dataset (100 samples)
We use `tweet_eval/sentiment` (3-way classification). The prompt asks for a one-word label, and we compute exact-match accuracy.

In [None]:
from collections import Counter
from datasets import load_dataset
import numpy as np

LABELS = {0: "negative", 1: "neutral", 2: "positive"}
EVAL_SPLIT = "train[:100]"

eval_ds = load_dataset("tweet_eval", "sentiment", split=EVAL_SPLIT)

prompts = [
    f"""Classify the sentiment of the tweet as negative, neutral, or positive. Respond with a single word.

Tweet: {row['text']}

Label:"""
    for row in eval_ds
]

sampling_params = SamplingParams(temperature=0.0, max_tokens=3, stop=["
"])
llm_for_eval = local_llm  # reuse the in-memory quantized model

generations = llm_for_eval.generate(prompts, sampling_params)

def normalize(text: str) -> str:
    t = text.strip().lower()
    for lbl in LABELS.values():
        if lbl in t:
            return lbl
    if t.startswith("pos"):
        return "positive"
    if t.startswith("neg"):
        return "negative"
    return "neutral"

preds = [normalize(out.outputs[0].text) for out in generations]
gold = [LABELS[int(row["label"])] for row in eval_ds]

acc = np.mean([p == g for p, g in zip(preds, gold)])
counts = Counter(preds)
print(f"Accuracy on {len(gold)} samples: {acc:.3f}")
print("Prediction distribution:", counts)
