# Llama Instruct 4-bit AWQ with vLLM Compressor

This notebook applies 4-bit Activation-Aware Quantization (AWQ) using the vLLM compressor (llm-compressor), serves the model with vLLM, makes a sample call, and benchmarks on 100 samples.

## Prerequisites
- GPU with compute capability >= 8.0 for fast W4A16 inference
- Set `HUGGINGFACE_HUB_TOKEN` if the model is gated
- Restart the kernel after installation if CUDA libraries change

In [9]:
# Install dependencies from requirements.txt (run once per environment)
import sys
!{sys.executable} -m pip install -q --upgrade -r requirements.txt


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## 1. Set up Hugging Face access


In [10]:
import os

# Optional: set HUGGINGFACE_HUB_TOKEN in your environment or paste it here.
HUGGINGFACE_HUB_TOKEN = os.getenv("HUGGINGFACE_HUB_TOKEN")
# HUGGINGFACE_HUB_TOKEN = "hf_..."  # Uncomment to hardcode for this notebook session

if not HUGGINGFACE_HUB_TOKEN:
    print("HUGGINGFACE_HUB_TOKEN not set; gated models may fail to load.")


## 2. Configure model + paths
AWQ needs small calibration data to compute per-channel scales.


In [11]:
from pathlib import Path
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"
CALIBRATION_DATASET = "HuggingFaceH4/ultrachat_200k"
CALIBRATION_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 256  # AWQ benefits from a bit more data
MAX_SEQUENCE_LENGTH = 1024
QUANTIZED_DIR = Path("llama-awq-w4a16")
BASE_MODEL_PATH = None  # populated by the download step below

QUANTIZED_DIR.mkdir(exist_ok=True)
print(f"Saving quantized model to: {QUANTIZED_DIR.resolve()}")


Saving quantized model to: /home/ubuntu/vllm-compression-workshop/llama-awq-w4a16


## 2.1 Download the base model (cache snapshot)


In [None]:
import os
from huggingface_hub import snapshot_download

local_files_only = bool(os.getenv("HF_HUB_OFFLINE")) or not HUGGINGFACE_HUB_TOKEN
BASE_MODEL_PATH = snapshot_download(
    MODEL_ID,
    token=HUGGINGFACE_HUB_TOKEN,
    local_files_only=local_files_only,
)
print(f"Base model snapshot: {BASE_MODEL_PATH}")


## 3. Load model + tokenizer


In [12]:
MODEL_SOURCE = BASE_MODEL_PATH or MODEL_ID

model = AutoModelForCausalLM.from_pretrained(
    MODEL_SOURCE,
    torch_dtype="auto",
    device_map="auto",
    token=HUGGINGFACE_HUB_TOKEN,
)
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_SOURCE,
    trust_remote_code=True,
    token=HUGGINGFACE_HUB_TOKEN,
)


## 3.1 Build calibration set


In [13]:
raw_ds = load_dataset(
    CALIBRATION_DATASET,
    split=f"{CALIBRATION_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
    token=HUGGINGFACE_HUB_TOKEN,
).shuffle(seed=42)

def format_example(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"], tokenize=False
        )
    }

formatted = raw_ds.map(format_example)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

calibration_ds = formatted.map(tokenize, remove_columns=formatted.column_names)
print(calibration_ds)


Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 256
})


## 3.2 Quantize to 4-bit (W4A16) with AWQ
AWQ scales activations before weight quantization. The default mapping covers Llama-family layers.


In [14]:
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier

recipe = [
    AWQModifier(
        ignore=["lm_head"],
        scheme="W4A16_ASYM",
        targets=["Linear"],
        duo_scaling=True,
    )
]

oneshot(
    model=model,
    dataset=calibration_ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

model.save_pretrained(QUANTIZED_DIR, save_compressed=True)
tokenizer.save_pretrained(QUANTIZED_DIR)
print(f"Quantized model saved to {QUANTIZED_DIR}")


2026-01-14T19:38:01.976305+0000 | reset | INFO - Compression lifecycle reset
2026-01-14T19:38:01.980036+0000 | from_modifiers | INFO - Creating recipe from modifiers
2026-01-14T19:38:02.032368+0000 | on_initialize | INFO - No AWQModifier.mappings provided, inferring from model...


Resolving mapping 1/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 1144.50it/s]
Resolving mapping 2/4 (15 skipped): 100%|██████████| 16/16 [00:00<00:00, 2249.41it/s]
Resolving mapping 3/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 1426.09it/s]
Resolving mapping 4/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 2396.49it/s]

2026-01-14T19:38:02.092476+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2026-01-14T19:38:02.093107+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AWQModifier`



Preparing cache: 100%|██████████| 256/256 [00:00<00:00, 1047.98it/s]
(1/17): Calibrating: 100%|██████████| 256/256 [00:02<00:00, 115.18it/s]
Smoothing: 100%|██████████| 3/3 [00:19<00:00,  6.49s/it]
(1/17): Propagating: 100%|██████████| 256/256 [00:01<00:00, 199.79it/s]
(2/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 143.31it/s]
Smoothing: 100%|██████████| 3/3 [00:19<00:00,  6.54s/it]
(2/17): Propagating: 100%|██████████| 256/256 [00:01<00:00, 237.13it/s]
(3/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 139.92it/s]
Smoothing: 100%|██████████| 3/3 [00:19<00:00,  6.58s/it]
(3/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 287.04it/s]
(4/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 140.06it/s]
Smoothing: 100%|██████████| 3/3 [00:19<00:00,  6.62s/it]
(4/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 294.24it/s]
(5/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 140.62it/s]
Smoothing: 100%|██████████| 3/3 [00:19<00:00,  6.66s/i

2026-01-14T19:44:24.528673+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers





2026-01-14T19:44:24.996605+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 215it [00:03, 56.32it/s]


Quantized model saved to llama-awq-w4a16


## 4. Serve the quantized model


Run this in a separate terminal so the notebook can continue:

```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python scripts/serve_quantized.py \
  --port 8000 --api-key dummy --quantization none \
  --gpu-memory-utilization 0.7 --max-model-len 2048 --max-num-seqs 32
```


## 4.5 Smoke test (streaming hello world)


In [None]:
from openai import OpenAI

QUANT_BASE_URL = "http://localhost:8000/v1"
QUANT_SERVED_MODEL = "llama-awq-w4a16"  # update if you used --served-model-name

client = OpenAI(base_url=QUANT_BASE_URL, api_key="dummy")
stream = client.chat.completions.create(
    model=QUANT_SERVED_MODEL,
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Hello world in one short sentence."},
    ],
    max_tokens=32,
    stream=True,
)
for event in stream:
    delta = event.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()


## 5. Run 1000 jobs on the quantized model


In [None]:
import sys

!{sys.executable} scripts/run_batch.py \
  --base-url http://localhost:8000/v1 \
  --model llama-awq-w4a16 \
  --task xsum --tokenizer llama-awq-w4a16 --max-context-tokens 2048 --context-buffer 128 \
  --output results/awq.jsonl --max-samples 1000


## 6. Serve the non-quantized (base) model


In [None]:
base_cmd = (
    f"PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
"
    f"python scripts/serve_unquantized.py \
"
    f"  --model {BASE_MODEL_PATH} \
"
    "  --port 8001 --api-key dummy \
"
    "  --gpu-memory-utilization 0.7 --max-model-len 2048 --max-num-seqs 16 \
"
    "  --served-model-name base-llama"
)
print(base_cmd)


Stop the quantized server first, then run one of the commands below:

```bash
# Use the local snapshot to avoid extra HF downloads
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python scripts/serve_unquantized.py \
  --model /path/to/local/snapshot \
  --port 8001 --api-key dummy \
  --gpu-memory-utilization 0.7 --max-model-len 2048 --max-num-seqs 16 \
  --served-model-name base-llama

# Or pull from HF (requires access + token)
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python scripts/serve_unquantized.py \
  --port 8001 --api-key dummy \
  --gpu-memory-utilization 0.7 --max-model-len 2048 --max-num-seqs 16 \
  --served-model-name base-llama
```


## 6.5 Smoke test (streaming hello world)


In [None]:
from openai import OpenAI

BASE_BASE_URL = "http://localhost:8001/v1"
BASE_SERVED_MODEL = "base-llama"

client = OpenAI(base_url=BASE_BASE_URL, api_key="dummy")
stream = client.chat.completions.create(
    model=BASE_SERVED_MODEL,
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Hello world in one short sentence."},
    ],
    max_tokens=32,
    stream=True,
)
for event in stream:
    delta = event.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()


## 7. Run 1000 jobs on the base model


In [None]:
import sys

!{sys.executable} scripts/run_batch.py \
  --base-url http://localhost:8001/v1 \
  --model base-llama \
  --task xsum --tokenizer meta-llama/Llama-3.2-1B-Instruct --max-context-tokens 2048 --context-buffer 128 \
  --output results/base.jsonl --max-samples 1000


## 8. Measure accuracy (Rouge-L)


In [2]:
from pathlib import Path
import json
from statistics import mean
from rouge_score import rouge_scorer

def load_rows(path: Path):
    if not path.exists():
        raise FileNotFoundError(f"Missing results file: {path}")
    with path.open("r", encoding="utf-8") as f:
        return [json.loads(line) for line in f if line.strip()]

def rouge_l(rows):
    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    scores = [
        scorer.score(r["reference"], r["prediction"])["rougeL"].fmeasure
        for r in rows
    ]
    return mean(scores)

awq_rows = load_rows(Path("results/awq.jsonl"))
base_rows = load_rows(Path("results/base.jsonl"))

awq_score = rouge_l(awq_rows)
base_score = rouge_l(base_rows)

print(f"AWQ Rouge-L:  {awq_score:.4f} ({len(awq_rows)} samples)")
print(f"Base Rouge-L: {base_score:.4f} ({len(base_rows)} samples)")
print(f"Delta (AWQ - Base): {awq_score - base_score:+.4f}")


AWQ Rouge-L:  0.1552 (1000 samples)
Base Rouge-L: 0.1513 (1000 samples)
Delta (AWQ - Base): +0.0039
