In [1]:
import os
import time
import math
import shutil
import torch
import numpy as np
import evaluate
from datasets import load_dataset, DownloadConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    default_data_collator,
    EvalPrediction,
)
from optimum.intel import INCTrainer, INCModelForCausalLM
from neural_compressor import QuantizationAwareTrainingConfig

2025-08-05 00:00:41.315292: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-05 00:00:41.330496: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-08-05 00:00:41.351020: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-08-05 00:00:41.356649: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-08-05 00:00:41.370077: I tensorflow/core/platform/cpu_feature_guar

# Quantization-Aware Training with Intel Neural Compressor

This notebook demonstrates how to implement Quantization-Aware Training (QAT) using Intel's Neural Compressor framework.

## Required Dependencies
Setting up essential libraries for:
- Core Python utilities: os, time, math, shutil
- Deep Learning: PyTorch
- Numerical operations: NumPy
- Model & Data handling: 
  - Hugging Face Transformers
  - Datasets library
  - Evaluation metrics
- Intel optimization tools:
  - Neural Compressor
  - Optimum Intel integration

In [2]:
# ─── SETTINGS ────────────────────────────────────────────────────────────────
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
PROMPT = (
    "Over the next decade, sustainable energy solutions will revolutionize "
    "global power grids, reducing carbon footprints and fostering resilient "
    "communities through innovative storage and distribution technologies."
)
PERP_TEXT = (
    "Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast "
    "to the natural intelligence displayed by humans and animals. Leading AI textbooks "
    "define the field as the study of intelligent agents: any system that perceives "
    "its environment and takes actions that maximize its chance of achieving its goals."
)
MAX_NEW_TOKENS = 50
OUTPUT_DIR = "qat_tinyllama"

# Global Configuration Settings

Defines essential parameters for the QAT experiment:

1. Model Configuration:
   - Uses TinyLlama-1.1B-Chat model
   - Chosen for balance of size and capability

2. Test Data:
   - PROMPT: Complex text about sustainable energy
     * Tests model's ability to handle technical content
     * Evaluates coherence in generation
   
   - PERP_TEXT: Definition of AI
     * Used for perplexity calculation
     * Tests model's language understanding

3. Generation Parameters:
   - MAX_NEW_TOKENS: Controls generation length
   - OUTPUT_DIR: Directory for saving quantized model

In [3]:
def load_model_and_tokenizer(model_id: str):
    # 1. Load the model
    model = AutoModelForCausalLM.from_pretrained(model_id)
    # (Optional) save memory during training
    model.gradient_checkpointing_enable()

    # 2. Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
    # 3. Make sure we have a pad token
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id

    return model, tokenizer


# Model and Tokenizer Initialization

Defines a function that sets up the model and tokenizer with proper configurations:

1. Model Loading:
   - Initializes CausalLM architecture
   - Enables gradient checkpointing for memory efficiency
   - Optimizes for training large models

2. Tokenizer Setup:
   - Uses fast tokenizer implementation
   - Configures padding strategy:
     * Sets EOS token as padding token
     * Updates model config to recognize padding
   
3. Memory Management:
   - Implements memory-efficient loading
   - Ensures proper token handling
   - Maintains model-tokenizer alignment

In [4]:
def prepare_datasets(tokenizer, block_size=8, train_size=1000, eval_size=500):
    # 1) Load the raw Wikitext-2 dataset
    raw = load_dataset("wikitext", "wikitext-2-raw-v1")

    # 2) Tokenization + label prep
    def _tokenize(examples):
        out = tokenizer(
            examples["text"],
            truncation=True,
            padding="max_length",
            max_length=block_size,
        )
        out["labels"] = out["input_ids"].copy()
        return out

    tokenized = raw.map(
        _tokenize,
        batched=True,
        remove_columns=["text"],
    )

    # 3) Select the subsets
    train_ds = tokenized["train"].select(range(train_size))
    eval_ds  = tokenized["validation"].select(range(eval_size))

    return train_ds, eval_ds


# Dataset Preparation for QAT

Implements dataset processing pipeline for training and evaluation:

1. Data Loading:
   - Sources Wikitext-2 dataset
   - Handles raw text format
   - Manages download and caching

2. Text Processing:
   - Implements efficient tokenization:
     * Applies truncation for consistent length
     * Uses max-length padding
     * Creates aligned input/label pairs
   - Processes in batches for efficiency

3. Dataset Management:
   - Creates training subset (default: 1000 examples)
   - Prepares evaluation set (default: 500 examples)
   - Maintains data consistency

4. Performance Considerations:
   - Uses batched processing for speed
   - Removes unnecessary columns
   - Optimizes memory usage

In [None]:
def create_inc_trainer(model, tokenizer, train_ds, quant_config, output_dir):
    # 1) Define the Hugging Face TrainingArguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=1,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        fp16=True,
        optim="adamw_bnb_8bit",
        evaluation_strategy="no",
        save_strategy="no", 
        logging_steps=50,
        save_total_limit=1,
    )

    # 2) Instantiate the INCTrainer
    trainer = INCTrainer(
        model=model,
        quantization_config=quant_config,
        args=training_args,
        train_dataset=train_ds,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    return trainer

# Intel Neural Compressor Trainer Setup

Creates a specialized trainer for Quantization-Aware Training:

1. Training Configuration:
   - Sets up HuggingFace TrainingArguments:
     * Single epoch training
     * Small batch size (1) with gradient accumulation (4 steps)
     * Mixed precision (FP16) for efficiency
     * 8-bit Adam optimizer for memory savings
     * Periodic logging (every 50 steps)
     * Checkpoint management (keeps latest only)

2. QAT Trainer Initialization:
   - Integrates quantization configuration
   - Sets up training dataset
   - Configures data collation
   - Maintains tokenizer alignment

3. Optimization Features:
   - Uses Intel's optimized training
   - Implements efficient memory usage
   - Enables proper quantization tracking

In [6]:
def make_compute_ppl_fn(pad_token_id: int):
    """
    Returns a compute_metrics function that knows the pad_token_id.
    """
    def compute_ppl(pred: EvalPrediction):
        # 1) Unpack
        logits = pred.predictions         # np array (batch, seq_len, vocab_size)
        labels = pred.label_ids           # np array (batch, seq_len)

        # 2) Shift so each token predicts the next one
        shift_logits = logits[..., :-1, :]
        shift_labels = labels[..., 1:]

        # 3) Flatten
        flat_logits = shift_logits.reshape(-1, shift_logits.shape[-1])
        flat_labels = shift_labels.reshape(-1)

        # 4) To torch
        logits_t = torch.from_numpy(flat_logits)
        labels_t = torch.from_numpy(flat_labels)

        # 5) CE loss ignoring pad_token_id
        loss_fct = torch.nn.CrossEntropyLoss(ignore_index=pad_token_id)
        loss = loss_fct(logits_t, labels_t)

        # 6) Return perplexity
        return {"eval_perplexity": torch.exp(loss).item()}
    return compute_ppl

# Perplexity Calculation Function Generator

Creates a closure-based metrics computation function:

1. Prediction Processing:
   - Handles complex tensor operations:
     * Unpacks logits and labels
     * Applies sequence shifting for next-token prediction
     * Manages tensor reshaping and flattening
   
2. Perplexity Computation:
   - Implements sophisticated calculation:
     * Converts arrays to PyTorch tensors
     * Uses CrossEntropyLoss with padding handling
     * Computes exponential of loss
   
3. Technical Features:
   - Handles padding tokens properly
   - Maintains numerical stability
   - Returns HuggingFace-compatible metrics

4. Memory Efficiency:
   - Uses efficient tensor operations
   - Minimizes memory overhead
   - Properly manages GPU/CPU transitions

In [7]:
def run_qat_and_evaluate(trainer, eval_ds, tokenizer):
    # 1) Train with QAT
    trainer.train()

    # 2) Small slice for eval
    small_eval = eval_ds.select(range(min(len(eval_ds), 100)))

    # 3) Build a ppl_fn that closes over pad_token_id
    ppl_fn = make_compute_ppl_fn(tokenizer.pad_token_id)

    # 4) Set up evaluation INCTrainer
    eval_args = TrainingArguments(
        output_dir=trainer.args.output_dir,
        per_device_eval_batch_size=1,
        fp16=True,
        eval_strategy="no",
        save_strategy="no",
        logging_steps=50,
    )
    eval_trainer = INCTrainer(
        model=trainer.model,
        args=eval_args,
        eval_dataset=small_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=ppl_fn,  # << use the closure here
    )

    # 5) Evaluate and return metrics
    return eval_trainer.evaluate()

# QAT Training and Evaluation Pipeline

Implements comprehensive training and evaluation workflow:

1. Training Phase:
   - Executes QAT training process
   - Manages model updates and quantization

2. Evaluation Preparation:
   - Creates manageable evaluation subset
   - Initializes perplexity calculator
   - Sets up evaluation environment

3. Evaluation Configuration:
   - Configures evaluation parameters:
     * Small batch size for accuracy
     * FP16 precision for efficiency
     * Proper logging intervals
     * Metric computation setup

4. Execution:
   - Runs evaluation pipeline
   - Collects performance metrics
   - Returns comprehensive results

5. Memory Management:
   - Efficient dataset handling
   - Proper resource cleanup
   - Optimized evaluation process

In [8]:
def save_and_load_qat_model(trainer, output_dir):
    # 1) Save to the specified directory
    trainer.save_model(output_dir)

    # 2) Load it back as an INCModelForCausalLM
    loaded_model = INCModelForCausalLM.from_pretrained(output_dir)

    return loaded_model

# Model Persistence Functions

Handles saving and loading of quantized models:

1. Model Saving:
   - Uses specialized trainer save method
   - Preserves quantization parameters
   - Maintains architecture integrity
   - Ensures proper weight storage

2. Model Loading:
   - Loads as INCModelForCausalLM
   - Restores quantization settings
   - Maintains Intel optimizations
   - Verifies model integrity

3. Features:
   - Proper format handling
   - Quantization preservation
   - Efficient storage/loading
   - Version compatibility

In [9]:
def measure_latency_and_throughput(model, tokenizer, prompt: str, device, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    input_len = inputs["input_ids"].size(1)

    # warm-up
    _ = model.generate(**inputs, max_new_tokens=5)
    if device.type == "cuda":
        torch.cuda.synchronize()

    # timed generation
    start = time.time()
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
    if device.type == "cuda":
        torch.cuda.synchronize()
    end = time.time()

    latency = end - start
    gen_tokens = outputs.size(1) - input_len
    return latency, gen_tokens / latency

def measure_peak_mem_and_perplexity(model, tokenizer, text: str, device):
    inputs = tokenizer(text, return_tensors="pt").to(device)

    torch.cuda.reset_peak_memory_stats(device)
    with torch.no_grad():
        out = model(**inputs, labels=inputs["input_ids"])
    torch.cuda.synchronize()
    peak_mem_mib = torch.cuda.max_memory_allocated(device) / 1024**2
    perplexity = math.exp(out.loss.item())
    return peak_mem_mib, perplexity

# Performance Measurement Functions

Implements comprehensive performance metrics collection:

1. Latency and Throughput Measurement:
   - Handles input preparation
   - Includes warm-up generation
   - Measures generation time
   - Calculates tokens per second
   - Manages GPU synchronization
   
2. Memory and Perplexity Analysis:
   - Tracks peak GPU memory usage
   - Calculates model perplexity
   - Handles device-specific operations
   - Provides memory usage in MB
   
3. Features:
   - Accurate timing mechanisms
   - Proper GPU synchronization
   - Memory usage tracking
   - Comprehensive metrics collection
   - Cross-device compatibility

In [10]:
def start():
    # 1) Load model & tokenizer
    model, tokenizer = load_model_and_tokenizer(MODEL_NAME)

    # 2) Prepare datasets
    train_ds, eval_ds = prepare_datasets(tokenizer)

    # 3) QAT configuration
    quant_config = QuantizationAwareTrainingConfig()

    # 4) Create and run QAT trainer
    qat_trainer = create_inc_trainer(model, tokenizer, train_ds, quant_config, OUTPUT_DIR)
    metrics = run_qat_and_evaluate(qat_trainer, eval_ds, tokenizer)
    print(f"Final perplexity: {metrics['eval_perplexity']:.2f}")

    # 5) Save & load quantized model
    qat_model = save_and_load_qat_model(qat_trainer, OUTPUT_DIR)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    qat_model.to(device)

    # 6) Benchmarks
    latency, throughput = measure_latency_and_throughput(qat_model, tokenizer, PROMPT, device)
    print(f"Latency     : {latency:.3f} s")
    print(f"Throughput  : {throughput:.1f} tokens/s")

    peak_mem, ppl = measure_peak_mem_and_perplexity(qat_model, tokenizer, PERP_TEXT, device)
    print(f"Peak GPU memory     : {peak_mem:.1f} MiB")
    print(f"Next-token perplexity: {ppl:.3f}")
    size_mb = sum(os.path.getsize(os.path.join(OUTPUT_DIR, f)) for f in os.listdir(OUTPUT_DIR)) / 1024**2

    print(f"Quantized model size : {size_mb:.2f} MB")
    print(f"Average latency      : {latency*1000:.1f} ms")
    print(f"Throughput           : {throughput:.1f} tokens/sec")
    if peak_mem is not None:
        print(f"Peak GPU memory      : {peak_mem:.1f} MB")


# Main Execution Function

Orchestrates the complete QAT workflow:

1. Initialization (Steps 1-2):
   - Loads model and tokenizer
   - Prepares training/evaluation datasets
   
2. QAT Setup and Training (Steps 3-4):
   - Configures quantization parameters
   - Creates and runs QAT trainer
   - Evaluates model performance
   - Reports perplexity metrics

3. Model Management (Step 5):
   - Saves quantized model
   - Loads for inference
   - Moves to appropriate device

4. Performance Analysis (Step 6):
   - Measures generation metrics:
     * Latency
     * Throughput
     * Memory usage
     * Model size
   - Calculates perplexity
   - Reports comprehensive statistics

In [11]:
start()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
50,3.7871
100,2.3485
150,2.4155
200,2.1259
250,2.1398


Final perplexity: 49.10


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Latency     : 1.084 s
Throughput  : 46.1 tokens/s
Peak GPU memory     : 10592.2 MiB
Next-token perplexity: 10.600
Quantized model size : 4200.37 MB
Average latency      : 1084.0 ms
Throughput           : 46.1 tokens/sec
Peak GPU memory      : 10592.2 MB


# Execute QAT Pipeline

Initiates the complete Quantization-Aware Training process:
- Runs all steps from model loading to evaluation
- Executes training and quantization
- Generates performance metrics
- Displays comprehensive results

This cell starts the actual execution of our QAT experiment.