## Compress the base model by using LLM compressor

After you establish **accuracy and performance baselines** for the base model, you can compress it. This notebook uses the quantization technique. The purpose of quantitizing the model is to reduce the modelâ€™s memory footprint, resulting in a smaller and more deployment-efficient model.

**Note**: This notebook uses **data-aware quantization**, which relies on representative calibration data to preserve model quality.

**Goal**: Reduce the model size while retaining its accuracy and inference performance.

**Key Actions**:

- Load the base model.

- Measure its size and memory usage.

- Use a calibration dataset (WikiText, UltraChat) to collect activation statistics.

- Apply a quantization recipe (SmoothQuant + GPTQ modifier).

- Save the compressed model and verify the reduction in its size.

**Outcome**:

- The compressed model saved on disk.

- The model size is reduced, typically by 50% (depending on quantization scheme).

### Install dependencies

In [None]:
!pip install -qqU .
!pip install -qqU torch==2.9.0

***NOTE:** You can safely ignore the following error:  

`ERROR: pip's dependency resolver does not currently take into account... llmcompressor 0.8.1 requires torch<=2.8.0,>=2.7.0, but you have torch 2.9.0 which is incompatible.`

In [None]:
import torch
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from transformers import AutoModelForCausalLM, AutoTokenizer
from utils import model_size_gb, tokenize_for_calibration

In [None]:
# check available device
device = "cuda" if torch.cuda.is_available() else "cpu"
device

### Check GPU memory

To make sure that you have enough GPU memory to run this notebook:

1. In a terminal window, run the `nvidia-smi` command.

2.  If there are processes that are using GPU memory that this notebook requires, run the `kill -9 <pid>` command for each process to stop it.

### Load the base model

To load the model in the data type specified in its configuration, load the model with **from_pretrained** using the **AutoModelForCausalLM** class. Specify the data type by setting the **torch_dtype** parameter to **auto**. Otherwise, PyTorch loads the weights in **full precision (fp32)**.

In [None]:
# set up variables
base_model_path = "../base_model/RedHatAI-Llama-3.1-8B-Instruct"
compressed_model_path = (
    "../compressed_model/RedHatAI-Llama-3.1-8B-Instruct-int8-dynamic"
)

In [None]:
# loading model and tokenizer from huggingfaceabs
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = AutoModelForCausalLM.from_pretrained(
    base_model_path, torch_dtype="auto", device_map="auto"
)
print("Base model loaded")

In [None]:
# check model size
# !du -sh {base_model_path}
model_size = model_size_gb(model)
print(f"The size of the base model is: {model_size:.4f}GB")

### Prepare a calibration dataset

Because this example uses data-aware quantization to compress the base model, you must use a dataset to calibrate the model with real or representative inputs. For this example, use the `wikitext-2-raw-v1` version of the WikiText dataset. It is a small, general-purpose dataset for faster processing. For  information about calibration datasets, see [Compress a Large Language Model by using LLM Compressor](../docs/Compression.md).

In [None]:
# Define the dataset to use for calibration
dataset_id = "wikitext"

# Specify the configuration / version of the dataset
config = "wikitext-2-raw-v1"  # Small version (~2M tokens), raw text format

# Set the number of calibration samples based on available device
# - On GPU: use more samples to get more accurate activation statistics
# - On CPU: reduce samples to prevent memory issues and keep demo fast
num_calibration_samples = 512 if device == "cuda" else 16

# Set the maximum sequence length for calibration
max_sequence_length = 1024 if device == "cuda" else 16

# Load the dataset using Hugging Face Datasets API
# This downloads train split of the dataset
ds = load_dataset(dataset_id, config, split="train")
# Shuffle and grab only the number of samples we need
ds = ds.shuffle(seed=42).select(range(num_calibration_samples))

Inspect the dataset by extracting the contents of the columns to determine which column contains the content that you want to use. For the `wikitext-2-raw-v1` dataset in this example, the `text` column contains the content that is passed as input for calibrating the model:

In [None]:
# inspect the dataset
print(f"columns in the {dataset_id}: {ds.column_names}\n")
print(ds[0])

### (Optional) Use a custom template for calibration

Use a **custom template** when you want the calibration text to closely mimic the input format that the model will encounter in production.  

For example, if your model is **instruction-following** or **chat-based**, by providing the template the model was originally trained on or the template that will be used during inference, you can ensure that the activation statistics collected during calibration reflect realistic usage patterns. 

By using a custom template for calibration, you can improve the accuracy of quantization and compression.

If your model can handle raw text and does not require a specific format, you can rely on the default template instead.

To specify a custom template, use the `tokenize_for_calibration` function with the `custom_template` argument. The `custom_template` argument accepts the following format:

```python
custom_template = {
 "template_text": "Instruction: {content}\nOutput:", 
 "placeholder": "content"
}

In [None]:
# to get activations for the calibration dataset, we need to:
# 1. extract the samples from the dataset
# 2. tokenize samples in the dataset
input_column = "text"

# Call tokenize_for_calibration using dataset.map
tokenized_dataset = ds.map(
    lambda batch: tokenize_for_calibration(
        examples=batch,  # batch from Hugging Face dataset
        input_column=input_column,  # the column containing text to calibrate
        tokenizer=tokenizer,  # your Hugging Face tokenizer
        max_length=max_sequence_length,  # maximum sequence length
        model_type="chat",  # use chat template if no custom template
        custom_template=None,  # optional, provide a dict if you want a custom template
    ),
    batched=True,
)

In [None]:
tokenized_dataset

### Quantize the base model to INT8

After preparing the dataset for calibration, define a recipe for quantization. For quantization scheme `W8A8-INT8`, use `SmoothQuantModifier` followed by `GPTQModifier`.

For details on the SmoothQuant and GPTQ algorithms, see [Compress a Large Language Model by using LLM Compressor](Compression.md).

**NOTE:** The process of running the next cell to compress the model might take a significant amount of time (approximately one hour), as the model is executed on a calibration dataset to collect and calibrate activation statistics prior to applying quantization.

In [None]:
# Define the quantization scheme
scheme = "W8A8"  # W8A8 means 8-bit weights and 8-bit activations

# Strength for SmoothQuant smoothing
# This controls how much the activation values are smoothed to reduce outliers
smoothing_strength = 0.8

# Create SmoothQuant modifier
# - smooths activations before quantization to improve stability and reduce degradation
smooth_quant = SmoothQuantModifier(smoothing_strength=smoothing_strength)

# Create GPTQ modifier
# - targets="Linear" quantizes only Linear layers (e.g., feedforward layers)
# - scheme=scheme uses the W8A8 quantization scheme
# - ignore=["lm_head"] preserves the LM head to avoid generation quality loss
quantizer = GPTQModifier(targets="Linear", scheme=scheme, ignore=["lm_head"])

# Combine the modifiers into a recipe list
# The order matters: first apply SmoothQuant, then GPTQ
recipe = [smooth_quant, quantizer]

# Perform quantization
oneshot(
    model=base_model_path,  # Model to quantize
    dataset=tokenized_dataset,  # Calibration dataset, used for both SmoothQuant & GPTQ
    recipe=recipe,  # List of quantization modifiers to apply
    output_dir=compressed_model_path,  # Directory to save the quantized model
    max_seq_length=2048,  # Maximum sequence length for calibration
    num_calibration_samples=512,  # Number of samples used for calibration
)

### Check the size of the compressed model

In [None]:
# Load quantized model
model_quant = AutoModelForCausalLM.from_pretrained(compressed_model_path)
model_quant.config.dtype = model.config.torch_dtype
model_quant.save_pretrained(compressed_model_path)
model_size = model_size_gb(model_quant)
print(f"Model size (GB): {model_size}")

After quantizing the model, the size reduces from 14.9GB to 8GB. 

### Alternate compression technique: FP8 quantization

The LLM compressor also supports FP8 quantization. This conversion does not require any calibration dataset. To quantize the model to FP8, uncomment and then run the following cell:

In [None]:
# recipe = QuantizationModifier(
#   targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

# oneshot(model=model_name, recipe=recipe, output_dir=compressed_model_path)

# # Load quantized model
# model_quant = AutoModelForCausalLM.from_pretrained(compressed_model_path)
# model_quant.config.dtype = model.config.torch_dtype
# model_quant.save_pretrained(compressed_model_path)
# model_size = model_size_gb(model_quant)
# print(f"Model size (GB): {model_size}")