# Quantization of `granite-3.3-2b-instruct` model

Recall that our overall solution uses the quantized version of the model `granite-3.3-2b-instruct`. In this lab, we will be taking in the base model `granite-3.3-2b-instruct` and quantizing it to `W4A16` - which is fixed-point integer (INT) quantization scheme for weights and floating‑point for activations - to provide both memory savings (weight - INT4) and inference acceleration (activations - BF16) with `vLLM`

**Note**: `W4A16` computation is supported on Nvidia GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).

**Note**: The steps here will take around 20-30 minutes, depending on the connectivity. The most time consuming steps are the installation of llmcompressor (up to 5 mins) and the quantization step (which can take more anywhere between 10-15 mins)

## Setting up llm-compressor

Installing `llmcompressor` may take a minute, depending on the bandwith available. Do note the versions of `transformer` library we would be using. There is a known issue (*torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow*) with the usage of the latest transformer library (version `4.53.2` as of July 17, 2025) in combination with the latest version of llmcompressor (version `0.6.0`).

In [None]:
!pip install llmcompressor==0.7.0

Let's make sure we have installed the right versions installed

In [None]:
!pip list | grep llmcompressor

In [None]:
!pip list | grep transformer

## Let' start with the quantization of the model

There are 6 steps:
1. Loading the model
2. Choosing the quantization scheme and method
3. Preparing the calibration data
4. Applying quantization
5. Saving the model
6. Evaluation of accuracy in vLLM

### Loading the model

First, let's download the model from local S3 and then load the model using AutoModelForCausalLM for handling quantized saving and loading. 

In [None]:
# import os
# from boto3 import client

# def download_model_from_s3(s3_path, local_model_dir):
#     print('Starting download of model from S3')
    
#     # Get S3 credentials from environment
#     s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
#     s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
#     s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
#     s3_bucket_name = os.environ["AWS_S3_BUCKET"]

#     print(f'Downloading model from bucket {s3_bucket_name} '
#           f'path {s3_path} from S3 storage at {s3_endpoint_url}')
#     print(f'Target local directory: {local_model_dir}')

#     s3_client = client(
#         's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
#         aws_secret_access_key=s3_secret_key, verify=False
#     )

#     os.makedirs(local_model_dir, exist_ok=True)

#     print(f'Listing objects with prefix: {s3_path}')
#     paginator = s3_client.get_paginator('list_objects_v2')
#     pages = paginator.paginate(Bucket=s3_bucket_name, Prefix=s3_path)

#     downloaded_files = []
#     total_objects = 0
    
#     for page in pages:
#         if 'Contents' in page:
#             for obj in page['Contents']:
#                 total_objects += 1
#                 s3_key = obj['Key']
#                 print(f'Found S3 object: {s3_key}')
                
#                 # Skip if it's just a directory marker
#                 if s3_key.endswith('/'):
#                     print(f'Skipping directory marker: {s3_key}')
#                     continue
                
#                 relative_path = s3_key[len(s3_path):].lstrip('/')
#                 local_file_path = os.path.join(local_model_dir, relative_path)
                
#                 print(f'S3 key: {s3_key}')
#                 print(f'Local file path: {local_file_path}')
                
#                 local_dir = os.path.dirname(local_file_path)
#                 if local_dir:
#                     os.makedirs(local_dir, exist_ok=True)
#                     print(f'Created directory: {local_dir}')
                
#                 try:
#                     s3_client.download_file(s3_bucket_name, s3_key, local_file_path)
#                     downloaded_files.append(local_file_path)
#                     print(f'Successfully downloaded {s3_key} to {local_file_path}')
#                 except Exception as e:
#                     print(f'Error downloading {s3_key}: {e}')
#                     raise

#     print(f'Total objects found: {total_objects}')
#     print(f'Files downloaded: {len(downloaded_files)}')
    
#     config_path = os.path.join(local_model_dir, 'config.json')
#     if os.path.exists(config_path):
#         print(f'Verified: config.json found at {config_path}')
#     else:
#         print(f'Warning: config.json not found at {config_path}')
    
#     print('Finished downloading model from S3')
#     return local_model_dir

In [None]:
# current_dir = os.getcwd()
# BASE_MODEL_DIR = current_dir + "/granite-3.3-2b-instruct"
# S3_MODEL_PATH = "ibm-granite/granite-3.3-2b-instruct/"  

# model_path = download_model_from_s3(S3_MODEL_PATH, BASE_MODEL_DIR)

# # use the downloaded model path for tokenizer and model loading
# from transformers import AutoTokenizer, AutoModelForCausalLM

# tokenizer = AutoTokenizer.from_pretrained(model_path)
# model = AutoModelForCausalLM.from_pretrained(
#     model_path, device_map="auto", torch_dtype="auto"
# )

Alternatively, the model can also be loaded from HuggingFace directly as follows:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

### Prepare calibration data

Prepare the calibration data. When quantizing weigths of a model to int4 using GPTQ, we need some sample data to run the GPTQ algorithms. As a result, it is very important to use calibration data that closely matches the type of data used in our deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.

In our case, we are quantizing an Instruction tuned generic model, so we will use the ultrachat dataset. Some best practices include:
- 512 samples is a good place to start (increase if accuracy drops). We are going to use 256 to speed up the process.
- 2048 sequence length is a good place to start
- Use the chat template or instrucion template that the model is trained with


In [None]:
from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 512  # 1024
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"

# Load dataset.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

# Preprocess the data into the format the model is trained with.
def preprocess(example):
    return {"text": example["text"]}
ds = ds.map(preprocess)

# Tokenize the data
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        truncation=False,
        add_special_tokens=True,
    )
ds = ds.map(tokenize, remove_columns=ds.column_names)

With the dataset ready, we will now apply quantization.

We first select the quantization algorithm. For W4A16, we want to:
- Run SmoothQuant to make the activations easier to quantize
- Quantize the weights to 4 bits with channelwise scales using GPTQ
- Quantize the activations with dynamic per token strategy

**Note**: The quantization step takes a long time to complete due to the callibration requirements -- around 10 - 15 mins, depending on the GPU.

### Imports and definitions

**GPTQModifier**: Applies Gentle Quantization (GPTQ) for weight-only quantization.

**SmoothQuantModifier**: Prepares model activations for smoother quantization by scaling internal activations and weights.

**oneshot**: High-level API that applies your quantization recipe in one go.

In [None]:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

### Hyperparameters

Rationale
- **DAMPENING_FRAC=0.1** gently prevents large Hessian-derived updates during quantization.
- **OBSERVER="mse"** measures quantization error by squared deviations, yielding well-rounded scales.
- **GROUP_SIZE=128** determines group size for per-channel quantization; typical default usage.

In [None]:
DAMPENING_FRAC = 0.1  # tapering adjustment to prevent extreme weight updates
OBSERVER = "mse"  # denotes minmax - quantization layout based on mean‐squared‐error
GROUP_SIZE = 128  # # per-channel grouping width for quantization

### Layer Mappings & Ignoring Heads

Logic

- **ignore=["lm_head"]** skips quantization on the output layer to preserve final logits and maintain accuracy.
- mappings link groups of linear projections (q, k, v, gating, up/down projections) with layernorm blocks—SmoothQuant uses these to shift and normalize activations across paired layers for better quant distribution.

In [None]:
ignore=["lm_head"]
mappings=[
    [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
    [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
    [["re:.*down_proj"], "re:.*up_proj"]
]

### Recipe Definition

**Workflow**

- **SmoothQuantModifier**: Re-scales activations across paired layers before quantization to reduce outliers (smoothing_strength=0.7, high smoothing but not extreme).
- **GPTQModifier**: Performs Weight-Only quantization (4-bit weights, 16-bit activations) on all Linear layers except those ignored, applying your dampening and observer settings. Scheme "W4A16" reduces model size while maintaining decent accuracy. 

In [None]:
# recipe = [
#     SmoothQuantModifier(smoothing_strength=0.7, ignore=ignore, mappings=mappings),
#     GPTQModifier(
#         targets=["Linear"],
#         ignore=ignore,
#         scheme="W4A16",
#         dampening_frac=DAMPENING_FRAC,
#         observer=OBSERVER,
#     )
# ]

recipe = [
    SmoothQuantModifier(smoothing_strength=0.7, ignore=ignore, mappings=mappings),
    GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"])
]

# If the above fails, your absolute fallback is to skip the layer:
# ignore.append("model.layers.31")
# ignore.append("model.layers.32")
# ignore.append("model.layers.33")
# ignore.append("model.layers.34")
# ignore.append("model.layers.35")
# ignore.append("model.layers.36")
# ignore.append("model.layers.37")
# ignore.append("model.layers.38")
# ignore.append("model.layers.39")
# ignore.append("model.layers.40")
# ignore.append("model.layers.41")

### Quantize in One Shot

**How It Works**

- Feeds dataset (calibration set) into your model to gather activation statistics.
- Applies SmoothQuant rescaling followed by GPTQ quantization in a sequential per-layer manner.
- **max_seq_length=8196** ensures large context coverage for calibration.

In [None]:
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=2048,
)

### Save the Compressed Model

**Explanation**

- Naming: appends -W4A16 to distinguish the quantized checkpoint.
- **save_compressed=True** stores weights in compact safetensors format for deployment via vLLM.

In [None]:
# Save to disk compressed.
MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"
SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

### Evaluate accuracy in vLLM

We can evaluate accuracy with lm_eval

##### Check GPU memory leftovers:

In [None]:
!nvidia-smi

**IMPORTANT**: After quantizing the model the GPU memory may not be freed (see the above output). You need to **restart the kernel** before evaluating the model to ensure you have enough GPU RAM available.

#### Install lm_eval

In [None]:
!pip install -q lm_eval==v0.4.3

#### Install vLLM for evaluation

Run the following to test accuracy on GSM-8K:

In [None]:
pip install -q vllm

### Evaluation Command

- `--model vllm` - Uses vLLM backend for fast, memory-efficient inference on large models 
- `--model_args` - pretrained=$MODEL_ID: specifies which model to load.
- `add_bos_token=true`: ensures a beginning-of-sequence token is added; required for consistent results on math and reasoning tasks 
- `max_model_len=4096`: sets the context window the model uses for evaluation.
- `gpu_memory_utilization=0.5`: limits vLLM to use 50% of GPU memory, allowing to avoid OOM.

In [None]:
import os

current_dir = os.getcwd()

MODEL_ID = current_dir + "/granite-3.2-2b-instruct-W4A16"

!lm_eval --model vllm \
  --model_args "pretrained=$MODEL_ID,add_bos_token=true,max_model_len=4096,gpu_memory_utilization=0.5" \
  --trust_remote_code \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'

With powerful GPU(s), you could also run the vLLM based evals with the following - using higher GPU memory utilization and chunked prefill. 
```bash
!lm_eval \
  --model vllm \
  --model_args pretrained=$SAVE_DIR,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True \
  --trust_remote_code \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config
```

### Upload the optimized model to MinIO

In [None]:
%pip install -q boto3
import os
from boto3 import client

current_dir = os.getcwd()
OPTIMIZED_MODEL_DIR = current_dir + "/granite-3.2-2b-instruct-W4A16"
S3_PATH = "granite-int4-notebook"

print('Starting upload of quantizied model')
s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
s3_bucket_name = os.environ["AWS_S3_BUCKET"]

print(f'Uploading predictions to bucket {s3_bucket_name} '
        f'to S3 storage at {s3_endpoint_url}')

s3_client = client(
    's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key, verify=False
)

# Walk through the local folder and upload files
for root, dirs, files in os.walk(OPTIMIZED_MODEL_DIR):
    for file in files:
        local_file_path = os.path.join(root, file)
        s3_file_path = os.path.join(S3_PATH, local_file_path[len(OPTIMIZED_MODEL_DIR)+1:])
        s3_client.upload_file(local_file_path, s3_bucket_name, s3_file_path)
        print(f'Uploaded {local_file_path}')

print('Finished uploading of quantizied model')

### Bonus labs
- Experiment with different quantization scheme & method to further improve its accuracy
- Prepare a new dataset tailored to a specific use case by collecting and performing data mixing for calibration