# Quantization of `granite-3.3-2b-instruct` model

Recall that our overall solution uses the quantized version of the model `granite-3.3-2b-instruct`. In this lab, we will be taking in the base model `granite-3.3-2b-instruct` and quantizing it to `W4A16` - which is fixed-point integer (INT) quantization scheme for weights and floating‑point for activations - to provide both memory savings (weight - INT4) and inference acceleration (activations - BF16) with `vLLM`

**Note**: `W4A16` computation is supported on Nvidia GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).

**Note**: The steps here will take around 20-30 minutes, depending on the connectivity. The most time consuming steps are the installation of llmcompressor (up to 5 mins) and the quantization step (which can take more anywhere between 10-15 mins)

## Setting up llm-compressor first

Installing `llmcompressor` may take a minute, depending on the bandwith available. Do note the versions of `transformer` library we would be using. There is a known issue (*torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow*) with the usage of the latest transformer library (version `4.53.2` as of July 17, 2025) in combination with the latest version of llmcompressor (version `0.6.0`).

In [9]:
!pip install -q llmcompressor==0.6.0 transformers==4.52.2


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Let's make sure we have installed the right versions installed

In [10]:
!pip list | grep llmcompressor

llmcompressor                     0.6.0


In [11]:
!pip list | grep transformer

transformers                      4.52.2


## Let' start with the quantization of the model

There are 5 steps:
1. Loading the model
2. Preparing the calibration data
3. Applying quantization
4. Evaluation of accuracy in vLLM
5. Uploading the model to S3 (MinIO)

### Loading the model

Load the model using AutoModelForCausalLM for handling quantized saving and loading. The model can be loaded from HuggingFace directly as follows:

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Prepare calibration data

Prepare the calibration data. When quantizing weigths of a model to int4 using GPTQ, we need some sample data to run the GPTQ algorithms. As a result, it is very important to use calibration data that closely matches the type of data used in our deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.

In our case, we are quantizing an Instruction tuned generic model, so we will use the ultrachat dataset. Some best practices include:
- 512 samples is a good place to start (increase if accuracy drops). We are going to use 256 to speed up the process.
- 2048 sequence length is a good place to start
- Use the chat template or instrucion template that the model is trained with


In [23]:
from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 512  # 1024
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"

# Load dataset.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

# Preprocess the data into the format the model is trained with.
def preprocess(example):
    return {"text": example["text"]}
ds = ds.map(preprocess)

# Tokenize the data
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        truncation=False,
        add_special_tokens=True,
    )
ds = ds.map(tokenize, remove_columns=ds.column_names)

With the dataset ready, we will now apply quantization.

We first select the quantization algorithm. For W4A16, we want to:
- Run SmoothQuant to make the activations easier to quantize
- Quantize the weights to 4 bits with channelwise scales using GPTQ
- Quantize the activations with dynamic per token strategy

**Note**: The quantization step takes a long time to complete due to the callibration requirements -- around 10 - 15 mins, depending on the GPU.

### Imports and definitions

**GPTQModifier**: Applies Gentle Quantization (GPTQ) for weight-only quantization.

**SmoothQuantModifier**: Prepares model activations for smoother quantization by scaling internal activations and weights.

**oneshot**: High-level API that applies your quantization recipe in one go.

In [24]:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

### Hyperparameters

Rationale
- **DAMPENING_FRAC=0.1** gently prevents large Hessian-derived updates during quantization.
- **OBSERVER="mse"** measures quantization error by squared deviations, yielding well-rounded scales.
- **GROUP_SIZE=128** determines group size for per-channel quantization; typical default usage.

In [25]:
DAMPENING_FRAC = 0.1  # tapering adjustment to prevent extreme weight updates
OBSERVER = "mse"  # denotes minmax - quantization layout based on mean‐squared‐error
GROUP_SIZE = 128  # # per-channel grouping width for quantization

### Layer Mappings & Ignoring Heads

Logic

- **ignore=["lm_head"]** skips quantization on the output layer to preserve final logits and maintain accuracy.
- mappings link groups of linear projections (q, k, v, gating, up/down projections) with layernorm blocks—SmoothQuant uses these to shift and normalize activations across paired layers for better quant distribution.

In [26]:
ignore=["lm_head"]
mappings=[
    [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
    [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
    [["re:.*down_proj"], "re:.*up_proj"]
]

### Recipe Definition

**Workflow**

- **SmoothQuantModifier**: Re-scales activations across paired layers before quantization to reduce outliers (smoothing_strength=0.7, high smoothing but not extreme).
- **GPTQModifier**: Performs Weight-Only quantization (4-bit weights, 16-bit activations) on all Linear layers except those ignored, applying your dampening and observer settings. Scheme "W4A16" reduces model size while maintaining decent accuracy. 

In [27]:
recipe = [
    SmoothQuantModifier(smoothing_strength=0.7, ignore=ignore, mappings=mappings),
    GPTQModifier(
        targets=["Linear"],
        ignore=ignore,
        scheme="W4A16",
        dampening_frac=DAMPENING_FRAC,
        observer=OBSERVER,
    )
]

### Quantize in One Shot

**How It Works**

- Feeds dataset (calibration set) into your model to gather activation statistics.
- Applies SmoothQuant rescaling followed by GPTQ quantization in a sequential per-layer manner.
- **max_seq_length=8196** ensures large context coverage for calibration.

In [28]:
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=8196,
)

  oneshot(


2025-07-17T04:25:03.079329+0000 | reset | INFO - Compression lifecycle reset
2025-07-17T04:25:03.082644+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-07-17T04:25:04.425764+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-07-17T04:25:04.426533+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `SmoothQuantModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 1942.94it/s]
(1/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 303.10it/s]

2025-07-17T04:25:07.961550+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.input_layernorm
2025-07-17T04:25:08.002475+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.post_attention_layernorm
2025-07-17T04:25:08.004240+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.mlp.up_proj



(1/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 289.82it/s]
(2/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 463.90it/s]

2025-07-17T04:25:11.013296+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.input_layernorm
2025-07-17T04:25:11.014804+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.post_attention_layernorm
2025-07-17T04:25:11.016318+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.mlp.up_proj



(2/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 456.78it/s]
(3/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 463.29it/s]

2025-07-17T04:25:13.272626+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.input_layernorm
2025-07-17T04:25:13.274251+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.post_attention_layernorm
2025-07-17T04:25:13.275625+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.mlp.up_proj



(3/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.05it/s]
(4/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 466.20it/s]

2025-07-17T04:25:15.501407+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.input_layernorm
2025-07-17T04:25:15.503257+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.post_attention_layernorm
2025-07-17T04:25:15.504743+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.mlp.up_proj



(4/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.76it/s]
(5/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 462.54it/s]

2025-07-17T04:25:17.742049+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.input_layernorm
2025-07-17T04:25:17.743592+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.post_attention_layernorm
2025-07-17T04:25:17.744826+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.mlp.up_proj



(5/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.94it/s]
(6/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 461.57it/s]

2025-07-17T04:25:19.983044+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.input_layernorm
2025-07-17T04:25:19.984626+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.post_attention_layernorm
2025-07-17T04:25:19.986004+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.mlp.up_proj



(6/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.31it/s]
(7/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 460.95it/s]

2025-07-17T04:25:22.230511+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.input_layernorm
2025-07-17T04:25:22.232120+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.post_attention_layernorm





2025-07-17T04:25:22.233585+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.mlp.up_proj


(7/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 458.34it/s]
(8/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 452.71it/s]

2025-07-17T04:25:24.513949+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.input_layernorm
2025-07-17T04:25:24.515558+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.post_attention_layernorm
2025-07-17T04:25:24.516937+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.mlp.up_proj



(8/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.86it/s]
(9/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 460.47it/s]

2025-07-17T04:25:26.771510+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.input_layernorm
2025-07-17T04:25:26.772853+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.post_attention_layernorm
2025-07-17T04:25:26.774284+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.mlp.up_proj



(9/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.47it/s]
(10/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 463.24it/s]

2025-07-17T04:25:29.034235+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.input_layernorm
2025-07-17T04:25:29.035810+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.post_attention_layernorm
2025-07-17T04:25:29.037238+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.mlp.up_proj



(10/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.35it/s]
(11/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 462.48it/s]

2025-07-17T04:25:31.270857+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.input_layernorm
2025-07-17T04:25:31.272577+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.post_attention_layernorm
2025-07-17T04:25:31.273959+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.mlp.up_proj



(11/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 461.43it/s]
(12/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 453.88it/s]

2025-07-17T04:25:33.555912+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.input_layernorm
2025-07-17T04:25:33.557337+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.post_attention_layernorm
2025-07-17T04:25:33.558636+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.mlp.up_proj



(12/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.09it/s]
(13/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 458.06it/s]

2025-07-17T04:25:35.817930+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.input_layernorm
2025-07-17T04:25:35.819371+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.post_attention_layernorm
2025-07-17T04:25:35.820744+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.mlp.up_proj



(13/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.12it/s]
(14/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 457.50it/s]

2025-07-17T04:25:38.080379+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.input_layernorm
2025-07-17T04:25:38.082140+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.post_attention_layernorm
2025-07-17T04:25:38.083569+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.mlp.up_proj



(14/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.57it/s]
(15/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 462.86it/s]

2025-07-17T04:25:40.323204+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.input_layernorm
2025-07-17T04:25:40.324655+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.post_attention_layernorm
2025-07-17T04:25:40.325999+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.mlp.up_proj



(15/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.11it/s]
(16/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 459.91it/s]

2025-07-17T04:25:42.573695+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.input_layernorm
2025-07-17T04:25:42.575233+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.post_attention_layernorm





2025-07-17T04:25:42.576570+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.mlp.up_proj


(16/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.90it/s]
(17/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 459.64it/s]

2025-07-17T04:25:44.829186+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.input_layernorm
2025-07-17T04:25:44.830681+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.post_attention_layernorm
2025-07-17T04:25:44.832026+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.mlp.up_proj



(17/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.39it/s]
(18/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 462.74it/s]

2025-07-17T04:25:47.068090+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.input_layernorm
2025-07-17T04:25:47.069725+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.post_attention_layernorm
2025-07-17T04:25:47.071054+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.mlp.up_proj



(18/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.56it/s]
(19/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 461.13it/s]

2025-07-17T04:25:49.311649+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.input_layernorm
2025-07-17T04:25:49.313101+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.post_attention_layernorm
2025-07-17T04:25:49.314486+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.mlp.up_proj



(19/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.77it/s]
(20/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 456.91it/s]

2025-07-17T04:25:51.575625+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.input_layernorm
2025-07-17T04:25:51.577109+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.post_attention_layernorm
2025-07-17T04:25:51.578320+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.mlp.up_proj



(20/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.07it/s]
(21/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 459.68it/s]

2025-07-17T04:25:53.820805+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.input_layernorm
2025-07-17T04:25:53.822377+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.post_attention_layernorm
2025-07-17T04:25:53.823740+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.mlp.up_proj



(21/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.16it/s]
(22/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 458.98it/s]

2025-07-17T04:25:56.076752+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.input_layernorm
2025-07-17T04:25:56.078256+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.post_attention_layernorm





2025-07-17T04:25:56.079786+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.mlp.up_proj


(22/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 456.57it/s]
(23/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 454.82it/s]

2025-07-17T04:25:58.361157+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.input_layernorm
2025-07-17T04:25:58.362514+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.post_attention_layernorm
2025-07-17T04:25:58.363830+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.mlp.up_proj



(23/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 459.57it/s]
(24/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 459.57it/s]

2025-07-17T04:26:00.651653+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.input_layernorm
2025-07-17T04:26:00.653174+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.post_attention_layernorm





2025-07-17T04:26:00.654772+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.mlp.up_proj


(24/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.71it/s]
(25/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 459.62it/s]

2025-07-17T04:26:02.942068+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.input_layernorm
2025-07-17T04:26:02.943685+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.post_attention_layernorm





2025-07-17T04:26:02.945338+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.mlp.up_proj


(25/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 462.85it/s]
(26/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 458.38it/s]

2025-07-17T04:26:05.231860+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.input_layernorm
2025-07-17T04:26:05.233379+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.post_attention_layernorm





2025-07-17T04:26:05.234706+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.mlp.up_proj


(26/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 462.18it/s]
(27/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 458.27it/s]

2025-07-17T04:26:07.508018+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.input_layernorm
2025-07-17T04:26:07.509813+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.post_attention_layernorm
2025-07-17T04:26:07.511253+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.mlp.up_proj



(27/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 457.28it/s]
(28/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 452.69it/s]

2025-07-17T04:26:09.807651+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.input_layernorm
2025-07-17T04:26:09.809145+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.post_attention_layernorm
2025-07-17T04:26:09.810546+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.mlp.up_proj



(28/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 458.93it/s]
(29/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 457.17it/s]

2025-07-17T04:26:12.095488+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.input_layernorm
2025-07-17T04:26:12.097044+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.post_attention_layernorm
2025-07-17T04:26:12.098435+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.mlp.up_proj



(29/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 462.24it/s]
(30/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 443.77it/s]

2025-07-17T04:26:14.411755+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.input_layernorm
2025-07-17T04:26:14.413241+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.post_attention_layernorm
2025-07-17T04:26:14.414771+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.mlp.up_proj



(30/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 451.33it/s]
(31/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 455.22it/s]

2025-07-17T04:26:16.727209+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.input_layernorm
2025-07-17T04:26:16.728812+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.post_attention_layernorm
2025-07-17T04:26:16.730292+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.mlp.up_proj



(31/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 460.19it/s]
(32/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 457.11it/s]

2025-07-17T04:26:19.013659+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.input_layernorm
2025-07-17T04:26:19.015346+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.post_attention_layernorm
2025-07-17T04:26:19.016828+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.mlp.up_proj



(32/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.00it/s]
(33/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 456.65it/s]

2025-07-17T04:26:21.310605+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.input_layernorm
2025-07-17T04:26:21.312063+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.post_attention_layernorm
2025-07-17T04:26:21.313359+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.mlp.up_proj



(33/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 461.78it/s]
(34/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 448.01it/s]

2025-07-17T04:26:23.611169+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.input_layernorm
2025-07-17T04:26:23.612884+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.post_attention_layernorm
2025-07-17T04:26:23.614362+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.mlp.up_proj



(34/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 460.20it/s]
(35/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 457.64it/s]

2025-07-17T04:26:25.887825+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.input_layernorm
2025-07-17T04:26:25.889487+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.post_attention_layernorm
2025-07-17T04:26:25.890927+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.mlp.up_proj



(35/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.11it/s]
(36/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 452.74it/s]

2025-07-17T04:26:28.170739+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.input_layernorm
2025-07-17T04:26:28.172270+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.post_attention_layernorm
2025-07-17T04:26:28.173678+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.mlp.up_proj



(36/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.08it/s]
(37/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 459.93it/s]

2025-07-17T04:26:30.430765+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.input_layernorm
2025-07-17T04:26:30.432557+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.post_attention_layernorm
2025-07-17T04:26:30.433942+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.mlp.up_proj



(37/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 461.74it/s]
(38/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 449.39it/s]

2025-07-17T04:26:32.724296+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.input_layernorm
2025-07-17T04:26:32.725845+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.post_attention_layernorm
2025-07-17T04:26:32.727141+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.mlp.up_proj



(38/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 460.04it/s]
(39/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 448.91it/s]

2025-07-17T04:26:35.017931+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.input_layernorm
2025-07-17T04:26:35.019401+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.post_attention_layernorm
2025-07-17T04:26:35.020791+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.mlp.up_proj



(39/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 462.15it/s]
(40/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 450.27it/s]

2025-07-17T04:26:37.305060+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.input_layernorm
2025-07-17T04:26:37.306352+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.post_attention_layernorm
2025-07-17T04:26:37.307667+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.mlp.up_proj



(40/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 459.73it/s]
(41/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 299.37it/s]
(41/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 299.98it/s]


2025-07-17T04:26:42.006259+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 2037.18it/s]
(1/41): Calibrating: 100%|██████████| 512/512 [00:08<00:00, 63.65it/s]

2025-07-17T04:26:52.051964+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples





2025-07-17T04:26:52.895198+0000 | compress | METRIC - time 0.84s
2025-07-17T04:26:52.896233+0000 | compress | METRIC - error 25357.14
2025-07-17T04:26:52.897140+0000 | compress | METRIC - GPU 0 | usage: 25.06% | total memory: 24 GB
2025-07-17T04:26:52.897646+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:26:52.898366+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
2025-07-17T04:26:53.709686+0000 | compress | METRIC - time 0.81s
2025-07-17T04:26:53.710477+0000 | compress | METRIC - error 8634.46
2025-07-17T04:26:53.711323+0000 | compress | METRIC - GPU 0 | usage: 25.06% | total memory: 24 GB
2025-07-17T04:26:53.711890+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:26:53.712755+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 512 samples
2025-07-17T04:26:54.519493+0000 | compress | METRIC - time 0.81s
2025-07-17T04:26:54.520398+0000 | compress | METRI

(1/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 384.94it/s]
(2/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.69it/s]

2025-07-17T04:27:09.636938+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.q_proj using 512 samples





2025-07-17T04:27:10.456580+0000 | compress | METRIC - time 0.82s
2025-07-17T04:27:10.457656+0000 | compress | METRIC - error 15228.57
2025-07-17T04:27:10.458292+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:27:10.458838+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:27:10.459721+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.k_proj using 512 samples
2025-07-17T04:27:11.266058+0000 | compress | METRIC - time 0.81s
2025-07-17T04:27:11.267063+0000 | compress | METRIC - error 17480.67
2025-07-17T04:27:11.267884+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:27:11.268383+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:27:11.269337+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.v_proj using 512 samples
2025-07-17T04:27:12.087528+0000 | compress | METRIC - time 0.82s
2025-07-17T04:27:12.088510+0000 | compress | METR

(2/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 456.33it/s]
(3/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.60it/s]

2025-07-17T04:27:26.988769+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.q_proj using 512 samples





2025-07-17T04:27:27.841158+0000 | compress | METRIC - time 0.85s
2025-07-17T04:27:27.842371+0000 | compress | METRIC - error 16670.27
2025-07-17T04:27:27.843054+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:27:27.843495+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:27:27.844281+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.k_proj using 512 samples
2025-07-17T04:27:28.675341+0000 | compress | METRIC - time 0.83s
2025-07-17T04:27:28.676397+0000 | compress | METRIC - error 8536.95
2025-07-17T04:27:28.677264+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:27:28.677730+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:27:28.678460+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.v_proj using 512 samples
2025-07-17T04:27:29.513408+0000 | compress | METRIC - time 0.83s
2025-07-17T04:27:29.514586+0000 | compress | METRI

(3/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 460.56it/s]
(4/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.80it/s]

2025-07-17T04:27:44.445871+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.q_proj using 512 samples





2025-07-17T04:27:45.301407+0000 | compress | METRIC - time 0.85s
2025-07-17T04:27:45.302466+0000 | compress | METRIC - error 21549.38
2025-07-17T04:27:45.303292+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:27:45.303781+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:27:45.304639+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.k_proj using 512 samples
2025-07-17T04:27:46.113156+0000 | compress | METRIC - time 0.81s
2025-07-17T04:27:46.114284+0000 | compress | METRIC - error 9593.58
2025-07-17T04:27:46.115446+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:27:46.116003+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:27:46.116884+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.v_proj using 512 samples
2025-07-17T04:27:46.923210+0000 | compress | METRIC - time 0.81s
2025-07-17T04:27:46.924239+0000 | compress | METRI

(4/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 458.04it/s]
(5/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.65it/s]

2025-07-17T04:28:01.935530+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.q_proj using 512 samples





2025-07-17T04:28:02.776991+0000 | compress | METRIC - time 0.84s
2025-07-17T04:28:02.778220+0000 | compress | METRIC - error 27785.49
2025-07-17T04:28:02.778932+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:28:02.779527+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:28:02.780141+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.k_proj using 512 samples
2025-07-17T04:28:03.613894+0000 | compress | METRIC - time 0.83s
2025-07-17T04:28:03.615113+0000 | compress | METRIC - error 12993.50
2025-07-17T04:28:03.615815+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:28:03.616278+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:28:03.617046+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.v_proj using 512 samples
2025-07-17T04:28:04.444856+0000 | compress | METRIC - time 0.83s
2025-07-17T04:28:04.446229+0000 | compress | METR

(5/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 460.89it/s]
(6/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.80it/s]

2025-07-17T04:28:19.423117+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.q_proj using 512 samples





2025-07-17T04:28:20.249923+0000 | compress | METRIC - time 0.82s
2025-07-17T04:28:20.251178+0000 | compress | METRIC - error 28138.86
2025-07-17T04:28:20.251871+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:28:20.252382+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:28:20.253171+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.k_proj using 512 samples
2025-07-17T04:28:21.061378+0000 | compress | METRIC - time 0.81s
2025-07-17T04:28:21.062458+0000 | compress | METRIC - error 10117.54
2025-07-17T04:28:21.063372+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:28:21.063861+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:28:21.064843+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.v_proj using 512 samples
2025-07-17T04:28:21.868247+0000 | compress | METRIC - time 0.80s
2025-07-17T04:28:21.869560+0000 | compress | METR

(6/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 457.90it/s]
(7/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.72it/s]

2025-07-17T04:28:36.862482+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.q_proj using 512 samples





2025-07-17T04:28:37.727374+0000 | compress | METRIC - time 0.86s
2025-07-17T04:28:37.728499+0000 | compress | METRIC - error 31424.29
2025-07-17T04:28:37.729362+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:28:37.729865+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:28:37.730843+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.k_proj using 512 samples
2025-07-17T04:28:38.572632+0000 | compress | METRIC - time 0.84s
2025-07-17T04:28:38.573814+0000 | compress | METRIC - error 10798.92
2025-07-17T04:28:38.574722+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:28:38.575189+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:28:38.576219+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.v_proj using 512 samples
2025-07-17T04:28:39.415701+0000 | compress | METRIC - time 0.84s
2025-07-17T04:28:39.416954+0000 | compress | METR

(7/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 462.57it/s]
(8/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.98it/s]

2025-07-17T04:28:54.391185+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.q_proj using 512 samples





2025-07-17T04:28:55.217944+0000 | compress | METRIC - time 0.82s
2025-07-17T04:28:55.219189+0000 | compress | METRIC - error 31715.43
2025-07-17T04:28:55.219948+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:28:55.220448+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:28:55.221186+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.k_proj using 512 samples
2025-07-17T04:28:56.026360+0000 | compress | METRIC - time 0.80s
2025-07-17T04:28:56.027509+0000 | compress | METRIC - error 10376.40
2025-07-17T04:28:56.028157+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:28:56.028910+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:28:56.029564+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.v_proj using 512 samples
2025-07-17T04:28:56.841180+0000 | compress | METRIC - time 0.81s
2025-07-17T04:28:56.842257+0000 | compress | METR

(8/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 453.85it/s]
(9/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.70it/s]

2025-07-17T04:29:11.949691+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.q_proj using 512 samples





2025-07-17T04:29:12.817098+0000 | compress | METRIC - time 0.87s
2025-07-17T04:29:12.818164+0000 | compress | METRIC - error 26619.54
2025-07-17T04:29:12.819131+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:29:12.819709+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:29:12.820454+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.k_proj using 512 samples
2025-07-17T04:29:13.660209+0000 | compress | METRIC - time 0.84s
2025-07-17T04:29:13.661337+0000 | compress | METRIC - error 10806.31
2025-07-17T04:29:13.662525+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:29:13.662957+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:29:13.663702+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.v_proj using 512 samples
2025-07-17T04:29:14.500145+0000 | compress | METRIC - time 0.84s
2025-07-17T04:29:14.501228+0000 | compress | METR

(9/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 456.63it/s]
(10/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.69it/s]

2025-07-17T04:29:29.547159+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.q_proj using 512 samples





2025-07-17T04:29:30.395879+0000 | compress | METRIC - time 0.85s
2025-07-17T04:29:30.396946+0000 | compress | METRIC - error 36707.75
2025-07-17T04:29:30.397887+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:29:30.398373+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:29:30.399303+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.k_proj using 512 samples
2025-07-17T04:29:31.230916+0000 | compress | METRIC - time 0.83s
2025-07-17T04:29:31.231975+0000 | compress | METRIC - error 13270.29
2025-07-17T04:29:31.232821+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:29:31.233261+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:29:31.234197+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.v_proj using 512 samples
2025-07-17T04:29:32.068814+0000 | compress | METRIC - time 0.83s
2025-07-17T04:29:32.069883+0000 | compress | METR

(10/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 456.70it/s]
(11/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.70it/s]

2025-07-17T04:29:47.143683+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.q_proj using 512 samples





2025-07-17T04:29:47.968795+0000 | compress | METRIC - time 0.82s
2025-07-17T04:29:47.969959+0000 | compress | METRIC - error 32072.78
2025-07-17T04:29:47.970820+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:29:47.971429+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:29:47.972147+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.k_proj using 512 samples
2025-07-17T04:29:48.783714+0000 | compress | METRIC - time 0.81s
2025-07-17T04:29:48.784804+0000 | compress | METRIC - error 11672.78
2025-07-17T04:29:48.785378+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:29:48.786022+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:29:48.786659+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.v_proj using 512 samples
2025-07-17T04:29:49.599772+0000 | compress | METRIC - time 0.81s
2025-07-17T04:29:49.601026+0000 | compress | ME

(11/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 456.40it/s]
(12/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.87it/s]

2025-07-17T04:30:04.579915+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.q_proj using 512 samples





2025-07-17T04:30:05.419654+0000 | compress | METRIC - time 0.84s
2025-07-17T04:30:05.420790+0000 | compress | METRIC - error 31032.87
2025-07-17T04:30:05.421703+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:30:05.422172+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:30:05.422910+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.k_proj using 512 samples
2025-07-17T04:30:06.238655+0000 | compress | METRIC - time 0.82s
2025-07-17T04:30:06.239568+0000 | compress | METRIC - error 12427.48
2025-07-17T04:30:06.240212+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:30:06.240747+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:30:06.241358+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.v_proj using 512 samples
2025-07-17T04:30:07.054716+0000 | compress | METRIC - time 0.81s
2025-07-17T04:30:07.056067+0000 | compress | ME

(12/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 458.86it/s]
(13/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.80it/s]

2025-07-17T04:30:21.971653+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.q_proj using 512 samples





2025-07-17T04:30:22.787305+0000 | compress | METRIC - time 0.81s
2025-07-17T04:30:22.788469+0000 | compress | METRIC - error 46894.30
2025-07-17T04:30:22.789379+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:30:22.789887+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:30:22.790680+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.k_proj using 512 samples
2025-07-17T04:30:23.588008+0000 | compress | METRIC - time 0.80s
2025-07-17T04:30:23.589302+0000 | compress | METRIC - error 23200.22
2025-07-17T04:30:23.590203+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:30:23.590735+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:30:23.591760+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.v_proj using 512 samples
2025-07-17T04:30:24.402627+0000 | compress | METRIC - time 0.81s
2025-07-17T04:30:24.403710+0000 | compress | ME

(13/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 459.87it/s]
(14/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.67it/s]

2025-07-17T04:30:39.484367+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.q_proj using 512 samples





2025-07-17T04:30:40.298482+0000 | compress | METRIC - time 0.81s
2025-07-17T04:30:40.299615+0000 | compress | METRIC - error 46320.20
2025-07-17T04:30:40.301128+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:30:40.301593+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:30:40.302281+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.k_proj using 512 samples
2025-07-17T04:30:41.095363+0000 | compress | METRIC - time 0.79s
2025-07-17T04:30:41.096506+0000 | compress | METRIC - error 23099.65
2025-07-17T04:30:41.097206+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:30:41.097671+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:30:41.098338+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.v_proj using 512 samples
2025-07-17T04:30:41.894459+0000 | compress | METRIC - time 0.80s
2025-07-17T04:30:41.895523+0000 | compress | ME

(14/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 461.23it/s]
(15/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.30it/s]

2025-07-17T04:30:56.669044+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.q_proj using 512 samples





2025-07-17T04:30:57.537166+0000 | compress | METRIC - time 0.87s
2025-07-17T04:30:57.538264+0000 | compress | METRIC - error 48043.87
2025-07-17T04:30:57.538984+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:30:57.539445+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:30:57.540337+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.k_proj using 512 samples
2025-07-17T04:30:58.354673+0000 | compress | METRIC - time 0.81s
2025-07-17T04:30:58.355822+0000 | compress | METRIC - error 24123.07
2025-07-17T04:30:58.356713+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:30:58.357229+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:30:58.358204+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.v_proj using 512 samples
2025-07-17T04:30:59.199179+0000 | compress | METRIC - time 0.84s
2025-07-17T04:30:59.200252+0000 | compress | ME

(15/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 458.42it/s]
(16/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.03it/s]

2025-07-17T04:31:14.120215+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.q_proj using 512 samples





2025-07-17T04:31:14.976074+0000 | compress | METRIC - time 0.85s
2025-07-17T04:31:14.977201+0000 | compress | METRIC - error 46178.38
2025-07-17T04:31:14.977925+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:31:14.978352+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:31:14.979154+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.k_proj using 512 samples
2025-07-17T04:31:15.824838+0000 | compress | METRIC - time 0.85s
2025-07-17T04:31:15.826096+0000 | compress | METRIC - error 19151.73
2025-07-17T04:31:15.826825+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:31:15.827263+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:31:15.827962+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.v_proj using 512 samples
2025-07-17T04:31:16.671298+0000 | compress | METRIC - time 0.84s
2025-07-17T04:31:16.672617+0000 | compress | ME

(16/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 451.82it/s]
(17/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.67it/s]

2025-07-17T04:31:31.772166+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.q_proj using 512 samples





2025-07-17T04:31:32.598223+0000 | compress | METRIC - time 0.82s
2025-07-17T04:31:32.599423+0000 | compress | METRIC - error 51326.73
2025-07-17T04:31:32.600299+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:31:32.600856+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:31:32.601689+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.k_proj using 512 samples
2025-07-17T04:31:33.430093+0000 | compress | METRIC - time 0.83s
2025-07-17T04:31:33.431257+0000 | compress | METRIC - error 18061.91
2025-07-17T04:31:33.432064+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:31:33.432612+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:31:33.433360+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.v_proj using 512 samples
2025-07-17T04:31:34.280301+0000 | compress | METRIC - time 0.85s
2025-07-17T04:31:34.281435+0000 | compress | ME

(17/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 440.95it/s]
(18/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.59it/s]

2025-07-17T04:31:49.452079+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.q_proj using 512 samples





2025-07-17T04:31:50.305951+0000 | compress | METRIC - time 0.85s
2025-07-17T04:31:50.307166+0000 | compress | METRIC - error 45944.11
2025-07-17T04:31:50.308426+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:31:50.308894+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:31:50.309765+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.k_proj using 512 samples
2025-07-17T04:31:51.154462+0000 | compress | METRIC - time 0.84s
2025-07-17T04:31:51.155621+0000 | compress | METRIC - error 22809.62
2025-07-17T04:31:51.156369+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:31:51.156851+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:31:51.157669+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.v_proj using 512 samples
2025-07-17T04:31:52.000326+0000 | compress | METRIC - time 0.84s
2025-07-17T04:31:52.001688+0000 | compress | ME

(18/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 436.60it/s]
(19/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.51it/s]

2025-07-17T04:32:07.204629+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.q_proj using 512 samples





2025-07-17T04:32:08.056064+0000 | compress | METRIC - time 0.85s
2025-07-17T04:32:08.057244+0000 | compress | METRIC - error 45081.05
2025-07-17T04:32:08.058067+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:32:08.058549+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:32:08.059329+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.k_proj using 512 samples
2025-07-17T04:32:08.873402+0000 | compress | METRIC - time 0.81s
2025-07-17T04:32:08.874838+0000 | compress | METRIC - error 22542.04
2025-07-17T04:32:08.875707+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:32:08.876344+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:32:08.877091+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.v_proj using 512 samples
2025-07-17T04:32:09.692157+0000 | compress | METRIC - time 0.81s
2025-07-17T04:32:09.693346+0000 | compress | ME

(19/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 431.29it/s]
(20/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.30it/s]

2025-07-17T04:32:24.857146+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.q_proj using 512 samples





2025-07-17T04:32:25.707313+0000 | compress | METRIC - time 0.85s
2025-07-17T04:32:25.708507+0000 | compress | METRIC - error 50180.57
2025-07-17T04:32:25.709371+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:32:25.709981+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:32:25.710808+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.k_proj using 512 samples
2025-07-17T04:32:26.539391+0000 | compress | METRIC - time 0.83s
2025-07-17T04:32:26.540448+0000 | compress | METRIC - error 23515.94
2025-07-17T04:32:26.541296+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:32:26.541865+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:32:26.542764+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.v_proj using 512 samples
2025-07-17T04:32:27.364706+0000 | compress | METRIC - time 0.82s
2025-07-17T04:32:27.365852+0000 | compress | ME

(20/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 437.51it/s]
(21/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 64.97it/s]

2025-07-17T04:32:42.638021+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.q_proj using 512 samples





2025-07-17T04:32:43.480513+0000 | compress | METRIC - time 0.84s
2025-07-17T04:32:43.481659+0000 | compress | METRIC - error 47922.67
2025-07-17T04:32:43.482449+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:32:43.482905+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:32:43.483812+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.k_proj using 512 samples
2025-07-17T04:32:44.305911+0000 | compress | METRIC - time 0.82s
2025-07-17T04:32:44.307119+0000 | compress | METRIC - error 21310.68
2025-07-17T04:32:44.308006+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:32:44.308668+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:32:44.309377+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.v_proj using 512 samples
2025-07-17T04:32:45.154864+0000 | compress | METRIC - time 0.85s
2025-07-17T04:32:45.155920+0000 | compress | ME

(21/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 432.59it/s]
(22/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.17it/s]

2025-07-17T04:33:00.396097+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.q_proj using 512 samples





2025-07-17T04:33:01.239784+0000 | compress | METRIC - time 0.84s
2025-07-17T04:33:01.241009+0000 | compress | METRIC - error 44397.91
2025-07-17T04:33:01.241851+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:33:01.242477+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:33:01.243245+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.k_proj using 512 samples
2025-07-17T04:33:02.070702+0000 | compress | METRIC - time 0.83s
2025-07-17T04:33:02.071905+0000 | compress | METRIC - error 20565.67
2025-07-17T04:33:02.073107+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:33:02.073676+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:33:02.074685+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.v_proj using 512 samples
2025-07-17T04:33:02.904567+0000 | compress | METRIC - time 0.83s
2025-07-17T04:33:02.905791+0000 | compress | ME

(22/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 439.66it/s]
(23/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.45it/s]

2025-07-17T04:33:18.020539+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.q_proj using 512 samples





2025-07-17T04:33:18.862773+0000 | compress | METRIC - time 0.84s
2025-07-17T04:33:18.863974+0000 | compress | METRIC - error 43669.35
2025-07-17T04:33:18.864754+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:33:18.865248+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:33:18.866026+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.k_proj using 512 samples
2025-07-17T04:33:19.696790+0000 | compress | METRIC - time 0.83s
2025-07-17T04:33:19.697951+0000 | compress | METRIC - error 16668.33
2025-07-17T04:33:19.698791+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:33:19.699430+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:33:19.700176+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.v_proj using 512 samples
2025-07-17T04:33:20.533852+0000 | compress | METRIC - time 0.83s
2025-07-17T04:33:20.535000+0000 | compress | ME

(23/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 433.59it/s]
(24/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.50it/s]

2025-07-17T04:33:35.616509+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.q_proj using 512 samples





2025-07-17T04:33:36.432237+0000 | compress | METRIC - time 0.81s
2025-07-17T04:33:36.433430+0000 | compress | METRIC - error 55479.35
2025-07-17T04:33:36.434170+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:33:36.434643+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:33:36.435330+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.k_proj using 512 samples
2025-07-17T04:33:37.241602+0000 | compress | METRIC - time 0.81s
2025-07-17T04:33:37.242710+0000 | compress | METRIC - error 25977.51
2025-07-17T04:33:37.243277+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:33:37.243766+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:33:37.244761+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.v_proj using 512 samples
2025-07-17T04:33:38.052292+0000 | compress | METRIC - time 0.81s
2025-07-17T04:33:38.053512+0000 | compress | ME

(24/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 441.27it/s]
(25/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.50it/s]

2025-07-17T04:33:52.928864+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples





2025-07-17T04:33:53.749625+0000 | compress | METRIC - time 0.82s
2025-07-17T04:33:53.750866+0000 | compress | METRIC - error 44112.20
2025-07-17T04:33:53.751664+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:33:53.752131+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:33:53.752894+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.k_proj using 512 samples
2025-07-17T04:33:54.551797+0000 | compress | METRIC - time 0.80s
2025-07-17T04:33:54.553060+0000 | compress | METRIC - error 19161.67
2025-07-17T04:33:54.553756+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:33:54.554180+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:33:54.554862+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.v_proj using 512 samples
2025-07-17T04:33:55.353521+0000 | compress | METRIC - time 0.80s
2025-07-17T04:33:55.354608+0000 | compress | ME

(25/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 437.41it/s]
(26/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.56it/s]

2025-07-17T04:34:10.250078+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.q_proj using 512 samples





2025-07-17T04:34:11.083813+0000 | compress | METRIC - time 0.83s
2025-07-17T04:34:11.085008+0000 | compress | METRIC - error 48480.65
2025-07-17T04:34:11.086063+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:34:11.086589+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:34:11.087609+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.k_proj using 512 samples
2025-07-17T04:34:11.895359+0000 | compress | METRIC - time 0.81s
2025-07-17T04:34:11.896760+0000 | compress | METRIC - error 19464.48
2025-07-17T04:34:11.897434+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:34:11.897910+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:34:11.898739+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.v_proj using 512 samples
2025-07-17T04:34:12.705707+0000 | compress | METRIC - time 0.81s
2025-07-17T04:34:12.706923+0000 | compress | ME

(26/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 442.71it/s]
(27/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.37it/s]

2025-07-17T04:34:27.598917+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.q_proj using 512 samples





2025-07-17T04:34:28.427528+0000 | compress | METRIC - time 0.83s
2025-07-17T04:34:28.428767+0000 | compress | METRIC - error 42320.27
2025-07-17T04:34:28.429710+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:34:28.430338+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:34:28.431434+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.k_proj using 512 samples
2025-07-17T04:34:29.248557+0000 | compress | METRIC - time 0.82s
2025-07-17T04:34:29.249999+0000 | compress | METRIC - error 16964.61
2025-07-17T04:34:29.251131+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:34:29.251692+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:34:29.252558+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.v_proj using 512 samples
2025-07-17T04:34:30.082471+0000 | compress | METRIC - time 0.83s
2025-07-17T04:34:30.083661+0000 | compress | ME

(27/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 438.19it/s]
(28/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.10it/s]

2025-07-17T04:34:45.187672+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.q_proj using 512 samples





2025-07-17T04:34:46.013430+0000 | compress | METRIC - time 0.82s
2025-07-17T04:34:46.014737+0000 | compress | METRIC - error 45695.69
2025-07-17T04:34:46.015539+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:34:46.016134+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:34:46.016983+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.k_proj using 512 samples
2025-07-17T04:34:46.830012+0000 | compress | METRIC - time 0.81s
2025-07-17T04:34:46.831210+0000 | compress | METRIC - error 18027.95
2025-07-17T04:34:46.831912+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:34:46.832667+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:34:46.833487+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.v_proj using 512 samples
2025-07-17T04:34:47.653723+0000 | compress | METRIC - time 0.82s
2025-07-17T04:34:47.654955+0000 | compress | ME

(28/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 437.17it/s]
(29/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 65.86it/s]

2025-07-17T04:35:02.794180+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.q_proj using 512 samples





2025-07-17T04:35:03.620978+0000 | compress | METRIC - time 0.82s
2025-07-17T04:35:03.622138+0000 | compress | METRIC - error 50989.55
2025-07-17T04:35:03.623034+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:03.623608+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:35:03.624354+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.k_proj using 512 samples
2025-07-17T04:35:04.440531+0000 | compress | METRIC - time 0.82s
2025-07-17T04:35:04.441817+0000 | compress | METRIC - error 19043.12
2025-07-17T04:35:04.442732+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:04.443161+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:35:04.444196+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.v_proj using 512 samples
2025-07-17T04:35:05.261094+0000 | compress | METRIC - time 0.82s
2025-07-17T04:35:05.262337+0000 | compress | ME

(29/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 434.89it/s]
(30/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.27it/s]

2025-07-17T04:35:20.322578+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.q_proj using 512 samples





2025-07-17T04:35:21.153050+0000 | compress | METRIC - time 0.83s
2025-07-17T04:35:21.154219+0000 | compress | METRIC - error 45168.05
2025-07-17T04:35:21.155019+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:21.155597+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:35:21.156564+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.k_proj using 512 samples
2025-07-17T04:35:21.970962+0000 | compress | METRIC - time 0.81s
2025-07-17T04:35:21.972378+0000 | compress | METRIC - error 16230.17
2025-07-17T04:35:21.973152+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:21.973659+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:35:21.974488+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.v_proj using 512 samples
2025-07-17T04:35:22.793278+0000 | compress | METRIC - time 0.82s
2025-07-17T04:35:22.794394+0000 | compress | ME

(30/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 437.74it/s]
(31/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.52it/s]

2025-07-17T04:35:37.821685+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.q_proj using 512 samples





2025-07-17T04:35:38.658687+0000 | compress | METRIC - time 0.84s
2025-07-17T04:35:38.659788+0000 | compress | METRIC - error 59249.29
2025-07-17T04:35:38.660653+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:38.661095+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:35:38.661936+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.k_proj using 512 samples
2025-07-17T04:35:39.484482+0000 | compress | METRIC - time 0.82s
2025-07-17T04:35:39.485900+0000 | compress | METRIC - error 21245.49
2025-07-17T04:35:39.486632+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:39.487299+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:35:39.488327+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.v_proj using 512 samples
2025-07-17T04:35:40.301850+0000 | compress | METRIC - time 0.81s
2025-07-17T04:35:40.303014+0000 | compress | ME

(31/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 439.02it/s]
(32/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.64it/s]

2025-07-17T04:35:55.274771+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.q_proj using 512 samples





2025-07-17T04:35:56.112074+0000 | compress | METRIC - time 0.84s
2025-07-17T04:35:56.113263+0000 | compress | METRIC - error 74129.67
2025-07-17T04:35:56.114093+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:56.114584+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:35:56.115455+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.k_proj using 512 samples
2025-07-17T04:35:56.935224+0000 | compress | METRIC - time 0.82s
2025-07-17T04:35:56.936496+0000 | compress | METRIC - error 27229.66
2025-07-17T04:35:56.937679+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:35:56.938279+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:35:56.938954+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.v_proj using 512 samples
2025-07-17T04:35:57.775908+0000 | compress | METRIC - time 0.84s
2025-07-17T04:35:57.777087+0000 | compress | ME

(32/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 438.25it/s]
(33/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.80it/s]

2025-07-17T04:36:12.731246+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.q_proj using 512 samples





2025-07-17T04:36:13.580628+0000 | compress | METRIC - time 0.85s
2025-07-17T04:36:13.582012+0000 | compress | METRIC - error 57412.60
2025-07-17T04:36:13.582861+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:36:13.583518+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:36:13.584357+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.k_proj using 512 samples
2025-07-17T04:36:14.412377+0000 | compress | METRIC - time 0.83s
2025-07-17T04:36:14.413591+0000 | compress | METRIC - error 23030.44
2025-07-17T04:36:14.414441+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:36:14.414915+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:36:14.415734+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.v_proj using 512 samples
2025-07-17T04:36:15.242725+0000 | compress | METRIC - time 0.83s
2025-07-17T04:36:15.243916+0000 | compress | ME

(33/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 436.76it/s]
(34/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.92it/s]

2025-07-17T04:36:30.304585+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.q_proj using 512 samples





2025-07-17T04:36:31.144069+0000 | compress | METRIC - time 0.84s
2025-07-17T04:36:31.145135+0000 | compress | METRIC - error 76004.03
2025-07-17T04:36:31.145891+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:36:31.146313+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:36:31.147000+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.k_proj using 512 samples
2025-07-17T04:36:31.971317+0000 | compress | METRIC - time 0.82s
2025-07-17T04:36:31.972566+0000 | compress | METRIC - error 30944.51
2025-07-17T04:36:31.973295+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:36:31.973933+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:36:31.974625+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.v_proj using 512 samples
2025-07-17T04:36:32.789124+0000 | compress | METRIC - time 0.81s
2025-07-17T04:36:32.790365+0000 | compress | ME

(34/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 440.66it/s]
(35/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.99it/s]

2025-07-17T04:36:47.717109+0000 | compress_modules | INFO - Quantizing model.layers.34.self_attn.q_proj using 512 samples





2025-07-17T04:36:48.540675+0000 | compress | METRIC - time 0.82s
2025-07-17T04:36:48.541828+0000 | compress | METRIC - error 96078.50
2025-07-17T04:36:48.542624+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:36:48.543068+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:36:48.543978+0000 | compress_modules | INFO - Quantizing model.layers.34.self_attn.k_proj using 512 samples
2025-07-17T04:36:49.395534+0000 | compress | METRIC - time 0.85s
2025-07-17T04:36:49.396707+0000 | compress | METRIC - error 33798.20
2025-07-17T04:36:49.397401+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:36:49.398022+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:36:49.398702+0000 | compress_modules | INFO - Quantizing model.layers.34.self_attn.v_proj using 512 samples
2025-07-17T04:36:50.247226+0000 | compress | METRIC - time 0.85s
2025-07-17T04:36:50.248381+0000 | compress | ME

(35/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 436.09it/s]
(36/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.89it/s]

2025-07-17T04:37:05.392329+0000 | compress_modules | INFO - Quantizing model.layers.35.self_attn.q_proj using 512 samples





2025-07-17T04:37:06.236682+0000 | compress | METRIC - time 0.84s
2025-07-17T04:37:06.237953+0000 | compress | METRIC - error 90035.16
2025-07-17T04:37:06.238924+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:37:06.239470+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:37:06.240346+0000 | compress_modules | INFO - Quantizing model.layers.35.self_attn.k_proj using 512 samples
2025-07-17T04:37:07.069608+0000 | compress | METRIC - time 0.83s
2025-07-17T04:37:07.070838+0000 | compress | METRIC - error 26605.23
2025-07-17T04:37:07.072024+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T04:37:07.072548+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:37:07.073347+0000 | compress_modules | INFO - Quantizing model.layers.35.self_attn.v_proj using 512 samples
2025-07-17T04:37:07.903151+0000 | compress | METRIC - time 0.83s
2025-07-17T04:37:07.904375+0000 | compress | ME

(36/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 434.05it/s]
(37/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.75it/s]

2025-07-17T04:37:22.842038+0000 | compress_modules | INFO - Quantizing model.layers.36.self_attn.q_proj using 512 samples





2025-07-17T04:37:23.705391+0000 | compress | METRIC - time 0.86s
2025-07-17T04:37:23.706642+0000 | compress | METRIC - error 83689.77
2025-07-17T04:37:23.707367+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:37:23.707945+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:37:23.708645+0000 | compress_modules | INFO - Quantizing model.layers.36.self_attn.k_proj using 512 samples
2025-07-17T04:37:24.555723+0000 | compress | METRIC - time 0.85s
2025-07-17T04:37:24.556919+0000 | compress | METRIC - error 22709.64
2025-07-17T04:37:24.557510+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:37:24.557887+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:37:24.558775+0000 | compress_modules | INFO - Quantizing model.layers.36.self_attn.v_proj using 512 samples
2025-07-17T04:37:25.409665+0000 | compress | METRIC - time 0.85s
2025-07-17T04:37:25.410787+0000 | compress | ME

(37/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 435.00it/s]
(38/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.65it/s]

2025-07-17T04:37:40.590184+0000 | compress_modules | INFO - Quantizing model.layers.37.self_attn.q_proj using 512 samples





2025-07-17T04:37:41.444771+0000 | compress | METRIC - time 0.85s
2025-07-17T04:37:41.446061+0000 | compress | METRIC - error 78542.81
2025-07-17T04:37:41.446891+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:37:41.447515+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:37:41.448220+0000 | compress_modules | INFO - Quantizing model.layers.37.self_attn.k_proj using 512 samples
2025-07-17T04:37:42.302196+0000 | compress | METRIC - time 0.85s
2025-07-17T04:37:42.303427+0000 | compress | METRIC - error 22171.32
2025-07-17T04:37:42.304250+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:37:42.304758+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:37:42.305593+0000 | compress_modules | INFO - Quantizing model.layers.37.self_attn.v_proj using 512 samples
2025-07-17T04:37:43.162663+0000 | compress | METRIC - time 0.86s
2025-07-17T04:37:43.163758+0000 | compress | ME

(38/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 434.56it/s]
(39/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.54it/s]

2025-07-17T04:37:58.376841+0000 | compress_modules | INFO - Quantizing model.layers.38.self_attn.q_proj using 512 samples





2025-07-17T04:37:59.234275+0000 | compress | METRIC - time 0.86s
2025-07-17T04:37:59.235520+0000 | compress | METRIC - error 93588.78
2025-07-17T04:37:59.236312+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:37:59.236924+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:37:59.237656+0000 | compress_modules | INFO - Quantizing model.layers.38.self_attn.k_proj using 512 samples
2025-07-17T04:38:00.084368+0000 | compress | METRIC - time 0.85s
2025-07-17T04:38:00.085746+0000 | compress | METRIC - error 24124.31
2025-07-17T04:38:00.086540+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:38:00.086951+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:38:00.087772+0000 | compress_modules | INFO - Quantizing model.layers.38.self_attn.v_proj using 512 samples
2025-07-17T04:38:00.934518+0000 | compress | METRIC - time 0.85s
2025-07-17T04:38:00.935772+0000 | compress | ME

(39/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 434.71it/s]
(40/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.77it/s]

2025-07-17T04:38:16.092048+0000 | compress_modules | INFO - Quantizing model.layers.39.self_attn.q_proj using 512 samples





2025-07-17T04:38:16.948643+0000 | compress | METRIC - time 0.85s
2025-07-17T04:38:16.949703+0000 | compress | METRIC - error 68646.34
2025-07-17T04:38:16.950535+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:38:16.951046+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T04:38:16.951953+0000 | compress_modules | INFO - Quantizing model.layers.39.self_attn.k_proj using 512 samples
2025-07-17T04:38:17.808140+0000 | compress | METRIC - time 0.86s
2025-07-17T04:38:17.809307+0000 | compress | METRIC - error 18900.18
2025-07-17T04:38:17.810247+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T04:38:17.810687+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T04:38:17.811460+0000 | compress_modules | INFO - Quantizing model.layers.39.self_attn.v_proj using 512 samples
2025-07-17T04:38:18.653772+0000 | compress | METRIC - time 0.84s
2025-07-17T04:38:18.654980+0000 | compress | ME

(40/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 434.33it/s]
(41/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 297.54it/s]
(41/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 297.61it/s]


2025-07-17T04:38:29.664843+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers


### Save the Compressed Model

**Explanation**

- Naming: appends -W4A16 to distinguish the quantized checkpoint.
- **save_compressed=True** stores weights in compact safetensors format for deployment via vLLM.

In [29]:
# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

2025-07-17T04:41:16.812481+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 527it [00:12, 43.02it/s]


('granite-3.2-2b-instruct-W4A16/tokenizer_config.json',
 'granite-3.2-2b-instruct-W4A16/special_tokens_map.json',
 'granite-3.2-2b-instruct-W4A16/chat_template.jinja',
 'granite-3.2-2b-instruct-W4A16/vocab.json',
 'granite-3.2-2b-instruct-W4A16/merges.txt',
 'granite-3.2-2b-instruct-W4A16/added_tokens.json',
 'granite-3.2-2b-instruct-W4A16/tokenizer.json')

### Evaluate accuracy in vLLM

We can evaluate accuracy with lm_eval

##### Check GPU memory leftovers:

In [1]:
!nvidia-smi

Thu Jul 17 04:42:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   50C    P8             17W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

**IMPORTANT**: After quantizing the model the GPU memory may not be freed (see the above output). You need to **restart the kernel** before evaluating the model to ensure you have enough GPU RAM available.

#### Install lm_eval

In [2]:
!pip install -q lm_eval==v0.4.3


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Install vLLM for evaluation

Run the following to test accuracy on GSM-8K:

In [3]:
pip install -q vllm

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmcompressor 0.6.0 requires numpy<2.0,>=1.17.0, but you have numpy 2.2.6 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Evaluation Command

- `--model vllm` - Uses vLLM backend for fast, memory-efficient inference on large models 
- `--model_args` - pretrained=$MODEL_ID: specifies which model to load.
- `add_bos_token=true`: ensures a beginning-of-sequence token is added; required for consistent results on math and reasoning tasks 
- `max_model_len=4096`: sets the context window the model uses for evaluation.
- `gpu_memory_utilization=0.5`: limits vLLM to use 50% of GPU memory, allowing to avoid OOM.

In [9]:
import os

current_dir = os.getcwd()

MODEL_ID = current_dir + "/granite-3.2-2b-instruct-W4A16"

!lm_eval --model vllm \
  --model_args "pretrained=$MODEL_ID,add_bos_token=true,max_model_len=4096,gpu_memory_utilization=0.5" \
  --trust_remote_code \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'

INFO 07-17 04:50:30 [__init__.py:244] Automatically detected platform cuda.
2025-07-17:04:50:32,416 INFO     [__main__.py:272] Verbosity set to INFO
2025-07-17:04:50:36,511 INFO     [__main__.py:357] Passed `--trust_remote_code`, setting environment variable `HF_DATASETS_TRUST_REMOTE_CODE=true`
2025-07-17:04:50:36,511 INFO     [__main__.py:369] Selected Tasks: ['gsm8k']
2025-07-17:04:50:36,513 INFO     [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-07-17:04:50:36,513 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': '/opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16', 'add_bos_token': True, 'max_model_len': 4096, 'gpu_memory_utilization': 0.5, 'trust_remote_code': True}
INFO 07-17 04:50:43 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 07-17 04:50:43 [c

With powerful GPU(s), you could also run the vLLM based evals with the following - using higher GPU memory utilization and chunked prefill. 
```bash
!lm_eval \
  --model vllm \
  --model_args pretrained=$SAVE_DIR,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True \
  --trust_remote_code \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config
```

**Next Steps**: 
- How would you futher improve the accuacy of the model?
- How would you go about preparing the right data set for a different use case?

### Upload Optimized Model to MinIO

In [2]:
import os
from boto3 import client

current_dir = os.getcwd()
OPTIMIZED_MODEL_DIR = current_dir + "/granite-3.2-2b-instruct-W4A16"
S3_PATH = "granite-int4-notebook"

print('Starting upload of quantizied model')
s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
s3_bucket_name = os.environ["AWS_S3_BUCKET"]

print(f'Uploading predictions to bucket {s3_bucket_name} '
        f'to S3 storage at {s3_endpoint_url}')

s3_client = client(
    's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key, verify=False
)

# Walk through the local folder and upload files
for root, dirs, files in os.walk(OPTIMIZED_MODEL_DIR):
    for file in files:
        local_file_path = os.path.join(root, file)
        s3_file_path = os.path.join(S3_PATH, local_file_path[len(OPTIMIZED_MODEL_DIR)+1:])
        s3_client.upload_file(local_file_path, s3_bucket_name, s3_file_path)
        print(f'Uploaded {local_file_path}')

print('Finished uploading of quantizied model')

Starting results upload.
Uploading predictions to bucket models to S3 storage at http://minio-service.minio.svc.cluster.local:9000
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/merges.txt
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/model.safetensors
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/added_tokens.json
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/tokenizer_config.json
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/chat_template.jinja
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/config.json
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/