# Quantization of `granite-3.3-2b-instruct` model

Recall that our overall solution uses the quantized version of the model `granite-3.3-2b-instruct`. In this lab, we will be taking in the base model `granite-3.3-2b-instruct` and quantizing it to `W4A16` - which is fixed-point integer (INT) quantization scheme for weights and floating‑point for activations - to provide both memory savings (weight - INT4) and inference acceleration (activations - BF16) with `vLLM`

**Note**: `W4A16` computation is supported on Nvidia GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).

**Note**: The steps here will take around 20-30 minutes, depending on the connectivity. The most time consuming steps are the installation of llmcompressor (up to 5 mins) and the quantization step (which can take more anywhere between 10-15 mins)

## Setting up llm-compressor

Installing `llmcompressor` may take a minute, depending on the bandwith available. Do note the versions of `transformer` library we would be using. There is a known issue (*torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow*) with the usage of the latest transformer library (version `4.53.2` as of July 17, 2025) in combination with the latest version of llmcompressor (version `0.6.0`).

In [3]:
!pip install -q llmcompressor==0.6.0 transformers==4.52.2

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
codeflare-sdk 0.27.0 requires pydantic<2, but you have pydantic 2.11.7 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Let's make sure we have installed the right versions installed

In [6]:
!pip list | grep llmcompressor

llmcompressor             0.6.0


In [7]:
!pip list | grep transformer

transformers              4.52.2


## Let' start with the quantization of the model

There are 6 steps:
1. Loading the model
2. Choosing the quantization scheme and method
3. Preparing the calibration data
4. Applying quantization
5. Saving the model
6. Evaluation of accuracy in vLLM

### Loading the model

Load the model using AutoModelForCausalLM for handling quantized saving and loading. The model can be loaded from HuggingFace directly as follows:

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Prepare calibration data

Prepare the calibration data. When quantizing weigths of a model to int4 using GPTQ, we need some sample data to run the GPTQ algorithms. As a result, it is very important to use calibration data that closely matches the type of data used in our deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.

In our case, we are quantizing an Instruction tuned generic model, so we will use the ultrachat dataset. Some best practices include:
- 512 samples is a good place to start (increase if accuracy drops). We are going to use 256 to speed up the process.
- 2048 sequence length is a good place to start
- Use the chat template or instrucion template that the model is trained with


In [10]:
from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 512  # 1024
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"

# Load dataset.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

# Preprocess the data into the format the model is trained with.
def preprocess(example):
    return {"text": example["text"]}
ds = ds.map(preprocess)

# Tokenize the data
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        truncation=False,
        add_special_tokens=True,
    )
ds = ds.map(tokenize, remove_columns=ds.column_names)

With the dataset ready, we will now apply quantization.

We first select the quantization algorithm. For W4A16, we want to:
- Run SmoothQuant to make the activations easier to quantize
- Quantize the weights to 4 bits with channelwise scales using GPTQ
- Quantize the activations with dynamic per token strategy

**Note**: The quantization step takes a long time to complete due to the callibration requirements -- around 10 - 15 mins, depending on the GPU.

### Imports and definitions

**GPTQModifier**: Applies Gentle Quantization (GPTQ) for weight-only quantization.

**SmoothQuantModifier**: Prepares model activations for smoother quantization by scaling internal activations and weights.

**oneshot**: High-level API that applies your quantization recipe in one go.

In [11]:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

### Hyperparameters

Rationale
- **DAMPENING_FRAC=0.1** gently prevents large Hessian-derived updates during quantization.
- **OBSERVER="mse"** measures quantization error by squared deviations, yielding well-rounded scales.
- **GROUP_SIZE=128** determines group size for per-channel quantization; typical default usage.

In [12]:
DAMPENING_FRAC = 0.1  # tapering adjustment to prevent extreme weight updates
OBSERVER = "mse"  # denotes minmax - quantization layout based on mean‐squared‐error
GROUP_SIZE = 128  # # per-channel grouping width for quantization

### Layer Mappings & Ignoring Heads

Logic

- **ignore=["lm_head"]** skips quantization on the output layer to preserve final logits and maintain accuracy.
- mappings link groups of linear projections (q, k, v, gating, up/down projections) with layernorm blocks—SmoothQuant uses these to shift and normalize activations across paired layers for better quant distribution.

In [13]:
ignore=["lm_head"]
mappings=[
    [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
    [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
    [["re:.*down_proj"], "re:.*up_proj"]
]

### Recipe Definition

**Workflow**

- **SmoothQuantModifier**: Re-scales activations across paired layers before quantization to reduce outliers (smoothing_strength=0.7, high smoothing but not extreme).
- **GPTQModifier**: Performs Weight-Only quantization (4-bit weights, 16-bit activations) on all Linear layers except those ignored, applying your dampening and observer settings. Scheme "W4A16" reduces model size while maintaining decent accuracy. 

In [14]:
recipe = [
    SmoothQuantModifier(smoothing_strength=0.7, ignore=ignore, mappings=mappings),
    GPTQModifier(
        targets=["Linear"],
        ignore=ignore,
        scheme="W4A16",
        dampening_frac=DAMPENING_FRAC,
        observer=OBSERVER,
    )
]

### Quantize in One Shot

**How It Works**

- Feeds dataset (calibration set) into your model to gather activation statistics.
- Applies SmoothQuant rescaling followed by GPTQ quantization in a sequential per-layer manner.
- **max_seq_length=8196** ensures large context coverage for calibration.

In [15]:
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=8196,
)

  oneshot(


2025-07-17T09:10:48.228683+0000 | reset | INFO - Compression lifecycle reset
2025-07-17T09:10:48.231650+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-07-17T09:10:49.561648+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-07-17T09:10:49.562336+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `SmoothQuantModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 2115.70it/s]
(1/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 343.40it/s]

2025-07-17T09:10:53.857901+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.input_layernorm
2025-07-17T09:10:53.875479+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.post_attention_layernorm
2025-07-17T09:10:53.877259+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.mlp.up_proj



(1/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 300.24it/s]
(2/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 437.34it/s]

2025-07-17T09:10:56.953348+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.input_layernorm
2025-07-17T09:10:56.954853+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.post_attention_layernorm
2025-07-17T09:10:56.956287+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.mlp.up_proj



(2/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.31it/s]
(3/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 441.41it/s]

2025-07-17T09:10:59.248953+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.input_layernorm
2025-07-17T09:10:59.250376+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.post_attention_layernorm
2025-07-17T09:10:59.251826+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.mlp.up_proj



(3/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 475.07it/s]
(4/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 442.41it/s]

2025-07-17T09:11:01.538848+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.input_layernorm
2025-07-17T09:11:01.540379+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.post_attention_layernorm
2025-07-17T09:11:01.541775+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.mlp.up_proj



(4/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.91it/s]
(5/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 451.59it/s]

2025-07-17T09:11:03.820924+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.input_layernorm
2025-07-17T09:11:03.822178+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.post_attention_layernorm
2025-07-17T09:11:03.823551+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.mlp.up_proj



(5/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.12it/s]
(6/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 441.27it/s]

2025-07-17T09:11:06.117085+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.input_layernorm
2025-07-17T09:11:06.118349+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.post_attention_layernorm
2025-07-17T09:11:06.119738+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.mlp.up_proj



(6/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.82it/s]
(7/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 440.38it/s]

2025-07-17T09:11:08.408562+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.input_layernorm
2025-07-17T09:11:08.409992+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.post_attention_layernorm
2025-07-17T09:11:08.411390+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.mlp.up_proj



(7/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.20it/s]
(8/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 466.48it/s]

2025-07-17T09:11:10.643711+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.input_layernorm
2025-07-17T09:11:10.645082+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.post_attention_layernorm
2025-07-17T09:11:10.646389+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.mlp.up_proj



(8/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.75it/s]
(9/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 465.95it/s]

2025-07-17T09:11:12.897019+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.input_layernorm
2025-07-17T09:11:12.898520+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.post_attention_layernorm
2025-07-17T09:11:12.899852+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.mlp.up_proj



(9/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.78it/s]
(10/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 467.68it/s]

2025-07-17T09:11:15.143099+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.input_layernorm
2025-07-17T09:11:15.144331+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.post_attention_layernorm
2025-07-17T09:11:15.145665+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.mlp.up_proj



(10/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.43it/s]
(11/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 464.67it/s]

2025-07-17T09:11:17.391588+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.input_layernorm
2025-07-17T09:11:17.392926+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.post_attention_layernorm
2025-07-17T09:11:17.394324+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.mlp.up_proj



(11/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.82it/s]
(12/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 465.94it/s]

2025-07-17T09:11:19.612820+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.input_layernorm
2025-07-17T09:11:19.614339+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.post_attention_layernorm
2025-07-17T09:11:19.615680+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.mlp.up_proj



(12/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.80it/s]
(13/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 466.42it/s]

2025-07-17T09:11:21.834771+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.input_layernorm
2025-07-17T09:11:21.836081+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.post_attention_layernorm
2025-07-17T09:11:21.837403+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.mlp.up_proj



(13/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.42it/s]
(14/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 465.34it/s]

2025-07-17T09:11:24.085712+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.input_layernorm
2025-07-17T09:11:24.087061+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.post_attention_layernorm
2025-07-17T09:11:24.088498+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.mlp.up_proj



(14/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.53it/s]
(15/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 468.32it/s]

2025-07-17T09:11:26.308589+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.input_layernorm
2025-07-17T09:11:26.310011+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.post_attention_layernorm
2025-07-17T09:11:26.311375+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.mlp.up_proj



(15/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 475.07it/s]
(16/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 470.34it/s]

2025-07-17T09:11:28.584965+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.input_layernorm
2025-07-17T09:11:28.586377+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.post_attention_layernorm
2025-07-17T09:11:28.587720+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.mlp.up_proj



(16/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.16it/s]
(17/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 472.09it/s]

2025-07-17T09:11:30.812575+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.input_layernorm
2025-07-17T09:11:30.814084+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.post_attention_layernorm
2025-07-17T09:11:30.815313+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.mlp.up_proj



(17/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 475.54it/s]
(18/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 470.03it/s]

2025-07-17T09:11:33.041094+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.input_layernorm
2025-07-17T09:11:33.042497+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.post_attention_layernorm
2025-07-17T09:11:33.043710+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.mlp.up_proj



(18/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.77it/s]
(19/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.41it/s]

2025-07-17T09:11:35.250779+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.input_layernorm
2025-07-17T09:11:35.252048+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.post_attention_layernorm
2025-07-17T09:11:35.253381+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.mlp.up_proj



(19/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.16it/s]
(20/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.47it/s]

2025-07-17T09:11:37.488693+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.input_layernorm
2025-07-17T09:11:37.490197+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.post_attention_layernorm
2025-07-17T09:11:37.491564+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.mlp.up_proj



(20/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.76it/s]
(21/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.85it/s]

2025-07-17T09:11:39.729003+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.input_layernorm
2025-07-17T09:11:39.730592+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.post_attention_layernorm
2025-07-17T09:11:39.732024+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.mlp.up_proj



(21/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 442.98it/s]
(22/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 442.35it/s]

2025-07-17T09:11:42.094612+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.input_layernorm
2025-07-17T09:11:42.096243+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.post_attention_layernorm
2025-07-17T09:11:42.097643+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.mlp.up_proj



(22/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.34it/s]
(23/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 443.15it/s]

2025-07-17T09:11:44.384259+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.input_layernorm
2025-07-17T09:11:44.385576+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.post_attention_layernorm





2025-07-17T09:11:44.386859+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.mlp.up_proj


(23/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.54it/s]
(24/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 442.73it/s]

2025-07-17T09:11:46.673256+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.input_layernorm





2025-07-17T09:11:46.674869+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.post_attention_layernorm
2025-07-17T09:11:46.676078+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.mlp.up_proj


(24/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.57it/s]
(25/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 441.64it/s]

2025-07-17T09:11:48.975029+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.input_layernorm
2025-07-17T09:11:48.976383+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.post_attention_layernorm
2025-07-17T09:11:48.977710+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.mlp.up_proj



(25/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.72it/s]
(26/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 444.21it/s]

2025-07-17T09:11:51.266723+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.input_layernorm
2025-07-17T09:11:51.268072+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.post_attention_layernorm
2025-07-17T09:11:51.269437+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.mlp.up_proj



(26/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.21it/s]
(27/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 441.50it/s]

2025-07-17T09:11:53.576883+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.input_layernorm
2025-07-17T09:11:53.578364+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.post_attention_layernorm
2025-07-17T09:11:53.579759+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.mlp.up_proj



(27/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.31it/s]
(28/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 443.21it/s]

2025-07-17T09:11:55.867887+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.input_layernorm
2025-07-17T09:11:55.869267+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.post_attention_layernorm
2025-07-17T09:11:55.870516+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.mlp.up_proj



(28/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.81it/s]
(29/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 436.31it/s]

2025-07-17T09:11:58.193478+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.input_layernorm
2025-07-17T09:11:58.194736+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.post_attention_layernorm
2025-07-17T09:11:58.196060+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.mlp.up_proj



(29/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.42it/s]
(30/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 442.32it/s]

2025-07-17T09:12:00.493838+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.input_layernorm
2025-07-17T09:12:00.495190+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.post_attention_layernorm
2025-07-17T09:12:00.496546+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.mlp.up_proj



(30/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.10it/s]
(31/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 439.11it/s]

2025-07-17T09:12:02.810139+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.input_layernorm
2025-07-17T09:12:02.811540+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.post_attention_layernorm
2025-07-17T09:12:02.812758+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.mlp.up_proj



(31/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.49it/s]
(32/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 440.76it/s]

2025-07-17T09:12:05.120826+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.input_layernorm
2025-07-17T09:12:05.122231+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.post_attention_layernorm
2025-07-17T09:12:05.123496+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.mlp.up_proj



(32/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.29it/s]
(33/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 438.05it/s]

2025-07-17T09:12:07.443336+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.input_layernorm
2025-07-17T09:12:07.444688+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.post_attention_layernorm
2025-07-17T09:12:07.446032+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.mlp.up_proj



(33/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.18it/s]
(34/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 438.62it/s]

2025-07-17T09:12:09.763132+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.input_layernorm
2025-07-17T09:12:09.764647+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.post_attention_layernorm
2025-07-17T09:12:09.765842+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.mlp.up_proj



(34/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.56it/s]
(35/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 413.98it/s]

2025-07-17T09:12:12.158309+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.input_layernorm
2025-07-17T09:12:12.159854+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.post_attention_layernorm
2025-07-17T09:12:12.161081+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.mlp.up_proj



(35/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.87it/s]
(36/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 465.23it/s]

2025-07-17T09:12:14.419904+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.input_layernorm
2025-07-17T09:12:14.421593+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.post_attention_layernorm
2025-07-17T09:12:14.422906+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.mlp.up_proj



(36/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.88it/s]
(37/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 463.60it/s]

2025-07-17T09:12:16.671714+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.input_layernorm
2025-07-17T09:12:16.673249+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.post_attention_layernorm
2025-07-17T09:12:16.674507+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.mlp.up_proj



(37/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.36it/s]
(38/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 462.57it/s]

2025-07-17T09:12:18.943263+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.input_layernorm
2025-07-17T09:12:18.944795+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.post_attention_layernorm
2025-07-17T09:12:18.946137+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.mlp.up_proj



(38/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.46it/s]
(39/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 463.90it/s]

2025-07-17T09:12:21.208639+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.input_layernorm
2025-07-17T09:12:21.210082+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.post_attention_layernorm
2025-07-17T09:12:21.211463+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.mlp.up_proj



(39/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.91it/s]
(40/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 465.37it/s]

2025-07-17T09:12:23.460818+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.input_layernorm
2025-07-17T09:12:23.462240+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.post_attention_layernorm
2025-07-17T09:12:23.463593+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.mlp.up_proj



(40/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.22it/s]
(41/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 304.07it/s]
(41/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 304.19it/s]


2025-07-17T09:12:28.093779+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 2169.90it/s]
(1/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.17it/s]

2025-07-17T09:12:38.183255+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples





2025-07-17T09:12:39.040645+0000 | compress | METRIC - time 0.86s
2025-07-17T09:12:39.041504+0000 | compress | METRIC - error 25369.22
2025-07-17T09:12:39.042480+0000 | compress | METRIC - GPU 0 | usage: 25.90% | total memory: 24 GB
2025-07-17T09:12:39.043142+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:12:39.043840+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
2025-07-17T09:12:39.851400+0000 | compress | METRIC - time 0.81s
2025-07-17T09:12:39.852115+0000 | compress | METRIC - error 8632.69
2025-07-17T09:12:39.852901+0000 | compress | METRIC - GPU 0 | usage: 25.90% | total memory: 24 GB
2025-07-17T09:12:39.853438+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:12:39.854161+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 512 samples
2025-07-17T09:12:40.657667+0000 | compress | METRIC - time 0.80s
2025-07-17T09:12:40.658578+0000 | compress | METRI

(1/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 393.04it/s]
(2/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.88it/s]

2025-07-17T09:12:55.554970+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.q_proj using 512 samples





2025-07-17T09:12:56.369038+0000 | compress | METRIC - time 0.81s
2025-07-17T09:12:56.369967+0000 | compress | METRIC - error 15232.86
2025-07-17T09:12:56.370539+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:12:56.371152+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:12:56.371798+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.k_proj using 512 samples
2025-07-17T09:12:57.171643+0000 | compress | METRIC - time 0.80s
2025-07-17T09:12:57.172587+0000 | compress | METRIC - error 17377.51
2025-07-17T09:12:57.173394+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:12:57.173864+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:12:57.174701+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.v_proj using 512 samples
2025-07-17T09:12:57.993096+0000 | compress | METRIC - time 0.82s
2025-07-17T09:12:57.993966+0000 | compress | METR

(2/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.10it/s]
(3/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.35it/s]

2025-07-17T09:13:12.538205+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.q_proj using 512 samples





2025-07-17T09:13:13.346470+0000 | compress | METRIC - time 0.81s
2025-07-17T09:13:13.347439+0000 | compress | METRIC - error 16670.19
2025-07-17T09:13:13.348243+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:13:13.348725+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:13:13.349555+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.k_proj using 512 samples
2025-07-17T09:13:14.136070+0000 | compress | METRIC - time 0.79s
2025-07-17T09:13:14.137312+0000 | compress | METRIC - error 8583.50
2025-07-17T09:13:14.138087+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:13:14.138585+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:13:14.139453+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.v_proj using 512 samples
2025-07-17T09:13:14.923982+0000 | compress | METRIC - time 0.78s
2025-07-17T09:13:14.924881+0000 | compress | METRI

(3/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.18it/s]
(4/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.48it/s]

2025-07-17T09:13:29.398301+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.q_proj using 512 samples





2025-07-17T09:13:30.208726+0000 | compress | METRIC - time 0.81s
2025-07-17T09:13:30.209743+0000 | compress | METRIC - error 21516.20
2025-07-17T09:13:30.210608+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:13:30.211085+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:13:30.211881+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.k_proj using 512 samples
2025-07-17T09:13:31.001804+0000 | compress | METRIC - time 0.79s
2025-07-17T09:13:31.002756+0000 | compress | METRIC - error 9601.42
2025-07-17T09:13:31.003687+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:13:31.004150+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:13:31.005044+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.v_proj using 512 samples
2025-07-17T09:13:31.814985+0000 | compress | METRIC - time 0.81s
2025-07-17T09:13:31.815972+0000 | compress | METRI

(4/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.23it/s]
(5/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.76it/s]

2025-07-17T09:13:46.430781+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.q_proj using 512 samples





2025-07-17T09:13:47.253459+0000 | compress | METRIC - time 0.82s
2025-07-17T09:13:47.254425+0000 | compress | METRIC - error 27894.97
2025-07-17T09:13:47.255150+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:13:47.255760+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:13:47.256388+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.k_proj using 512 samples
2025-07-17T09:13:48.059260+0000 | compress | METRIC - time 0.80s
2025-07-17T09:13:48.060490+0000 | compress | METRIC - error 12959.94
2025-07-17T09:13:48.061073+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:13:48.061570+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:13:48.062334+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.v_proj using 512 samples
2025-07-17T09:13:48.867227+0000 | compress | METRIC - time 0.80s
2025-07-17T09:13:48.868186+0000 | compress | METR

(5/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.45it/s]
(6/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.74it/s]

2025-07-17T09:14:03.404273+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.q_proj using 512 samples





2025-07-17T09:14:04.233150+0000 | compress | METRIC - time 0.83s
2025-07-17T09:14:04.234195+0000 | compress | METRIC - error 28034.49
2025-07-17T09:14:04.234970+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:04.235584+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:14:04.236206+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.k_proj using 512 samples
2025-07-17T09:14:05.039455+0000 | compress | METRIC - time 0.80s
2025-07-17T09:14:05.040525+0000 | compress | METRIC - error 10033.96
2025-07-17T09:14:05.041220+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:05.041612+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:14:05.042282+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.v_proj using 512 samples
2025-07-17T09:14:05.849610+0000 | compress | METRIC - time 0.81s
2025-07-17T09:14:05.850622+0000 | compress | METR

(6/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.77it/s]
(7/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.73it/s]

2025-07-17T09:14:20.494887+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.q_proj using 512 samples





2025-07-17T09:14:21.318709+0000 | compress | METRIC - time 0.82s
2025-07-17T09:14:21.319709+0000 | compress | METRIC - error 31429.44
2025-07-17T09:14:21.320539+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:21.321116+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:14:21.321767+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.k_proj using 512 samples
2025-07-17T09:14:22.119177+0000 | compress | METRIC - time 0.80s
2025-07-17T09:14:22.120159+0000 | compress | METRIC - error 10746.79
2025-07-17T09:14:22.120931+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:22.121379+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:14:22.122147+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.v_proj using 512 samples
2025-07-17T09:14:22.919274+0000 | compress | METRIC - time 0.80s
2025-07-17T09:14:22.920267+0000 | compress | METR

(7/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.92it/s]
(8/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.60it/s]

2025-07-17T09:14:37.622722+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.q_proj using 512 samples





2025-07-17T09:14:38.447828+0000 | compress | METRIC - time 0.82s
2025-07-17T09:14:38.448815+0000 | compress | METRIC - error 32248.95
2025-07-17T09:14:38.449802+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:38.450322+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:14:38.451203+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.k_proj using 512 samples
2025-07-17T09:14:39.254256+0000 | compress | METRIC - time 0.80s
2025-07-17T09:14:39.255239+0000 | compress | METRIC - error 10452.95
2025-07-17T09:14:39.256094+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:39.256542+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:14:39.257500+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.v_proj using 512 samples
2025-07-17T09:14:40.060874+0000 | compress | METRIC - time 0.80s
2025-07-17T09:14:40.061915+0000 | compress | METR

(8/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.27it/s]
(9/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.54it/s]

2025-07-17T09:14:54.791891+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.q_proj using 512 samples





2025-07-17T09:14:55.614717+0000 | compress | METRIC - time 0.82s
2025-07-17T09:14:55.615793+0000 | compress | METRIC - error 26889.49
2025-07-17T09:14:55.616710+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:55.617191+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:14:55.617995+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.k_proj using 512 samples
2025-07-17T09:14:56.428056+0000 | compress | METRIC - time 0.81s
2025-07-17T09:14:56.429103+0000 | compress | METRIC - error 10940.04
2025-07-17T09:14:56.430394+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:14:56.430817+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:14:56.431550+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.v_proj using 512 samples
2025-07-17T09:14:57.234893+0000 | compress | METRIC - time 0.80s
2025-07-17T09:14:57.235880+0000 | compress | METR

(9/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.18it/s]
(10/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.16it/s]

2025-07-17T09:15:11.821833+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.q_proj using 512 samples





2025-07-17T09:15:12.651138+0000 | compress | METRIC - time 0.83s
2025-07-17T09:15:12.652165+0000 | compress | METRIC - error 36968.02
2025-07-17T09:15:12.653021+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:15:12.653499+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:15:12.654263+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.k_proj using 512 samples
2025-07-17T09:15:13.456170+0000 | compress | METRIC - time 0.80s
2025-07-17T09:15:13.457325+0000 | compress | METRIC - error 13299.96
2025-07-17T09:15:13.458196+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:15:13.458694+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:15:13.459679+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.v_proj using 512 samples
2025-07-17T09:15:14.258400+0000 | compress | METRIC - time 0.80s
2025-07-17T09:15:14.259475+0000 | compress | METR

(10/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.15it/s]
(11/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.95it/s]

2025-07-17T09:15:28.968847+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.q_proj using 512 samples





2025-07-17T09:15:29.806764+0000 | compress | METRIC - time 0.84s
2025-07-17T09:15:29.807751+0000 | compress | METRIC - error 32465.08
2025-07-17T09:15:29.808508+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:15:29.808942+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:15:29.809719+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.k_proj using 512 samples
2025-07-17T09:15:30.623535+0000 | compress | METRIC - time 0.81s
2025-07-17T09:15:30.624564+0000 | compress | METRIC - error 11748.45
2025-07-17T09:15:30.625258+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:15:30.625710+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:15:30.626501+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.v_proj using 512 samples
2025-07-17T09:15:31.437996+0000 | compress | METRIC - time 0.81s
2025-07-17T09:15:31.438953+0000 | compress | ME

(11/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.66it/s]
(12/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.70it/s]

2025-07-17T09:15:46.059122+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.q_proj using 512 samples





2025-07-17T09:15:46.871560+0000 | compress | METRIC - time 0.81s
2025-07-17T09:15:46.872605+0000 | compress | METRIC - error 31204.64
2025-07-17T09:15:46.873103+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:15:46.873483+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:15:46.874396+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.k_proj using 512 samples
2025-07-17T09:15:47.660329+0000 | compress | METRIC - time 0.79s
2025-07-17T09:15:47.661304+0000 | compress | METRIC - error 12479.69
2025-07-17T09:15:47.662052+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:15:47.662461+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:15:47.663157+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.v_proj using 512 samples
2025-07-17T09:15:48.448815+0000 | compress | METRIC - time 0.79s
2025-07-17T09:15:48.449781+0000 | compress | ME

(12/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.42it/s]
(13/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.20it/s]

2025-07-17T09:16:03.100483+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.q_proj using 512 samples





2025-07-17T09:16:03.927443+0000 | compress | METRIC - time 0.83s
2025-07-17T09:16:03.928554+0000 | compress | METRIC - error 46924.11
2025-07-17T09:16:03.929377+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:16:03.929866+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:16:03.930753+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.k_proj using 512 samples
2025-07-17T09:16:04.741484+0000 | compress | METRIC - time 0.81s
2025-07-17T09:16:04.742562+0000 | compress | METRIC - error 23056.03
2025-07-17T09:16:04.743182+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:16:04.743827+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:16:04.744565+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.v_proj using 512 samples
2025-07-17T09:16:05.553726+0000 | compress | METRIC - time 0.81s
2025-07-17T09:16:05.554752+0000 | compress | ME

(13/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.63it/s]
(14/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.25it/s]

2025-07-17T09:16:20.309969+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.q_proj using 512 samples





2025-07-17T09:16:21.139125+0000 | compress | METRIC - time 0.83s
2025-07-17T09:16:21.140127+0000 | compress | METRIC - error 46197.75
2025-07-17T09:16:21.140867+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:16:21.141324+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:16:21.142104+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.k_proj using 512 samples
2025-07-17T09:16:21.946739+0000 | compress | METRIC - time 0.80s
2025-07-17T09:16:21.947703+0000 | compress | METRIC - error 22849.33
2025-07-17T09:16:21.948589+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:16:21.949041+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:16:21.949851+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.v_proj using 512 samples
2025-07-17T09:16:22.760330+0000 | compress | METRIC - time 0.81s
2025-07-17T09:16:22.761328+0000 | compress | ME

(14/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.69it/s]
(15/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.32it/s]

2025-07-17T09:16:37.537280+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.q_proj using 512 samples





2025-07-17T09:16:38.364287+0000 | compress | METRIC - time 0.83s
2025-07-17T09:16:38.365407+0000 | compress | METRIC - error 47951.80
2025-07-17T09:16:38.366091+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:16:38.366700+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:16:38.367389+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.k_proj using 512 samples
2025-07-17T09:16:39.171378+0000 | compress | METRIC - time 0.80s
2025-07-17T09:16:39.172599+0000 | compress | METRIC - error 24082.19
2025-07-17T09:16:39.173266+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:16:39.173747+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:16:39.174638+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.v_proj using 512 samples
2025-07-17T09:16:39.978109+0000 | compress | METRIC - time 0.80s
2025-07-17T09:16:39.979050+0000 | compress | ME

(15/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.93it/s]
(16/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.34it/s]

2025-07-17T09:16:54.745500+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.q_proj using 512 samples





2025-07-17T09:16:55.570058+0000 | compress | METRIC - time 0.82s
2025-07-17T09:16:55.571065+0000 | compress | METRIC - error 46071.93
2025-07-17T09:16:55.571705+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:16:55.572145+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:16:55.573264+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.k_proj using 512 samples
2025-07-17T09:16:56.384586+0000 | compress | METRIC - time 0.81s
2025-07-17T09:16:56.385595+0000 | compress | METRIC - error 19065.91
2025-07-17T09:16:56.386399+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:16:56.386914+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:16:56.387742+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.v_proj using 512 samples
2025-07-17T09:16:57.199033+0000 | compress | METRIC - time 0.81s
2025-07-17T09:16:57.200077+0000 | compress | ME

(16/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.78it/s]
(17/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.14it/s]

2025-07-17T09:17:11.994019+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.q_proj using 512 samples





2025-07-17T09:17:12.813570+0000 | compress | METRIC - time 0.82s
2025-07-17T09:17:12.814779+0000 | compress | METRIC - error 51090.16
2025-07-17T09:17:12.815368+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:17:12.815810+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:17:12.816552+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.k_proj using 512 samples
2025-07-17T09:17:13.623112+0000 | compress | METRIC - time 0.81s
2025-07-17T09:17:13.624180+0000 | compress | METRIC - error 17951.00
2025-07-17T09:17:13.625047+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:17:13.625559+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:17:13.626372+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.v_proj using 512 samples
2025-07-17T09:17:14.434505+0000 | compress | METRIC - time 0.81s
2025-07-17T09:17:14.435175+0000 | compress | ME

(17/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.29it/s]
(18/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.31it/s]

2025-07-17T09:17:29.187640+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.q_proj using 512 samples





2025-07-17T09:17:30.021360+0000 | compress | METRIC - time 0.83s
2025-07-17T09:17:30.022352+0000 | compress | METRIC - error 45799.10
2025-07-17T09:17:30.023181+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:17:30.023683+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:17:30.024605+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.k_proj using 512 samples
2025-07-17T09:17:30.835755+0000 | compress | METRIC - time 0.81s
2025-07-17T09:17:30.836746+0000 | compress | METRIC - error 22858.98
2025-07-17T09:17:30.837617+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:17:30.838336+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:17:30.839056+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.v_proj using 512 samples
2025-07-17T09:17:31.645211+0000 | compress | METRIC - time 0.81s
2025-07-17T09:17:31.646249+0000 | compress | ME

(18/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.97it/s]
(19/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.38it/s]

2025-07-17T09:17:46.498751+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.q_proj using 512 samples





2025-07-17T09:17:47.322612+0000 | compress | METRIC - time 0.82s
2025-07-17T09:17:47.323629+0000 | compress | METRIC - error 44894.59
2025-07-17T09:17:47.324479+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:17:47.325005+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:17:47.325899+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.k_proj using 512 samples
2025-07-17T09:17:48.128083+0000 | compress | METRIC - time 0.80s
2025-07-17T09:17:48.129064+0000 | compress | METRIC - error 22514.70
2025-07-17T09:17:48.129739+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:17:48.130128+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:17:48.130911+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.v_proj using 512 samples
2025-07-17T09:17:48.939127+0000 | compress | METRIC - time 0.81s
2025-07-17T09:17:48.940199+0000 | compress | ME

(19/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.75it/s]
(20/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.24it/s]


2025-07-17T09:18:03.790706+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.q_proj using 512 samples
2025-07-17T09:18:04.609501+0000 | compress | METRIC - time 0.82s
2025-07-17T09:18:04.610474+0000 | compress | METRIC - error 49939.93
2025-07-17T09:18:04.611162+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:18:04.611580+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:18:04.612315+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.k_proj using 512 samples
2025-07-17T09:18:05.414218+0000 | compress | METRIC - time 0.80s
2025-07-17T09:18:05.415191+0000 | compress | METRIC - error 23367.27
2025-07-17T09:18:05.415907+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:18:05.416296+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:18:05.417040+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.v_proj using 51

(20/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.45it/s]
(21/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.04it/s]

2025-07-17T09:18:20.984184+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.q_proj using 512 samples





2025-07-17T09:18:21.826158+0000 | compress | METRIC - time 0.84s
2025-07-17T09:18:21.827145+0000 | compress | METRIC - error 47636.85
2025-07-17T09:18:21.827849+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:18:21.828355+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:18:21.829029+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.k_proj using 512 samples
2025-07-17T09:18:22.657893+0000 | compress | METRIC - time 0.83s
2025-07-17T09:18:22.658920+0000 | compress | METRIC - error 21290.70
2025-07-17T09:18:22.659775+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:18:22.660285+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:18:22.661144+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.v_proj using 512 samples
2025-07-17T09:18:23.493282+0000 | compress | METRIC - time 0.83s
2025-07-17T09:18:23.494256+0000 | compress | ME

(21/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.67it/s]
(22/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.53it/s]

2025-07-17T09:18:38.395963+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.q_proj using 512 samples





2025-07-17T09:18:39.219948+0000 | compress | METRIC - time 0.82s
2025-07-17T09:18:39.221031+0000 | compress | METRIC - error 44198.46
2025-07-17T09:18:39.221837+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:18:39.222510+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:18:39.223126+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.k_proj using 512 samples
2025-07-17T09:18:40.029739+0000 | compress | METRIC - time 0.81s
2025-07-17T09:18:40.030792+0000 | compress | METRIC - error 20548.51
2025-07-17T09:18:40.031887+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:18:40.032359+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:18:40.033132+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.v_proj using 512 samples
2025-07-17T09:18:40.838880+0000 | compress | METRIC - time 0.81s
2025-07-17T09:18:40.839907+0000 | compress | ME

(22/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.83it/s]
(23/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.37it/s]

2025-07-17T09:18:55.813495+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.q_proj using 512 samples





2025-07-17T09:18:56.647302+0000 | compress | METRIC - time 0.83s
2025-07-17T09:18:56.648360+0000 | compress | METRIC - error 43450.63
2025-07-17T09:18:56.648945+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:18:56.649325+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:18:56.650204+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.k_proj using 512 samples
2025-07-17T09:18:57.458695+0000 | compress | METRIC - time 0.81s
2025-07-17T09:18:57.459801+0000 | compress | METRIC - error 16607.99
2025-07-17T09:18:57.460522+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:18:57.460908+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:18:57.461633+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.v_proj using 512 samples
2025-07-17T09:18:58.265592+0000 | compress | METRIC - time 0.80s
2025-07-17T09:18:58.266576+0000 | compress | ME

(23/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.26it/s]
(24/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.49it/s]

2025-07-17T09:19:13.122800+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.q_proj using 512 samples





2025-07-17T09:19:13.949065+0000 | compress | METRIC - time 0.82s
2025-07-17T09:19:13.950109+0000 | compress | METRIC - error 55148.71
2025-07-17T09:19:13.950885+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:19:13.951315+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:19:13.952103+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.k_proj using 512 samples
2025-07-17T09:19:14.764098+0000 | compress | METRIC - time 0.81s
2025-07-17T09:19:14.765044+0000 | compress | METRIC - error 25836.77
2025-07-17T09:19:14.765792+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:19:14.766248+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:19:14.767045+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.v_proj using 512 samples
2025-07-17T09:19:15.578714+0000 | compress | METRIC - time 0.81s
2025-07-17T09:19:15.579734+0000 | compress | ME

(24/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.27it/s]
(25/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.59it/s]


2025-07-17T09:19:30.428389+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples
2025-07-17T09:19:31.326280+0000 | compress | METRIC - time 0.90s
2025-07-17T09:19:31.327251+0000 | compress | METRIC - error 43845.05
2025-07-17T09:19:31.328039+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:19:31.328657+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:19:31.329280+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.k_proj using 512 samples
2025-07-17T09:19:32.139832+0000 | compress | METRIC - time 0.81s
2025-07-17T09:19:32.140874+0000 | compress | METRIC - error 19017.76
2025-07-17T09:19:32.141589+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:19:32.142112+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:19:32.142879+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.v_proj using 51

(25/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.19it/s]
(26/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.46it/s]

2025-07-17T09:19:47.936064+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.q_proj using 512 samples





2025-07-17T09:19:48.747274+0000 | compress | METRIC - time 0.81s
2025-07-17T09:19:48.748312+0000 | compress | METRIC - error 48272.56
2025-07-17T09:19:48.749072+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:19:48.749506+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:19:48.750209+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.k_proj using 512 samples
2025-07-17T09:19:49.579017+0000 | compress | METRIC - time 0.83s
2025-07-17T09:19:49.580017+0000 | compress | METRIC - error 19325.72
2025-07-17T09:19:49.580766+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:19:49.581137+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:19:49.581879+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.v_proj using 512 samples
2025-07-17T09:19:50.409785+0000 | compress | METRIC - time 0.83s
2025-07-17T09:19:50.410847+0000 | compress | ME

(26/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.36it/s]
(27/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.47it/s]

2025-07-17T09:20:05.418051+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.q_proj using 512 samples





2025-07-17T09:20:06.249227+0000 | compress | METRIC - time 0.83s
2025-07-17T09:20:06.250280+0000 | compress | METRIC - error 42097.77
2025-07-17T09:20:06.251513+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:20:06.251937+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:20:06.252761+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.k_proj using 512 samples
2025-07-17T09:20:07.064027+0000 | compress | METRIC - time 0.81s
2025-07-17T09:20:07.065048+0000 | compress | METRIC - error 16851.89
2025-07-17T09:20:07.065814+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:20:07.066172+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:20:07.066917+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.v_proj using 512 samples
2025-07-17T09:20:07.876043+0000 | compress | METRIC - time 0.81s
2025-07-17T09:20:07.877103+0000 | compress | ME

(27/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.63it/s]
(28/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.46it/s]

2025-07-17T09:20:22.699485+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.q_proj using 512 samples





2025-07-17T09:20:23.530916+0000 | compress | METRIC - time 0.83s
2025-07-17T09:20:23.531891+0000 | compress | METRIC - error 45329.41
2025-07-17T09:20:23.532659+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:20:23.533146+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:20:23.534064+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.k_proj using 512 samples
2025-07-17T09:20:24.336473+0000 | compress | METRIC - time 0.80s
2025-07-17T09:20:24.337642+0000 | compress | METRIC - error 17873.83
2025-07-17T09:20:24.338395+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:20:24.338897+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:20:24.339836+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.v_proj using 512 samples
2025-07-17T09:20:25.141974+0000 | compress | METRIC - time 0.80s
2025-07-17T09:20:25.142956+0000 | compress | ME

(28/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.84it/s]
(29/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.55it/s]

2025-07-17T09:20:40.139567+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.q_proj using 512 samples





2025-07-17T09:20:40.968021+0000 | compress | METRIC - time 0.83s
2025-07-17T09:20:40.969032+0000 | compress | METRIC - error 50712.64
2025-07-17T09:20:40.969715+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:20:40.970273+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:20:40.970880+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.k_proj using 512 samples
2025-07-17T09:20:41.778819+0000 | compress | METRIC - time 0.81s
2025-07-17T09:20:41.779903+0000 | compress | METRIC - error 18863.51
2025-07-17T09:20:41.835905+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:20:41.836444+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:20:41.837235+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.v_proj using 512 samples
2025-07-17T09:20:42.651368+0000 | compress | METRIC - time 0.81s
2025-07-17T09:20:42.652394+0000 | compress | ME

(29/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.48it/s]
(30/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.42it/s]

2025-07-17T09:20:57.537124+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.q_proj using 512 samples





2025-07-17T09:20:58.383054+0000 | compress | METRIC - time 0.84s
2025-07-17T09:20:58.384062+0000 | compress | METRIC - error 44708.07
2025-07-17T09:20:58.384925+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:20:58.385354+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:20:58.386255+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.k_proj using 512 samples
2025-07-17T09:20:59.218741+0000 | compress | METRIC - time 0.83s
2025-07-17T09:20:59.219813+0000 | compress | METRIC - error 16112.13
2025-07-17T09:20:59.220621+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:20:59.221203+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:20:59.221967+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.v_proj using 512 samples
2025-07-17T09:21:00.048054+0000 | compress | METRIC - time 0.83s
2025-07-17T09:21:00.048978+0000 | compress | ME

(30/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 462.23it/s]
(31/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.53it/s]

2025-07-17T09:21:15.103407+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.q_proj using 512 samples





2025-07-17T09:21:15.924720+0000 | compress | METRIC - time 0.82s
2025-07-17T09:21:15.925751+0000 | compress | METRIC - error 58578.48
2025-07-17T09:21:15.926497+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:21:15.926943+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:21:15.927641+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.k_proj using 512 samples
2025-07-17T09:21:16.754310+0000 | compress | METRIC - time 0.83s
2025-07-17T09:21:16.755338+0000 | compress | METRIC - error 21071.55
2025-07-17T09:21:16.756005+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:21:16.756394+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:21:16.757353+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.v_proj using 512 samples
2025-07-17T09:21:17.570608+0000 | compress | METRIC - time 0.81s
2025-07-17T09:21:17.571661+0000 | compress | ME

(31/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.75it/s]
(32/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.65it/s]

2025-07-17T09:21:32.438722+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.q_proj using 512 samples





2025-07-17T09:21:33.271001+0000 | compress | METRIC - time 0.83s
2025-07-17T09:21:33.272014+0000 | compress | METRIC - error 73469.30
2025-07-17T09:21:33.272737+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:21:33.273117+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:21:33.273919+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.k_proj using 512 samples
2025-07-17T09:21:34.083015+0000 | compress | METRIC - time 0.81s
2025-07-17T09:21:34.084047+0000 | compress | METRIC - error 26961.98
2025-07-17T09:21:34.084760+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:21:34.085139+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:21:34.086068+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.v_proj using 512 samples
2025-07-17T09:21:34.894990+0000 | compress | METRIC - time 0.81s
2025-07-17T09:21:34.896006+0000 | compress | ME

(32/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.19it/s]
(33/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.46it/s]

2025-07-17T09:21:49.793598+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.q_proj using 512 samples





2025-07-17T09:21:50.630531+0000 | compress | METRIC - time 0.84s
2025-07-17T09:21:50.631618+0000 | compress | METRIC - error 57086.66
2025-07-17T09:21:50.632462+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:21:50.632892+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:21:50.633625+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.k_proj using 512 samples
2025-07-17T09:21:51.439022+0000 | compress | METRIC - time 0.81s
2025-07-17T09:21:51.439999+0000 | compress | METRIC - error 22851.25
2025-07-17T09:21:51.440771+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:21:51.441315+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:21:51.441967+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.v_proj using 512 samples
2025-07-17T09:21:52.250077+0000 | compress | METRIC - time 0.81s
2025-07-17T09:21:52.251012+0000 | compress | ME

(33/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.47it/s]
(34/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.62it/s]

2025-07-17T09:22:07.100814+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.q_proj using 512 samples





2025-07-17T09:22:07.923936+0000 | compress | METRIC - time 0.82s
2025-07-17T09:22:07.924917+0000 | compress | METRIC - error 75548.86
2025-07-17T09:22:07.925663+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:22:07.926084+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:22:07.926797+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.k_proj using 512 samples
2025-07-17T09:22:08.725566+0000 | compress | METRIC - time 0.80s
2025-07-17T09:22:08.726752+0000 | compress | METRIC - error 30731.62
2025-07-17T09:22:08.727719+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:22:08.728219+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:22:08.728808+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.v_proj using 512 samples
2025-07-17T09:22:09.529207+0000 | compress | METRIC - time 0.80s
2025-07-17T09:22:09.530214+0000 | compress | ME

(34/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.76it/s]
(35/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.45it/s]

2025-07-17T09:22:24.421549+0000 | compress_modules | INFO - Quantizing model.layers.34.self_attn.q_proj using 512 samples





2025-07-17T09:22:25.245891+0000 | compress | METRIC - time 0.82s
2025-07-17T09:22:25.246853+0000 | compress | METRIC - error 95343.88
2025-07-17T09:22:25.247607+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:22:25.248234+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:22:25.248960+0000 | compress_modules | INFO - Quantizing model.layers.34.self_attn.k_proj using 512 samples
2025-07-17T09:22:26.057992+0000 | compress | METRIC - time 0.81s
2025-07-17T09:22:26.058970+0000 | compress | METRIC - error 33731.77
2025-07-17T09:22:26.059760+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:22:26.060193+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:22:26.061042+0000 | compress_modules | INFO - Quantizing model.layers.34.self_attn.v_proj using 512 samples
2025-07-17T09:22:26.874201+0000 | compress | METRIC - time 0.81s
2025-07-17T09:22:26.875178+0000 | compress | ME

(35/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.50it/s]
(36/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.42it/s]

2025-07-17T09:22:41.776737+0000 | compress_modules | INFO - Quantizing model.layers.35.self_attn.q_proj using 512 samples





2025-07-17T09:22:42.619000+0000 | compress | METRIC - time 0.84s
2025-07-17T09:22:42.620072+0000 | compress | METRIC - error 89639.70
2025-07-17T09:22:42.620829+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:22:42.621277+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:22:42.621981+0000 | compress_modules | INFO - Quantizing model.layers.35.self_attn.k_proj using 512 samples
2025-07-17T09:22:43.422963+0000 | compress | METRIC - time 0.80s
2025-07-17T09:22:43.424002+0000 | compress | METRIC - error 26552.77
2025-07-17T09:22:43.424764+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:22:43.425147+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:22:43.425957+0000 | compress_modules | INFO - Quantizing model.layers.35.self_attn.v_proj using 512 samples
2025-07-17T09:22:44.228712+0000 | compress | METRIC - time 0.80s
2025-07-17T09:22:44.229835+0000 | compress | ME

(36/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 465.34it/s]
(37/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.47it/s]

2025-07-17T09:22:59.135519+0000 | compress_modules | INFO - Quantizing model.layers.36.self_attn.q_proj using 512 samples





2025-07-17T09:22:59.956276+0000 | compress | METRIC - time 0.82s
2025-07-17T09:22:59.957285+0000 | compress | METRIC - error 83624.81
2025-07-17T09:22:59.958001+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:22:59.958361+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:22:59.959250+0000 | compress_modules | INFO - Quantizing model.layers.36.self_attn.k_proj using 512 samples
2025-07-17T09:23:00.763615+0000 | compress | METRIC - time 0.80s
2025-07-17T09:23:00.764667+0000 | compress | METRIC - error 22628.89
2025-07-17T09:23:00.765384+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:23:00.765988+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:23:00.766663+0000 | compress_modules | INFO - Quantizing model.layers.36.self_attn.v_proj using 512 samples
2025-07-17T09:23:01.560200+0000 | compress | METRIC - time 0.79s
2025-07-17T09:23:01.561236+0000 | compress | ME

(37/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.68it/s]
(38/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.37it/s]

2025-07-17T09:23:16.500517+0000 | compress_modules | INFO - Quantizing model.layers.37.self_attn.q_proj using 512 samples





2025-07-17T09:23:17.328354+0000 | compress | METRIC - time 0.83s
2025-07-17T09:23:17.329083+0000 | compress | METRIC - error 78419.88
2025-07-17T09:23:17.329745+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:23:17.330141+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:23:17.330916+0000 | compress_modules | INFO - Quantizing model.layers.37.self_attn.k_proj using 512 samples
2025-07-17T09:23:18.159611+0000 | compress | METRIC - time 0.83s
2025-07-17T09:23:18.160609+0000 | compress | METRIC - error 22057.50
2025-07-17T09:23:18.161308+0000 | compress | METRIC - GPU 0 | usage: 25.07% | total memory: 24 GB
2025-07-17T09:23:18.161733+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:23:18.162493+0000 | compress_modules | INFO - Quantizing model.layers.37.self_attn.v_proj using 512 samples
2025-07-17T09:23:18.992199+0000 | compress | METRIC - time 0.83s
2025-07-17T09:23:18.993282+0000 | compress | ME

(38/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.35it/s]
(39/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.83it/s]

2025-07-17T09:23:33.823929+0000 | compress_modules | INFO - Quantizing model.layers.38.self_attn.q_proj using 512 samples





2025-07-17T09:23:34.656995+0000 | compress | METRIC - time 0.83s
2025-07-17T09:23:34.658030+0000 | compress | METRIC - error 93430.33
2025-07-17T09:23:34.658814+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:23:34.659444+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:23:34.660077+0000 | compress_modules | INFO - Quantizing model.layers.38.self_attn.k_proj using 512 samples
2025-07-17T09:23:35.474771+0000 | compress | METRIC - time 0.81s
2025-07-17T09:23:35.475832+0000 | compress | METRIC - error 24041.77
2025-07-17T09:23:35.476719+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:23:35.477227+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:23:35.477960+0000 | compress_modules | INFO - Quantizing model.layers.38.self_attn.v_proj using 512 samples
2025-07-17T09:23:36.292487+0000 | compress | METRIC - time 0.81s
2025-07-17T09:23:36.293514+0000 | compress | ME

(39/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.85it/s]
(40/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.35it/s]

2025-07-17T09:23:51.171739+0000 | compress_modules | INFO - Quantizing model.layers.39.self_attn.q_proj using 512 samples





2025-07-17T09:23:52.012056+0000 | compress | METRIC - time 0.84s
2025-07-17T09:23:52.013042+0000 | compress | METRIC - error 68211.16
2025-07-17T09:23:52.013800+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:23:52.014351+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-07-17T09:23:52.014977+0000 | compress_modules | INFO - Quantizing model.layers.39.self_attn.k_proj using 512 samples
2025-07-17T09:23:52.818778+0000 | compress | METRIC - time 0.80s
2025-07-17T09:23:52.819843+0000 | compress | METRIC - error 18803.32
2025-07-17T09:23:52.820602+0000 | compress | METRIC - GPU 0 | usage: 25.08% | total memory: 24 GB
2025-07-17T09:23:52.821033+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-07-17T09:23:52.821892+0000 | compress_modules | INFO - Quantizing model.layers.39.self_attn.v_proj using 512 samples
2025-07-17T09:23:53.635544+0000 | compress | METRIC - time 0.81s
2025-07-17T09:23:53.636610+0000 | compress | ME

(40/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 464.68it/s]
(41/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 302.82it/s]
(41/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 302.48it/s]


2025-07-17T09:24:04.449763+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers


### Save the Compressed Model

**Explanation**

- Naming: appends -W4A16 to distinguish the quantized checkpoint.
- **save_compressed=True** stores weights in compact safetensors format for deployment via vLLM.

In [29]:
# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

2025-07-17T04:41:16.812481+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 527it [00:12, 43.02it/s]


('granite-3.2-2b-instruct-W4A16/tokenizer_config.json',
 'granite-3.2-2b-instruct-W4A16/special_tokens_map.json',
 'granite-3.2-2b-instruct-W4A16/chat_template.jinja',
 'granite-3.2-2b-instruct-W4A16/vocab.json',
 'granite-3.2-2b-instruct-W4A16/merges.txt',
 'granite-3.2-2b-instruct-W4A16/added_tokens.json',
 'granite-3.2-2b-instruct-W4A16/tokenizer.json')

### Evaluate accuracy in vLLM

We can evaluate accuracy with lm_eval

##### Check GPU memory leftovers:

In [16]:
!nvidia-smi

Thu Jul 17 09:24:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   72C    P0             51W /   72W |    5552MiB /  23034MiB |     34%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

**IMPORTANT**: After quantizing the model the GPU memory may not be freed (see the above output). You need to **restart the kernel** before evaluating the model to ensure you have enough GPU RAM available.

#### Install lm_eval

In [2]:
!pip install -q lm_eval==v0.4.3


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Install vLLM for evaluation

Run the following to test accuracy on GSM-8K:

In [3]:
pip install -q vllm

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmcompressor 0.6.0 requires numpy<2.0,>=1.17.0, but you have numpy 2.2.6 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Evaluation Command

- `--model vllm` - Uses vLLM backend for fast, memory-efficient inference on large models 
- `--model_args` - pretrained=$MODEL_ID: specifies which model to load.
- `add_bos_token=true`: ensures a beginning-of-sequence token is added; required for consistent results on math and reasoning tasks 
- `max_model_len=4096`: sets the context window the model uses for evaluation.
- `gpu_memory_utilization=0.5`: limits vLLM to use 50% of GPU memory, allowing to avoid OOM.

In [9]:
import os

current_dir = os.getcwd()

MODEL_ID = current_dir + "/granite-3.2-2b-instruct-W4A16"

!lm_eval --model vllm \
  --model_args "pretrained=$MODEL_ID,add_bos_token=true,max_model_len=4096,gpu_memory_utilization=0.5" \
  --trust_remote_code \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'

INFO 07-17 04:50:30 [__init__.py:244] Automatically detected platform cuda.
2025-07-17:04:50:32,416 INFO     [__main__.py:272] Verbosity set to INFO
2025-07-17:04:50:36,511 INFO     [__main__.py:357] Passed `--trust_remote_code`, setting environment variable `HF_DATASETS_TRUST_REMOTE_CODE=true`
2025-07-17:04:50:36,511 INFO     [__main__.py:369] Selected Tasks: ['gsm8k']
2025-07-17:04:50:36,513 INFO     [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-07-17:04:50:36,513 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': '/opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16', 'add_bos_token': True, 'max_model_len': 4096, 'gpu_memory_utilization': 0.5, 'trust_remote_code': True}
INFO 07-17 04:50:43 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 07-17 04:50:43 [c

With powerful GPU(s), you could also run the vLLM based evals with the following - using higher GPU memory utilization and chunked prefill. 
```bash
!lm_eval \
  --model vllm \
  --model_args pretrained=$SAVE_DIR,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True \
  --trust_remote_code \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config
```

**Next Steps**: 
- How would you futher improve the accuacy of the model?
- How would you go about preparing the right data set for a different use case?

### Optionally, upload the optimized model to MinIO

In [2]:
import os
from boto3 import client

current_dir = os.getcwd()
OPTIMIZED_MODEL_DIR = current_dir + "/granite-3.2-2b-instruct-W4A16"
S3_PATH = "granite-int4-notebook"

print('Starting upload of quantizied model')
s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
s3_bucket_name = os.environ["AWS_S3_BUCKET"]

print(f'Uploading predictions to bucket {s3_bucket_name} '
        f'to S3 storage at {s3_endpoint_url}')

s3_client = client(
    's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key, verify=False
)

# Walk through the local folder and upload files
for root, dirs, files in os.walk(OPTIMIZED_MODEL_DIR):
    for file in files:
        local_file_path = os.path.join(root, file)
        s3_file_path = os.path.join(S3_PATH, local_file_path[len(OPTIMIZED_MODEL_DIR)+1:])
        s3_client.upload_file(local_file_path, s3_bucket_name, s3_file_path)
        print(f'Uploaded {local_file_path}')

print('Finished uploading of quantizied model')

Starting results upload.
Uploading predictions to bucket models to S3 storage at http://minio-service.minio.svc.cluster.local:9000
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/merges.txt
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/model.safetensors
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/added_tokens.json
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/tokenizer_config.json
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/chat_template.jinja
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/granite-3.2-2b-instruct-W4A16/config.json
Uploaded /opt/app-root/src/showroom-summit2025-lb2959-neural-magic/lab-materials/03/