# Quantization of `granite-3.3-2b-instruct` model

Recall that our overall solution uses the quantized version of the model `granite-3.3-2b-instruct`. In this lab, we will be taking in the base model `granite-3.3-2b-instruct` and quantizing it to `W4A16` - which is fixed-point integer (INT) quantization scheme for weights and floating‑point for activations - to provide both memory savings (weight - INT4) and inference acceleration (activations - BF16) with `vLLM`

**Note**: `W4A16` computation is supported on Nvidia GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).

**Note**: The steps here will take around 20-30 minutes, depending on the connectivity. The most time consuming steps are the installation of llmcompressor (up to 5 mins) and the quantization step (which can take more anywhere between 10-15 mins)

## Setting up llm-compressor

Installing `llmcompressor` may take a minute, depending on the bandwith available. Do note the versions of `transformer` library we would be using. There is a known issue (*torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow*) with the usage of the latest transformer library (version `4.53.2` as of July 17, 2025) in combination with the latest version of llmcompressor (version `0.6.0`).

In [1]:
!pip install -q llmcompressor==0.6.0 transformers==4.52.2 boto3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
vllm 0.10.0 requires transformers>=4.53.2, but you have transformers 4.52.2 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Let's make sure we have installed the right versions installed

In [2]:
!pip list | grep llmcompressor

llmcompressor                     0.6.0


In [3]:
!pip list | grep transformer

transformers                      4.52.2


## Let' start with the quantization of the model

There are 6 steps:
1. Loading the model
2. Choosing the quantization scheme and method
3. Preparing the calibration data
4. Applying quantization
5. Saving the model
6. Evaluation of accuracy in vLLM

### Loading the model

First, let's download the model from local S3 and then load the model using AutoModelForCausalLM for handling quantized saving and loading. 

In [4]:
import os
from boto3 import client

def download_model_from_s3(s3_path, local_model_dir):
    print('Starting download of model from S3')
    
    # Get S3 credentials from environment
    s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
    s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
    s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
    s3_bucket_name = os.environ["AWS_S3_BUCKET"]

    print(f'Downloading model from bucket {s3_bucket_name} '
          f'path {s3_path} from S3 storage at {s3_endpoint_url}')
    print(f'Target local directory: {local_model_dir}')

    s3_client = client(
        's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
        aws_secret_access_key=s3_secret_key, verify=False
    )

    os.makedirs(local_model_dir, exist_ok=True)

    print(f'Listing objects with prefix: {s3_path}')
    paginator = s3_client.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=s3_bucket_name, Prefix=s3_path)

    downloaded_files = []
    total_objects = 0
    
    for page in pages:
        if 'Contents' in page:
            for obj in page['Contents']:
                total_objects += 1
                s3_key = obj['Key']
                print(f'Found S3 object: {s3_key}')
                
                # Skip if it's just a directory marker
                if s3_key.endswith('/'):
                    print(f'Skipping directory marker: {s3_key}')
                    continue
                
                relative_path = s3_key[len(s3_path):].lstrip('/')
                local_file_path = os.path.join(local_model_dir, relative_path)
                
                print(f'S3 key: {s3_key}')
                print(f'Local file path: {local_file_path}')
                
                local_dir = os.path.dirname(local_file_path)
                if local_dir:
                    os.makedirs(local_dir, exist_ok=True)
                    print(f'Created directory: {local_dir}')
                
                try:
                    s3_client.download_file(s3_bucket_name, s3_key, local_file_path)
                    downloaded_files.append(local_file_path)
                    print(f'Successfully downloaded {s3_key} to {local_file_path}')
                except Exception as e:
                    print(f'Error downloading {s3_key}: {e}')
                    raise

    print(f'Total objects found: {total_objects}')
    print(f'Files downloaded: {len(downloaded_files)}')
    
    config_path = os.path.join(local_model_dir, 'config.json')
    if os.path.exists(config_path):
        print(f'Verified: config.json found at {config_path}')
    else:
        print(f'Warning: config.json not found at {config_path}')
    
    print('Finished downloading model from S3')
    return local_model_dir

In [5]:
current_dir = os.getcwd()
BASE_MODEL_DIR = current_dir + "/granite-3.3-2b-instruct"
S3_MODEL_PATH = "ibm-granite/granite-3.3-2b-instruct/"  

model_path = download_model_from_s3(S3_MODEL_PATH, BASE_MODEL_DIR)

# use the downloaded model path for tokenizer and model loading
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", torch_dtype="auto"
)

Starting download of model from S3
Downloading model from bucket models path ibm-granite/granite-3.3-2b-instruct/ from S3 storage at http://minio-service.minio.svc.cluster.local:9000
Target local directory: /opt/app-root/src/granite-3.3-2b-instruct
Listing objects with prefix: ibm-granite/granite-3.3-2b-instruct/
Found S3 object: ibm-granite/granite-3.3-2b-instruct/.gitattributes
S3 key: ibm-granite/granite-3.3-2b-instruct/.gitattributes
Local file path: /opt/app-root/src/granite-3.3-2b-instruct/.gitattributes
Created directory: /opt/app-root/src/granite-3.3-2b-instruct
Successfully downloaded ibm-granite/granite-3.3-2b-instruct/.gitattributes to /opt/app-root/src/granite-3.3-2b-instruct/.gitattributes
Found S3 object: ibm-granite/granite-3.3-2b-instruct/README.md
S3 key: ibm-granite/granite-3.3-2b-instruct/README.md
Local file path: /opt/app-root/src/granite-3.3-2b-instruct/README.md
Created directory: /opt/app-root/src/granite-3.3-2b-instruct
Successfully downloaded ibm-granite/grani

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.44it/s]


Alternatively, the model can also be loaded from HuggingFace directly as follows:

In [6]:
# from transformers import AutoTokenizer, AutoModelForCausalLM

# MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"
# model = AutoModelForCausalLM.from_pretrained(
#     MODEL_ID, device_map="auto", torch_dtype="auto",
# )
# tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

### Prepare calibration data

Prepare the calibration data. When quantizing weigths of a model to int4 using GPTQ, we need some sample data to run the GPTQ algorithms. As a result, it is very important to use calibration data that closely matches the type of data used in our deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.

In our case, we are quantizing an Instruction tuned generic model, so we will use the ultrachat dataset. Some best practices include:
- 512 samples is a good place to start (increase if accuracy drops). We are going to use 256 to speed up the process.
- 2048 sequence length is a good place to start
- Use the chat template or instrucion template that the model is trained with


In [7]:
from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 512  # 1024
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"

# Load dataset.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

# Preprocess the data into the format the model is trained with.
def preprocess(example):
    return {"text": example["text"]}
ds = ds.map(preprocess)

# Tokenize the data
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        truncation=False,
        add_special_tokens=True,
    )
ds = ds.map(tokenize, remove_columns=ds.column_names)

With the dataset ready, we will now apply quantization.

We first select the quantization algorithm. For W4A16, we want to:
- Run SmoothQuant to make the activations easier to quantize
- Quantize the weights to 4 bits with channelwise scales using GPTQ
- Quantize the activations with dynamic per token strategy

**Note**: The quantization step takes a long time to complete due to the callibration requirements -- around 10 - 15 mins, depending on the GPU.

### Imports and definitions

**GPTQModifier**: Applies Gentle Quantization (GPTQ) for weight-only quantization.

**SmoothQuantModifier**: Prepares model activations for smoother quantization by scaling internal activations and weights.

**oneshot**: High-level API that applies your quantization recipe in one go.

In [8]:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

### Hyperparameters

Rationale
- **DAMPENING_FRAC=0.1** gently prevents large Hessian-derived updates during quantization.
- **OBSERVER="mse"** measures quantization error by squared deviations, yielding well-rounded scales.
- **GROUP_SIZE=128** determines group size for per-channel quantization; typical default usage.

In [9]:
DAMPENING_FRAC = 0.1  # tapering adjustment to prevent extreme weight updates
OBSERVER = "mse"  # denotes minmax - quantization layout based on mean‐squared‐error
GROUP_SIZE = 128  # # per-channel grouping width for quantization

### Layer Mappings & Ignoring Heads

Logic

- **ignore=["lm_head"]** skips quantization on the output layer to preserve final logits and maintain accuracy.
- mappings link groups of linear projections (q, k, v, gating, up/down projections) with layernorm blocks—SmoothQuant uses these to shift and normalize activations across paired layers for better quant distribution.

In [10]:
ignore=["lm_head"]
mappings=[
    [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
    [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
    [["re:.*down_proj"], "re:.*up_proj"]
]

### Recipe Definition

**Workflow**

- **SmoothQuantModifier**: Re-scales activations across paired layers before quantization to reduce outliers (smoothing_strength=0.7, high smoothing but not extreme).
- **GPTQModifier**: Performs Weight-Only quantization (4-bit weights, 16-bit activations) on all Linear layers except those ignored, applying your dampening and observer settings. Scheme "W4A16" reduces model size while maintaining decent accuracy. 

In [11]:
recipe = [
    SmoothQuantModifier(smoothing_strength=0.7, ignore=ignore, mappings=mappings),
    GPTQModifier(
        targets=["Linear"],
        ignore=ignore,
        scheme="W4A16",
        dampening_frac=DAMPENING_FRAC,
        observer=OBSERVER,
    )
]

### Quantize in One Shot

**How It Works**

- Feeds dataset (calibration set) into your model to gather activation statistics.
- Applies SmoothQuant rescaling followed by GPTQ quantization in a sequential per-layer manner.
- **max_seq_length=8196** ensures large context coverage for calibration.

In [12]:
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=8196,
)

2025-08-16T09:15:44.561080+0000 | reset | INFO - Compression lifecycle reset
2025-08-16T09:15:44.564115+0000 | from_modifiers | INFO - Creating recipe from modifiers


  oneshot(


2025-08-16T09:15:45.933448+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-08-16T09:15:45.934151+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `SmoothQuantModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 1501.20it/s]
(1/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 283.13it/s]

2025-08-16T09:15:50.352219+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.input_layernorm





2025-08-16T09:15:50.400204+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.post_attention_layernorm
2025-08-16T09:15:50.402220+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.mlp.up_proj


(1/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 355.18it/s]
(2/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 473.82it/s]

2025-08-16T09:15:53.045000+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.input_layernorm
2025-08-16T09:15:53.046173+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.post_attention_layernorm
2025-08-16T09:15:53.047534+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.mlp.up_proj



(2/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.40it/s]
(3/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 472.21it/s]

2025-08-16T09:15:55.250345+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.input_layernorm
2025-08-16T09:15:55.251621+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.post_attention_layernorm
2025-08-16T09:15:55.252929+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.mlp.up_proj



(3/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.04it/s]
(4/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 472.32it/s]

2025-08-16T09:15:57.460846+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.input_layernorm
2025-08-16T09:15:57.462022+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.post_attention_layernorm
2025-08-16T09:15:57.463400+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.mlp.up_proj



(4/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.48it/s]
(5/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 471.05it/s]

2025-08-16T09:15:59.662461+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.input_layernorm
2025-08-16T09:15:59.663684+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.post_attention_layernorm
2025-08-16T09:15:59.665042+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.mlp.up_proj



(5/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.05it/s]
(6/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.70it/s]

2025-08-16T09:16:01.861968+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.input_layernorm
2025-08-16T09:16:01.863318+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.post_attention_layernorm
2025-08-16T09:16:01.864670+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.mlp.up_proj



(6/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.36it/s]
(7/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 470.71it/s]

2025-08-16T09:16:04.063151+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.input_layernorm
2025-08-16T09:16:04.064236+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.post_attention_layernorm
2025-08-16T09:16:04.065615+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.mlp.up_proj



(7/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.92it/s]
(8/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.83it/s]

2025-08-16T09:16:06.262265+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.input_layernorm
2025-08-16T09:16:06.263296+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.post_attention_layernorm
2025-08-16T09:16:06.264622+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.mlp.up_proj



(8/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 476.75it/s]
(9/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.78it/s]

2025-08-16T09:16:08.457422+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.input_layernorm
2025-08-16T09:16:08.458720+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.post_attention_layernorm
2025-08-16T09:16:08.460111+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.mlp.up_proj



(9/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 475.04it/s]
(10/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 471.60it/s]

2025-08-16T09:16:10.650769+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.input_layernorm
2025-08-16T09:16:10.651996+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.post_attention_layernorm
2025-08-16T09:16:10.653310+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.mlp.up_proj



(10/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.45it/s]
(11/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 471.68it/s]

2025-08-16T09:16:12.849016+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.input_layernorm
2025-08-16T09:16:12.850135+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.post_attention_layernorm
2025-08-16T09:16:12.851434+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.mlp.up_proj



(11/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.63it/s]
(12/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 474.65it/s]

2025-08-16T09:16:15.037823+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.input_layernorm
2025-08-16T09:16:15.039108+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.post_attention_layernorm
2025-08-16T09:16:15.040495+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.mlp.up_proj



(12/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.76it/s]
(13/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 468.67it/s]

2025-08-16T09:16:17.248231+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.input_layernorm
2025-08-16T09:16:17.249620+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.post_attention_layernorm
2025-08-16T09:16:17.250963+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.mlp.up_proj



(13/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.11it/s]
(14/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 473.51it/s]

2025-08-16T09:16:19.450255+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.input_layernorm
2025-08-16T09:16:19.451321+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.post_attention_layernorm
2025-08-16T09:16:19.452614+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.mlp.up_proj



(14/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.67it/s]
(15/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 461.73it/s]

2025-08-16T09:16:21.710554+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.input_layernorm
2025-08-16T09:16:21.711753+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.post_attention_layernorm
2025-08-16T09:16:21.713227+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.mlp.up_proj



(15/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 476.96it/s]
(16/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 462.35it/s]

2025-08-16T09:16:23.946279+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.input_layernorm
2025-08-16T09:16:23.947703+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.post_attention_layernorm
2025-08-16T09:16:23.949068+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.mlp.up_proj



(16/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 475.52it/s]
(17/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 465.54it/s]

2025-08-16T09:16:26.176759+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.input_layernorm
2025-08-16T09:16:26.178015+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.post_attention_layernorm
2025-08-16T09:16:26.179391+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.mlp.up_proj



(17/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 477.35it/s]
(18/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.33it/s]

2025-08-16T09:16:28.400208+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.input_layernorm
2025-08-16T09:16:28.401563+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.post_attention_layernorm
2025-08-16T09:16:28.402944+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.mlp.up_proj



(18/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.55it/s]
(19/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 463.13it/s]

2025-08-16T09:16:30.649543+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.input_layernorm
2025-08-16T09:16:30.650997+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.post_attention_layernorm
2025-08-16T09:16:30.652471+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.mlp.up_proj



(19/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 475.57it/s]
(20/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 465.47it/s]

2025-08-16T09:16:32.885470+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.input_layernorm
2025-08-16T09:16:32.886541+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.post_attention_layernorm
2025-08-16T09:16:32.887935+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.mlp.up_proj



(20/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 476.38it/s]
(21/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 458.79it/s]

2025-08-16T09:16:35.142315+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.input_layernorm
2025-08-16T09:16:35.143549+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.post_attention_layernorm
2025-08-16T09:16:35.144930+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.mlp.up_proj



(21/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.14it/s]
(22/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 470.52it/s]

2025-08-16T09:16:37.374262+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.input_layernorm
2025-08-16T09:16:37.375677+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.post_attention_layernorm
2025-08-16T09:16:37.377125+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.mlp.up_proj



(22/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.84it/s]
(23/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.86it/s]

2025-08-16T09:16:39.612846+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.input_layernorm
2025-08-16T09:16:39.613965+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.post_attention_layernorm
2025-08-16T09:16:39.615283+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.mlp.up_proj



(23/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.49it/s]
(24/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 468.69it/s]

2025-08-16T09:16:41.861126+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.input_layernorm
2025-08-16T09:16:41.862369+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.post_attention_layernorm
2025-08-16T09:16:41.863732+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.mlp.up_proj



(24/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.85it/s]
(25/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 467.70it/s]

2025-08-16T09:16:44.100733+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.input_layernorm
2025-08-16T09:16:44.102112+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.post_attention_layernorm
2025-08-16T09:16:44.103315+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.mlp.up_proj



(25/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 475.99it/s]
(26/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.40it/s]

2025-08-16T09:16:46.336915+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.input_layernorm
2025-08-16T09:16:46.338228+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.post_attention_layernorm
2025-08-16T09:16:46.339417+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.mlp.up_proj



(26/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.31it/s]
(27/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 468.34it/s]

2025-08-16T09:16:48.576246+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.input_layernorm
2025-08-16T09:16:48.577727+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.post_attention_layernorm
2025-08-16T09:16:48.579138+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.mlp.up_proj



(27/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 431.95it/s]
(28/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 467.83it/s]

2025-08-16T09:16:50.923264+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.input_layernorm
2025-08-16T09:16:50.924544+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.post_attention_layernorm
2025-08-16T09:16:50.925779+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.mlp.up_proj



(28/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 474.73it/s]
(29/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.15it/s]

2025-08-16T09:16:53.159316+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.input_layernorm
2025-08-16T09:16:53.160392+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.post_attention_layernorm
2025-08-16T09:16:53.161843+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.mlp.up_proj



(29/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.49it/s]
(30/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 470.30it/s]

2025-08-16T09:16:55.403942+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.input_layernorm
2025-08-16T09:16:55.405137+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.post_attention_layernorm
2025-08-16T09:16:55.406469+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.mlp.up_proj



(30/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.45it/s]
(31/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.63it/s]

2025-08-16T09:16:57.647297+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.input_layernorm
2025-08-16T09:16:57.648744+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.post_attention_layernorm
2025-08-16T09:16:57.650139+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.mlp.up_proj



(31/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.30it/s]
(32/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.16it/s]

2025-08-16T09:16:59.905626+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.input_layernorm
2025-08-16T09:16:59.907094+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.post_attention_layernorm
2025-08-16T09:16:59.908428+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.mlp.up_proj



(32/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.33it/s]
(33/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 469.50it/s]

2025-08-16T09:17:02.144471+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.input_layernorm
2025-08-16T09:17:02.145525+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.post_attention_layernorm
2025-08-16T09:17:02.146893+0000 | _apply_smoothing | INFO - Smoothing with model.layers.32.mlp.up_proj



(33/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.37it/s]
(34/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 467.30it/s]

2025-08-16T09:17:04.391271+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.input_layernorm
2025-08-16T09:17:04.392705+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.post_attention_layernorm
2025-08-16T09:17:04.394087+0000 | _apply_smoothing | INFO - Smoothing with model.layers.33.mlp.up_proj



(34/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.91it/s]
(35/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 466.90it/s]

2025-08-16T09:17:06.653328+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.input_layernorm
2025-08-16T09:17:06.654813+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.post_attention_layernorm
2025-08-16T09:17:06.656299+0000 | _apply_smoothing | INFO - Smoothing with model.layers.34.mlp.up_proj



(35/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 457.82it/s]
(36/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 456.10it/s]

2025-08-16T09:17:08.951607+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.input_layernorm
2025-08-16T09:17:08.952877+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.post_attention_layernorm
2025-08-16T09:17:08.954191+0000 | _apply_smoothing | INFO - Smoothing with model.layers.35.mlp.up_proj



(36/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.22it/s]
(37/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 459.15it/s]

2025-08-16T09:17:11.231150+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.input_layernorm
2025-08-16T09:17:11.232583+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.post_attention_layernorm
2025-08-16T09:17:11.233946+0000 | _apply_smoothing | INFO - Smoothing with model.layers.36.mlp.up_proj



(37/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.61it/s]
(38/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 450.56it/s]

2025-08-16T09:17:13.519312+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.input_layernorm
2025-08-16T09:17:13.520406+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.post_attention_layernorm
2025-08-16T09:17:13.521846+0000 | _apply_smoothing | INFO - Smoothing with model.layers.37.mlp.up_proj



(38/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.05it/s]
(39/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 451.78it/s]

2025-08-16T09:17:15.793193+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.input_layernorm
2025-08-16T09:17:15.794547+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.post_attention_layernorm
2025-08-16T09:17:15.796050+0000 | _apply_smoothing | INFO - Smoothing with model.layers.38.mlp.up_proj



(39/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.34it/s]
(40/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 450.85it/s]

2025-08-16T09:17:18.100023+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.input_layernorm
2025-08-16T09:17:18.101378+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.post_attention_layernorm
2025-08-16T09:17:18.102753+0000 | _apply_smoothing | INFO - Smoothing with model.layers.39.mlp.up_proj



(40/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 473.77it/s]
(41/41): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 298.52it/s]
(41/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 298.50it/s]


2025-08-16T09:17:22.786710+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 2146.16it/s]
(1/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.32it/s]

2025-08-16T09:17:33.260223+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples





2025-08-16T09:17:34.504364+0000 | compress | METRIC - time 1.24s
2025-08-16T09:17:34.505432+0000 | compress | METRIC - error 22241.08
2025-08-16T09:17:34.506363+0000 | compress | METRIC - GPU 0 | usage: 25.48% | total memory: 24 GB
2025-08-16T09:17:34.506816+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:17:34.507720+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
2025-08-16T09:17:35.318101+0000 | compress | METRIC - time 0.81s
2025-08-16T09:17:35.318818+0000 | compress | METRIC - error 7729.28
2025-08-16T09:17:35.319623+0000 | compress | METRIC - GPU 0 | usage: 25.48% | total memory: 24 GB
2025-08-16T09:17:35.320105+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:17:35.321097+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 512 samples
2025-08-16T09:17:36.109252+0000 | compress | METRIC - time 0.79s
2025-08-16T09:17:36.110176+0000 | compress | METRI

(1/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 398.27it/s]
(2/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.98it/s]

2025-08-16T09:17:51.019436+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.q_proj using 512 samples





2025-08-16T09:17:51.839451+0000 | compress | METRIC - time 0.82s
2025-08-16T09:17:51.840305+0000 | compress | METRIC - error 13780.00
2025-08-16T09:17:51.840999+0000 | compress | METRIC - GPU 0 | usage: 25.48% | total memory: 24 GB
2025-08-16T09:17:51.841615+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:17:51.842303+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.k_proj using 512 samples
2025-08-16T09:17:52.705861+0000 | compress | METRIC - time 0.86s
2025-08-16T09:17:52.706699+0000 | compress | METRIC - error 12779.66
2025-08-16T09:17:52.707499+0000 | compress | METRIC - GPU 0 | usage: 25.48% | total memory: 24 GB
2025-08-16T09:17:52.707968+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:17:52.708825+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.v_proj using 512 samples
2025-08-16T09:17:53.503001+0000 | compress | METRIC - time 0.79s
2025-08-16T09:17:53.503856+0000 | compress | METR

(2/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 471.88it/s]
(3/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.90it/s]

2025-08-16T09:18:08.116535+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.q_proj using 512 samples





2025-08-16T09:18:08.926492+0000 | compress | METRIC - time 0.81s
2025-08-16T09:18:08.927464+0000 | compress | METRIC - error 16020.69
2025-08-16T09:18:08.928162+0000 | compress | METRIC - GPU 0 | usage: 25.47% | total memory: 24 GB
2025-08-16T09:18:08.928592+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:18:08.929292+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.k_proj using 512 samples
2025-08-16T09:18:09.710154+0000 | compress | METRIC - time 0.78s
2025-08-16T09:18:09.711089+0000 | compress | METRIC - error 8008.41
2025-08-16T09:18:09.711644+0000 | compress | METRIC - GPU 0 | usage: 25.47% | total memory: 24 GB
2025-08-16T09:18:09.712059+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:18:09.712854+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.v_proj using 512 samples
2025-08-16T09:18:10.491865+0000 | compress | METRIC - time 0.78s
2025-08-16T09:18:10.492901+0000 | compress | METRI

(3/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.91it/s]
(4/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.15it/s]

2025-08-16T09:18:24.981316+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.q_proj using 512 samples





2025-08-16T09:18:25.787536+0000 | compress | METRIC - time 0.81s
2025-08-16T09:18:25.788628+0000 | compress | METRIC - error 21088.79
2025-08-16T09:18:25.789534+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:18:25.790071+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:18:25.790992+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.k_proj using 512 samples
2025-08-16T09:18:26.571161+0000 | compress | METRIC - time 0.78s
2025-08-16T09:18:26.572129+0000 | compress | METRIC - error 9300.08
2025-08-16T09:18:26.573041+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:18:26.573497+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:18:26.574410+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.v_proj using 512 samples
2025-08-16T09:18:27.363596+0000 | compress | METRIC - time 0.79s
2025-08-16T09:18:27.364581+0000 | compress | METRI

(4/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.28it/s]
(5/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.88it/s]

2025-08-16T09:18:41.973126+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.q_proj using 512 samples





2025-08-16T09:18:42.796914+0000 | compress | METRIC - time 0.82s
2025-08-16T09:18:42.797861+0000 | compress | METRIC - error 26743.89
2025-08-16T09:18:42.798404+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:18:42.799033+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:18:42.799667+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.k_proj using 512 samples
2025-08-16T09:18:43.582834+0000 | compress | METRIC - time 0.78s
2025-08-16T09:18:43.583837+0000 | compress | METRIC - error 12433.05
2025-08-16T09:18:43.584574+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:18:43.585084+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:18:43.585761+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.v_proj using 512 samples
2025-08-16T09:18:44.371251+0000 | compress | METRIC - time 0.79s
2025-08-16T09:18:44.372219+0000 | compress | METR

(9/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.73it/s]
(10/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.66it/s]

2025-08-16T09:20:07.202289+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.q_proj using 512 samples





2025-08-16T09:20:08.007489+0000 | compress | METRIC - time 0.80s
2025-08-16T09:20:08.008422+0000 | compress | METRIC - error 35187.00
2025-08-16T09:20:08.009168+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:20:08.009537+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:20:08.010251+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.k_proj using 512 samples
2025-08-16T09:20:08.782733+0000 | compress | METRIC - time 0.77s
2025-08-16T09:20:08.783701+0000 | compress | METRIC - error 12908.41
2025-08-16T09:20:08.784345+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:20:08.784721+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:20:08.785392+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.v_proj using 512 samples
2025-08-16T09:20:09.558123+0000 | compress | METRIC - time 0.77s
2025-08-16T09:20:09.559100+0000 | compress | METR

(10/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.34it/s]
(11/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.49it/s]

2025-08-16T09:20:24.164930+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.q_proj using 512 samples





2025-08-16T09:20:24.982603+0000 | compress | METRIC - time 0.82s
2025-08-16T09:20:24.983529+0000 | compress | METRIC - error 30414.69
2025-08-16T09:20:24.984250+0000 | compress | METRIC - GPU 0 | usage: 25.55% | total memory: 24 GB
2025-08-16T09:20:24.984625+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:20:24.985286+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.k_proj using 512 samples
2025-08-16T09:20:25.770865+0000 | compress | METRIC - time 0.79s
2025-08-16T09:20:25.771834+0000 | compress | METRIC - error 11236.39
2025-08-16T09:20:25.772562+0000 | compress | METRIC - GPU 0 | usage: 25.55% | total memory: 24 GB
2025-08-16T09:20:25.772948+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:20:25.773667+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.v_proj using 512 samples
2025-08-16T09:20:26.554333+0000 | compress | METRIC - time 0.78s
2025-08-16T09:20:26.555256+0000 | compress | ME

(11/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.71it/s]
(12/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.42it/s]

2025-08-16T09:20:41.269110+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.q_proj using 512 samples





2025-08-16T09:20:42.091831+0000 | compress | METRIC - time 0.82s
2025-08-16T09:20:42.092812+0000 | compress | METRIC - error 29898.63
2025-08-16T09:20:42.093636+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:20:42.094198+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:20:42.094855+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.k_proj using 512 samples
2025-08-16T09:20:42.881229+0000 | compress | METRIC - time 0.79s
2025-08-16T09:20:42.882225+0000 | compress | METRIC - error 12083.06
2025-08-16T09:20:42.883072+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:20:42.883508+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:20:42.884365+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.v_proj using 512 samples
2025-08-16T09:20:43.669034+0000 | compress | METRIC - time 0.78s
2025-08-16T09:20:43.669979+0000 | compress | ME

(12/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.61it/s]
(13/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.30it/s]

2025-08-16T09:20:58.383503+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.q_proj using 512 samples





2025-08-16T09:20:59.198113+0000 | compress | METRIC - time 0.81s
2025-08-16T09:20:59.199089+0000 | compress | METRIC - error 45231.54
2025-08-16T09:20:59.199697+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:20:59.200075+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:20:59.200712+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.k_proj using 512 samples
2025-08-16T09:20:59.984556+0000 | compress | METRIC - time 0.78s
2025-08-16T09:20:59.985570+0000 | compress | METRIC - error 22084.02
2025-08-16T09:20:59.986355+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:20:59.986716+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:20:59.987373+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.v_proj using 512 samples
2025-08-16T09:21:00.779884+0000 | compress | METRIC - time 0.79s
2025-08-16T09:21:00.780928+0000 | compress | ME

(13/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.46it/s]
(14/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.00it/s]

2025-08-16T09:21:15.532568+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.q_proj using 512 samples





2025-08-16T09:21:16.347391+0000 | compress | METRIC - time 0.81s
2025-08-16T09:21:16.348387+0000 | compress | METRIC - error 44073.83
2025-08-16T09:21:16.349086+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:21:16.349658+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:21:16.350230+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.k_proj using 512 samples
2025-08-16T09:21:17.134383+0000 | compress | METRIC - time 0.78s
2025-08-16T09:21:17.135359+0000 | compress | METRIC - error 21831.96
2025-08-16T09:21:17.136134+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:21:17.136577+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:21:17.137437+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.v_proj using 512 samples
2025-08-16T09:21:17.919102+0000 | compress | METRIC - time 0.78s
2025-08-16T09:21:17.920042+0000 | compress | ME

(14/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.78it/s]
(15/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.43it/s]

2025-08-16T09:21:32.572384+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.q_proj using 512 samples





2025-08-16T09:21:33.389502+0000 | compress | METRIC - time 0.82s
2025-08-16T09:21:33.390539+0000 | compress | METRIC - error 46560.04
2025-08-16T09:21:33.391303+0000 | compress | METRIC - GPU 0 | usage: 25.50% | total memory: 24 GB
2025-08-16T09:21:33.391667+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:21:33.392333+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.k_proj using 512 samples
2025-08-16T09:21:34.175233+0000 | compress | METRIC - time 0.78s
2025-08-16T09:21:34.176194+0000 | compress | METRIC - error 23445.62
2025-08-16T09:21:34.177221+0000 | compress | METRIC - GPU 0 | usage: 25.50% | total memory: 24 GB
2025-08-16T09:21:34.177600+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:21:34.178273+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.v_proj using 512 samples
2025-08-16T09:21:34.956990+0000 | compress | METRIC - time 0.78s
2025-08-16T09:21:34.957960+0000 | compress | ME

(15/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.43it/s]
(16/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.74it/s]

2025-08-16T09:21:49.551322+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.q_proj using 512 samples





2025-08-16T09:21:50.413707+0000 | compress | METRIC - time 0.86s
2025-08-16T09:21:50.414647+0000 | compress | METRIC - error 44732.01
2025-08-16T09:21:50.415358+0000 | compress | METRIC - GPU 0 | usage: 25.51% | total memory: 24 GB
2025-08-16T09:21:50.415709+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:21:50.416439+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.k_proj using 512 samples
2025-08-16T09:21:51.201907+0000 | compress | METRIC - time 0.79s
2025-08-16T09:21:51.202873+0000 | compress | METRIC - error 18848.61
2025-08-16T09:21:51.203381+0000 | compress | METRIC - GPU 0 | usage: 25.51% | total memory: 24 GB
2025-08-16T09:21:51.203750+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:21:51.204571+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.v_proj using 512 samples
2025-08-16T09:21:51.983860+0000 | compress | METRIC - time 0.78s
2025-08-16T09:21:51.984832+0000 | compress | ME

(16/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.67it/s]
(17/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.81it/s]

2025-08-16T09:22:06.562028+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.q_proj using 512 samples





2025-08-16T09:22:07.368460+0000 | compress | METRIC - time 0.81s
2025-08-16T09:22:07.369404+0000 | compress | METRIC - error 50821.28
2025-08-16T09:22:07.370008+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:22:07.370372+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:22:07.371007+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.k_proj using 512 samples
2025-08-16T09:22:08.151284+0000 | compress | METRIC - time 0.78s
2025-08-16T09:22:08.152226+0000 | compress | METRIC - error 18177.61
2025-08-16T09:22:08.152869+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:22:08.153236+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:22:08.153919+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.v_proj using 512 samples
2025-08-16T09:22:08.931940+0000 | compress | METRIC - time 0.78s
2025-08-16T09:22:08.932932+0000 | compress | ME

(17/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 472.70it/s]
(18/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.97it/s]

2025-08-16T09:22:23.473195+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.q_proj using 512 samples





2025-08-16T09:22:24.307419+0000 | compress | METRIC - time 0.83s
2025-08-16T09:22:24.308429+0000 | compress | METRIC - error 45939.96
2025-08-16T09:22:24.309158+0000 | compress | METRIC - GPU 0 | usage: 25.53% | total memory: 24 GB
2025-08-16T09:22:24.309535+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:22:24.310211+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.k_proj using 512 samples
2025-08-16T09:22:25.097242+0000 | compress | METRIC - time 0.79s
2025-08-16T09:22:25.098221+0000 | compress | METRIC - error 22757.23
2025-08-16T09:22:25.098968+0000 | compress | METRIC - GPU 0 | usage: 25.53% | total memory: 24 GB
2025-08-16T09:22:25.099334+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:22:25.100022+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.v_proj using 512 samples
2025-08-16T09:22:25.922070+0000 | compress | METRIC - time 0.82s
2025-08-16T09:22:25.923012+0000 | compress | ME

(18/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 470.80it/s]
(19/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.80it/s]

2025-08-16T09:22:40.485376+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.q_proj using 512 samples





2025-08-16T09:22:41.306885+0000 | compress | METRIC - time 0.82s
2025-08-16T09:22:41.307862+0000 | compress | METRIC - error 45650.38
2025-08-16T09:22:41.308573+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:22:41.308954+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:22:41.309607+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.k_proj using 512 samples
2025-08-16T09:22:42.088756+0000 | compress | METRIC - time 0.78s
2025-08-16T09:22:42.089691+0000 | compress | METRIC - error 22239.74
2025-08-16T09:22:42.090442+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:22:42.090863+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:22:42.091519+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.v_proj using 512 samples
2025-08-16T09:22:42.878364+0000 | compress | METRIC - time 0.79s
2025-08-16T09:22:42.879321+0000 | compress | ME

(19/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.84it/s]
(20/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.70it/s]

2025-08-16T09:22:57.482368+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.q_proj using 512 samples





2025-08-16T09:22:58.347440+0000 | compress | METRIC - time 0.86s
2025-08-16T09:22:58.348466+0000 | compress | METRIC - error 49685.23
2025-08-16T09:22:58.349236+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:22:58.349602+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:22:58.350316+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.k_proj using 512 samples
2025-08-16T09:22:59.140507+0000 | compress | METRIC - time 0.79s
2025-08-16T09:22:59.141496+0000 | compress | METRIC - error 23585.09
2025-08-16T09:22:59.142272+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:22:59.142630+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:22:59.143351+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.v_proj using 512 samples
2025-08-16T09:22:59.941466+0000 | compress | METRIC - time 0.80s
2025-08-16T09:22:59.942470+0000 | compress | ME

(20/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 469.68it/s]
(21/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.29it/s]

2025-08-16T09:23:14.715557+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.q_proj using 512 samples





2025-08-16T09:23:15.536212+0000 | compress | METRIC - time 0.82s
2025-08-16T09:23:15.537194+0000 | compress | METRIC - error 48290.51
2025-08-16T09:23:15.538089+0000 | compress | METRIC - GPU 0 | usage: 25.48% | total memory: 24 GB
2025-08-16T09:23:15.538544+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:23:15.539402+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.k_proj using 512 samples
2025-08-16T09:23:16.320681+0000 | compress | METRIC - time 0.78s
2025-08-16T09:23:16.321677+0000 | compress | METRIC - error 21581.77
2025-08-16T09:23:16.322453+0000 | compress | METRIC - GPU 0 | usage: 25.48% | total memory: 24 GB
2025-08-16T09:23:16.322839+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:23:16.323533+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.v_proj using 512 samples
2025-08-16T09:23:17.111863+0000 | compress | METRIC - time 0.79s
2025-08-16T09:23:17.112924+0000 | compress | ME

(21/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.36it/s]
(22/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.44it/s]

2025-08-16T09:23:31.757435+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.q_proj using 512 samples





2025-08-16T09:23:32.568261+0000 | compress | METRIC - time 0.81s
2025-08-16T09:23:32.569282+0000 | compress | METRIC - error 45971.88
2025-08-16T09:23:32.570045+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:23:32.570456+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:23:32.571194+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.k_proj using 512 samples
2025-08-16T09:23:33.370634+0000 | compress | METRIC - time 0.80s
2025-08-16T09:23:33.371622+0000 | compress | METRIC - error 21119.10
2025-08-16T09:23:33.372333+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:23:33.372878+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:23:33.373439+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.v_proj using 512 samples
2025-08-16T09:23:34.164835+0000 | compress | METRIC - time 0.79s
2025-08-16T09:23:34.165846+0000 | compress | ME

(22/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.77it/s]
(23/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.42it/s]

2025-08-16T09:23:48.830389+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.q_proj using 512 samples





2025-08-16T09:23:49.652274+0000 | compress | METRIC - time 0.82s
2025-08-16T09:23:49.653314+0000 | compress | METRIC - error 46517.13
2025-08-16T09:23:49.654094+0000 | compress | METRIC - GPU 0 | usage: 25.50% | total memory: 24 GB
2025-08-16T09:23:49.654546+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:23:49.655407+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.k_proj using 512 samples
2025-08-16T09:23:50.446006+0000 | compress | METRIC - time 0.79s
2025-08-16T09:23:50.447004+0000 | compress | METRIC - error 17908.47
2025-08-16T09:23:50.447916+0000 | compress | METRIC - GPU 0 | usage: 25.50% | total memory: 24 GB
2025-08-16T09:23:50.448504+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:23:50.449152+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.v_proj using 512 samples
2025-08-16T09:23:51.232467+0000 | compress | METRIC - time 0.78s
2025-08-16T09:23:51.233501+0000 | compress | ME

(23/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.85it/s]
(24/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.40it/s]

2025-08-16T09:24:06.036225+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.q_proj using 512 samples





2025-08-16T09:24:06.853759+0000 | compress | METRIC - time 0.82s
2025-08-16T09:24:06.854725+0000 | compress | METRIC - error 58195.68
2025-08-16T09:24:06.855557+0000 | compress | METRIC - GPU 0 | usage: 25.51% | total memory: 24 GB
2025-08-16T09:24:06.855993+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:24:06.856859+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.k_proj using 512 samples
2025-08-16T09:24:07.640042+0000 | compress | METRIC - time 0.78s
2025-08-16T09:24:07.641094+0000 | compress | METRIC - error 27485.05
2025-08-16T09:24:07.641914+0000 | compress | METRIC - GPU 0 | usage: 25.51% | total memory: 24 GB
2025-08-16T09:24:07.642370+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:24:07.643237+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.v_proj using 512 samples
2025-08-16T09:24:08.427463+0000 | compress | METRIC - time 0.78s
2025-08-16T09:24:08.428389+0000 | compress | ME

(24/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.22it/s]
(25/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.51it/s]

2025-08-16T09:24:23.110930+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples





2025-08-16T09:24:23.951403+0000 | compress | METRIC - time 0.84s
2025-08-16T09:24:23.952438+0000 | compress | METRIC - error 47724.11
2025-08-16T09:24:23.953035+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:24:23.953448+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:24:23.954357+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.k_proj using 512 samples
2025-08-16T09:24:24.735055+0000 | compress | METRIC - time 0.78s
2025-08-16T09:24:24.736050+0000 | compress | METRIC - error 20773.05
2025-08-16T09:24:24.736875+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:24:24.737295+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:24:24.738066+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.v_proj using 512 samples
2025-08-16T09:24:25.521487+0000 | compress | METRIC - time 0.78s
2025-08-16T09:24:25.522487+0000 | compress | ME

(25/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.43it/s]
(26/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.49it/s]

2025-08-16T09:24:40.209442+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.q_proj using 512 samples





2025-08-16T09:24:41.032682+0000 | compress | METRIC - time 0.82s
2025-08-16T09:24:41.033656+0000 | compress | METRIC - error 52032.14
2025-08-16T09:24:41.034381+0000 | compress | METRIC - GPU 0 | usage: 25.53% | total memory: 24 GB
2025-08-16T09:24:41.034739+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:24:41.035446+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.k_proj using 512 samples
2025-08-16T09:24:41.833926+0000 | compress | METRIC - time 0.80s
2025-08-16T09:24:41.834937+0000 | compress | METRIC - error 21003.83
2025-08-16T09:24:41.835611+0000 | compress | METRIC - GPU 0 | usage: 25.53% | total memory: 24 GB
2025-08-16T09:24:41.836040+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:24:41.836728+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.v_proj using 512 samples
2025-08-16T09:24:42.624252+0000 | compress | METRIC - time 0.79s
2025-08-16T09:24:42.625136+0000 | compress | ME

(26/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.48it/s]
(27/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 65.95it/s]

2025-08-16T09:24:57.367255+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.q_proj using 512 samples





2025-08-16T09:24:58.219930+0000 | compress | METRIC - time 0.85s
2025-08-16T09:24:58.220966+0000 | compress | METRIC - error 47385.82
2025-08-16T09:24:58.221870+0000 | compress | METRIC - GPU 0 | usage: 25.53% | total memory: 24 GB
2025-08-16T09:24:58.222303+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:24:58.223161+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.k_proj using 512 samples
2025-08-16T09:24:59.012175+0000 | compress | METRIC - time 0.79s
2025-08-16T09:24:59.013173+0000 | compress | METRIC - error 18852.43
2025-08-16T09:24:59.014363+0000 | compress | METRIC - GPU 0 | usage: 25.53% | total memory: 24 GB
2025-08-16T09:24:59.014772+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:24:59.015635+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.v_proj using 512 samples
2025-08-16T09:24:59.804075+0000 | compress | METRIC - time 0.79s
2025-08-16T09:24:59.805071+0000 | compress | ME

(27/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.26it/s]
(28/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.21it/s]

2025-08-16T09:25:14.507126+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.q_proj using 512 samples





2025-08-16T09:25:15.328140+0000 | compress | METRIC - time 0.82s
2025-08-16T09:25:15.329145+0000 | compress | METRIC - error 50514.16
2025-08-16T09:25:15.329911+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:25:15.330277+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:25:15.331033+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.k_proj using 512 samples
2025-08-16T09:25:16.120458+0000 | compress | METRIC - time 0.79s
2025-08-16T09:25:16.121423+0000 | compress | METRIC - error 20067.66
2025-08-16T09:25:16.122268+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:25:16.122872+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:25:16.123562+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.v_proj using 512 samples
2025-08-16T09:25:16.907405+0000 | compress | METRIC - time 0.78s
2025-08-16T09:25:16.908348+0000 | compress | ME

(28/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.95it/s]
(29/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.61it/s]

2025-08-16T09:25:31.542130+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.q_proj using 512 samples





2025-08-16T09:25:32.357214+0000 | compress | METRIC - time 0.81s
2025-08-16T09:25:32.358168+0000 | compress | METRIC - error 56589.19
2025-08-16T09:25:32.359003+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:25:32.359581+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:25:32.360284+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.k_proj using 512 samples
2025-08-16T09:25:33.142207+0000 | compress | METRIC - time 0.78s
2025-08-16T09:25:33.143217+0000 | compress | METRIC - error 21200.02
2025-08-16T09:25:33.144346+0000 | compress | METRIC - GPU 0 | usage: 25.54% | total memory: 24 GB
2025-08-16T09:25:33.144769+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:25:33.145604+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.v_proj using 512 samples
2025-08-16T09:25:33.926302+0000 | compress | METRIC - time 0.78s
2025-08-16T09:25:33.927321+0000 | compress | ME

(29/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.60it/s]
(30/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.62it/s]

2025-08-16T09:25:48.577480+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.q_proj using 512 samples





2025-08-16T09:25:49.391009+0000 | compress | METRIC - time 0.81s
2025-08-16T09:25:49.392054+0000 | compress | METRIC - error 49479.14
2025-08-16T09:25:49.392635+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:25:49.393060+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:25:49.393970+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.k_proj using 512 samples
2025-08-16T09:25:50.178380+0000 | compress | METRIC - time 0.78s
2025-08-16T09:25:50.179371+0000 | compress | METRIC - error 17688.41
2025-08-16T09:25:50.180212+0000 | compress | METRIC - GPU 0 | usage: 25.49% | total memory: 24 GB
2025-08-16T09:25:50.180638+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:25:50.181460+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.v_proj using 512 samples
2025-08-16T09:25:50.963058+0000 | compress | METRIC - time 0.78s
2025-08-16T09:25:50.964102+0000 | compress | ME

(30/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 463.56it/s]
(31/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.55it/s]

2025-08-16T09:26:05.651840+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.q_proj using 512 samples





2025-08-16T09:26:06.464826+0000 | compress | METRIC - time 0.81s
2025-08-16T09:26:06.465753+0000 | compress | METRIC - error 63471.51
2025-08-16T09:26:06.466468+0000 | compress | METRIC - GPU 0 | usage: 25.50% | total memory: 24 GB
2025-08-16T09:26:06.466844+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:26:06.467532+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.k_proj using 512 samples
2025-08-16T09:26:07.244017+0000 | compress | METRIC - time 0.78s
2025-08-16T09:26:07.245142+0000 | compress | METRIC - error 23335.79
2025-08-16T09:26:07.245902+0000 | compress | METRIC - GPU 0 | usage: 25.50% | total memory: 24 GB
2025-08-16T09:26:07.246261+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:26:07.246987+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.v_proj using 512 samples
2025-08-16T09:26:08.026717+0000 | compress | METRIC - time 0.78s
2025-08-16T09:26:08.027675+0000 | compress | ME

(31/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.11it/s]
(32/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.70it/s]

2025-08-16T09:26:22.695191+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.q_proj using 512 samples





2025-08-16T09:26:23.511911+0000 | compress | METRIC - time 0.82s
2025-08-16T09:26:23.512948+0000 | compress | METRIC - error 77332.41
2025-08-16T09:26:23.513724+0000 | compress | METRIC - GPU 0 | usage: 25.51% | total memory: 24 GB
2025-08-16T09:26:23.514173+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:26:23.515046+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.k_proj using 512 samples
2025-08-16T09:26:24.293809+0000 | compress | METRIC - time 0.78s
2025-08-16T09:26:24.294817+0000 | compress | METRIC - error 28440.25
2025-08-16T09:26:24.295628+0000 | compress | METRIC - GPU 0 | usage: 25.51% | total memory: 24 GB
2025-08-16T09:26:24.296103+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:26:24.296948+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.v_proj using 512 samples
2025-08-16T09:26:25.073098+0000 | compress | METRIC - time 0.78s
2025-08-16T09:26:25.074328+0000 | compress | ME

(32/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 468.33it/s]
(33/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.74it/s]

2025-08-16T09:26:39.814300+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.q_proj using 512 samples





2025-08-16T09:26:40.642091+0000 | compress | METRIC - time 0.83s
2025-08-16T09:26:40.643165+0000 | compress | METRIC - error 59816.77
2025-08-16T09:26:40.643885+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:26:40.644245+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:26:40.644962+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.k_proj using 512 samples
2025-08-16T09:26:41.432884+0000 | compress | METRIC - time 0.79s
2025-08-16T09:26:41.433904+0000 | compress | METRIC - error 23898.27
2025-08-16T09:26:41.434668+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:26:41.435041+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:26:41.435714+0000 | compress_modules | INFO - Quantizing model.layers.32.self_attn.v_proj using 512 samples
2025-08-16T09:26:42.228843+0000 | compress | METRIC - time 0.79s
2025-08-16T09:26:42.229864+0000 | compress | ME

(33/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 466.70it/s]
(34/41): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 66.49it/s]

2025-08-16T09:26:56.940557+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.q_proj using 512 samples





2025-08-16T09:26:57.765238+0000 | compress | METRIC - time 0.82s
2025-08-16T09:26:57.766257+0000 | compress | METRIC - error 79826.00
2025-08-16T09:26:57.766992+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:26:57.767388+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-08-16T09:26:57.768158+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.k_proj using 512 samples
2025-08-16T09:26:58.556898+0000 | compress | METRIC - time 0.79s
2025-08-16T09:26:58.557890+0000 | compress | METRIC - error 32549.31
2025-08-16T09:26:58.558550+0000 | compress | METRIC - GPU 0 | usage: 25.52% | total memory: 24 GB
2025-08-16T09:26:58.558938+0000 | compress | METRIC - Compressed module size: 2.121728 MB
2025-08-16T09:26:58.559573+0000 | compress_modules | INFO - Quantizing model.layers.33.self_attn.v_proj using 512 samples
2025-08-16T09:26:59.341812+0000 | compress | METRIC - time 0.78s
2025-08-16T09:26:59.342730+0000 | compress | ME

(34/41): Propagating: 100%|██████████| 512/512 [00:01<00:00, 467.21it/s]
(35/41): Calibrating:  10%|█         | 53/512 [00:00<00:06, 70.69it/s]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



### Save the Compressed Model

**Explanation**

- Naming: appends -W4A16 to distinguish the quantized checkpoint.
- **save_compressed=True** stores weights in compact safetensors format for deployment via vLLM.

In [13]:
# Save to disk compressed.
MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"
SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

2025-08-16T09:28:53.884850+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 527it [00:10, 48.07it/s]


('granite-3.2-2b-instruct-W4A16/tokenizer_config.json',
 'granite-3.2-2b-instruct-W4A16/special_tokens_map.json',
 'granite-3.2-2b-instruct-W4A16/chat_template.jinja',
 'granite-3.2-2b-instruct-W4A16/vocab.json',
 'granite-3.2-2b-instruct-W4A16/merges.txt',
 'granite-3.2-2b-instruct-W4A16/added_tokens.json',
 'granite-3.2-2b-instruct-W4A16/tokenizer.json')

### Evaluate accuracy in vLLM

We can evaluate accuracy with lm_eval

##### Check GPU memory leftovers:

In [14]:
!nvidia-smi

Sat Aug 16 09:29:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   72C    P0             39W /   72W |    5579MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

**IMPORTANT**: After quantizing the model the GPU memory may not be freed (see the above output). You need to **restart the kernel** before evaluating the model to ensure you have enough GPU RAM available.

#### Install lm_eval

In [15]:
!pip install -q lm_eval==v0.4.3


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Install vLLM for evaluation

Run the following to test accuracy on GSM-8K:

In [16]:
pip install -q vllm

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmcompressor 0.6.0 requires numpy<2.0,>=1.17.0, but you have numpy 2.2.6 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Evaluation Command

- `--model vllm` - Uses vLLM backend for fast, memory-efficient inference on large models 
- `--model_args` - pretrained=$MODEL_ID: specifies which model to load.
- `add_bos_token=true`: ensures a beginning-of-sequence token is added; required for consistent results on math and reasoning tasks 
- `max_model_len=4096`: sets the context window the model uses for evaluation.
- `gpu_memory_utilization=0.5`: limits vLLM to use 50% of GPU memory, allowing to avoid OOM.

In [17]:
import os

current_dir = os.getcwd()

MODEL_ID = current_dir + "/granite-3.2-2b-instruct-W4A16"

!lm_eval --model vllm \
  --model_args "pretrained=$MODEL_ID,add_bos_token=true,max_model_len=4096,gpu_memory_utilization=0.5" \
  --trust_remote_code \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'

INFO 08-16 09:29:23 [__init__.py:235] Automatically detected platform cuda.
2025-08-16:09:29:24,368 INFO     [__main__.py:272] Verbosity set to INFO
2025-08-16:09:29:28,940 INFO     [__main__.py:357] Passed `--trust_remote_code`, setting environment variable `HF_DATASETS_TRUST_REMOTE_CODE=true`
2025-08-16:09:29:28,940 INFO     [__main__.py:369] Selected Tasks: ['gsm8k']
2025-08-16:09:29:28,942 INFO     [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-08-16:09:29:28,942 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': '/opt/app-root/src/granite-3.2-2b-instruct-W4A16', 'add_bos_token': True, 'max_model_len': 4096, 'gpu_memory_utilization': 0.5, 'trust_remote_code': True}
INFO 08-16 09:29:35 [config.py:1604] Using max model len 4096
INFO 08-16 09:29:36 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 08-16 09:29:36 [core.py:572] Waiting for init message fro

With powerful GPU(s), you could also run the vLLM based evals with the following - using higher GPU memory utilization and chunked prefill. 
```bash
!lm_eval \
  --model vllm \
  --model_args pretrained=$SAVE_DIR,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True \
  --trust_remote_code \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config
```

### Upload the optimized model to MinIO

In [19]:
import os
from boto3 import client

current_dir = os.getcwd()
OPTIMIZED_MODEL_DIR = current_dir + "/granite-3.2-2b-instruct-W4A16"
S3_PATH = "granite-int4-notebook"

print('Starting upload of quantizied model')
s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
s3_bucket_name = os.environ["AWS_S3_BUCKET"]

print(f'Uploading predictions to bucket {s3_bucket_name} '
        f'to S3 storage at {s3_endpoint_url}')

s3_client = client(
    's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key, verify=False
)

# Walk through the local folder and upload files
for root, dirs, files in os.walk(OPTIMIZED_MODEL_DIR):
    for file in files:
        local_file_path = os.path.join(root, file)
        s3_file_path = os.path.join(S3_PATH, local_file_path[len(OPTIMIZED_MODEL_DIR)+1:])
        s3_client.upload_file(local_file_path, s3_bucket_name, s3_file_path)
        print(f'Uploaded {local_file_path}')

print('Finished uploading of quantizied model')

Starting upload of quantizied model
Uploading predictions to bucket models to S3 storage at http://minio-service.minio.svc.cluster.local:9000
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/added_tokens.json
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/model.safetensors
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/special_tokens_map.json
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/generation_config.json
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/tokenizer_config.json
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/vocab.json
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/merges.txt
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/recipe.yaml
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/chat_template.jinja
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/tokenizer.json
Uploaded /opt/app-root/src/granite-3.2-2b-instruct-W4A16/config.json
Finished uploading of quantizied model


### Bonus labs
- Experiment with different quantization scheme & method to further improve its accuracy
- Prepare a new dataset tailored to a specific use case by collecting and performing data mixing for calibration