## Compress the Base LLM using LLM Compressor:
Now that we have established **accuracy and performance baselines** for the base model (`Llama-3.1-8B-Instruct`), the next step is to apply **model compression**. Specifically, we will quantize the base model and later evaluate the compressed version for both accuracy and system-level performance.

This will allow us to assess whether model compression improves deployment metrics—such as **Time to First Token (TTFT)** while maintaining accuracy comparable to the base model.


This step focuses on **reducing the model’s memory footprint** through quantization, resulting in a smaller and more deployment-efficient model.

**Note**: We use **data-aware quantization**, which relies on representative calibration data to preserve model quality.

**Goal**: Reduce model size (e.g., FP16 → INT8 / INT4) while retaining accuracy and inference performance.

**Key Actions**:

- Load the base model.

- Measure its size and memory usage.

- Use a calibration dataset (e.g., WikiText, UltraChat) to collect activation statistics.

- Apply a quantization recipe (e.g., SmoothQuant + GPTQ modifier).

- Save the compressed model and verify size reduction.

**Outcome**:

- Compressed model saved on disk.

- Model size reduced, typically by 50% (depending on quantization scheme).

### Install Dependencies

In [2]:
!pip install .
!pip install torch==2.9.0

Running the above cell might return errors like `ERROR: pip's dependency resolver does not currently take into account... llmcompressor 0.8.1 requires torch<=2.8.0,>=2.7.0, but you have torch 2.9.0 which is incompatible.` which can be safely ignored. 

In [1]:
import torch
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from transformers import AutoModelForCausalLM, AutoTokenizer
from utils import model_size_gb, tokenize_for_calibration

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# check available device
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

### Loading Base Model

**Make sure to kill any running processes using `nvidia-smi` and `kill -9 <pid>` that might be consuming GPU memory.**

While loading the model using **from_pretrained** using transformers' **AutoModelForCausalLM** class, we specify the data type using the **torch_dtype** parameter and set it to **auto** so the model is loaded in the data type specified in its config.
Otherwise, PyTorch loads the weights in **full precision (fp32)**.

In [3]:
# set up variables
base_model_path = "../base_model"
compressed_model_path = "../Llama_3.1_8B_Instruct_int8_dynamic"

In [4]:
# loading model and tokenizer from huggingfaceabs
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = AutoModelForCausalLM.from_pretrained(
    base_model_path, torch_dtype="auto", device_map="auto"
)
print("Base model loaded")

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.50it/s]

Base model loaded





In [5]:
# check model size
# !du -sh {base_model_path}
model_size = model_size_gb(model)
print(f"The size of the base model is: {model_size:.4f}GB")

The size of the base model is: 14.9575GB


### Preparing Calibration Dataset

Since we are using data-aware quantization to compress the base model, we need a dataset to calibrate the model with real or representative inputs. For the sake of this example, we will use a small, general-purpose dataset for faster processing. Specifically, we use the `wikitext-2-raw-v1` version of the WikiText dataset which is the smaller version.  More information on why to use a calibration dataset is provided in [Compression.md](../docs/Compression.md)

In [6]:
# Define the dataset to use for calibration
dataset_id = "wikitext"

# Specify the configuration / version of the dataset
config = "wikitext-2-raw-v1"  # Small version (~2M tokens), raw text format

# Set the number of calibration samples based on available device
# - On GPU: use more samples to get more accurate activation statistics
# - On CPU: reduce samples to prevent memory issues and keep demo fast
num_calibration_samples = 512 if device == "cuda" else 16

# Set the maximum sequence length for calibration
max_sequence_length = 1024 if device == "cuda" else 16

# Load the dataset using Hugging Face Datasets API
# This downloads train split of the dataset
ds = load_dataset(dataset_id, config, split="train")
# Shuffle and grab only the number of samples we need
ds = ds.shuffle(seed=42).select(range(num_calibration_samples))

In [7]:
# inspect the dataset
print(f"columns in the {dataset_id}: {ds.column_names}\n")
print(ds[0])

columns in the wikitext: ['text']

{'text': ' Continuous , short @-@ arc , high pressure xenon arc lamps have a color temperature closely approximating noon sunlight and are used in solar simulators . That is , the chromaticity of these lamps closely approximates a heated black body radiator that has a temperature close to that observed from the Sun . After they were first introduced during the 1940s , these lamps began replacing the shorter @-@ lived carbon arc lamps in movie projectors . They are employed in typical 35mm , IMAX and the new digital projectors film projection systems , automotive HID headlights , high @-@ end " tactical " flashlights and other specialized uses . These arc lamps are an excellent source of short wavelength ultraviolet radiation and they have intense emissions in the near infrared , which is used in some night vision systems . \n'}


**Datset inspection shows the we need to extract column ```text``` and pass it as input to the model.**

### When to Use a Custom Template for Calibration

Use a **custom template** when you want the calibration text to closely mimic the input format your model will see in production.  

For example, if your model is **instruction-following** or **chat-based**, providing the template the model was originally trained on or the template that will be used during inference ensures that the activation statistics collected during calibration reflect realistic usage patterns. 

This can improve the accuracy of quantization and compression.

If your model can handle raw text and doesn’t require a specific format, you can rely on the default template instead.

A custom template can be provided to the `tokenize_for_calibration` function using the `custom_template` argument. It accepts the following format:

```python
custom_template = {
 "template_text": "Instruction: {content}\nOutput:", 
 "placeholder": "content"
}

In [8]:
# to get activations for the calibration dataset, we need to:
# 1. extract the samples from the dataset
# 2. tokenize samples in the dataset
input_column = "text"

# Call tokenize_for_calibration using dataset.map
tokenized_dataset = ds.map(
    lambda batch: tokenize_for_calibration(
        examples=batch,  # batch from Hugging Face dataset
        input_column=input_column,  # the column containing text to calibrate
        tokenizer=tokenizer,  # your Hugging Face tokenizer
        max_length=max_sequence_length,  # maximum sequence length
        model_type="chat",  # use chat template if no custom template
        custom_template=None,  # optional, provide a dict if you want a custom template
    ),
    batched=True,
)

In [9]:
tokenized_dataset

Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 512
})

### Quantizing/Compressing Base Model to INT8
After preparing the dataset for calibration, we define a recipe for quantization. For quantization scheme `W8A8-INT8`, we use `SmoothQuantModifier` followed by `GPTQModifier`.

More details on what SmoothQUant and GPTQ algorithms are provided in [Compression.md](Compression.md).

In [10]:
# Define the quantization scheme
scheme = "W8A8"  # W8A8 means 8-bit weights and 8-bit activations

# Strength for SmoothQuant smoothing
# This controls how much the activation values are smoothed to reduce outliers
smoothing_strength = 0.8

# Create SmoothQuant modifier
# - smooths activations before quantization to improve stability and reduce degradation
smooth_quant = SmoothQuantModifier(smoothing_strength=smoothing_strength)

# Create GPTQ modifier
# - targets="Linear" quantizes only Linear layers (e.g., feedforward layers)
# - scheme=scheme uses the W8A8 quantization scheme
# - ignore=["lm_head"] preserves the LM head to avoid generation quality loss
quantizer = GPTQModifier(targets="Linear", scheme=scheme, ignore=["lm_head"])

# Combine the modifiers into a recipe list
# The order matters: first apply SmoothQuant, then GPTQ
recipe = [smooth_quant, quantizer]

# Perform quantization
oneshot(
    model=base_model_path,  # Model to quantize
    dataset=tokenized_dataset,  # Calibration dataset, used for both SmoothQuant & GPTQ
    recipe=recipe,  # List of quantization modifiers to apply
    output_dir=compressed_model_path,  # Directory to save the quantized model
    max_seq_length=2048,  # Maximum sequence length for calibration
    num_calibration_samples=512,  # Number of samples used for calibration
)

Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 127.41it/s]


2026-01-13T14:11:50.033633+0000 | reset | INFO - Compression lifecycle reset
2026-01-13T14:11:50.043707+0000 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/13-01-2026_14.11.50.log
2026-01-13T14:11:50.044993+0000 | from_modifiers | INFO - Creating recipe from modifiers
2026-01-13T14:11:50.045455+0000 | _infer_mappings_from_model | INFO - No SmoothQuantModifier.mappings provided, inferring from model...
2026-01-13T14:11:50.815363+0000 | initialize | INFO - Compression lifecycle initialized for 2 modifiers
2026-01-13T14:11:50.816990+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `SmoothQuantModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 676.40it/s]
(1/33): Calibrating: 100%|██████████| 512/512 [00:03<00:00, 146.55it/s]

2026-01-13T14:11:55.526979+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.input_layernorm
2026-01-13T14:11:55.533888+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.post_attention_layernorm



(1/33): Propagating: 100%|██████████| 512/512 [00:04<00:00, 127.97it/s]
(2/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 249.60it/s]

2026-01-13T14:12:01.809945+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.input_layernorm
2026-01-13T14:12:01.811656+0000 | _apply_smoothing | INFO - Smoothing with model.layers.1.post_attention_layernorm



(2/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 176.97it/s]
(3/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 244.27it/s]

2026-01-13T14:12:06.911825+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.input_layernorm
2026-01-13T14:12:06.914245+0000 | _apply_smoothing | INFO - Smoothing with model.layers.2.post_attention_layernorm



(3/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.80it/s]
(4/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 245.30it/s]

2026-01-13T14:12:12.055982+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.input_layernorm
2026-01-13T14:12:12.058026+0000 | _apply_smoothing | INFO - Smoothing with model.layers.3.post_attention_layernorm



(4/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 170.99it/s]
(5/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 243.97it/s]

2026-01-13T14:12:17.221489+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.input_layernorm
2026-01-13T14:12:17.223245+0000 | _apply_smoothing | INFO - Smoothing with model.layers.4.post_attention_layernorm



(5/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 168.69it/s]
(6/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 243.20it/s]

2026-01-13T14:12:22.436349+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.input_layernorm
2026-01-13T14:12:22.438659+0000 | _apply_smoothing | INFO - Smoothing with model.layers.5.post_attention_layernorm



(6/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 170.45it/s]
(7/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 246.57it/s]

2026-01-13T14:12:27.589420+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.input_layernorm
2026-01-13T14:12:27.591509+0000 | _apply_smoothing | INFO - Smoothing with model.layers.6.post_attention_layernorm



(7/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 169.73it/s]
(8/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 246.09it/s]

2026-01-13T14:12:32.760153+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.input_layernorm
2026-01-13T14:12:32.761803+0000 | _apply_smoothing | INFO - Smoothing with model.layers.7.post_attention_layernorm



(8/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 170.13it/s]
(9/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 237.72it/s]

2026-01-13T14:12:37.995764+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.input_layernorm
2026-01-13T14:12:37.997565+0000 | _apply_smoothing | INFO - Smoothing with model.layers.8.post_attention_layernorm



(9/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 170.80it/s]
(10/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 239.85it/s]

2026-01-13T14:12:43.340857+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.input_layernorm





2026-01-13T14:12:43.460974+0000 | _apply_smoothing | INFO - Smoothing with model.layers.9.post_attention_layernorm


(10/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.80it/s]
(11/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 244.67it/s]

2026-01-13T14:12:48.587994+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.input_layernorm





2026-01-13T14:12:48.730005+0000 | _apply_smoothing | INFO - Smoothing with model.layers.10.post_attention_layernorm


(11/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.68it/s]
(12/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 239.05it/s]

2026-01-13T14:12:53.908080+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.input_layernorm
2026-01-13T14:12:53.909916+0000 | _apply_smoothing | INFO - Smoothing with model.layers.11.post_attention_layernorm



(12/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.06it/s]
(13/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 244.22it/s]

2026-01-13T14:12:59.053436+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.input_layernorm
2026-01-13T14:12:59.055693+0000 | _apply_smoothing | INFO - Smoothing with model.layers.12.post_attention_layernorm



(13/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 167.89it/s]
(14/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 240.81it/s]

2026-01-13T14:13:04.303913+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.input_layernorm
2026-01-13T14:13:04.305686+0000 | _apply_smoothing | INFO - Smoothing with model.layers.13.post_attention_layernorm



(14/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 170.62it/s]
(15/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 239.12it/s]


2026-01-13T14:13:09.520387+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.input_layernorm
2026-01-13T14:13:09.522723+0000 | _apply_smoothing | INFO - Smoothing with model.layers.14.post_attention_layernorm


(15/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.16it/s]
(16/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 243.72it/s]

2026-01-13T14:13:14.670420+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.input_layernorm
2026-01-13T14:13:14.672482+0000 | _apply_smoothing | INFO - Smoothing with model.layers.15.post_attention_layernorm



(16/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 171.50it/s]
(17/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 242.97it/s]

2026-01-13T14:13:19.836468+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.input_layernorm
2026-01-13T14:13:19.838141+0000 | _apply_smoothing | INFO - Smoothing with model.layers.16.post_attention_layernorm



(17/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 171.44it/s]
(18/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 237.03it/s]

2026-01-13T14:13:25.057284+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.input_layernorm
2026-01-13T14:13:25.058968+0000 | _apply_smoothing | INFO - Smoothing with model.layers.17.post_attention_layernorm



(18/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.04it/s]
(19/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 242.45it/s]

2026-01-13T14:13:30.216322+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.input_layernorm
2026-01-13T14:13:30.218094+0000 | _apply_smoothing | INFO - Smoothing with model.layers.18.post_attention_layernorm



(19/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 169.44it/s]
(20/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 235.44it/s]

2026-01-13T14:13:35.485457+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.input_layernorm
2026-01-13T14:13:35.487486+0000 | _apply_smoothing | INFO - Smoothing with model.layers.19.post_attention_layernorm



(20/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 171.67it/s]
(21/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 242.21it/s]

2026-01-13T14:13:40.655436+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.input_layernorm
2026-01-13T14:13:40.657165+0000 | _apply_smoothing | INFO - Smoothing with model.layers.20.post_attention_layernorm



(21/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 171.42it/s]
(22/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 240.28it/s]

2026-01-13T14:13:45.986388+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.input_layernorm
2026-01-13T14:13:45.988141+0000 | _apply_smoothing | INFO - Smoothing with model.layers.21.post_attention_layernorm



(22/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 170.21it/s]
(23/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 235.97it/s]

2026-01-13T14:13:51.672572+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.input_layernorm
2026-01-13T14:13:51.674522+0000 | _apply_smoothing | INFO - Smoothing with model.layers.22.post_attention_layernorm



(23/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 168.39it/s]
(24/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 238.10it/s]

2026-01-13T14:13:56.972653+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.input_layernorm
2026-01-13T14:13:57.004561+0000 | _apply_smoothing | INFO - Smoothing with model.layers.23.post_attention_layernorm



(24/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 170.75it/s]
(25/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 240.11it/s]

2026-01-13T14:14:02.206475+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.input_layernorm
2026-01-13T14:14:02.208185+0000 | _apply_smoothing | INFO - Smoothing with model.layers.24.post_attention_layernorm



(25/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 170.41it/s]
(26/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 237.26it/s]

2026-01-13T14:14:07.443103+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.input_layernorm
2026-01-13T14:14:07.444836+0000 | _apply_smoothing | INFO - Smoothing with model.layers.25.post_attention_layernorm



(26/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 169.19it/s]
(27/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 241.52it/s]

2026-01-13T14:14:12.743593+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.input_layernorm





2026-01-13T14:14:12.810591+0000 | _apply_smoothing | INFO - Smoothing with model.layers.26.post_attention_layernorm


(27/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 169.07it/s]
(28/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 233.22it/s]

2026-01-13T14:14:18.107611+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.input_layernorm
2026-01-13T14:14:18.109436+0000 | _apply_smoothing | INFO - Smoothing with model.layers.27.post_attention_layernorm



(28/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 170.93it/s]
(29/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 241.52it/s]

2026-01-13T14:14:23.296933+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.input_layernorm





2026-01-13T14:14:23.298915+0000 | _apply_smoothing | INFO - Smoothing with model.layers.28.post_attention_layernorm


(29/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 170.92it/s]
(30/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 239.42it/s]

2026-01-13T14:14:28.504257+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.input_layernorm
2026-01-13T14:14:28.506093+0000 | _apply_smoothing | INFO - Smoothing with model.layers.29.post_attention_layernorm



(30/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 169.97it/s]
(31/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 237.61it/s]

2026-01-13T14:14:33.745931+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.input_layernorm
2026-01-13T14:14:33.747983+0000 | _apply_smoothing | INFO - Smoothing with model.layers.30.post_attention_layernorm



(31/33): Propagating: 100%|██████████| 512/512 [00:02<00:00, 171.07it/s]
(32/33): Calibrating: 100%|██████████| 512/512 [00:02<00:00, 241.93it/s]

2026-01-13T14:14:38.929564+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.input_layernorm
2026-01-13T14:14:38.931757+0000 | _apply_smoothing | INFO - Smoothing with model.layers.31.post_attention_layernorm



(32/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 168.08it/s]
(33/33): Calibrating: 100%|██████████| 512/512 [00:03<00:00, 153.45it/s]
(33/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 152.84it/s]


2026-01-13T14:14:49.160597+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`


Preparing cache: 100%|██████████| 512/512 [00:00<00:00, 584.55it/s]
(1/33): Calibrating: 100%|██████████| 512/512 [00:11<00:00, 43.27it/s]

2026-01-13T14:15:02.540467+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples





2026-01-13T14:15:04.220161+0000 | compress | METRIC - time 1.68s
2026-01-13T14:15:04.221318+0000 | compress | METRIC - error 4.76
2026-01-13T14:15:04.222749+0000 | compress | METRIC - GPU 0 | usage: 43.79% | total memory: 48 GB
2026-01-13T14:15:04.223381+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:15:04.224142+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
2026-01-13T14:15:05.728384+0000 | compress | METRIC - time 1.50s
2026-01-13T14:15:05.729235+0000 | compress | METRIC - error 3.37
2026-01-13T14:15:05.730278+0000 | compress | METRIC - GPU 0 | usage: 43.79% | total memory: 48 GB
2026-01-13T14:15:05.731004+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:15:05.731828+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 512 samples
2026-01-13T14:15:07.246608+0000 | compress | METRIC - time 1.51s
2026-01-13T14:15:07.247739+0000 | compress | METRIC - erro

(1/33): Propagating: 100%|██████████| 512/512 [00:04<00:00, 115.75it/s]
(2/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.70it/s]

2026-01-13T14:15:32.213394+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.q_proj using 512 samples





2026-01-13T14:15:33.801186+0000 | compress | METRIC - time 1.59s
2026-01-13T14:15:33.803077+0000 | compress | METRIC - error 5.74
2026-01-13T14:15:33.804539+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:15:33.805167+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:15:33.805805+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.k_proj using 512 samples
2026-01-13T14:15:35.304874+0000 | compress | METRIC - time 1.50s
2026-01-13T14:15:35.305955+0000 | compress | METRIC - error 3.29
2026-01-13T14:15:35.306896+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:15:35.307348+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:15:35.308028+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.v_proj using 512 samples
2026-01-13T14:15:36.779215+0000 | compress | METRIC - time 1.47s
2026-01-13T14:15:36.780524+0000 | compress | METRIC - erro

(2/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.42it/s]
(3/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.79it/s]

2026-01-13T14:16:00.320628+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.q_proj using 512 samples





2026-01-13T14:16:01.821610+0000 | compress | METRIC - time 1.50s
2026-01-13T14:16:01.823102+0000 | compress | METRIC - error 16.08
2026-01-13T14:16:01.823785+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:16:01.824297+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:16:01.824996+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.k_proj using 512 samples
2026-01-13T14:16:03.295843+0000 | compress | METRIC - time 1.47s
2026-01-13T14:16:03.296921+0000 | compress | METRIC - error 11.18
2026-01-13T14:16:03.297669+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:16:03.298118+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:16:03.298800+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.v_proj using 512 samples
2026-01-13T14:16:04.779357+0000 | compress | METRIC - time 1.48s
2026-01-13T14:16:04.780381+0000 | compress | METRIC - er

(3/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 159.02it/s]
(4/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.88it/s]

2026-01-13T14:16:28.343546+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.q_proj using 512 samples





2026-01-13T14:16:29.879580+0000 | compress | METRIC - time 1.54s
2026-01-13T14:16:29.880710+0000 | compress | METRIC - error 9.89
2026-01-13T14:16:29.881606+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:16:29.882213+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:16:29.882836+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.k_proj using 512 samples
2026-01-13T14:16:31.388141+0000 | compress | METRIC - time 1.50s
2026-01-13T14:16:31.389013+0000 | compress | METRIC - error 6.22
2026-01-13T14:16:31.389965+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:16:31.390619+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:16:31.391248+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.v_proj using 512 samples
2026-01-13T14:16:32.880858+0000 | compress | METRIC - time 1.49s
2026-01-13T14:16:32.881800+0000 | compress | METRIC - erro

(4/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.90it/s]
(5/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.40it/s]

2026-01-13T14:16:56.561891+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.q_proj using 512 samples





2026-01-13T14:16:58.121898+0000 | compress | METRIC - time 1.56s
2026-01-13T14:16:58.123166+0000 | compress | METRIC - error 11.52
2026-01-13T14:16:58.124081+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:16:58.124715+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:16:58.125483+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.k_proj using 512 samples
2026-01-13T14:16:59.649933+0000 | compress | METRIC - time 1.52s
2026-01-13T14:16:59.651045+0000 | compress | METRIC - error 7.68
2026-01-13T14:16:59.651956+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:16:59.652446+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:16:59.653190+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.v_proj using 512 samples
2026-01-13T14:17:01.174651+0000 | compress | METRIC - time 1.52s
2026-01-13T14:17:01.176211+0000 | compress | METRIC - err

(5/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.22it/s]
(6/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.13it/s]

2026-01-13T14:17:25.085342+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.q_proj using 512 samples





2026-01-13T14:17:26.635421+0000 | compress | METRIC - time 1.55s
2026-01-13T14:17:26.636530+0000 | compress | METRIC - error 18.76
2026-01-13T14:17:26.637155+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:17:26.637692+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:17:26.638346+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.k_proj using 512 samples
2026-01-13T14:17:28.159185+0000 | compress | METRIC - time 1.52s
2026-01-13T14:17:28.160237+0000 | compress | METRIC - error 12.35
2026-01-13T14:17:28.301128+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:17:28.302211+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:17:28.303107+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.v_proj using 512 samples
2026-01-13T14:17:29.826444+0000 | compress | METRIC - time 1.52s
2026-01-13T14:17:29.828408+0000 | compress | METRIC - er

(6/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 160.90it/s]
(7/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 51.85it/s]

2026-01-13T14:17:53.719749+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.q_proj using 512 samples





2026-01-13T14:17:55.260076+0000 | compress | METRIC - time 1.54s
2026-01-13T14:17:55.261020+0000 | compress | METRIC - error 16.58
2026-01-13T14:17:55.339216+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:17:55.340156+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:17:55.341134+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.k_proj using 512 samples
2026-01-13T14:17:56.869485+0000 | compress | METRIC - time 1.53s
2026-01-13T14:17:56.870497+0000 | compress | METRIC - error 11.86
2026-01-13T14:17:56.871273+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:17:56.871868+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:17:56.872600+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.v_proj using 512 samples
2026-01-13T14:17:58.395874+0000 | compress | METRIC - time 1.52s
2026-01-13T14:17:58.396968+0000 | compress | METRIC - er

(7/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 162.02it/s]
(8/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 51.99it/s]

2026-01-13T14:18:22.165439+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.q_proj using 512 samples





2026-01-13T14:18:23.709590+0000 | compress | METRIC - time 1.54s
2026-01-13T14:18:23.710694+0000 | compress | METRIC - error 15.58
2026-01-13T14:18:23.711413+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:18:23.711854+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:18:23.712565+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.k_proj using 512 samples
2026-01-13T14:18:25.278637+0000 | compress | METRIC - time 1.57s
2026-01-13T14:18:25.279773+0000 | compress | METRIC - error 10.83
2026-01-13T14:18:25.280470+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:18:25.281022+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:18:25.281683+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.v_proj using 512 samples
2026-01-13T14:18:26.801429+0000 | compress | METRIC - time 1.52s
2026-01-13T14:18:26.802653+0000 | compress | METRIC - er

(8/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.07it/s]
(9/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 51.94it/s]

2026-01-13T14:18:50.597555+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.q_proj using 512 samples





2026-01-13T14:18:52.147587+0000 | compress | METRIC - time 1.55s
2026-01-13T14:18:52.148489+0000 | compress | METRIC - error 22.34
2026-01-13T14:18:52.149324+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:18:52.149766+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:18:52.150480+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.k_proj using 512 samples
2026-01-13T14:18:53.675900+0000 | compress | METRIC - time 1.52s
2026-01-13T14:18:53.677154+0000 | compress | METRIC - error 19.50
2026-01-13T14:18:53.677859+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:18:53.678338+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:18:53.679075+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.v_proj using 512 samples
2026-01-13T14:18:55.203073+0000 | compress | METRIC - time 1.52s
2026-01-13T14:18:55.204187+0000 | compress | METRIC - er

(9/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 163.40it/s]
(10/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.31it/s]

2026-01-13T14:19:18.906919+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.q_proj using 512 samples





2026-01-13T14:19:20.445921+0000 | compress | METRIC - time 1.54s
2026-01-13T14:19:20.447019+0000 | compress | METRIC - error 23.36
2026-01-13T14:19:20.447797+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:19:20.448271+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:19:20.448969+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.k_proj using 512 samples
2026-01-13T14:19:22.019512+0000 | compress | METRIC - time 1.57s
2026-01-13T14:19:22.020568+0000 | compress | METRIC - error 22.11
2026-01-13T14:19:22.021688+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:19:22.022418+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:19:22.023050+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.v_proj using 512 samples
2026-01-13T14:19:23.540944+0000 | compress | METRIC - time 1.52s
2026-01-13T14:19:23.541718+0000 | compress | METRIC - er

(10/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 158.33it/s]
(11/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.15it/s]

2026-01-13T14:19:47.447943+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.q_proj using 512 samples





2026-01-13T14:19:48.993281+0000 | compress | METRIC - time 1.54s
2026-01-13T14:19:48.994156+0000 | compress | METRIC - error 28.13
2026-01-13T14:19:48.994908+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:19:48.995395+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:19:48.996249+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.k_proj using 512 samples
2026-01-13T14:19:50.520810+0000 | compress | METRIC - time 1.52s
2026-01-13T14:19:50.521651+0000 | compress | METRIC - error 27.47
2026-01-13T14:19:50.522432+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:19:50.522917+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:19:50.523702+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.v_proj using 512 samples
2026-01-13T14:19:52.043750+0000 | compress | METRIC - time 1.52s
2026-01-13T14:19:52.044859+0000 | compress | METRIC - 

(11/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 160.33it/s]
(12/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.63it/s]

2026-01-13T14:20:15.965375+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.q_proj using 512 samples





2026-01-13T14:20:17.460266+0000 | compress | METRIC - time 1.49s
2026-01-13T14:20:17.461431+0000 | compress | METRIC - error 23.05
2026-01-13T14:20:17.462296+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:20:17.462862+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:20:17.463508+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.k_proj using 512 samples
2026-01-13T14:20:19.092053+0000 | compress | METRIC - time 1.63s
2026-01-13T14:20:19.093071+0000 | compress | METRIC - error 21.89
2026-01-13T14:20:19.094012+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:20:19.094457+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:20:19.095166+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.v_proj using 512 samples
2026-01-13T14:20:20.589504+0000 | compress | METRIC - time 1.49s
2026-01-13T14:20:20.590598+0000 | compress | METRIC - 

(12/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 164.16it/s]
(13/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.27it/s]

2026-01-13T14:20:44.094635+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.q_proj using 512 samples





2026-01-13T14:20:45.574156+0000 | compress | METRIC - time 1.48s
2026-01-13T14:20:45.575248+0000 | compress | METRIC - error 25.15
2026-01-13T14:20:45.575919+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:20:45.576346+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:20:45.577002+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.k_proj using 512 samples
2026-01-13T14:20:47.069825+0000 | compress | METRIC - time 1.49s
2026-01-13T14:20:47.070912+0000 | compress | METRIC - error 17.34
2026-01-13T14:20:47.071749+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:20:47.072315+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:20:47.072898+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.v_proj using 512 samples
2026-01-13T14:20:48.566785+0000 | compress | METRIC - time 1.49s
2026-01-13T14:20:48.567764+0000 | compress | METRIC - 

(13/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 164.21it/s]
(14/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.18it/s]

2026-01-13T14:21:12.114840+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.q_proj using 512 samples





2026-01-13T14:21:13.605837+0000 | compress | METRIC - time 1.49s
2026-01-13T14:21:13.606955+0000 | compress | METRIC - error 28.69
2026-01-13T14:21:13.619521+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:21:13.620436+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:21:13.621513+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.k_proj using 512 samples
2026-01-13T14:21:15.106360+0000 | compress | METRIC - time 1.48s
2026-01-13T14:21:15.107254+0000 | compress | METRIC - error 42.65
2026-01-13T14:21:15.108274+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:21:15.108761+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:21:15.109485+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.v_proj using 512 samples
2026-01-13T14:21:16.590012+0000 | compress | METRIC - time 1.48s
2026-01-13T14:21:16.591047+0000 | compress | METRIC - 

(14/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 163.60it/s]
(15/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.15it/s]

2026-01-13T14:21:39.989992+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.q_proj using 512 samples





2026-01-13T14:21:41.490298+0000 | compress | METRIC - time 1.50s
2026-01-13T14:21:41.491427+0000 | compress | METRIC - error 30.81
2026-01-13T14:21:41.492278+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:21:41.492866+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:21:41.493525+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.k_proj using 512 samples
2026-01-13T14:21:43.014091+0000 | compress | METRIC - time 1.52s
2026-01-13T14:21:43.015177+0000 | compress | METRIC - error 43.81
2026-01-13T14:21:43.016419+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:21:43.016857+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:21:43.017573+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.v_proj using 512 samples
2026-01-13T14:21:44.467199+0000 | compress | METRIC - time 1.45s
2026-01-13T14:21:44.468167+0000 | compress | METRIC - 

(15/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 163.45it/s]
(16/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.54it/s]

2026-01-13T14:22:07.971529+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.q_proj using 512 samples





2026-01-13T14:22:09.440282+0000 | compress | METRIC - time 1.47s
2026-01-13T14:22:09.441404+0000 | compress | METRIC - error 42.27
2026-01-13T14:22:09.442164+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:22:09.442690+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:22:09.443304+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.k_proj using 512 samples
2026-01-13T14:22:10.888784+0000 | compress | METRIC - time 1.45s
2026-01-13T14:22:10.889787+0000 | compress | METRIC - error 38.37
2026-01-13T14:22:10.890447+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:22:10.890867+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:22:10.891531+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.v_proj using 512 samples
2026-01-13T14:22:12.360139+0000 | compress | METRIC - time 1.47s
2026-01-13T14:22:12.361010+0000 | compress | METRIC - 

(16/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 158.39it/s]
(17/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.47it/s]

2026-01-13T14:22:35.772912+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.q_proj using 512 samples





2026-01-13T14:22:37.254318+0000 | compress | METRIC - time 1.48s
2026-01-13T14:22:37.255456+0000 | compress | METRIC - error 37.26
2026-01-13T14:22:37.256151+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:22:37.256684+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:22:37.257265+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.k_proj using 512 samples
2026-01-13T14:22:38.704495+0000 | compress | METRIC - time 1.45s
2026-01-13T14:22:38.705572+0000 | compress | METRIC - error 41.83
2026-01-13T14:22:38.706147+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:22:38.706558+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:22:38.707208+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.v_proj using 512 samples
2026-01-13T14:22:40.155108+0000 | compress | METRIC - time 1.45s
2026-01-13T14:22:40.156261+0000 | compress | METRIC - 

(17/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.85it/s]
(18/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 53.10it/s]

2026-01-13T14:23:03.223850+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.q_proj using 512 samples





2026-01-13T14:23:04.692286+0000 | compress | METRIC - time 1.47s
2026-01-13T14:23:04.693303+0000 | compress | METRIC - error 32.22
2026-01-13T14:23:04.694254+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:23:04.694789+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:23:04.695389+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.k_proj using 512 samples
2026-01-13T14:23:06.182765+0000 | compress | METRIC - time 1.49s
2026-01-13T14:23:06.183782+0000 | compress | METRIC - error 39.66
2026-01-13T14:23:06.184492+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:23:06.184931+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:23:06.185579+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.v_proj using 512 samples
2026-01-13T14:23:07.660996+0000 | compress | METRIC - time 1.47s
2026-01-13T14:23:07.662131+0000 | compress | METRIC - 

(18/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 162.71it/s]
(19/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.50it/s]

2026-01-13T14:23:30.826326+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.q_proj using 512 samples





2026-01-13T14:23:32.330602+0000 | compress | METRIC - time 1.50s
2026-01-13T14:23:32.331945+0000 | compress | METRIC - error 34.64
2026-01-13T14:23:32.332777+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:23:32.333229+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:23:32.333900+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.k_proj using 512 samples
2026-01-13T14:23:33.809149+0000 | compress | METRIC - time 1.47s
2026-01-13T14:23:33.810020+0000 | compress | METRIC - error 40.87
2026-01-13T14:23:33.810919+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:23:33.811352+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:23:33.812075+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.v_proj using 512 samples
2026-01-13T14:23:35.285600+0000 | compress | METRIC - time 1.47s
2026-01-13T14:23:35.286573+0000 | compress | METRIC - 

(19/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 164.07it/s]
(20/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.82it/s]

2026-01-13T14:23:58.430337+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.q_proj using 512 samples





2026-01-13T14:23:59.928799+0000 | compress | METRIC - time 1.50s
2026-01-13T14:23:59.929776+0000 | compress | METRIC - error 32.26
2026-01-13T14:23:59.930762+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:23:59.931252+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:23:59.931931+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.k_proj using 512 samples
2026-01-13T14:24:01.535980+0000 | compress | METRIC - time 1.60s
2026-01-13T14:24:01.536718+0000 | compress | METRIC - error 30.64
2026-01-13T14:24:01.537502+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:24:01.537952+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:24:01.538623+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.v_proj using 512 samples
2026-01-13T14:24:03.050659+0000 | compress | METRIC - time 1.51s
2026-01-13T14:24:03.051392+0000 | compress | METRIC - 

(20/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 160.51it/s]
(21/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.67it/s]

2026-01-13T14:24:26.525249+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.q_proj using 512 samples





2026-01-13T14:24:28.139902+0000 | compress | METRIC - time 1.61s
2026-01-13T14:24:28.140726+0000 | compress | METRIC - error 34.64
2026-01-13T14:24:28.141586+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:24:28.142120+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:24:28.142769+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.k_proj using 512 samples
2026-01-13T14:24:29.623430+0000 | compress | METRIC - time 1.48s
2026-01-13T14:24:29.624438+0000 | compress | METRIC - error 36.53
2026-01-13T14:24:29.625122+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:24:29.625537+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:24:29.626196+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.v_proj using 512 samples
2026-01-13T14:24:31.108961+0000 | compress | METRIC - time 1.48s
2026-01-13T14:24:31.109977+0000 | compress | METRIC - 

(21/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.93it/s]
(22/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.31it/s]

2026-01-13T14:24:54.498362+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.q_proj using 512 samples





2026-01-13T14:24:56.031830+0000 | compress | METRIC - time 1.53s
2026-01-13T14:24:56.033648+0000 | compress | METRIC - error 34.79
2026-01-13T14:24:56.034313+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:24:56.034857+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:24:56.035475+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.k_proj using 512 samples
2026-01-13T14:24:57.523746+0000 | compress | METRIC - time 1.49s
2026-01-13T14:24:57.524742+0000 | compress | METRIC - error 34.17
2026-01-13T14:24:57.525458+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:24:57.525983+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:24:57.526568+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.v_proj using 512 samples
2026-01-13T14:24:59.029502+0000 | compress | METRIC - time 1.50s
2026-01-13T14:24:59.030458+0000 | compress | METRIC - 

(22/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.42it/s]
(23/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.96it/s]

2026-01-13T14:25:22.357385+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.q_proj using 512 samples





2026-01-13T14:25:23.876821+0000 | compress | METRIC - time 1.52s
2026-01-13T14:25:23.877764+0000 | compress | METRIC - error 37.21
2026-01-13T14:25:23.878433+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:25:23.878956+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:25:23.879714+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.k_proj using 512 samples
2026-01-13T14:25:25.381908+0000 | compress | METRIC - time 1.50s
2026-01-13T14:25:25.382932+0000 | compress | METRIC - error 38.76
2026-01-13T14:25:25.383850+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:25:25.384411+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:25:25.385049+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.v_proj using 512 samples
2026-01-13T14:25:26.862929+0000 | compress | METRIC - time 1.48s
2026-01-13T14:25:26.863937+0000 | compress | METRIC - 

(23/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 156.23it/s]
(24/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 51.85it/s]

2026-01-13T14:25:50.646918+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.q_proj using 512 samples





2026-01-13T14:25:52.349864+0000 | compress | METRIC - time 1.70s
2026-01-13T14:25:52.350790+0000 | compress | METRIC - error 36.60
2026-01-13T14:25:52.351628+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:25:52.352223+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:25:52.352814+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.k_proj using 512 samples
2026-01-13T14:25:53.876149+0000 | compress | METRIC - time 1.52s
2026-01-13T14:25:53.877420+0000 | compress | METRIC - error 33.63
2026-01-13T14:25:53.878002+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:25:53.878574+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:25:53.879170+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.v_proj using 512 samples
2026-01-13T14:25:55.395231+0000 | compress | METRIC - time 1.52s
2026-01-13T14:25:55.396259+0000 | compress | METRIC - 

(24/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 155.98it/s]
(25/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 51.90it/s]

2026-01-13T14:26:19.077206+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples





2026-01-13T14:26:20.599872+0000 | compress | METRIC - time 1.52s
2026-01-13T14:26:20.601331+0000 | compress | METRIC - error 36.05
2026-01-13T14:26:20.602072+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:26:20.602659+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:26:20.603327+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.k_proj using 512 samples
2026-01-13T14:26:22.115556+0000 | compress | METRIC - time 1.51s
2026-01-13T14:26:22.117549+0000 | compress | METRIC - error 33.83
2026-01-13T14:26:22.118292+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:26:22.118828+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:26:22.119466+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.v_proj using 512 samples
2026-01-13T14:26:23.618969+0000 | compress | METRIC - time 1.50s
2026-01-13T14:26:23.620125+0000 | compress | METRIC - 

(25/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 156.98it/s]
(26/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.35it/s]

2026-01-13T14:26:46.994157+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.q_proj using 512 samples





2026-01-13T14:26:48.550751+0000 | compress | METRIC - time 1.56s
2026-01-13T14:26:48.551650+0000 | compress | METRIC - error 35.72
2026-01-13T14:26:48.552460+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:26:48.552990+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:26:48.553607+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.k_proj using 512 samples
2026-01-13T14:26:50.118242+0000 | compress | METRIC - time 1.56s
2026-01-13T14:26:50.119397+0000 | compress | METRIC - error 27.53
2026-01-13T14:26:50.120258+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:26:50.120756+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:26:50.121463+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.v_proj using 512 samples
2026-01-13T14:26:51.671776+0000 | compress | METRIC - time 1.55s
2026-01-13T14:26:51.672880+0000 | compress | METRIC - 

(26/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 158.41it/s]
(27/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.28it/s]

2026-01-13T14:27:15.535376+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.q_proj using 512 samples





2026-01-13T14:27:17.086794+0000 | compress | METRIC - time 1.55s
2026-01-13T14:27:17.087822+0000 | compress | METRIC - error 40.26
2026-01-13T14:27:17.089223+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:27:17.089959+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:27:17.090713+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.k_proj using 512 samples
2026-01-13T14:27:18.621358+0000 | compress | METRIC - time 1.53s
2026-01-13T14:27:18.622415+0000 | compress | METRIC - error 37.27
2026-01-13T14:27:18.623296+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:27:18.623804+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:27:18.624744+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.v_proj using 512 samples
2026-01-13T14:27:20.155463+0000 | compress | METRIC - time 1.53s
2026-01-13T14:27:20.156708+0000 | compress | METRIC - 

(27/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 162.39it/s]
(28/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 51.67it/s]

2026-01-13T14:27:44.077810+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.q_proj using 512 samples





2026-01-13T14:27:45.547577+0000 | compress | METRIC - time 1.47s
2026-01-13T14:27:45.548492+0000 | compress | METRIC - error 37.03
2026-01-13T14:27:45.549526+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:27:45.550009+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:27:45.550703+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.k_proj using 512 samples
2026-01-13T14:27:47.072676+0000 | compress | METRIC - time 1.52s
2026-01-13T14:27:47.073752+0000 | compress | METRIC - error 33.70
2026-01-13T14:27:47.075177+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:27:47.075594+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:27:47.076243+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.v_proj using 512 samples
2026-01-13T14:27:48.512121+0000 | compress | METRIC - time 1.44s
2026-01-13T14:27:48.513137+0000 | compress | METRIC - 

(28/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.02it/s]
(29/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.58it/s]

2026-01-13T14:28:11.674081+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.q_proj using 512 samples





2026-01-13T14:28:13.177115+0000 | compress | METRIC - time 1.50s
2026-01-13T14:28:13.178024+0000 | compress | METRIC - error 37.34
2026-01-13T14:28:13.178679+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:28:13.179094+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:28:13.179738+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.k_proj using 512 samples
2026-01-13T14:28:14.673652+0000 | compress | METRIC - time 1.49s
2026-01-13T14:28:14.674501+0000 | compress | METRIC - error 29.91
2026-01-13T14:28:14.675194+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:28:14.675759+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:28:14.676507+0000 | compress_modules | INFO - Quantizing model.layers.28.self_attn.v_proj using 512 samples
2026-01-13T14:28:16.151124+0000 | compress | METRIC - time 1.47s
2026-01-13T14:28:16.152123+0000 | compress | METRIC - 

(29/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 163.46it/s]
(30/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.42it/s]

2026-01-13T14:28:39.536448+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.q_proj using 512 samples





2026-01-13T14:28:41.045500+0000 | compress | METRIC - time 1.51s
2026-01-13T14:28:41.046755+0000 | compress | METRIC - error 31.70
2026-01-13T14:28:41.170865+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:28:41.172248+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:28:41.173393+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.k_proj using 512 samples
2026-01-13T14:28:42.665565+0000 | compress | METRIC - time 1.49s
2026-01-13T14:28:42.666290+0000 | compress | METRIC - error 24.29
2026-01-13T14:28:42.666894+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:28:42.667366+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:28:42.667845+0000 | compress_modules | INFO - Quantizing model.layers.29.self_attn.v_proj using 512 samples
2026-01-13T14:28:44.157966+0000 | compress | METRIC - time 1.49s
2026-01-13T14:28:44.158743+0000 | compress | METRIC - 

(30/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.24it/s]
(31/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.39it/s]

2026-01-13T14:29:07.532211+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.q_proj using 512 samples





2026-01-13T14:29:09.044547+0000 | compress | METRIC - time 1.51s
2026-01-13T14:29:09.045936+0000 | compress | METRIC - error 42.84
2026-01-13T14:29:09.046680+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:29:09.047091+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:29:09.047702+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.k_proj using 512 samples
2026-01-13T14:29:10.758845+0000 | compress | METRIC - time 1.71s
2026-01-13T14:29:10.759613+0000 | compress | METRIC - error 39.53
2026-01-13T14:29:10.851100+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:29:10.851764+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:29:10.852577+0000 | compress_modules | INFO - Quantizing model.layers.30.self_attn.v_proj using 512 samples
2026-01-13T14:29:12.379990+0000 | compress | METRIC - time 1.53s
2026-01-13T14:29:12.380794+0000 | compress | METRIC - 

(31/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 160.61it/s]
(32/33): Calibrating: 100%|██████████| 512/512 [00:09<00:00, 52.37it/s]

2026-01-13T14:29:36.086079+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.q_proj using 512 samples





2026-01-13T14:29:37.597004+0000 | compress | METRIC - time 1.51s
2026-01-13T14:29:37.597874+0000 | compress | METRIC - error 28.22
2026-01-13T14:29:37.598446+0000 | compress | METRIC - GPU 0 | usage: 41.61% | total memory: 48 GB
2026-01-13T14:29:37.598823+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2026-01-13T14:29:37.599456+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.k_proj using 512 samples
2026-01-13T14:29:39.113656+0000 | compress | METRIC - time 1.51s
2026-01-13T14:29:39.114529+0000 | compress | METRIC - error 19.26
2026-01-13T14:29:39.115143+0000 | compress | METRIC - GPU 0 | usage: 41.62% | total memory: 48 GB
2026-01-13T14:29:39.115507+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2026-01-13T14:29:39.116154+0000 | compress_modules | INFO - Quantizing model.layers.31.self_attn.v_proj using 512 samples
2026-01-13T14:29:40.620702+0000 | compress | METRIC - time 1.50s
2026-01-13T14:29:40.621569+0000 | compress | METRIC - 

(32/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 161.56it/s]
(33/33): Calibrating: 100%|██████████| 512/512 [00:03<00:00, 157.29it/s]
(33/33): Propagating: 100%|██████████| 512/512 [00:03<00:00, 157.76it/s]


2026-01-13T14:30:01.210386+0000 | finalize | INFO - Compression lifecycle finalized for 2 modifiers
2026-01-13T14:30:01.255584+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 224it [00:20, 10.78it/s]


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_

### Checking model size

In [11]:
# Load quantized model
model_quant = AutoModelForCausalLM.from_pretrained(compressed_model_path)
model_quant.config.dtype = model.config.torch_dtype
model_quant.save_pretrained(compressed_model_path)
model_size = model_size_gb(model_quant)
print(f"Model size (GB): {model_size}")

Compressing model: 224it [00:00, 1292.67it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  7.45it/s]
`torch_dtype` is deprecated! Use `dtype` instead!


Model size (GB): 8.460090637207031


### Observation
After quantizing the model, the size has clearly reduced from 14.9GB to 8GB. 


**ALTERNATIVELY**, llm-compressor also supports FP8 quantization. This conversion does not require any calibration dataset. Uncomment the below cell to quantize the model to FP8.

In [None]:
# recipe = QuantizationModifier(
#   targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

# oneshot(model=model_name, recipe=recipe, output_dir=compressed_model_path)

# # Load quantized model
# model_quant = AutoModelForCausalLM.from_pretrained(compressed_model_path)
# model_quant.config.dtype = model.config.torch_dtype
# model_quant.save_pretrained(compressed_model_path)
# model_size = model_size_gb(model_quant)
# print(f"Model size (GB): {model_size}")

Now that we have reduced the model size, the next step is to evaluate this compressed model to make sure the accuracy has retained after compression. This is done in step `04_Compressed_Accuracy_Benchmarking`.