## W4A16 Quantization and Compression ##

Using compressed-tensors, we can compress a quantized model to store it more efficiently on disk.

In this example, we run post-training quantization (PTQ) to quantize the weights of an example model to 4 bits. We then save a compressed version of the model on disk by packing each group of eight 4-bit weights into a single int32

By packing groups of eight 4-bit weights into a single int32, we can store a quantized model more efficiently on disk.

In [12]:
import torch
import os
from tqdm import tqdm
from compressed_tensors.quantization import (
    QuantizationConfig,
    QuantizationStatus,
    apply_quantization_config
)
from compressed_tensors.compressors import ModelCompressor
from transformers import AutoModelForCausalLM, AutoTokenizer, DefaultDataCollator
from datasets import load_dataset
from torch.utils.data import RandomSampler
from torch.utils.data import DataLoader

In [13]:
# load a dense, unquantized tiny llama model
device = "cuda:0"
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype=torch.bfloat16)
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): 

The following quantization config will be used to quantize all of the Linear layers to 4 bits, excluding the lm_head layer. 

The `format` argument is set to `pack-quantized`, indicating that when the model is saved we should use the `PackedQuantizationCompressor` which will pack every eight 4-bit weights into an `int32`. 

This will give us a compression ratio of 8x on each Linear layer compared to the unquantized `float32` representation

In [14]:
quantization_config_dict = {
	"quant_method": "compressed-tensors",
	"format": "pack-quantized",
	"global_compression_ratio": None,
	"config_groups": {
        "group_1": {
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": False,
                "strategy": "tensor"
            },
            "targets": ["Linear"]
        }
    },
	"ignore": ["lm_head"]
}

In [15]:
# setup the loaded model for quantization calibration

config = QuantizationConfig(**quantization_config_dict)
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)

In [16]:
# create a dataloader of calibration data

dataset = load_dataset("ptb_text_only", trust_remote_code=True)["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

data_loader = DataLoader(
    tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator(), sampler=RandomSampler(tokenized_dataset)
)

In [17]:
# calibrate scale and zero points for quantization using a small amount of train data
num_calibration_samples = 512

with torch.no_grad():
    for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
        sample = {key: value.to(device) for key,value in sample.items()}
        _ = model(**sample)

        if idx >= num_calibration_samples:
            break

# freeze scale and zero points after calibration
# model.apply(freeze_module_quantization)

Running calibration: 512it [00:58,  8.82it/s]


After running calibration, each quantized layer will have a new scale and zero_point parameter as shown below.

Notice that at this point, the weight itself is still a floating point and has not been quantized. 

To convert the weights to an integer type, we need to apply the `compress_model` function. After compressing the weights, a forward pass of the model can no longer be run in PyTorch.

After compressing the quantized model with the `pack-quantized` format, weights are represented as logical int4 values packed into int32 containers ( `weight_packed` ), with the original shape recorded in `weight_shape`.

This packed representation is what gets saved to disk when using ModelCompressor.compress_model(model).

In [18]:
state_dict = model.state_dict()
example_layer = "model.layers.0.self_attn.q_proj.weight"
scale = state_dict[example_layer + "_scale"]
zero_point = state_dict[example_layer + "_zero_point"]
weight = state_dict[example_layer]
print(f"Scale: {scale}, Zero Point: {zero_point}")
print(f"Weight min: {torch.min(weight)} max: {torch.max(weight)} dtype: {weight.dtype}")

Scale: tensor([-3.0465e+26], device='cuda:0', dtype=torch.bfloat16), Zero Point: tensor([0], device='cuda:0', dtype=torch.int8)
Weight min: -1.5859375 max: 1.03125 dtype: torch.bfloat16


In [19]:
# convert quantized weights to integers
compressor = ModelCompressor(quantization_config=config)
compressor.compress_model(model)

state_dict = model.state_dict()
example_layer = "model.layers.0.self_attn.q_proj.weight"
scale = state_dict[example_layer + "_scale"]
zero_point = state_dict[example_layer + "_zero_point"]
weight = state_dict[example_layer + "_packed"]
shape = state_dict[example_layer + "_shape"]
print(f"Compressed weight scale: {scale}, zero point: {zero_point}")
print(f"Compressed weight  dtype: {weight.dtype}")
print(f"Compressed weight shape: {weight.shape}")
print(f"Uncompressed weight shape: {shape}")

Compressing model: 154it [00:02, 59.75it/s]

Compressed weight scale: tensor([-3.0465e+26], device='cuda:0', dtype=torch.bfloat16), zero point: tensor([0], device='cuda:0', dtype=torch.int8)
Compressed weight  dtype: torch.int32
Compressed weight shape: torch.Size([2048, 256])
Uncompressed weight shape: tensor([2048, 2048], device='cuda:0')





In [20]:
# apply compression and save the model to disk

output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compression_format = config.format
print(f"Compression format: {compression_format}")


model.save_pretrained(output_dir, state_dict=model.state_dict())
compressor.update_config(output_dir)

compressed_size_on_disk_mb = os.path.getsize(os.path.join(output_dir, "model.safetensors")) / 1024 / 1024
print(f"Size of the model's weights on disk using safetensors: {compressed_size_on_disk_mb:.2f} MB")

Compression format: pack-quantized
Size of the model's weights on disk using safetensors: 712.25 MB
