## W4A16 Quantization and Compression ##

Using compressed-tensors, we can compress a quantized model to store it more efficiently on disk.

In this example, we run post-training quantization (PTQ) to quantize the weights of an example model to 4 bits. We then save a compressed version of the model on disk b packing each group of eight 4-bit weights into a single int32

By packing groups of eight 4-bit weights into a single int32, we can store a quantized model more efficiently on disk.

In [None]:
import torch
import os
from tqdm import tqdm
from compressed_tensors.quantization import (
    QuantizationConfig,
    QuantizationStatus,
    apply_quantization_config,
    freeze_module_quantization,
    compress_quantized_weights
)
from compressed_tensors.compressors import ModelCompressor
from transformers import AutoModelForCausalLM, AutoTokenizer, DefaultDataCollator
from datasets import load_dataset
from torch.utils.data import RandomSampler
from torch.utils.data import DataLoader

In [None]:
# load a dense, unquantized tiny llama model
device = "cuda:0"
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype="auto")
model

The following quantization config will be used to quantize all of the Linear layers to 4 bits, excluding the lm_head layer. 

The `format` argument is set to `pack-quantized`, indicating that when the model is saved we should use the `PackedQuantizationCompressor` which will pack every eight 4-bit weights into an `int32`. 

This will give us a compression ratio of 4x on each Linear layer compared to the unquantized `float16` representation

In [None]:
quantization_config_dict = {
	"quant_method": "sparseml",
	"format": "pack-quantized",
	"global_compression_ratio": None,
	"config_groups": {
        "group_1": {
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": False,
                "strategy": "tensor"
            },
            "targets": ["Linear"]
        }
    },
	"ignore": ["lm_head"]
}

In [None]:
# setup the loaded model for quantization calibration

config = QuantizationConfig(**quantization_config_dict)
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)

In [None]:
# create a dataloader of calibration data

dataset = load_dataset("ptb_text_only")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

data_loader = DataLoader(
    tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator(), sampler=RandomSampler(tokenized_dataset)
)

In [None]:
# calibrate scale and zero points for quantization using a small amount of train data
num_calibration_samples = 512

with torch.no_grad():
    for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
        sample = {key: value.to(device) for key,value in sample.items()}
        _ = model(**sample)

        if idx >= num_calibration_samples:
            break

# freeze scale and zero points after calibration
model.apply(freeze_module_quantization)

After running calibration, each quantized layer will have a new scale and zero_point parameter as shown below.

Notice that at this point, the weight itself is still a floating point and has not been quantized. 

To convert the weights to an integer type, we need to apply the `compress_quantized_weights` function. After compressing the weights, a forward pass of the model can no longer be run in PyTorch

In [None]:
state_dict = model.state_dict()
example_layer = "model.layers.0.self_attn.q_proj.weight"
scale = state_dict[example_layer + "_scale"]
zero_point = state_dict[example_layer + "_zero_point"]
weight = state_dict[example_layer]
print(f"Scale: {scale}, Zero Point: {zero_point}")
print(f"Weight min: {torch.min(weight)} max: {torch.max(weight)} dtype: {weight.dtype}")

In [None]:
# convert quantized weights to integers
model.apply(compress_quantized_weights)

state_dict = model.state_dict()
example_layer = "model.layers.0.self_attn.q_proj.weight"
scale = state_dict[example_layer + "_scale"]
zero_point = state_dict[example_layer + "_zero_point"]
weight = state_dict[example_layer]
print(f"Scale: {scale}, Zero Point: {zero_point}")
print(f"Weight min: {torch.min(weight)} max: {torch.max(weight)} dtype: {weight.dtype}")

After compressing the quantized model, the weight matrix has a range of int4 but is stored in an int8. 

We can further compress the model on disk using the `pack-quantized` format we specified in the config. This compression format will pack the int4 weights into int32

In [None]:
# apply compression and save the model to disk

output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compression_format = config.format
print(f"Compression format: {compression_format}")

compressor = ModelCompressor(quantization_config=config)
compressed_state_dict = compressor.compress(model)
model.save_pretrained(output_dir, state_dict=compressed_state_dict)
compressor.update_config(output_dir)

compressed_size_on_disk_mb = os.path.getsize(os.path.join(output_dir, "model.safetensors")) / 1024 / 1024
print(f"Size of the model's weights on disk using safetensors: {compressed_size_on_disk_mb:.2f} MB")