# 4-bit LLM Quantization with GPTQ

This notebook demonstrates GPTQ (Gradient-aware Post-Training Quantization), a sophisticated quantization technique that uses gradient information to minimize accuracy loss.

## Overview
- **Quantization Method**: GPTQ (Gradient Post-Training Quantization)
- **Precision**: 4-bit quantization with grouping
- **Benefits**: High compression with minimal accuracy loss
- **Use Case**: Production deployments requiring balance of size and quality

## Key Parameters
- **bits**: Number of bits for quantization (typically 4)
- **group_size**: Size of weight groups (128 is common)
- **damp_percent**: Damping factor for quantization
- **desc_act**: Whether to use activation order descending


## Step 1: Install Dependencies

Install the AutoGPTQ library and transformers.


In [None]:
# Install AutoGPTQ and required libraries
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers


## Step 2: Import Libraries and Configure Model


In [None]:
import random
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer

# Define model configuration
model_id = "gpt2"
out_dir = model_id + "-GPTQ"

print(f"Model: {model_id}")
print(f"Output directory: {out_dir}")


## Step 3: Load Model and Configure Quantization

Set up the quantization configuration and load the model.


In [None]:
# Configure quantization parameters
quantize_config = BaseQuantizeConfig(
    bits=4,                # 4-bit quantization
    group_size=128,        # Group size for quantization
    damp_percent=0.01,     # Damping factor
    desc_act=False,        # Activation order
)

# Load model and tokenizer
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

print("✓ Model and tokenizer loaded")


## Step 4: Prepare Calibration Data

Load and prepare the C4 dataset for calibration during quantization.


In [None]:
# Load calibration dataset
n_samples = 1024
print(f"Loading {n_samples} samples from C4 dataset...")

data = load_dataset(
    "allenai/c4", 
    data_files="en/c4-train.00001-of-01024.json.gz", 
    split=f"train[:{n_samples*5}]"
)

# Tokenize the data
tokenized_data = tokenizer("\n\n".join(data['text']), return_tensors='pt')
print(f"✓ Dataset loaded and tokenized")


## Step 5: Format Training Examples

Prepare the calibration examples in the required format.


In [None]:
# Format tokenized examples for quantization
examples_ids = []
for _ in range(n_samples):
    i = random.randint(0, tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1)
    j = i + tokenizer.model_max_length
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

print(f"✓ Prepared {len(examples_ids)} calibration examples")


## Step 6: Quantize the Model

Run the GPTQ quantization process (uncomment to execute).


In [None]:
# Quantize the model with GPTQ
# Uncomment the following lines to run quantization
# This can take significant time depending on model size

# %%time
# model.quantize(
#     examples_ids,
#     batch_size=1,
#     use_triton=True,
# )
# 
# # Save quantized model and tokenizer
# model.save_quantized(out_dir, use_safetensors=True)
# tokenizer.save_pretrained(out_dir)
# print(f"✓ Model quantized and saved to {out_dir}")


## Step 7: Load and Test Quantized Model

Load the quantized model and run inference.


In [None]:
# Determine device
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the quantized model
model = AutoGPTQForCausalLM.from_quantized(
    out_dir,
    device=device,
    use_triton=True,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(out_dir)

print("✓ Quantized model loaded successfully")


## Step 8: Run Inference

Generate text using the quantized model.


In [None]:
from transformers import pipeline

# Create text generation pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Generate text
prompt = "I have a dream"
result = generator(prompt, do_sample=True, max_length=50)[0]['generated_text']

print("Generated text:")
print("-" * 50)
print(result)
