# Demo: Post-Training Quantization (PTQ) with GPT-2 & bitsandbytes

**Welcome!**

In this demo, we'll explore one of the most powerful and easy-to-use techniques for making large models more efficient: **Post-Training Quantization (PTQ)**.

**Our Goal:** We'll take a standard `GPT-2` model and load it in three different ways:
1.  **Baseline:** In its standard high-precision format (FP16).
2.  **8-bit Quantized:** A significantly smaller version.
3.  **4-bit Quantized:** An even more aggressively compressed version.

We will measure the memory footprint at each step to see the dramatic savings firsthand. The focus is on the simplicity offered by modern tools like Hugging Face `transformers` and `bitsandbytes`.

## 1. Environment Setup

First, let's install the necessary libraries. We need:
- `transformers`: For loading our pre-trained GPT-2 model.
- `torch`: The core deep learning framework.
- `bitsandbytes`: The magic library that enables easy quantization on-the-fly.
- `accelerate`: A helper library from Hugging Face that simplifies running PyTorch on different hardware (like multiple GPUs).

In [1]:
!pip install transformers torch bitsandbytes accelerate

Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting accelerate
  Downloading accelerate-1.8.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.33.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting hf-xet<2.0.0,>=1.1.2 (from huggingface-hub<1.0,>=0.30.0->transformers)
  Downloading hf_xet-1.1.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (879 bytes)
Downloading transformers-4.52.4-py3-none-any.whl (10.5 MB)
[2K   [90m━━━━

## 2. Imports and Configuration

Now, we'll import the required modules and set up some basic configurations, like the model name we want to use. We'll also check if a CUDA-enabled GPU is available, as `bitsandbytes` is optimized for this hardware.

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

# Define the model we want to use
MODEL_NAME = "gpt2"

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if device.type == 'cpu':
    print("\nWARNING: bitsandbytes 8-bit and 4-bit quantization is primarily designed for CUDA GPUs. Memory savings and performance gains will not be apparent on a CPU.")

Using device: cuda


## 3. Helper Function: Measure Model Memory

To accurately compare our models, we need a way to measure how much memory they occupy. This helper function will calculate the model's size in megabytes (MB) by summing up the memory used by all of its parameters and buffers.

In [3]:
def get_model_memory_footprint(model):
    """Calculates and returns the model's memory footprint in MB."""
    mem_params = sum(param.nelement() * param.element_size() for param in model.parameters())
    mem_bufs = sum(buf.nelement() * buf.element_size() for buf in model.buffers())
    total_mem_bytes = mem_params + mem_bufs
    return total_mem_bytes / (1024 ** 2) # Convert bytes to MB

## 4. The Quantization Experiment

Let's begin the experiment! We'll start by initializing our tokenizer and a dictionary to store the memory footprint results for comparison later.

In [4]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

memory_footprints = {}

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Step 4.1: Load the Baseline Model (FP16)

First, we load the standard GPT-2 model. This will be our **baseline**. On a GPU, it's common to use `float16` (FP16) precision, which is already a good optimization over `float32`. This gives us a realistic starting point to compare against.

In [5]:
print("--- 1. Loading Baseline Model (FP16) ---")
baseline_name = "FP16 (Baseline)"

# Load the model in float16 and move it to the GPU
model_baseline = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, 
    torch_dtype=torch.float16 # Use half-precision
).to(device)

memory_baseline = get_model_memory_footprint(model_baseline)
memory_footprints[baseline_name] = f"{memory_baseline:.2f} MB"
print(f"Loaded '{baseline_name}' model.")
print(f"Memory Footprint: {memory_baseline:.2f} MB")

--- 1. Loading Baseline Model (FP16) ---


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Loaded 'FP16 (Baseline)' model.
Memory Footprint: 249.35 MB


### Step 4.2: Load the 8-bit Quantized Model

Now for our first optimization. By adding just one argument, **`load_in_8bit=True`**, we instruct `bitsandbytes` to quantize the model's weights to 8-bit integers as it's being loaded. We also use **`device_map="auto"`**, which is highly recommended to let the library handle placing the model on the correct device(s) efficiently.

In [6]:
print("\n--- 2. Loading Model with 8-bit Quantization ---")
quant_8bit_name = "INT8 (bitsandbytes)"

if device.type == "cuda":
    model_8bit = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME, 
        load_in_8bit=True, 
        device_map="auto" # Recommended for bitsandbytes
    )
    memory_8bit = get_model_memory_footprint(model_8bit)
    memory_footprints[quant_8bit_name] = f"{memory_8bit:.2f} MB"
    print(f"Loaded '{quant_8bit_name}' model.")
    print(f"Memory Footprint: {memory_8bit:.2f} MB")
else:
    print("Skipping 8-bit quantization as CUDA is not available.")
    memory_footprints[quant_8bit_name] = "N/A (CPU)"

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.



--- 2. Loading Model with 8-bit Quantization ---
Loaded 'INT8 (bitsandbytes)' model.
Memory Footprint: 168.35 MB


### Step 4.3: Load the 4-bit Quantized Model (NF4)

Let's get even more aggressive. We can use **`load_in_4bit=True`** to achieve a massive memory reduction. By default, this uses the **NF4** (NormalFloat 4-bit) data type, a format specifically designed to be highly efficient for the distribution of weights typically found in neural networks.

In [7]:
print("\n--- 3. Loading Model with 4-bit Quantization (NF4) ---")
quant_4bit_nf4_name = "NF4 (4-bit bitsandbytes)"

if device.type == "cuda":
    model_4bit_nf4 = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME, 
        load_in_4bit=True, 
        device_map="auto"
    )
    memory_4bit_nf4 = get_model_memory_footprint(model_4bit_nf4)
    memory_footprints[quant_4bit_nf4_name] = f"{memory_4bit_nf4:.2f} MB"
    print(f"Loaded '{quant_4bit_nf4_name}' model.")
    print(f"Memory Footprint: {memory_4bit_nf4:.2f} MB")
else:
    print("Skipping 4-bit NF4 quantization as CUDA is not available.")
    memory_footprints[quant_4bit_nf4_name] = "N/A (CPU)"

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.



--- 3. Loading Model with 4-bit Quantization (NF4) ---
Loaded 'NF4 (4-bit bitsandbytes)' model.
Memory Footprint: 127.85 MB


## 5. Analyze the Results: Memory Footprint Summary

Now for the moment of truth! Let's print the memory footprints we recorded for each model. 

In [8]:
print("--- Memory Footprint Summary ---")
for name, mem in memory_footprints.items():
    print(f"{name}: {mem}")

--- Memory Footprint Summary ---
FP16 (Baseline): 249.35 MB
INT8 (bitsandbytes): 168.35 MB
NF4 (4-bit bitsandbytes): 127.85 MB


## 6. Sanity Check: Does the Quantized Model Still Work?

A smaller model is useless if it can't generate text. Let's do a quick sanity check with our most compressed model (the 4-bit NF4 version) to confirm that it's still functional and can complete a prompt.

In [9]:
print("\n--- Sanity Check: Text Generation with 4-bit NF4 Model ---")

prompt = "The future of artificial intelligence is"
MAX_NEW_TOKENS_DEMO = 25

if 'model_4bit_nf4' in locals() and model_4bit_nf4 is not None:
    inputs = tokenizer(prompt, return_tensors="pt").to(model_4bit_nf4.device)
    
    # Generate text
    with torch.no_grad():
        outputs = model_4bit_nf4.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS_DEMO, pad_token_id=tokenizer.eos_token_id)
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print(f"Prompt: {prompt}")
    print(f"Generated by NF4 model: {generated_text}")
else:
    print("4-bit NF4 model was not loaded, skipping generation sanity check.")



--- Sanity Check: Text Generation with 4-bit NF4 Model ---




Prompt: The future of artificial intelligence is
Generated by NF4 model: The future of artificial intelligence is uncertain.

"I don't know if we're going to see it, but I think it's going to be


## 7. Cleanup and Conclusion

Finally, it's good practice to explicitly delete the models and clear the GPU cache to free up memory.

In [10]:
# Clean up models from memory
del model_baseline
if 'model_8bit' in locals(): del model_8bit
if 'model_4bit_nf4' in locals(): del model_4bit_nf4

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("\nCleaned up models and emptied CUDA cache.")


Cleaned up models and emptied CUDA cache.


### Key Takeaway from Demo:

This demo powerfully illustrates how simple it is to apply Post-Training Quantization with modern libraries. With just a single argument during model loading (`load_in_8bit=True` or `load_in_4bit=True`), we drastically reduced the memory footprint of a pre-trained model, making it possible to run larger models on consumer-grade hardware.

While we focused on memory here, the trade-offs with inference speed and output quality are critical real-world considerations that will be explored in the subsequent exercise.