## **Notebook 6.4: Practical Guide to LLM Quantization Techniques and Frameworks**

## **Introduction**

🎉 Welcome to **Notebook 6.4**, where we dive deeper into **quantization techniques** specifically for **large language models (LLMs)**! 🚀 In the previous notebook (**Notebook 6.3**), we implemented a **linear 8-bit quantizer** that was **model-agnostic** and observed the performance improvements for **transformer models**. But that was just the beginning! While **linear quantization** is a great first step, as models keep scaling up and growing larger, we encounter **outliers**—which presents new challenges. 

💡 **The Challenge?**  
As **transformer models** continue to scale up, **outliers** become a major concern. Outliers, in this context, are values that deviate significantly from the expected norm, and when dealing with quantization, they can cause inefficiencies and performance degradation. In simple terms, **outliers** are data points that are much larger or smaller than the rest of the data, often leading to **poor approximation** during the quantization process. The traditional **linear quantization** method struggles with these extreme values, resulting in **limited performance** as the model size increases.  

💡 **The Solution?**  
To address these issues, we need more advanced quantization techniques that can better handle the presence of **outliers**. This is where **Post-Training Quantization (PTQ)** techniques come into play. PTQ offers powerful methods to optimize models after they have been trained, without requiring retraining. PTQ can adapt to the outlier problem and **maintain model accuracy** even with larger and more complex models.

### **Why This Notebook?**  
In this notebook, we will explore **advanced quantization techniques** that tackle the **outlier problem** and utilize **Post-Training Quantization (PTQ)**. Specifically, we will look into techniques and frameworks that focus on **dynamic** and **static quantization** methods, both of which are highly effective for **large-scale models** like LLMs. 

<p align="center">
  <img src="images/Q_TECH.png" alt="Quantization Overview">
</p>


## **What’s Inside?**

🔍 Here’s what we will cover in this notebook:  

### **1️⃣ Understanding the Outlier Problem in Quantization**  
✅ What are **outliers**, and why do they affect **large model quantization**?  
✅ How do **outliers** impact performance when using **linear quantization**?  

### **2️⃣ Overview of Quantization Techniques for LLMs**  
✅ **QAT (Quantization-Aware Training)** vs **PTQ (Post-Training Quantization)**  
✅ When to use **PTQ** and why it’s well-suited for large models  

### **3️⃣ Static vs Dynamic Quantization**  
✅ Understanding the difference between **static** and **dynamic quantization**  
✅ How these methods work with **Post-Training Quantization**  

### **4️⃣ Exploring PTQ Frameworks and Tools**  
✅ Overview of popular **PTQ frameworks** and **tools**  
✅ How these frameworks help manage outliers and improve model efficiency  

### **5️⃣ Implementing PTQ in Large Language Models**  
✅ Practical guide to applying PTQ to **transformer models**  
✅ Step-by-step walkthrough using a **PTQ framework**  

## **Why This Notebook Matters**

Large language models are becoming increasingly powerful, but their deployment in resource-constrained environments is a challenge. **Quantization** is key to **making these models smaller, faster, and more efficient**, but as models scale, the **outlier problem** becomes more prominent. In this notebook, you will:

✅ Learn about the **outlier problem** in large-scale model quantization  
✅ Understand the distinction between **QAT** and **PTQ**.
✅ Gain hands-on experience with **static** and **dynamic quantization** techniques  
✅ Explore leading **PTQ frameworks** and how they handle outliers  
✅ Apply PTQ techniques to **real-world transformer models**  

## **Looking Ahead**

With the techniques and frameworks covered in this notebook, you’ll be able to optimize **large language models** with **Post-Training Quantization** and make them more deployable. But we’re not stopping here—there are still many advanced quantization strategies to explore in future notebooks, as the field is rapidly evolving.

Ready to tackle the **outlier problem** and unlock the true potential of **large language model quantization**? Let’s dive in! 🚀


## **Difference Between QAT and PTQ, Static and Dynamic Quantization**

### **QAT (Quantization-Aware Training) vs PTQ (Post-Training Quantization)**

- **QAT (Quantization-Aware Training)**:
  - **Definition**: QAT involves simulating quantization during the training process. The model learns to adapt to the quantization noise by training with quantized weights and activations.
  - **Use Case**: Best suited when training a model from scratch or fine-tuning it. The model becomes aware of the quantization effects, leading to better accuracy after quantization.
  - **Pros**: Retains model accuracy after quantization.
  - **Cons**: Requires retraining, which can be computationally expensive and time-consuming.

- **PTQ (Post-Training Quantization)**:
  - **Definition**: PTQ quantizes a pre-trained model without the need for retraining. This is done after the model has been fine-tuned, typically by converting the weights to lower precision (e.g., INT8).
  - **Use Case**: Ideal for deploying pre-trained models efficiently without needing retraining.
  - **Pros**: Faster and less computationally expensive than QAT.
  - **Cons**: May lead to a slight drop in accuracy due to the lack of fine-tuning for quantization.


<p align="center">
  <img src="images/QAT_PTQ.png" alt="Quantization Overview">
</p>


### **Static Quantization vs Dynamic Quantization**

- **Static Quantization**:
  - **Definition**: Static quantization involves calibrating the model with a representative dataset to determine the optimal quantization parameters (like scaling factors) for weights and activations.
  - **Use Case**: Used in situations where the entire model is quantized upfront and requires careful tuning.
  - **Pros**: Results in the most efficient models with maximum compression.
  - **Cons**: Requires calibration data, which may not always be available.

- **Dynamic Quantization**:
  - **Definition**: Dynamic quantization performs quantization dynamically during inference, adjusting the precision of activations and weights as needed.
  - **Use Case**: Ideal when you want quick deployment without requiring a calibration step.
  - **Pros**: Easier to implement and faster than static quantization.
  - **Cons**: Might be less efficient in terms of memory and speed compared to static quantization.

### **Our Focus: PTQ for LLM Deployment**

In this notebook, we will focus on **Post-Training Quantization (PTQ)**, as our goal is to optimize **large language models (LLMs)** for **deployment after fine-tuning**. PTQ allows us to efficiently reduce model size and improve inference speed without retraining the model, which is key for deploying large models in resource-constrained environments.

In [1]:
import torch 
from transformers import AutoTokenizer, AutoModelForCausalLM 

  from .autonotebook import tqdm as notebook_tqdm


## **1- BitsAndBytes**

- **Definition**: `BitsAndBytes` is a library designed for efficient model quantization, supporting **Post-Training Quantization (PTQ)** and **Quantization-Aware Training (QAT)**.
  
- **Core Functionality**:
  - **Quantization**: Provides advanced techniques such as 8-bit and mixed precision quantization for optimizing large models.
  - **Efficient Memory Usage**: Reduces model size and memory footprint without sacrificing accuracy.
  - **Performance Optimization**: Increases inference speed by quantizing weights and activations to lower bit precision.

- **Under the Hood**:
  - **Vector-wise Quantization (Dot Product)**: Uses **vector-wise quantization** where the weights are quantized using **dot products**, improving both memory efficiency and computational speed.
  - **Mixed Precision**: Supports **mixed precision quantization**, combining high-precision weights with low-precision activations for optimal performance and accuracy.
  - **Quantize Optimizers**: Utilizes specialized optimizers to adapt weights during quantization and fine-tuning to minimize performance loss.
  - **NF4 (Static Quantization for QLoRA Fine-Tuning)**: Implements **NF4 quantization** for fine-tuning large models under **QLoRA** (Quantized Low-Rank Adaptation) using **static quantization** techniques to further optimize large-scale models.
  
- **Support for Multiple Frameworks**: Compatible with PyTorch, TensorFlow, and other ML frameworks for flexible deployment.

- the simplest way to use it is within the transformer library as follow : 

In [21]:
from transformers import BitsAndBytesConfig


model_name = "facebook/opt-125m"

    # Configure 8-bit quantization
quantization_config_8bit = BitsAndBytesConfig(
        load_in_8bit=True,
        device_map="auto"
    )

    # Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_8bit,
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
            (v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
            (q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
            (out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
          (fc2): L

Lets load the original model with no BitsandBites and see the size

In [22]:


# Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto"
    )

tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_fe

As you can see there are a big diffrent is size 

Additionally BitsAndBytes even let you load the model in 4bit

In [23]:

    # Configure 8-bit quantization
quantization_config_8bit = BitsAndBytesConfig(
        load_in_4bit=True, ## load in 4 bits
        device_map="auto"
    )

    # Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_8bit,
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (q_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_proj): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=768, out_features=3072, bias=True)
          (fc2): Linear4bit(

## **Advanced Usage**  

You can experiment with different **4-bit quantization** variants, such as:  

- **NF4 (Normalized Float 4, default)** – Recommended for better performance based on theoretical and empirical results.  
- **FP4 (Pure Floating-Point 4-bit Quantization)** – An alternative option.  

### **Additional Optimization Options**  
- **`bnb_4bit_use_double_quant`**: Enables a second quantization pass, reducing memory usage by an additional **0.4 bits per parameter**.  
- **Compute Precision**: While weights are stored in **4-bit**, computations can still be performed in higher precision:
  - **`float16`** (Faster training, commonly used)  
  - **`bfloat16`**  
  - **`float32`** (Default)  

Using **`float16`** as the compute type speeds up matrix multiplication and training.

### **Example: Load a Model in 4-bit Using NF4 & Double Quantization**  

To customize these parameters, we use `BitsAndBytesConfig` from **Hugging Face Transformers**:




In [19]:

    # Configure 8-bit quantization
quantization_config_8bit = BitsAndBytesConfig(
        load_in_4bit=True, ## load in 4 bits
         bnb_4bit_quant_type="nf4", ## the type of quantization
        bnb_4bit_use_double_quant=True, ## double quant
        bnb_4bit_compute_dtype=torch.bfloat16, ## bfloat for computaion during inference
        device_map="auto"
    )

    # Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_8bit,
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (q_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_proj): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=768, out_features=3072, bias=True)
          (fc2): Linear4bit(

### **Finding the Optimal Configuration**  
💡 **There’s no one-size-fits-all setup!** You should experiment with:  
✅ **Quantization type** (`nf4` vs `fp4`)  
✅ **Nested quantization** (`bnb_4bit_use_double_quant=True/False`)  
✅ **Compute dtype** (`float16`, `bfloat16`, or `float32`)  

By tuning these parameters, you can **find the sweet spot** that balances **accuracy, speed, and memory usage**, aligning with your **hardware capabilities** and **deployment goals**.
