## **Notebook 6.4: Practical Guide to LLM Quantization Techniques and Frameworks**

## **Introduction**

🎉 Welcome to **Notebook 6.4**, where we dive deeper into **quantization techniques** specifically for **large language models (LLMs)**! 🚀 In the previous notebook (**Notebook 6.3**), we implemented a **linear 8-bit quantizer** that was **model-agnostic** and observed the performance improvements for **transformer models**. But that was just the beginning! While **linear quantization** is a great first step, as models keep scaling up and growing larger, we encounter **outliers**—which presents new challenges. 

💡 **The Challenge?**  
As **transformer models** continue to scale up, **outliers** become a major concern. Outliers, in this context, are values that deviate significantly from the expected norm, and when dealing with quantization, they can cause inefficiencies and performance degradation. In simple terms, **outliers** are data points that are much larger or smaller than the rest of the data, often leading to **poor approximation** during the quantization process. The traditional **linear quantization** method struggles with these extreme values, resulting in **limited performance** as the model size increases.  

💡 **The Solution?**  
To address these issues, we need more advanced quantization techniques that can better handle the presence of **outliers**. This is where **Post-Training Quantization (PTQ)** techniques come into play. PTQ offers powerful methods to optimize models after they have been trained, without requiring retraining. PTQ can adapt to the outlier problem and **maintain model accuracy** even with larger and more complex models.

### **Why This Notebook?**  
In this notebook, we will explore **advanced quantization techniques** that tackle the **outlier problem** and utilize **Post-Training Quantization (PTQ)**. Specifically, we will look into techniques and frameworks that focus on **dynamic** and **static quantization** methods, both of which are highly effective for **large-scale models** like LLMs. 

<p align="center">
  <img src="images/Q_TECH.png" alt="Quantization Overview">
</p>


## **What’s Inside?**

🔍 Here’s what we will cover in this notebook:  

### **1️⃣ Understanding the Outlier Problem in Quantization**  
✅ What are **outliers**, and why do they affect **large model quantization**?  
✅ How do **outliers** impact performance when using **linear quantization**?  

### **2️⃣ Overview of Quantization Techniques for LLMs**  
✅ **QAT (Quantization-Aware Training)** vs **PTQ (Post-Training Quantization)**  
✅ When to use **PTQ** and why it’s well-suited for large models  

### **3️⃣ Static vs Dynamic Quantization**  
✅ Understanding the difference between **static** and **dynamic quantization**  
✅ How these methods work with **Post-Training Quantization**  

### **4️⃣ Exploring PTQ Frameworks and Tools**  
✅ Overview of popular **PTQ frameworks** and **tools**  
✅ How these frameworks help manage outliers and improve model efficiency  

### **5️⃣ Implementing PTQ in Large Language Models**  
✅ Practical guide to applying PTQ to **transformer models**  
✅ Step-by-step walkthrough using a **PTQ framework**  

## **Why This Notebook Matters**

Large language models are becoming increasingly powerful, but their deployment in resource-constrained environments is a challenge. **Quantization** is key to **making these models smaller, faster, and more efficient**, but as models scale, the **outlier problem** becomes more prominent. In this notebook, you will:

✅ Learn about the **outlier problem** in large-scale model quantization  
✅ Understand the distinction between **QAT** and **PTQ**.
✅ Gain hands-on experience with **static** and **dynamic quantization** techniques  
✅ Explore leading **PTQ frameworks** and how they handle outliers  
✅ Apply PTQ techniques to **real-world transformer models**  

## **Looking Ahead**

With the techniques and frameworks covered in this notebook, you’ll be able to optimize **large language models** with **Post-Training Quantization** and make them more deployable. But we’re not stopping here—there are still many advanced quantization strategies to explore in future notebooks, as the field is rapidly evolving.

Ready to tackle the **outlier problem** and unlock the true potential of **large language model quantization**? Let’s dive in! 🚀


## **Difference Between QAT and PTQ, Static and Dynamic Quantization**

### **QAT (Quantization-Aware Training) vs PTQ (Post-Training Quantization)**

- **QAT (Quantization-Aware Training)**:
  - **Definition**: QAT involves simulating quantization during the training process. The model learns to adapt to the quantization noise by training with quantized weights and activations.
  - **Use Case**: Best suited when training a model from scratch or fine-tuning it. The model becomes aware of the quantization effects, leading to better accuracy after quantization.
  - **Pros**: Retains model accuracy after quantization.
  - **Cons**: Requires retraining, which can be computationally expensive and time-consuming.

- **PTQ (Post-Training Quantization)**:
  - **Definition**: PTQ quantizes a pre-trained model without the need for retraining. This is done after the model has been fine-tuned, typically by converting the weights to lower precision (e.g., INT8).
  - **Use Case**: Ideal for deploying pre-trained models efficiently without needing retraining.
  - **Pros**: Faster and less computationally expensive than QAT.
  - **Cons**: May lead to a slight drop in accuracy due to the lack of fine-tuning for quantization.


<p align="center">
  <img src="images/QAT_PTQ.png" alt="Quantization Overview">
</p>


### **Static Quantization vs Dynamic Quantization**

- **Static Quantization**:
  - **Definition**: Static quantization involves calibrating the model with a representative dataset to determine the optimal quantization parameters (like scaling factors) for weights and activations.
  - **Use Case**: Used in situations where the entire model is quantized upfront and requires careful tuning.
  - **Pros**: Results in the most efficient models with maximum compression.
  - **Cons**: Requires calibration data, which may not always be available.

- **Dynamic Quantization**:
  - **Definition**: Dynamic quantization performs quantization dynamically during inference, adjusting the precision of activations and weights as needed.
  - **Use Case**: Ideal when you want quick deployment without requiring a calibration step.
  - **Pros**: Easier to implement and faster than static quantization.
  - **Cons**: Might be less efficient in terms of memory and speed compared to static quantization.

### **Our Focus: PTQ for LLM Deployment**

In this notebook, we will focus on **Post-Training Quantization (PTQ)**, as our goal is to optimize **large language models (LLMs)** for **deployment after fine-tuning**. PTQ allows us to efficiently reduce model size and improve inference speed without retraining the model, which is key for deploying large models in resource-constrained environments.

In [1]:
import torch 
from transformers import AutoTokenizer, AutoModelForCausalLM 

  from .autonotebook import tqdm as notebook_tqdm


## **1- BitsAndBytes**

- **Definition**: `BitsAndBytes` is a library designed for efficient model quantization, supporting **Post-Training Quantization (PTQ)** and **Quantization-Aware Training (QAT)**.
  
- **Core Functionality**:
  - **Quantization**: Provides advanced techniques such as 8-bit and mixed precision quantization for optimizing large models.
  - **Efficient Memory Usage**: Reduces model size and memory footprint without sacrificing accuracy.
  - **Performance Optimization**: Increases inference speed by quantizing weights and activations to lower bit precision.

  <p align="center">
  <img src="images/bnb.jpeg" alt="Quantization Overview">
</p>


- **Under the Hood**:
  - **Vector-wise Quantization (Dot Product)**: Uses **vector-wise quantization** where the weights are quantized using **dot products**, improving both memory efficiency and computational speed.
  - **Mixed Precision**: Supports **mixed precision quantization**, combining high-precision weights with low-precision activations for optimal performance and accuracy.
  - **Quantize Optimizers**: Utilizes specialized optimizers to adapt weights during quantization and fine-tuning to minimize performance loss.
  - **NF4 (Static Quantization for QLoRA Fine-Tuning)**: Implements **NF4 quantization** for fine-tuning large models under **QLoRA** (Quantized Low-Rank Adaptation) using **static quantization** techniques to further optimize large-scale models.
- **Support for Multiple Frameworks**: Compatible with PyTorch, TensorFlow, and other ML frameworks for flexible deployment.

- the simplest way to use it is within the transformer library as follow : 

In [21]:
from transformers import BitsAndBytesConfig


model_name = "facebook/opt-125m"

    # Configure 8-bit quantization
quantization_config_8bit = BitsAndBytesConfig(
        load_in_8bit=True,
        device_map="auto"
    )

    # Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_8bit,
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
            (v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
            (q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
            (out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
          (fc2): L

Lets load the original model with no BitsandBites and see the size

In [22]:


# Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto"
    )

tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_fe

As you can see there are a big diffrent is size 

Additionally BitsAndBytes even let you load the model in 4bit

In [23]:

    # Configure 8-bit quantization
quantization_config_8bit = BitsAndBytesConfig(
        load_in_4bit=True, ## load in 4 bits
        device_map="auto"
    )

    # Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_8bit,
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (q_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_proj): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=768, out_features=3072, bias=True)
          (fc2): Linear4bit(

## **Advanced Usage**  

You can experiment with different **4-bit quantization** variants, such as:  

- **NF4 (Normalized Float 4, default)** – Recommended for better performance based on theoretical and empirical results.  
- **FP4 (Pure Floating-Point 4-bit Quantization)** – An alternative option.  

### **Additional Optimization Options**  
- **`bnb_4bit_use_double_quant`**: Enables a second quantization pass, reducing memory usage by an additional **0.4 bits per parameter**.  
- **Compute Precision**: While weights are stored in **4-bit**, computations can still be performed in higher precision:
  - **`float16`** (Faster training, commonly used)  
  - **`bfloat16`**  
  - **`float32`** (Default)  

Using **`float16`** as the compute type speeds up matrix multiplication and training.

### **Example: Load a Model in 4-bit Using NF4 & Double Quantization**  

To customize these parameters, we use `BitsAndBytesConfig` from **Hugging Face Transformers**:




In [19]:

    # Configure 8-bit quantization
quantization_config_8bit = BitsAndBytesConfig(
        load_in_4bit=True, ## load in 4 bits
         bnb_4bit_quant_type="nf4", ## the type of quantization
        bnb_4bit_use_double_quant=True, ## double quant
        bnb_4bit_compute_dtype=torch.bfloat16, ## bfloat for computaion during inference
        device_map="auto"
    )

    # Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_8bit,
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model {model_name} loaded successfully with 8-bit quantization!")
print(model)

print(model.get_memory_footprint() / 1024**3, "GB")


Model facebook/opt-125m loaded successfully with 8-bit quantization!
OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (q_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_proj): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=768, out_features=3072, bias=True)
          (fc2): Linear4bit(

## 2.GPTQ

GPTQ (Generalized Post-Training Quantization) is an advanced quantization technique designed to efficiently reduce the precision of large language models (LLMs) while maintaining high accuracy. It enables **fast inference** and **lower memory usage** by using **4-bit weight quantization with error correction**.


  <p align="center">
  <img src="images/gptq.png" alt="Quantization Overview">
</p>

## The problem : 
Traditional quantization methods (like uniform or naïve rounding) can introduce significant errors when converting high-precision weights (e.g., FP16 or FP32) into low-bit representations (e.g., INT4). These errors can accumulate and cause a noticeable drop in performance.

## Soultion: 
GPTQ solves this issue using Optimal Brain Surgeon (OBS) principles, which adjust other weights after quantizing each weight to minimize the overall error. The key idea is:

- **Quantize One Weight at a Time**: Instead of naive rounding, GPTQ processes weights sequentially to minimize overall error.
- **Error Correction via Hessian Approximation**: Adjusts remaining weights dynamically to compensate for quantization error.
- **Optimal Brain Surgeon (OBS) Principles**: Uses second-order optimization for best quantization results.

## **Mathematical Formulation**

GPTQ minimizes the activation reconstruction error:
```python
# X: Input activations
# W: Original weight matrix
# W_tilde: Quantized version

def activation_reconstruction_error(X, W, W_tilde):
    return torch.norm(X @ W - X @ W_tilde, p='fro')**2
```
Instead of a naive quantization:
```python
# Naive quantization
W_tilde[i, j] = quantize(W[i, j])
```
GPTQ finds the best quantized weight while adjusting other weights using an **Hessian-based update**:
```python
# Hessian-based update
w_new = w_old - torch.inverse(H) @ e_j * (q_j - w_j)
```
where `H` is the Hessian approximation capturing sensitivity.

## **Algorithm Steps**
1. **Compute Activations**: Use a small dataset to estimate activations `X`.
2. **Estimate Hessian**: Compute `H = (X.T @ X) / N`.
3. **Iterate Over Weights**:
   - Find the best quantized value `q_j` for each weight.
   - Update remaining weights using Hessian inverse correction.
4. **Repeat Until All Weights Are Quantized**.


## **Benefits of GPTQ**
✅ **4-bit Quantization** with minimal accuracy loss  
✅ **Faster Inference** due to smaller models  
✅ **Memory Efficient** for edge and server deployment  
✅ **Better Than Naive Rounding** through Hessian correction  


Enough theory lets dive into the code : 


In [9]:
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Let’s implement the GPTQ algorithm using the AutoGPTQ library and quantize a GPT-2 model. This requires a GPU, but a free T4 on Google Colab will do. We start by loading the libraries and defining the model we want to quantize (in this case, GPT-2).

In [10]:
import random

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer


# Define base model and output directory
model_id = "gpt2"
out_dir = model_id + "-GPTQ"

We now want to load the model and the tokenizer. The tokenizer is loaded using the classic AutoTokenizer class from the transformers library. On the other hand, we need to pass a specific configuration (BaseQuantizeConfig) to load the model.

In this configuration, we can specify the number of bits to quantize (here, bits=4) and the group size (size of the lazy batch). Note that this group size is optional: we could also use one set of parameters for the entire weight matrix. In practice, these groups generally improve the quality of the quantization at a very low cost (especially with group_size=1024). The damp_percent value is here to help the Cholesky reformulation and should not be changed.

Finally, the desc_act (also called act order) is a tricky parameter. It allows you to process rows based on decreasing activation, meaning the most important or impactful rows (determined by sampled inputs and outputs) are processed first. This method aims to place most of the quantization error (inevitably introduced during quantization) on less significant weights. This approach improves the overall accuracy of the quantization process by ensuring the most significant weights are processed with greater precision. However, when used alongside group size, desc_act can lead to performance slowdowns due to the need to frequently reload quantization parameters. For this reason, we won’t use it here (it will probably be fixed in the future, however).

In [4]:
# Load quantize config, model and tokenizer
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
    desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The quantization process relies heavily on samples to evaluate and enhance the quality of the quantization. They provide a means of comparison between the outputs produced by the origina and the newly quantized model. The larger the number of samples provided, the greater the potential for more accurate and effective comparisons, leading to improved quantization quality.

In the context of this article, we utilize the C4 (Colossal Clean Crawled Corpus) dataset to generate our samples. The C4 dataset is a large-scale, multilingual collection of web text gathered from the Common Crawl project. This expansive dataset has been cleaned and prepared specifically for training large-scale language models, making it a great resource for tasks such as this. The WikiText dataset is another popular option.

In the following code block, we load 1024 samples from the C4 dataset, tokenize them, and format them.

In [5]:
# Load data and tokenize examples
n_samples = 1024
data = load_dataset("allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split=f"train[:{n_samples*5}]")
tokenized_data = tokenizer("\n\n".join(data['text']), return_tensors='pt')

# Format tokenized examples
examples_ids = []
for _ in range(n_samples):
    i = random.randint(0, tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1)
    j = i + tokenizer.model_max_length
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

README.md:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

c4-train.00001-of-01024.json.gz:   0%|          | 0.00/318M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2441065 > 1024). Running this sequence through the model will result in indexing errors


Now that dataset is ready, we can start the quantization process with a batch size of 1. Optionally, we also use OpenAI Triton, a CUDA alternative, to communicate with the GPU. Once this is done, we save the tokenizer and the model in a safetensors format.

In [6]:
%%time

# Quantize with GPTQ
model.quantize(
    examples_ids,
    batch_size=1,
    use_triton=True,
)

# Save model and tokenizer
model.save_quantized(out_dir, use_safetensors=True)
tokenizer.save_pretrained(out_dir)

INFO - Start quantizing layer 1/12
INFO - Quantizing attn.c_attn in layer 1/12...
INFO - Quantizing attn.c_proj in layer 1/12...
INFO - Quantizing mlp.c_fc in layer 1/12...
INFO - Quantizing mlp.c_proj in layer 1/12...
INFO - Start quantizing layer 2/12
INFO - Quantizing attn.c_attn in layer 2/12...
INFO - Quantizing attn.c_proj in layer 2/12...
INFO - Quantizing mlp.c_fc in layer 2/12...
INFO - Quantizing mlp.c_proj in layer 2/12...
INFO - Start quantizing layer 3/12
INFO - Quantizing attn.c_attn in layer 3/12...
INFO - Quantizing attn.c_proj in layer 3/12...
INFO - Quantizing mlp.c_fc in layer 3/12...
INFO - Quantizing mlp.c_proj in layer 3/12...
INFO - Start quantizing layer 4/12
INFO - Quantizing attn.c_attn in layer 4/12...
INFO - Quantizing attn.c_proj in layer 4/12...
INFO - Quantizing mlp.c_fc in layer 4/12...
INFO - Quantizing mlp.c_proj in layer 4/12...
INFO - Start quantizing layer 5/12
INFO - Quantizing attn.c_attn in layer 5/12...
INFO - Quantizing attn.c_proj in layer 5/1

CPU times: user 12min 44s, sys: 29min 34s, total: 42min 19s
Wall time: 42min 24s


('gpt2-GPTQ/tokenizer_config.json',
 'gpt2-GPTQ/special_tokens_map.json',
 'gpt2-GPTQ/vocab.json',
 'gpt2-GPTQ/merges.txt',
 'gpt2-GPTQ/added_tokens.json',
 'gpt2-GPTQ/tokenizer.json')

After , the model can loaded from the output directory using the AutoGPTQForCausalLM and AutoTokenizer classes.

In [12]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Reload model and tokenizer
model = AutoGPTQForCausalLM.from_quantized(
    out_dir,
    device=device,
    use_triton=True,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(out_dir)

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.


Lets check if the model working properly after the quantization process

In [14]:
from transformers import pipeline

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = generator("I have a dream", do_sample=True, max_length=100)[0]['generated_text']
print(result)

Device set to use cuda:0
The model 'GPT2GPTQForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausal

I have a dream to make it to one of the first theaters in the country. They are not even close and I have such great plans. I never met anyone here before and it is as good as my dreams. People just need to learn how to be the biggest stars that they can."

"It took seven years to come together to create an audience that many will like. It is what I have always dreamed of," says his brother, Eric, of "Avengers: Infinity


Here’s your enhanced Markdown with improved clarity, structure, and formatting:  

---

# **Quantization Techniques: A Review**  

Before diving deeper into **GGML & Llama.cpp**, let's review the three fundamental **quantization techniques** used to optimize Large Language Models (LLMs) for efficient inference and fine-tuning.  

### **1. NF4 (Normal Float 4-bit)**  
NF4 is a **static quantization method** primarily used by **QLoRA** to load a model in **4-bit precision** for efficient fine-tuning.  

- Utilized in **LoRA + PEFT** workflows.  
- Reduces memory consumption while maintaining model quality.  
- Commonly used for training or fine-tuning rather than direct inference.  

### **2. GPTQ (Generalized Post-Training Quantization)**  
GPTQ is a **post-training quantization (PTQ) method** designed to **compress LLMs without retraining** while minimizing performance degradation.  

- Optimized for GPU-based inference.  
- Uses an **approximate weight reconstruction algorithm** to reduce precision with minimal loss.  
- Provides significant **speed and memory efficiency improvements** while maintaining accuracy.  

### **3. GGML & GGUF (Next Section Preview)**  
In the next section, we'll explore **GGML** and its successor **GGUF**, two quantization formats designed for efficient **CPU & GPU inference** using **Llama.cpp**.  

---

# **GGML & Llama.cpp**  

### **What is GGML?**  
**GGML** (Georgi Gerganov Machine Learning) is a **high-performance C library** for machine learning, named after its creator **Georgi Gerganov**.  

- Provides fundamental ML structures, including **tensor operations**.  
- Introduces a specialized **binary format** to distribute LLMs efficiently.  
- Originally focused on **CPU-based inference** for large models.  


  <p align="center">
  <img src="images/llama_cpp.png" alt="Quantization Overview">
</p>

### **GGUF: The Next Evolution**  
GGML models previously used a **.ggml** format, but they have now transitioned to **GGUF (GGML Unified Format)**:  

- **Extensible design**: New features can be added without breaking compatibility.  
- **Centralized metadata**: Stores special tokens, **RoPE scaling**, quantization details, etc.  
- **Improved model compatibility** with tools like **llama.cpp** and other runtimes.  

### **Llama.cpp: Efficient LLM Inference**  
[Llama.cpp](https://github.com/ggerganov/llama.cpp) is an **optimized C++ library** for running LLaMA models efficiently.  

- Originally designed for **CPU inference**, making LLMs accessible on lower-end hardware.  
- Now supports **layer offloading to GPU**, allowing hybrid CPU+GPU execution.  
- Highly optimized with **AVX, FMA, and ARM NEON** instructions for fast computation.  

### **CPU + GPU Offloading: A Game Changer**  
By **offloading specific layers** to the GPU, **llama.cpp** significantly accelerates inference while reducing memory constraints.  

For example, on a **7B parameter model with 35 layers**, you can:  
- Run **critical layers on the GPU** for acceleration.  
- Keep **less demanding layers on the CPU** to conserve VRAM.  
- Achieve **faster inference speeds** while running **larger models** on limited hardware.  

So If the command line is your thing GGML & LLama.cpp is for you


In [None]:
# Step 1: Define variables
MODEL_ID = "mlabonne/EvolCodeLlama-7b"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

# Step 2: Extract model name
MODEL_NAME = MODEL_ID.split('/')[-1]

# Step 3: Install dependencies (remove sudo if running in container)
!apt-get install -y git-lfs cmake  # Removed sudo

# Step 4: Clean build llama.cpp with CUDA
!rm -rf llama.cpp  # Force remove existing directory
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!mkdir -p build 
%cd build
!cmake .. -DGGML_CUDA=ON
!cmake --build . --config Release -j 4
%cd ../..

# Step 5: Install Python requirements with correct path
%cd llama.cpp
!pip install -r requirements.txt  # Now in correct directory
%cd ..

# Step 6: Download model with proper cleanup
!rm -rf {MODEL_NAME}  # Remove existing model dir
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

# Step 7: Convert with correct script path
fp16 = f"{MODEL_NAME}/model.f16.gguf"
!python llama.cpp/scripts/convert-hf-to-gguf.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Step 8: Quantize with correct binary path
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/model.{method}.gguf"
    !./llama.cpp/build/bin/quantize {fp16} {qtype} {method}

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
Cloning into 'llama.cpp'...
remote: Enumerating objects: 45118, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (163/163), done.[K
remote: Total 45118 (delta 157), reused 80 (delta 79), pack-reused 44876 (from 3)[K
Receiving objects: 100% (45118/45118), 94.03 MiB | 1.00 MiB/s, done.
Resolving deltas: 100% (32494/32494), done.
/home/silva/SILVA.AI/Projects/MyLLM101/notebooks/llama.cpp
/home/silva/SILVA.AI/Projects/MyLLM101/notebooks/llama.cpp/build
-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI

# Model Conversion and Quantization Pipeline

This process sets up and converts a Hugging Face model to the GGUF format and applies quantization for optimized inference.

### **Steps Summary:**
1. **Define Variables**  
   - Specifies the model to use (`EvolCodeLlama-7b`) and quantization methods (`q4_k_m`, `q5_k_m`)(recommended).

2. **Install Dependencies**  
   - Installs necessary packages: `git-lfs` (for handling large files) and `cmake` (for compiling dependencies).

3. **Clone and Build `llama.cpp` with CUDA Support**  
   - Retrieves the `llama.cpp` repository and compiles it with CUDA for GPU acceleration.

4. **Install Python Requirements**  
   - Installs required Python libraries from `requirements.txt`.

5. **Download the Model from Hugging Face**  
   - Uses `git lfs` to clone the model repository.

6. **Convert the Model to GGUF Format**  
   - Converts the original Hugging Face model into GGUF format for compatibility with `llama.cpp`.

7. **Apply Quantization**  
   - Uses `llama.cpp`'s `quantize` tool to generate optimized versions of the model.

# 🚀 4- ExLlamaV2: The Fastest Library to Run LLMs  

GPTQ delivers amazing performance on GPUs. Compared to unquantized models, this method uses almost **3× less VRAM** while maintaining similar accuracy and faster generation. It has become so popular that it is now directly integrated into the **transformers** library.  

## ⚡ What is ExLlamaV2?  
ExLlamaV2 is a library designed to **maximize performance** for GPTQ models. It features:  
- **Optimized kernels** for faster inference  
- **EXL2 quantization format**, offering greater flexibility in weight storage  
- **Lower memory usage** with high performance  

## 🔧 Installation  

To install **ExLlamaV2**, run the following commands:  

```bash
# Clone the repository
git clone https://github.com/turboderp/exllamav2

# Install the library
pip install exllamav2
```

We download zephyr-7B-beta using the following command (this can take a while since the model is about 15 GB): 

As we mention before the GPTQ algorithm  requires a calibration dataset, which is used to measure the impact of the quantization process by comparing the outputs of the base model and its quantized version. We will use the wikitext dataset and directly download the test file as follows:

In [None]:
! git lfs install
! git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

In [None]:
! wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet 

# 🔄 Converting Models with ExLlamaV2  

Once ExLlamaV2 is installed, we can use the **`convert.py`** script to convert a model into the optimized EXL2 format.  

## 🎯 Required Arguments for Conversion  

The script requires **four main parameters**:  

- **`-i`** → Path to the base model in **Hugging Face (HF) format (FP16)**.  
- **`-o`** → Path to the working directory where **temporary files and final output** will be stored.  
- **`-c`** → Path to the **calibration dataset** in **Parquet format**.  
- **`-b`** → Target **average bits per weight (bpw)** (e.g., `4.0` for 4-bit precision).  

## 🛠️ Example Command  

```bash
python exllamav2/convert.py \
    -i path/to/base_model \
    -o path/to/output_directory \
    -c path/to/calibration_dataset.parquet \
    -b 4.0


In [None]:
! mkdir quant
! python python exllamav2/convert.py \
    -i base_model \
    -o quant \
    -c wikitext-test.parquet \
    -b 5.0

# 🖥️ GPU Requirements for Quantization  

- **7B models** need **~8GB VRAM**, while **70B models** require **~24GB VRAM**.  
- On **Google Colab (T4 GPU)**, quantizing **zephyr-7b-beta** took **~2h 10min**.  

# 🔍 Why Use EXL2 Over GPTQ?  

**EXL2** improves **GPTQ** by offering:  
✅ Support for **2, 3, 4, 5, 6, and 8-bit quantization** (not just 4-bit).  
✅ **Mixed-precision layers**, preserving critical weights with higher bits.  
✅ **Adaptive error minimization**, optimizing quantization for accuracy & efficiency.  

# 📊 How Does ExLlamaV2 Optimize Quantization?  

ExLlamaV2 **benchmarks multiple quantization parameters**, tracking errors & precision levels.  
- Example: A **layer can mix 5% 3-bit & 95% 2-bit precision** for an **average 2.188 bpw**.  
- Results are stored in **`measurement.json`**, aiding in optimal quantization selection.  


## 🦙 Running ExLlamaV2 for Inference
Now that our model is quantized, we want to run it to see how it performs. Before that, we need to copy essential config files from the base_model directory to the new quant directory. Basically, we want every file that is not hidden (.*) or a safetensors file. Additionally, we don’t need the out_tensor directory that was created by ExLlamaV2 during quantization.

In bash, you can implement this as follows:

In [None]:
!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/

# 🚀 Running Our EXL2 Model  

Our **EXL2 model** is ready! The most straightforward way to run it is using the `test_inference.py` script from the **ExLlamaV2** repo.  

### 🏃 Quick Inference Command:  
```bash
python exllamav2/test_inference.py -m quant/ -p "I have a dream"
```
✅ **`-m quant/`**: Specifies the directory containing the quantized model.  
✅ **`-p "I have a dream"`**: Provides the input prompt for inference.  

⚡ **Performance**:  
- **56.44 tokens/second** on a **T4 GPU**—faster than other quantization methods like **GGUF/llama.cpp** or **GPTQ**.  
- For a detailed comparison, check out this excellent [article by oobabooga](https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/).  

### 🔍 NF4 vs. GGML vs. GPTQ  
<p align="center">
  <img src="images/compare.png" alt="Quantization Overview">
</p>
```  

# 🗨️ Running EXL2 in Chat Mode  

For a more **interactive** experience, use the `chat.py` script:  
```bash
python exllamav2/examples/chat.py -m quant -mode llama
```
✅ **`-m quant`**: Specifies the quantized model directory.  
✅ **`-mode llama`**: Enables LLaMA-style chat functionality.  

### 🔗 Integrations & Requirements  
If you plan to use EXL2 models **regularly**, ExLlamaV2 is integrated into multiple backends, including:  
- **oobabooga’s text generation web UI** 🖥️  

⚠️ **Performance Tip**:  
To maximize efficiency, **FlashAttention 2** is required. This currently needs **CUDA 12.1** on Windows, which can be configured during installation.  


# 📤 Uploading to Hugging Face Hub  
Now that we’ve tested the model, we’re ready to **upload it to Hugging Face**!  


In [None]:
from huggingface_hub import notebook_login
from huggingface_hub import HfApi

notebook_login()
api = HfApi()
api.create_repo(
    repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
    repo_type="model"
)
api.upload_folder(
    repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
    folder_path="quant",
)

# 🎉 Model Successfully Uploaded to Hugging Face Hub  

The model is now available on **Hugging Face**! 🚀  

📍 **Find it here**:  
🔗 [Zephyr-7B-Beta-5.0bpw-EXL2](https://huggingface.co/mlabonne/zephyr-7b-beta-5.0bpw-exl2)  

---

## 🛠️ Flexible & Hardware-Friendly Quantization  

The notebook provides a **generalized approach** to quantization, allowing you to:  
✅ **Quantize different models** 🧠  
✅ **Experiment with various `bpw` values** 🎛️  
✅ **Optimize for specific hardware** 💻  

This makes it ideal for **tailoring models** to your **device’s capabilities**!  


# 🎯 **Conclusion: Mastering LLM Quantization in Practice**  

Congratulations! 🎉 You've successfully explored **practical quantization techniques** for **LLMs**, following up on the **foundations from Notebook 6.3**.  

In this notebook, we implemented **Post-Training Quantization (PTQ)** using various frameworks:  
✅ **bitsandbytes** 🏗️  
✅ **GPTQ** 🚀  
✅ **GGML & llama.cpp** 🦙  
✅ **ExLlamaV2** ⚡ (the fastest!)  

You've learned how to **optimize models for efficiency**, balancing memory savings with performance, and even experimented with **cutting-edge EXL2 quantization**.  

🔹 **Next Steps?** Try different quantization settings, test models on your hardware, and explore deployment options.  

👏 **Well done on making it this far!** You’re now equipped to apply quantization in real-world projects. 🚀  