### This notebook was created by [Harpreet Sahota](https://twitter.com/DataScienceHarp), DevRel Manager at [Deci AI](https://deci.ai/). Keep in touch by joining me in our community, [Deep Learning Daily](https://www.deeplearningdaily.community/)

GPTQ is the Marie Kondo for Language Models - it helps tidy up LLMs, making them more efficient and faster without losing their smarts.

What is GPTQ?

It's a post-training quantization technique for large language models, that uses second order information about the weight matrices for better quantization.

Second-order information basically means second order derivatives, such as the Hessian matrix, to perform weight quantization.

These second-order derivatives provide information about the curvature of the loss landscape, and GPTQ to capture more nuanced characteristics of the model parameters and improve the quantization process

The end result is that GPTQ effectively reduces the computational and memory requirements of large language models without significant loss of accuracy. Making them more efficient to deploy by reducing their size and computational requirements.

# 🔍 **GPTQ Algorithm**

1. **Arbitrary Order Insight**

    - Traditional quantization methods quantize weights in a specific order to minimize error.

    - [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) is an algorithm to compress any transformer-based language model into few bits with minimal performance degradation.

   - GPTQ's cool insight: Just mix it up, order doesn't matter much, especially with big, complex layers.
   
   - Traditional methods prioritize quantizing weights in a specific sequence to minimize quantization error.
     - However, GPTQ discovered that for large, heavily-parametrized layers, quantizing weights in arbitrary order is nearly as effective.
   
   - This approach maintains similar error levels while simplifying the process.

2. **Quantizing All Rows in Same Order**

   - Instead of treating each row like it's special, GPTQ quantizes every row the same way.

    - GPTQ quantizes all weights in the same order across rows, a departure from row-specific order.

   - This method significantly reduces runtime, especially for larger models.

   - Utilizes inverse Hessian information for quantization.

3. **Lazy Batch-Updates for Efficiency**

   - GPTQ groups columns, updates them together, and BAM! Much faster, especially on GPUs.

   - GPTQ batches updates to improve GPU utilization, applying the algorithm to groups of columns at a time.

   - This batch update approach doesn't reduce compute but alleviates memory bandwidth bottlenecks, offering substantial speedups.

4. **Cholesky Reformulation for Numerical Stability**

   - Addresses numerical inaccuracies, a significant issue at larger scales.

   - Uses a [Cholesky decomposition approach](https://www.seas.ucla.edu/~vandenbe/133A/lectures/chol.pdf) to precompute necessary information, ensuring robustness on huge models.

   - Mild dampening is applied to diagonal elements of Hessian matrix to avoid numerical issues.

### The end result?

🚀 Reduces model size significantly, enabling use on less powerful hardware, like fewer GPUs.

🤖 Ideal for giant models with billions of parameters.

⏱ Can quantize up to 175 billion parameters in about 4 hours.

🔍 Reduces bitwidth to 3-4 bits per weight with little accuracy loss.

💡 Makes using and accessing large language models more practical and accessible.


## Load required libraries

Let us first load the required libraries that are 🤗 transformers, optimum and auto-gptq library.

In [None]:
%%capture
!pip install transformers optimum accelerate peft trl auto-gptq bitsandbytes datasets

In [None]:
#  optional to push your model to hub
# from huggingface_hub import notebook_login

# notebook_login()

## 🚀 **Two Ways to Use a Quantized Model**:

1. **From Scratch**: Quantize a new language model.

2. **Pre-Quantized Models**: Load an already quantized model from 🤗 Hub.

🔬 **GPTQ Needs Calibration**: Requires inference on the quantized model to calibrate quantized weights.

📖 **Auto-GPTQ Process**:

- **Dataset Requirement**: For quantization, provide a dataset.

- **Choices**: Use default datasets like 'wikitext2', 'c4', etc., or your own list of strings.

🧪 **How It Works**: Detailed algorithm in the [original paper](https://arxiv.org/pdf/2210.17323.pdf).

##🔧 **Quantizing and Precision**:

- **Precision Level**: Opting for 4-bit precision.

- **Dataset Choice**: Using the `"wikitext2"` dataset for this example.

- **Supported Precisions**: Supported numbers are (2, 3, 4, 8).

⏳ **Patience is Key**: The process can take a long time.

## 📚 **Calibration Datasets in GPTQ**

To ensure that the model is quantized with minimal difference between the expected output of the original model and the quantized model, it's important to use a representative set of data that the model is expected to encounter during actual usage. Adequate amount of data is also crucial.

- **Critical Role**: Calibration datasets adjust quantized weights to match original model performance.

- **Fine-Tuning Necessity**: Inference on calibration data helps fine-tune the quantized model.

- **Quality Counts**: The effectiveness of quantization heavily depends on the dataset's representativeness.

- **For Custom Datasets**: Check the next section for quantizing with a custom dataset.

🤔 **Dataset Calibration Intuition**

- **Mimicking Real-World Data**: Calibration ensures the quantized model behaves like it would on actual data.

- **Adjusting Quantized Weights**: Helps fine-tune the model's parameters to closely resemble the original, full-precision model.

- **Balancing Performance and Size**: Aims to maintain performance while benefiting from reduced model size due to quantization.



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

model_id = "Deci/DeciLM-7B-instruct"

quantization_config = GPTQConfig(
     bits=4,
     group_size=128, # The group size to use for quantization; default value
     dataset="wikitext2", # The dataset used for calibration.
     desc_act=False,# Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse.

)

tokenizer = AutoTokenizer.from_pretrained(model_id,
                                          trust_remote_code = True)

quant_model = AutoModelForCausalLM.from_pretrained(model_id,
                                                   trust_remote_code = True,
                                                   quantization_config=quantization_config,
                                                   device_map='auto'
                                                   )

In [None]:
quant_model.push_to_hub("DeciLM-7B-Instruct-gptq-4bit-wikitext")

tokenizer.push_to_hub("DeciLM-7B-Instruct-gptq-4bit-wikitext")

# 🧩 **Custom Dataset for Quantization**

- **Personal Touch**: Use your own dataset, provide a list of strings.

  - DeciLM-7B-Instruct was instruction-tuned on Slim Orca, so we will use that!


- **Data Matters**: Insufficient data can impact model performance.

In [None]:
from datasets import load_dataset

slim_orca = load_dataset("Open-Orca/SlimOrca-Dedup", split='train')

skinny_orca = slim_orca.shuffle(seed=42).select(range(1000))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
def change_keys(example):
    for message in example['conversations']:
        # Change 'from' to 'role' and 'value' to 'content'
        message['role'] = message.pop('from')
        message['content'] = message.pop('value')

        # Update 'human' to 'user' in 'role'
        if message['role'] == 'human':
            message['role'] = 'user'
        if message['role'] == 'gpt':
            message['role'] = 'assistant'
    return example

def apply_chat_template(example):
    # Assuming 'text' is the column you want to apply the function to
    modified_text = tokenizer.apply_chat_template(example['conversations'], tokenize=False)
    return {"modified_text": modified_text}

In [None]:
import os
from transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer

model_id = "Deci/DeciLM-7B-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

skinny_orca = skinny_orca.map(change_keys, num_proc=os.cpu_count())

skinny_orca = skinny_orca.map(apply_chat_template, num_proc=os.cpu_count())

skinny_orca_texts = []

# Iterate over each row in the dataset
for example in skinny_orca:
    skinny_orca_texts.append(example['modified_text'])

Map (num_proc=12):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
import torch
from transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer

model_id = "Deci/DeciLM-7B-instruct"

quantization_config = GPTQConfig(
    bits=4,
    group_size=128,
    use_cuda_fp16=True,
    desc_act=False,
    dataset=skinny_orca_texts,
    model_seqlen=8192
)

quant_model = AutoModelForCausalLM.from_pretrained(model_id,
                                                   quantization_config=quantization_config,
                                                   trust_remote_code=True,
                                                   use_cache=False,
                                                   torch_dtype="auto",
                                                   low_cpu_mem_usage=True
                                                   )

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]



In [None]:
quant_model.push_to_hub("harpreetsahota/DeciLM-7B-Instruct-gptq-4bit-slim-orca")

tokenizer.push_to_hub("harpreetsahota/DeciLM-7B-Instruct-gptq-4bit-slim-orca")

🔍 **Checking Quantization Success**:

- **Key Attributes**: Look for `qweight` and `qzeros` in the linear layers.

- **Data Type**: These attributes should be in `torch.int32` dtype.

👀 **Validation Step**: This check ensures the model is correctly quantized.

🔧 **Linear Layers & Quantization**

- **Linear to QuantLinear**: Linear layers are transformed to `QuantLinear` modules using auto-gptq library.

- **Exllama Kernel Use**: Whether to use exllama backend. Only works with bits = 4. To use a model on a configuration with limited VRAM split across multiple devices with device_map, disable exllama by setting disable_exllama to True. True means exllama support is set to False, which is enabled by default. (`disable_exllama = False`).

- **Bit Specificity**: This kernel is compatible only with 4-bit models.

In [None]:
quant_model.model.layers[0].self_attn

DeciLMAttention(
  (rotary_emb): LlamaDynamicNTKScalingRotaryEmbedding()
  (k_proj): QuantLinear()
  (o_proj): QuantLinear()
  (q_proj): QuantLinear()
  (v_proj): QuantLinear()
)

In [None]:
quant_model.model.layers[0].self_attn.q_proj.__dict__

In [None]:
quant_dict = quant_model.config.quantization_config.to_dict()

In [None]:
quant_dict.keys()

dict_keys(['quant_method', 'bits', 'tokenizer', 'dataset', 'group_size', 'damp_percent', 'desc_act', 'sym', 'true_sequential', 'use_cuda_fp16', 'model_seqlen', 'block_name_to_quantize', 'module_name_preceding_first_block', 'batch_size', 'pad_token_id', 'use_exllama', 'max_input_length', 'exllama_config', 'cache_block_outputs'])

🚀 **Running Inference on Quantized Model**

- **Same Old, New Twist**: Use the familiar transformers API for inference.

- **Seamless Integration**: The quantized model works just like the regular transformers models.

In [None]:
decilm_4_bit_model = AutoModelForCausalLM.from_pretrained(
    'harpreetsahota/DeciLM-7B-Instruct-gptq-4bit-slim-orca',
    trust_remote_code=True,
    use_cache=False,
    device_map="auto",
    low_cpu_mem_usage=True
    )

tokenizer = AutoTokenizer.from_pretrained(
    'harpreetsahota/DeciLM-7B-Instruct-gptq-4bit-slim-orca',
    trust_remote_code = True
    )

In [None]:
from transformers import pipeline, TextStreamer

system_prompt = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

user_prompt = """You are the rapper Eazy-E. Write a rap about how you just ate the most amazing pancakes ever.\
Use lots of details to describe your experience.
"""

prompt = ([
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
    ])

streamer = TextStreamer(tokenizer)

pipe = pipeline("conversational",
                model=decilm_4_bit_model,
                tokenizer=tokenizer,
                temperature=0.1,
                max_length = 1024,
                streamer=streamer
                )

pipe(prompt)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.
### User:
You are the rapper Eazy-E. Write a rap about how you just ate the most amazing pancakes ever.Use lots of details to describe your experience. 

### Assistant:
Yo, listen up, I got a story to tell
'Bout the most amazing pancakes I ever did eat
I was hungry, so I went to the kitchen
And whipped up a batch, just for me

I mixed the batter, just like I do
And poured it into the pan, nice and smooth
The sizzle and the smell, it was heavenly
I knew I was in for a treat, oh so tasty

The pancakes were golden, just like the sun
And the syrup was sweet, like a summer fun
I took a bite, and my taste buds danced
I was in pancake heaven, no need to expand

The fluffy texture, it was so soft
And the flavor, it was like a gift
I savored every bite, with a smile on my face
I knew I had to share this with my friends, no need to waste

So I called them up, and they came right away
We sat down

Conversation id: 01fde20e-5f1c-47a6-91bb-4fa6b0fd2f3b
system: You are an AI assistant that follows instruction extremely well. Help as much as you can.
user: You are the rapper Eazy-E. Write a rap about how you just ate the most amazing pancakes ever.Use lots of details to describe your experience. 

assistant: Yo, listen up, I got a story to tell
'Bout the most amazing pancakes I ever did eat
I was hungry, so I went to the kitchen
And whipped up a batch, just for me

I mixed the batter, just like I do
And poured it into the pan, nice and smooth
The sizzle and the smell, it was heavenly
I knew I was in for a treat, oh so tasty

The pancakes were golden, just like the sun
And the syrup was sweet, like a summer fun
I took a bite, and my taste buds danced
I was in pancake heaven, no need to expand

The fluffy texture, it was so soft
And the flavor, it was like a gift
I savored every bite, with a smile on my face
I knew I had to share this with my friends, no need to waste

So I called the

# Quantize to 2-bits

In [None]:
model_id = "Deci/DeciLM-7B-instruct"

quantization_config = GPTQConfig(
    bits=2,
    group_size=128,
    use_cuda_fp16=True,
    desc_act=False,
    dataset=skinny_orca_texts,
    model_seqlen=8192
)

quant_model = AutoModelForCausalLM.from_pretrained(model_id,
                                                   quantization_config=quantization_config,
                                                   trust_remote_code=True,
                                                   use_cache=False,
                                                   torch_dtype="auto",
                                                   low_cpu_mem_usage=True
                                                   )

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
quant_model.push_to_hub("harpreetsahota/DeciLM-7B-Instruct-gptq-2bit-slim-orca")

tokenizer.push_to_hub("harpreetsahota/DeciLM-7B-Instruct-gptq-2bit-slim-orca")

model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/harpreetsahota/DeciLM-7B-Instruct-gptq-2bit-slim-orca/commit/006229e7dcaeb0ef4518d0ae759ae7a86341acd7', commit_message='Upload tokenizer', commit_description='', oid='006229e7dcaeb0ef4518d0ae759ae7a86341acd7', pr_url=None, pr_revision=None, pr_num=None)