## Model Quantization for Inference

LLMs are very large in size and to efficiently load models for inferencing on real-time, low-resource settings such as single GPUs, it is required to significantly reduce the size of LLMs without losing much on performance. This process is known as *Quatization*.

In this notebook, I have compiled popular quantization algorithms/strategies available through huggingface `transformers`.

**NOTE:** To implement this notebook, *GPU* is required. 

### Model Selection

In [None]:
model_id = "mistralai/Mistral-7B-v0.1"

## Full Precision

IEEE 32 bit floating point. This is `float32` or `FP32` precision.

In [None]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
            model_id, trust_remote_code=True, device_map="auto"
        )

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")

In [None]:
print(model)

* Model has simple linear layers.
* Model weights are stored in FP32.
* Computation dtype is also FP32.

In [None]:
# clean up
import torch

del model
torch.cuda.empty_cache()

## Half Precision
2 Bytes or 16 bit floating point. There are 2 variants:
1. `FP16` is normal 16 bit floating point.
2. `BF16` (bfloat 16) is more optimised version of floating 16 bit precision.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
            model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto"
        )

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")

Model size just got half. From around **28 GB** in `FP32` Precision to around **14 GB**.

In [None]:
print(model)

* Model weights are store in `BF16` precision.
* Computation also uses `BF16` precision.

In [None]:
del model
torch.cuda.empty_cache()

## INT8 Quantization

### Using transformers parameters

In [None]:
model = AutoModelForCausalLM.from_pretrained(
            model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto",
            load_in_8bit=True
        )

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")

* Model size got further reduced by half **7.5 GB**.
* `8bit` Precision is used for just storing the model.
* Computation uses `BF16` precision as mentioned in the parameter `torch_dtype`. This can take `FP32` or `FP16` as well.

In [None]:
print(model)

Simple Linear is replaced with  **Linear8bitLt**.

In [None]:
del model
torch.cuda.empty_cache()

Model `dtype` shows `torch.bfloat16`: This means the model will store weigths with **8bit**. While the computation will happen in `torch.bfloat16`.

### Using bitsandbytes quantization config

In [None]:
from transformers import BitsAndBytesConfig

In [None]:
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_threshold=200.0)

model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto",
    quantization_config=quantization_config
)

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")

In [None]:
print(model)

In [None]:
del model
torch.cuda.empty_cache()

## 4bit Model Quantization

### Using transformers parameters

In [None]:
model = AutoModelForCausalLM.from_pretrained(
            model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto",
            load_in_4bit=True
        )

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")

* Model `dtype` shows `torch.bfloat16`: This means the model will store weigths with **4bit**. While the computation will happen in `torch.bfloat16`.
* Model size further reduced to **4.24 GB**.

In [None]:
print(model)

Simple Linear is replaced with  **Linear4bit**.

In [None]:
del model
torch.cuda.empty_cache()

### using bitsandbytes quantization config

#### QLoRA NF4

In [None]:
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.bfloat16
        )
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, trust_remote_code=True, device_map="auto"
)

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")
print(f"GPU memory: {model.get_memory_footprint():.2f} B")

In [None]:
print(model)

In [None]:
del model
torch.cuda.empty_cache()

#### QLoRA NF4-double-quantization (Nested Quantization)

Using `bnb_4bit_use_double_quant` argument, double or nested quantization can be achieved. This will enable second quantization after the first one to save additional 0.4 bits per parameter.

In [None]:
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16
        )
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, trust_remote_code=True, device_map="auto"
)

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")
print(f"GPU memory: {model.get_memory_footprint():.2f} B")

Here, the impact of double quantization is not clearly evident. However, for larger models, there is an observable difference.

In [None]:
print(model)

In [None]:
del model
torch.cuda.empty_cache()

## GPTQ

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

The GPTQ algorithm requires to callibrate the quantized weights of the model by doing inference on the quantized model. For quantizing a model using auto-gptq, it is required to pass a dataset to quantizer. Default supported datasets are `['wikitext2', 'c4', 'c4-new', 'ptb', 'ptb-new']`. One can also pass their own dataset as a `list of strings`.

In [None]:
# Set quantization configuration
quantization_config = GPTQConfig(
     bits=4,
     group_size=128,
     dataset="c4",
     desc_act=False,
     tokenizer=tokenizer
)

### Loading the model and quantizing

In [None]:
# Load the model from HF
quant_model = AutoModelForCausalLM.from_pretrained(model_id, 
                quantization_config=quantization_config, trust_remote_code=True, device_map='auto')

This process takes few hours to generate a quantized model. Hence, it is recommended to save the model locally or push it to huggingface_hub for later use.

### Saving the quantized model and tokenizer

In [None]:
# save the quantize model to disk
save_folder = "./models/quantized-shearedllama-2.7b"
quant_model.save_pretrained(save_folder, safe_serialization=True)

In [None]:
# save the tokenizer
tokenizer.save_pretrained(save_folder)

In [None]:
del quant_model
torch.cuda.empty_cache()

### Loading the Quantized Model

In [None]:
model_path = "./models/quantized-shearedllama-2.7b/"
gptq_config = GPTQConfig(bits=4, disable_exllama=False)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", 
                                             quantization_config = gptq_config)

In [None]:
print(model.dtype)
print(f"GPU memory: {model.get_memory_footprint() / 1024**3:.2f} GB")

In [None]:
print(model)

Linear layer will be modified by by **QuantLinear** layer from auto-gptq.

In [None]:
print(model.config.quantization_config.to_dict())

In [None]:
del model
torch.cuda.empty_cache()

## Imprtant References
* HuggingFace Blogs
    1. https://huggingface.co/blog/hf-bitsandbytes-integration
    2. https://huggingface.co/blog/4bit-transformers-bitsandbytes
    3. https://huggingface.co/blog/gptq-integration
    4. https://huggingface.co/blog/merve/quantization
    5. https://huggingface.co/blog/overview-quantization-transformers
    6. https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/quantization
* Research Papers
    1. [LLM.int8](https://arxiv.org/abs/2208.07339)
    2. [QLoRA](https://arxiv.org/abs/2305.14314)
    3. [GPTQ](https://arxiv.org/pdf/2210.17323.pdf)
* Github Repo
    1. [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
    2. [auto-gptq](https://github.com/PanQiWei/AutoGPTQ)