# LLM Quantization
This notebook presents two different methods to quantize LLMs: GPTQ and AWQ. We recommend using AWQ.

## GPTQ
GPTQ is a quantization algorithm that quantizes LLMs' weights to 8, 4, 3 or 2 bit precision using an iterative 2nd order optimization process and a small calibration dataset.

More information at: https://arxiv.org/pdf/2210.17323.pdf

The tutorial provided uses AutoGPTQ implementation: https://github.com/AutoGPTQ/AutoGPTQ

In [None]:
# First let's download some useful libraries - this might take several minutes

!pip install torch transformers datasets optimum accelerate
!pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git -vvv --no-build-isolation 

In [None]:
# Now let's import some usefull libraries

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset, DatasetDict, Dataset
from transformers import AutoTokenizer

In [None]:
# Define quantization parameters

DEVICE = 'cuda:0'                                                       # device where to load the model and perform the quantization
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"                         # model to quantize, can be a hugginface id or a local path
QUANTIZED_MODEL_ID = "mistral-7b-instruct-v0.2-GPTQ-Q4-GS128-AOT-TST"   #name given to the quantized model
BITS = 4                                                                # number of bits to quantize to, recommended 4
GROUP_SIZE = 128                                                        # size used for grouping in quantization algorithm, recommended 128, -1 quantizes per column
ACT_ORDER = True                                                        # whether to quantize columns in order of decreasing activation size, recommended True
TRUE_SEQUENTIAL = True                                                  # whether to perform sequential quantization even within a single transformer block, recommended True

In [None]:
# Let's load the model with the quantization configuration defined above

quantization_config = BaseQuantizeConfig(bits = BITS,
                                group_size = GROUP_SIZE,
                                desc_act = ACT_ORDER,
                                true_sequential = TRUE_SEQUENTIAL,
                                )
model = AutoGPTQForCausalLM.from_pretrained(MODEL_ID, quantization_config).to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

In [None]:
# Load the dataset that'll be used for quantization (calibration dataset)
# Note: this dataset can be adapted to a specific fine-tuning, but as a general rule it should be similarly built as the model's training data
# Thus, for a foundational model such as Mistral-7B, a general-knowledge pre-training dataset extract works well
# We use here a 128-sample extract from the RedPajama2 of 2048 tokens

dataset = load_dataset('sade-adrien/quantization_samples', split='train')
quantization_samples = [tokenizer(sample, truncation=True, max_length=2048) for sample in dataset['raw_content']]

In [None]:
# Perform the quantization - takes 15-25mn for a 7B-parameter model on GPU and above dataset

model.quantize(quantization_samples)

In [None]:
# Save the quantized model

model.save_quantized(QUANTIZED_MODEL_ID)
tokenizer.save_pretrained(QUANTIZED_MODEL_ID)

In [None]:
# Load the quantized model

model = AutoGPTQForCausalLM.from_quantized(QUANTIZED_MODEL_ID,
                                            use_marlin = False,         # use marlin kernel
                                            disable_exllama = False,    # use exllama-1 kernel
                                            disable_exllamav2 = True,   # use exllama-2 kernel
                                            device=DEVICE,
                                            )
tokenizer = AutoTokenizer.from_pretrained(QUANTIZED_MODEL_ID)

In [None]:
# Or load the quantized model with HuggingFace framework, not recommended

from transformers import AutoModelForCausalLM
from optimum.gptq import load_quantized_model
from accelerate import init_empty_weights
import torch

with init_empty_weights():
    empty_model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map='cpu')
empty_model.tie_weights()
model = load_quantized_model(empty_model, save_folder=QUANTIZED_MODEL_ID, device_map={'': DEVICE})
tokenizer = AutoTokenizer.from_pretrained(QUANTIZED_MODEL_ID)


## AWQ
AWQ is an algorithm based on the hypothesis that a large percentage of the weights have a small impact on the output. The small fraction of important (aka salient) weights should be preserved in better precision during quantization to keep high performance. To this end, they are scaled efficiently right before quantization. 

More information at: https://arxiv.org/pdf/2306.00978.pdf

The tutorial provided uses AutoAWQ implementation: https://github.com/casper-hansen/AutoAWQ

Interestingly, this algorithm is not about the quantization itself but rather an improvement. This means, we have to use a quantization process along AWQ that is orthogonal. GPTQ is a perfect candidate and is commonly used along AWQ: this is the one used under the hood in this implementation.

In [None]:
# First let's download some useful libraries - this might take several minutes

!pip install torch transformers datasets optimum accelerate
!pip install git+https://github.com/casper-hansen/AutoAWQ_kernels.git -vvv
!pip install git+https://github.com/casper-hansen/AutoAWQ.git -vvv

In [None]:
# Now let's import some usefull libraries

from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import json

In [None]:
# Define quantization parameters

DEVICE = 'cuda:0'                                                       # device where to load the model and perform the quantization
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"                         # model to quantize, can be a hugginface id or a local path
QUANTIZED_MODEL_ID = "mistral-7b-instruct-v0.2-AWQ-Q4-GS128-GEMM"       #name given to the quantized model
BITS = 4                                                                # number of bits to quantize to, currently only 4 bits is supported
GROUP_SIZE = 128                                                        # size used for grouping in quantization algorithm, recommended 128, -1 quantizes per column
VERSION = 'GEMM'                                                        # version to quantize to ['GEMM', 'GEMV', 'GEMV_fast', 'marlin'], recommended GEMM
ZERO_POINT = True                                                       # whether to use zero-point quantization, recommend True, need False for marlin kernel

In [None]:
# Let's load the model with the quantization configuration defined above

model = AutoAWQForCausalLM.from_pretrained(MODEL_ID, device_map={'': DEVICE})
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

In [None]:
# Perform the quantization - takes 15-25mn for a 7B-parameter model on GPU

quantization_config = {"zero_point": ZERO_POINT, 
                    "q_group_size": GROUP_SIZE, 
                    "w_bit": BITS, 
                    "version": VERSION, 
                    "modules_to_not_convert": [],
                    }

model.quantize(tokenizer, quant_config=quantization_config)

In [None]:
# Save the quantized model

model.save_quantized(QUANTIZED_MODEL_ID)
tokenizer.save_pretrained(QUANTIZED_MODEL_ID)
with open(f"{QUANTIZED_MODEL_ID}/quant_config.json", "w") as file:
    json.dump(quantization_config, file, indent=4)

In [None]:
# Load the quantized model

model = AutoAWQForCausalLM.from_quantized(QUANTIZED_MODEL_ID, 
                                            fuse_layers=True,               # fusing layers accelerates inference greatly
                                            use_exllama=False,              # use exllama-1 kernel
                                            use_exllama_v2=True,            # use exllama-2 kernel
                                            max_seq_len=512,                # max sequence length, used when fusing layer to allocate memory
                                            device_map={'': DEVICE},
                                            )
tokenizer = AutoTokenizer.from_pretrained(QUANTIZED_MODEL_ID)

In [None]:
# Or load the quantized model with HuggingFace framework, not recommended
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

quantization_config = AwqConfig(version="gemm",                                                            # version of model to be loaded
                                exllama_config={"version": 2, "max_input_len": 2048, "max_batch_size": 8}, # use exllama kernel, can be omitted
                                do_fuse = True,                                                            # fusing layers accelerates inference greatly
                                fuse_max_seq_len = 512,                                                    # max sequence length, used when fusing layer to allocate memory
                                )

model = AutoModelForCausalLM.from_pretrained(QUANTIZED_MODEL_ID,
                                            quantization_config=quantization_config,
                                            device_map={'': DEVICE},
                                        )
tokenizer = AutoTokenizer.from_pretrained(QUANTIZED_MODEL_ID)