In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Quantization 
A Large Language Model is represented by a bunch of weights and activations. These values are generally represented by the usual 32-bit floating point (float32) datatype.

![](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*KLPKaFB9_f6jKcyl.png)

As you might expect, if we choose a lower bit size, then the model becomes less accurate but it also needs to represent fewer values, thereby decreasing its size and memory requirements.

There are three popular quantization methods for LLMs, we will discuss briefly about each of them, then we will look into how to use each of the techniques in code.

- GPTQ
- AWQ
- Bitsandbytes NF4

## GPTQ Quantization
GPTQ is a post-training quantization (PTQ) method for quantization that focuses primarily on GPU inference and performance.

Post-training quantization (PTQ) is the sort of quantization, wherein the weights of an already skilled model are transformed to a lower precision without any retraining. This is a trustworthy and clean-to-put-into-effect method

It is based on one-shot weight quantization method based on approximate second-order information, that is both highly- accurate and highly-efficient.

The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight between the original layer and the new quantized layer.

In [1]:
%%capture
!pip install transformers optimum accelerate auto-gptq bitsandbytes accelerate

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
raw_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [3]:
total_params = sum(p.numel() for p in raw_model.parameters())
print("Total Parameters:", total_params)

Total Parameters: 125239296


In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

2024-04-05 11:02:32.593599: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-05 11:02:32.593720: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-05 11:02:32.733244: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


Quantizing model.decoder.layers blocks :   0%|          | 0/12 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]



In [5]:
total_params = sum(p.numel() for p in model.parameters())
print("Total Parameters:", total_params)

Total Parameters: 40221696


## AWQ Quantization
AWQ (Activation-aware Weight Quantization) which is a quantization method similar to GPTQ. There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM’s performance.

In other words, a small fraction of weights will be skipped during quantization, which helps with the quantization loss.

As a result, their paper mentions a significant speed-up compared to GPTQ whilst keeping similar, and sometimes even better, performance.

### Quantize Model to AWQ
AWQ performs zero point quantization down to a precision of 4-bit integers.

### AutoAWQ integration with Transformers

Note:
- Some models like Falcon is only compatible with group size 64.
- To use Marlin, you must specify zero point as False and version as Marlin.


In [None]:
!pip install autoawq

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
# quant_path = 'mistral-instruct-v0.2-fbc-awq'
quant_config = { "zero_point": False, "q_group_size": 64, "w_bit": 4, "version": "GEMM" }

raw_model = AutoModelForCausalLM.from_pretrained(model_path)

# Load model
q_model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": False, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
q_model.quantize(tokenizer, quant_config=quant_config)

# # Save quantized model
# model.save_quantized(quant_path)
# tokenizer.save_pretrained(quant_path)

total_params_raw = sum(p.numel() for p in raw_model.parameters())
total_params_q = sum(p.numel() for p in q_model.parameters())
print("Total Parameters:", total_params_raw)
print("Total Parameters:", total_params_q)

## Bitsandbytes NF4
bitsandbytes is the easiest way for quantizing a model to 8 and 4-bit

It works in three steps:

- Normalization: The weights of the model are normalized so that we expect the weights to fall within a certain range. This allows for more efficient representation of more common values.
- Quantization: The weights are quantized to 4-bit. In NF4, the quantization levels are evenly spaced with respect to the normalized weights, thereby efficiently representing the original 32-bit weights.
- Dequantization: Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference.
It represents the weights with 4-bit quantization but does the inference in 16-bit.

This is a wonderful technique but it seems rather wasteful to have to apply them every time we have to load the model.

### Quantise model with Bitsandbytes
Using this quantisation is straightforward with HuggingFace. we just need to define a configuration for the quantization with Bitsandbytes:

In [6]:
from transformers import BitsAndBytesConfig
from torch import bfloat16

# Our 4-bit configuration to load the LLM with less GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

This configuration allows us to specify which quantization levels we are going for. Generally, we want to represent the weights with 4-bit quantization but do the inference in 16-bit.

Loading the model in a pipeline is then straightforward:

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Zephyr with BitsAndBytes Configuration
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-alpha",
    quantization_config=bnb_config,
    device_map='auto',
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

prompt = '''
<|system|> \
You are a friendly chatbot who always responds in the style of a pirate.\
</s>\
<|user|>\
Write a tweet on future of AI. </s>\
<|assistant|>
'''

outputs = pipe(
    prompt, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.7, 
    top_p=0.95
)
print(outputs[0]["generated_text"])

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/628 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



<|system|> You are a friendly chatbot who always responds in the style of a pirate.</s><|user|>Write a tweet on future of AI. </s><|assistant|>
"Ahoy there, matey! The future of AI be lookin' mighty fine, with advancements in machine learning, natural language processing, and neural networks. Brace yerself for a new era of intelligent automation and augmented intelligence!" #AI #FutureofAI #IntelligentAutomation #AugmentedIntelligence #MachineLearning #NaturalLanguageProcessing


In [8]:
raw_model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-alpha",
    device_map='auto',
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [9]:
total_params_raw = sum(p.numel() for p in raw_model.parameters())
total_params_q = sum(p.numel() for p in model.parameters())
print("Total Parameters:", total_params_raw)
print("Total Parameters:", total_params_q)

Total Parameters: 7241732096
Total Parameters: 3752071168


# Bitsandbytes vs GPTQ vs AWQ
A quick camparition between Bitsandbytes, GPTQ and AWQ quantization, so you can choose which methods to use according to your use case.

## Bitandbytes
- This method quantise the model using HF weights, so very easy to implement
- Slower than other quantisation methods as well as 16-bit LLM model.
- This is a wonderful technique but it seems rather wasteful to have to apply them every time we have to load the model.
- Take longer time to load the models weights

## GPTQ
- Much Faster as compared to Bitandbytes
- New model architectures are promptly supported in AutoGPTQ

### Challenges
- Need to quantise the model weights to GPTQ weights beforehand to use it in production.
- High computation to quantise the model
- Around ~ 16GB GPU memory required to quantise 7B parameter model

## AWQ
- Paper mentions a significant speed-up compared to GPTQ whilst keeping similar, and sometimes even better, performance

### Limitations
- Newer architectures like Gemma or DeciLM doesn’t support in AWQ quantization yet.