# Model Quantization #

Depending on the size of the model and how much GPU memory is available for running an LLM, the model may need to be quantized to ensure that it can fit in GPU memory. As a rough calculation, LLMs need 4x the number of parameters that it has, for memory. This is because each parameter is typically stored in a 32-bit (4 Byte) float. For example, an 7-billion ("7B") parameter model (e.g. Mistral 7B) requires around 28 GB or memory to run.

Model quantization reduces the amount of memory required by each model parameter. For example, 8-bit quantization reduces the amount of memory required for a 7-billion parameter model to 7 GB, since each parameter uses one byte instead of four. 4-bit quantization further reduces the amount of memory required to around 3.5 GB.

It's generally better to use a larger (in terms of parameters), quantized model rather than a smaller, un-quantized model of the same genre, as long as your quantization is >= 2 bits per parameter. The following article shows the drop-off in quality of model responses as fewer bits are used per model parameter, for the Llama 3 8B and 70B models.

https://github.com/matt-c1/llama-3-quant-comparison

In this notebook, we download full-sized models (no quantization) from HuggingFace, apply 4-bit or 8-bit quantization via the Bitsandbytes library, and store the qunatized model for later use. This speeds up model loading time since the quantized models don't need to be quantized at load-time.

References:
https://medium.com/@rakeshrajpurohit/model-quantization-with-hugging-face-transformers-and-bitsandbytes-integration-b4c9983e8996
https://towardsai.net/p/artificial-intelligence/llm-quantization-quantize-model-with-gptq-awq-and-bitsandbytes

In [None]:
!pip install --quiet --upgrade torch

In [None]:
# !pip install --quiet --upgrade transformers
!pip install --quiet --upgrade git+https://github.com/huggingface/transformers

In [None]:
!pip install --quiet --upgrade bitsandbytes accelerate

In [None]:
%%time

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import BitsAndBytesConfig

# model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
model_id = "google/gemma-2-27b-it"

## 8-bit quantization
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True
)

## 4-bit quantization
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    # bnb_4bit_compute_dtype=torch.bfloat16
    bnb_4bit_compute_dtype="bfloat16"
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit,
    # quantization_config=bnb_config_4bit,
    # torch_dtype=torch.bfloat16,
    torch_dtype="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    device_map="sequential" ## using sequential instead of auto/balanced since otherwise lm_head can get put on CPU
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

In [None]:
quantized_model_dir = "models/" + model_id.replace("/", "__")

model.save_pretrained(quantized_model_dir)
_ = tokenizer.save_pretrained(quantized_model_dir)

If the model was reduced to a size that will fit into GPU memory, through quantization, when we print the model device map, the items listed should all map the GPU device (or one of the devices) listed by print_device_info. None of the layers should map to CPU. For example, if we have two GPU devices available and listed (0 and 1), each of the items listed by print_device_map, including all layers, lm_head, and norm, should have the values 0 or 1. 

In [None]:
import utils
utils.print_device_info()
utils.print_model_info(model)
utils.print_device_map(model)

We'll now clear the quantized model from memory so that we can re-load the locally stored version, for confirming that complete process was successful.

In [None]:
import gc
import torch

del model
gc.collect()
torch.cuda.empty_cache()

Ensure that the GPU device(s) memory have been freed.

In [None]:
import utils
utils.print_device_info()

Now we'll local the locally stored, quantized model into memory.

In [None]:
%%time

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(
    quantized_model_dir,
    local_files_only=True
)

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    device_map="sequential",
    local_files_only=True
)

We should see that the locally stored model has the same quantization and memory usage as the model that was quantized as it was loaded into memory.

In [None]:
import utils
utils.print_device_info()
utils.print_model_info(model)
utils.print_device_map(model)