<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/GGUF_Quantization_with_an_Importance_Matrix_(imatrix)_and_K_quantization_Example_with_Gemma_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*More details in this article: [GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU](https://newsletter.kaitchup.com/p/gguf-quantization-with-imatrix-and-q-quants)*

This notebook shows how to quantize LLMs with the GGUF format using llama.cpp. The quantization method investigated here relies on k-quantization and an importance matrix.

For demonstration, it uses Gemma 2 2B instruct.

#Setup

First, install the following dependencies, here, with the GPU backend (CUDA):

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && GGML_CUDA=1 make && pip install -r requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 32950, done.[K
remote: Counting objects: 100% (8369/8369), done.[K
remote: Compressing objects: 100% (642/642), done.[K
remote: Total 32950 (delta 8026), reused 7802 (delta 7711), pack-reused 24581 (from 1)[K
Receiving objects: 100% (32950/32950), 57.06 MiB | 21.57 MiB/s, done.
Resolving deltas: 100% (23797/23797), done.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration

#Quantize

Then, we define the following variables:

In [None]:
from huggingface_hub import snapshot_download

model_name = "google/gemma-2-9b-it" # the model we want to quantize
#methods = ['q2_k', 'q3_k_m', 'q4_0', 'q4_k_m', 'q5_0', 'q5_k_m', 'q6_k', 'q8_0']
methods = ['Q4_K_S','Q4_K_M']
base_model = "./original_model_gemma2-9b/" # where the FP16 GGUF model will be stored
quantized_path = "./quantized_model_gemma2-9b/" #where the quantized GGUF model will be stored

original_model = quantized_path+'FP16.gguf'
!mkdir {quantized_path}

mkdir: cannot create directory ‘./quantized_model_gemma2-9b/’: File exists


We must download the model to quantize from the Hugging Face Hub:

In [None]:
snapshot_download(repo_id=model_name, local_dir=base_model , local_dir_use_symlinks=False)

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/25.8k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

(…)ransformers-4.42.0.dev0-py3-none-any.whl:   0%|          | 0.00/9.21M [00:00<?, ?B/s]

'/content/original_model_gemma2-9b'

Next, we convert the model that we have downloaded to the GGUF format with convert-hf-to-gguf.py

In [None]:
!python llama.cpp/convert_hf_to_gguf.py {base_model} --outfile {original_model}

INFO:hf-to-gguf:Loading model: original_model_gemma2-9b
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,                 torch.bfloat16 --> F16, shape = {3584, 256000}
INFO:hf-to-gguf:blk.0.attn_norm.weight,            torch.bfloat16 --> F32, shape = {3584}
INFO:hf-to-gguf:blk.0.ffn_down.weight,             torch.bfloat16 --> F16, shape = {14336, 3584}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,             torch.bfloat16 --> F16, shape = {3584, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,               torch.bfloat16 --> F16, shape = {3584, 14336}
INFO:hf-to-gguf:blk.0.post_attention_norm.weight,  torch.bfloat16 --> F32, shape = {3584}
INFO:hf-to-gguf:blk.0.post_ffw_norm.weight,        torch.bfloat16 --> F32, shape = {3584}
INFO:hf-to-gguf:blk.0.

Download the files to be used for calibration and evaluation:

In [None]:
!wget https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/mono/en.txt.gz
!gunzip en.txt.gz
!head -n 10000 en.txt > en-h10000.txt
!sh llama.cpp/scripts/get-wikitext-2.sh

--2024-09-04 11:44:56--  https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/mono/en.txt.gz
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 532958396 (508M) [application/gzip]
Saving to: ‘en.txt.gz’


2024-09-04 11:45:01 (114 MB/s) - ‘en.txt.gz’ saved [532958396/532958396]

gzip: en.txt already exists; do you wish to overwrite (y or n)? ^C
--2024-09-04 11:50:12--  https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
Resolving huggingface.co (huggingface.co)... 18.239.50.16, 18.239.50.103, 18.239.50.49, ...
Connecting to huggingface.co (huggingface.co)|18.239.50.16|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/c6/78/c67802fcd48fa6f6a86773410b21cc6db1c5c546b20683b6c30b95f327a66922/ef7edb566e3e2b2d31b29c1fdb0c89a4cc6835

Run the quantization for all the selected methods:

In [None]:
!./llama.cpp/llama-imatrix -m {original_model}  -f en-h10000.txt -o {quantized_path}/imatrix.dat --verbosity 1 -ngl 99
for m in methods:
    qtype = f"{quantized_path}/{m.upper()}.gguf"
    iqtype = f"{quantized_path}/{m.upper()}_I.gguf"
    !./llama.cpp/llama-quantize  {original_model} {qtype} {m}
    !./llama.cpp/llama-perplexity -m {qtype} -f wikitext-2-raw/wiki.test.raw > {quantized_path}/{m.upper()}_perplexity.txt

    !./llama.cpp/llama-quantize --imatrix {quantized_path}/imatrix.dat {original_model} {iqtype} {m}
    !./llama.cpp/llama-perplexity -m {iqtype} -f wikitext-2-raw/wiki.test.raw > {quantized_path}/{m.upper()}_I_perplexity.txt




llama_model_loader: loaded meta data with 38 key-value pairs and 464 tensors from ./quantized_model_gemma2-9b//FP16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Original_Model_Gemma2 9b
llama_model_loader: - kv   3:                           general.basename str              = original_model_gemma2
llama_model_loader: - kv   4:                         general.size_label str              = 9B
llama_model_loader: - kv   5:                            general.license str              = gemma
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:          

#Benchmark Inference Throughput

Reset your environment and recompile with the CPU backend to benchmark on a CPU, using these commands:

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && make && pip install -r requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 33381, done.[K
remote: Counting objects: 100% (9041/9041), done.[K
remote: Compressing objects: 100% (786/786), done.[K
remote: Total 33381 (delta 8610), reused 8386 (delta 8212), pack-reused 24340 (from 1)[K
Receiving objects: 100% (33381/33381), 56.89 MiB | 30.06 MiB/s, done.
Resolving deltas: 100% (24135/24135), done.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -W

In [None]:
from huggingface_hub import snapshot_download
quantized_path = "./quantized_model/" #where the quantized GGUF model will be stored

original_model = quantized_path+'FP16.gguf'
!mkdir {quantized_path}

snapshot_download(repo_id="kaitchup/gemma-2-2b-it-GGUF", local_dir=quantized_path , local_dir_use_symlinks=False)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.81k [00:00<?, ?B/s]

Q4_K_M_I_perplexity.txt:   0%|          | 0.00/7.33k [00:00<?, ?B/s]

Q4_K_M_perplexity.txt:   0%|          | 0.00/7.33k [00:00<?, ?B/s]

Q4_K_M.gguf:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

FP16.gguf:   0%|          | 0.00/5.24G [00:00<?, ?B/s]

Q4_K_S.gguf:   0%|          | 0.00/1.64G [00:00<?, ?B/s]

Q4_K_M_I.gguf:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Q4_K_S_I_perplexity.txt:   0%|          | 0.00/7.33k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Q4_K_S_perplexity.txt:   0%|          | 0.00/7.33k [00:00<?, ?B/s]

Q4_K_S_I.gguf:   0%|          | 0.00/1.64G [00:00<?, ?B/s]

imatrix.dat:   0%|          | 0.00/2.38M [00:00<?, ?B/s]

'/content/quantized_model'

Benchmarking the FP16 model and the GGUF versions:

In [None]:
!./llama.cpp/llama-bench  -m {original_model} -m {quantized_path}/Q4_K_M.gguf -m {quantized_path}/Q4_K_M_I.gguf -m  {quantized_path}/Q4_K_S.gguf -m  {quantized_path}/Q4_K_S_I.gguf -n 128,256,512

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| gemma2 2B F16                  |   5.97 GiB |     3.20 B | CPU        |       4 |         pp512 |         49.13 ± 0.52 |
| gemma2 2B F16                  |   5.97 GiB |     3.20 B | CPU        |       4 |         tg128 |          9.57 ± 0.11 |
| gemma2 2B F16                  |   5.97 GiB |     3.20 B | CPU        |       4 |         tg256 |          9.48 ± 0.11 |
| gemma2 2B F16                  |   5.97 GiB |     3.20 B | CPU        |       4 |         tg512 |          8.83 ± 0.41 |
| gemma2 2B Q4_K - Medium        |   2.04 GiB |     3.20 B | CPU        |       4 |         pp512 |         44.98 ± 0.51 |
| gemma2 2B Q4_K - Medium        |   2.04 GiB |     3.20 B | CPU        |       4 |         tg128 |         18.58 ± 0.24 |
| gemma2 2B Q4_K