<a href="https://colab.research.google.com/github/shah-zeb-naveed/large-language-models/blob/main/quantization/quantize_codellama2_llama_cpp_gguf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantize Llama 2 models using GGUF and llama.cpp
> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

## Usage

* `MODEL_ID`: The ID of the model to quantize (e.g., `mlabonne/EvolCodeLlama-7b`).
* `QUANTIZATION_METHOD`: The quantization method to use.

## Quantization methods

The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used (detailed below). Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by [TheBloke](https://huggingface.co/TheBloke/):

* `q2_k`: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
* `q3_k_l`: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_m`: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_s`: Uses Q3_K for all tensors
* `q4_0`: Original quant method, 4-bit.
* `q4_1`: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
* `q4_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
* `q4_k_s`: Uses Q4_K for all tensors
* `q5_0`: Higher accuracy, higher resource usage and slower inference.
* `q5_1`: Even higher accuracy, resource usage and slower inference.
* `q5_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
* `q5_k_s`:  Uses Q5_K for all tensors
* `q6_k`: Uses Q8_K for all tensors
* `q8_0`: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

As a rule of thumb, **I recommend using Q5_K_M** as it preserves most of the model's performance. Alternatively, you can use Q4_K_M if you want to save some memory. In general, K_M versions are better than K_S versions. I cannot recommend Q2_K or Q3_* versions, as they drastically decrease model performance.

In [1]:
# Variables
MODEL_ID = "mlabonne/EvolCodeLlama-7b"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]

In [2]:
!rm -r llama.cpp

rm: cannot remove 'llama.cpp': No such file or directory


In [3]:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && git reset --hard 895407f31b358e3d9335e847d13f033491ec8a5b

Cloning into 'llama.cpp'...
remote: Enumerating objects: 18399, done.[K
remote: Counting objects: 100% (4629/4629), done.[K
remote: Compressing objects: 100% (231/231), done.[K
remote: Total 18399 (delta 4511), reused 4423 (delta 4398), pack-reused 13770[K
Receiving objects: 100% (18399/18399), 21.22 MiB | 17.61 MiB/s, done.
Resolving deltas: 100% (12868/12868), done.


In [4]:
!cd llama.cpp && make clean && LLAMA_CUBLAS=1 make
#!pip install -r llama.cpp/requirements.txt

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS: -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -vrf *.o tests/*.o *.so *.a *.dll benchmark-matmult common/build-info.cpp *.dot *.gcno te

In [5]:
# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

Git LFS initialized.
Cloning into 'EvolCodeLlama-7b'...
remote: Enumerating objects: 35, done.[K
remote: Total 35 (delta 0), reused 0 (delta 0), pack-reused 35[K
Unpacking objects: 100% (35/35), 483.38 KiB | 6.36 MiB/s, done.
Filtering content: 100% (5/5), 4.70 GiB | 10.66 MiB/s, done.
Encountered 1 file(s) that may not have been copied correctly on Windows:
	pytorch_model-00001-of-00002.bin

See: `git lfs help smudge` for more details.


In [6]:
# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

Loading model file EvolCodeLlama-7b/pytorch_model-00001-of-00002.bin
Loading model file EvolCodeLlama-7b/pytorch_model-00001-of-00002.bin
Loading model file EvolCodeLlama-7b/pytorch_model-00002-of-00002.bin
params = Params(n_vocab=32016, n_embd=4096, n_layer=32, n_ctx=16384, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=1000000, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('EvolCodeLlama-7b'))
Found vocab files: {'tokenizer.model': PosixPath('EvolCodeLlama-7b/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': PosixPath('EvolCodeLlama-7b/tokenizer.json')}
Loading vocab file 'EvolCodeLlama-7b/tokenizer.model', type 'spm'
Vocab info: <SentencePieceVocab with 32016 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0}, add special tokens unset>
Permuting layer

In [8]:
# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
main: build = 2135 (895407f3)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'EvolCodeLlama-7b/evolcodellama-7b.fp16.bin' to 'EvolCodeLlama-7b/evolcodellama-7b.Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from EvolCodeLlama-7b/evolcodellama-7b.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:               

## Run inference

Here is a simple script to run your quantized models. I'm offloading every layer to the GPU (35 for a 7b parameter model) to speed up inference.

In [11]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = "generate fibonacci in python"#input("Enter your prompt: ")
chosen_method = "evolcodellama-7b.Q5_K_M.gguf"#input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 0 -p "{prompt}"

Enter your prompt: generate fibonacci sequence in Python
Name of the model (options: evolcodellama-7b.Q5_K_M.gguf, evolcodellama-7b.Q4_K_M.gguf): evolcodellama-7b.Q5_K_M.gguf
Log start
main: build = 2135 (895407f3)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1708003509
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from EvolCodeLlama-7b/evolcodellama-7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       

In [12]:
prompt = "generate fibonacci in python"#input("Enter your prompt: ")
chosen_method = "evolcodellama-7b.Q5_K_M.gguf"#input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Log start
main: build = 2135 (895407f3)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1708003634
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from EvolCodeLlama-7b/evolcodellama-7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:   

## Push to hub

To push your model to the hub, you'll need to input your Hugging Face token (https://huggingface.co/settings/tokens) in Google Colab's "Secrets" tab. The following code creates a new repo with the "-GGUF" suffix. Don't forget to change the `username` variable.

In [9]:
!pip install -q huggingface_hub

In [10]:
from huggingface_hub import create_repo, HfApi
from google.colab import userdata

username = "shahzebnaveed"

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

# Create empty repo
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF1",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF1",
    allow_patterns=f"*.gguf",
)

evolcodellama-7b.Q5_K_M.gguf:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

evolcodellama-7b.Q4_K_M.gguf:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/shahzebnaveed/EvolCodeLlama-7b-GGUF1/commit/bcf96566d24037d61883969ca924a876b75eb209', commit_message='Upload folder using huggingface_hub', commit_description='', oid='bcf96566d24037d61883969ca924a876b75eb209', pr_url=None, pr_revision=None, pr_num=None)