<a href="https://colab.research.google.com/github/wenqiglantz/llmops/blob/main/Quantize_Mistral_7B_Instruct_v0_2_using_GGUF_and_llama_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantize Mistral-7B-Instruct-v0.2 using GGUF and llama.cpp

This notebook demonstrates how to quantize `Mistral-7B-Instruct-v0.2` using GGUF and llama.cpp.

* `MODEL_ID`: `mistralai/Mistral-7B-Instruct-v0.2`
* `QUANTIZATION_METHOD`: The quantization method to use.
    - Q5_K_M: 5-bit, recommended, low quality loss.
    - Q4_K_M: 4-bit, recommended, offers balanced quality.


A big shout out to Maxime Labonne for his great work on quantizing Llama models through his blog post https://medium.com/towards-data-science/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172.

## Quantize model

### Log into Hugging Face

Since we will be downloading the base model `mistralai/Mistral-7B-Instruct-v0.2` from Hugging Face hub and uploading our quantized models back to Hugging Face hub, let's log into Hugging Face first.  I store my Hugging Face token in the secrets tab to the left.  The benefit of storing my token in this secrets tab is that I don't expose the token in my notebook, and I can reuse this secrets configuration for all my Colab notebooks.

In [None]:
from google.colab import userdata
from huggingface_hub import HfApi

HF_TOKEN = userdata.get("HF_TOKEN")

api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

wenqiglantz


### Install llama.cpp

We need llama.cpp to quantize our base model, so let's install it.

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 15421, done.[K
remote: Counting objects: 100% (5417/5417), done.[K
remote: Compressing objects: 100% (303/303), done.[K
remote: Total 15421 (delta 5284), reused 5133 (delta 5114), pack-reused 10004[K
Receiving objects: 100% (15421/15421), 17.96 MiB | 22.03 MiB/s, done.
Resolving deltas: 100% (10779/10779), done.
Already up to date.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn

### Download, convert, and quantize the base model

We first download the base model, then convert it to fp16, finally quantize it into both 5-bit and 4-bit models.

In [None]:
# Variables
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"
QUANTIZATION_METHODS = ["q5_k_m", "q4_k_m"]

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]
print(MODEL_NAME)

# Download model
!git lfs install
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{MODEL_ID}

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

Mistral-7B-Instruct-v0.2
Git LFS initialized.
Cloning into 'Mistral-7B-Instruct-v0.2'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 44 (delta 17), reused 0 (delta 0), pack-reused 2[K
Unpacking objects: 100% (44/44), 469.65 KiB | 1.05 MiB/s, done.
Filtering content: 100% (7/7), 3.46 GiB | 12.07 MiB/s, done.
Encountered 6 file(s) that may not have been copied correctly on Windows:
	pytorch_model-00002-of-00003.bin
	pytorch_model-00003-of-00003.bin
	pytorch_model-00001-of-00003.bin
	model-00003-of-00003.safetensors
	model-00001-of-00003.safetensors
	model-00002-of-00003.safetensors

See: `git lfs help smudge` for more details.
Loading model file Mistral-7B-Instruct-v0.2/model-00001-of-00003.safetensors
Loading model file Mistral-7B-Instruct-v0.2/model-00001-of-00003.safetensors
Loading model file Mistral-7B-Instruct-v0.2/model-00002-of-00003.safetensors
Loading model file Mis

### Run inference

Now that we have two quantized models, let's run an inference test by calling `llama.cpp/main`.

In [None]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]
print("Available models: " + ", ".join(model_list))

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Available models: mistral-7b-instruct-v0.2.Q4_K_M.gguf, mistral-7b-instruct-v0.2.Q5_K_M.gguf
Enter your prompt: to infinity and
Name of the model (options: mistral-7b-instruct-v0.2.Q4_K_M.gguf, mistral-7b-instruct-v0.2.Q5_K_M.gguf): mistral-7b-instruct-v0.2.Q5_K_M.gguf
Log start
main: build = 1803 (36e5a08)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1704826355
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from Mistral-7B-Instruct-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:       

## Upload the quantized models to Hugging Face hub

Now, we are ready to push our quantized models to the Hugging Face hub to share with the community (and myself).

In [None]:
!pip install -q huggingface_hub
from huggingface_hub import create_repo , HfApi
from google.colab import userdata

username = "wenqiglantz"

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

# Create empty repo
api.create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

mistral-7b-instruct-v0.2.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

mistral-7b-instruct-v0.2.Q5_K_M.gguf:   0%|          | 0.00/5.13G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wenqiglantz/Mistral-7B-Instruct-v0.2-GGUF/commit/932449460f4c5f8f3597910b4c033a74402cba82', commit_message='Upload folder using huggingface_hub', commit_description='', oid='932449460f4c5f8f3597910b4c033a74402cba82', pr_url=None, pr_revision=None, pr_num=None)