# Llama.cpp Quanitization Walkthrough


We will be quanitizing (Q-4) the Llama 3.1 8b model.

Open this in Google Colab for best experience.  If you can, please connect runtime to a GPU.

## Download Model from HuggingFace.

You'll need a HF Access Token

In [6]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import snapshot_download

model_name = "meta-llama/Meta-Llama-3.1-8B"
base_model = "./original_model/"
snapshot_download(repo_id=model_name, local_dir=base_model, ignore_patterns=["*.pth"])

## Clone llama.cpp Repository

In [3]:
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...

remote: Enumerating objects: 32321, done.[K

remote: Counting objects: 100% (6837/6837), done.[K

remote: Compressing objects: 100% (466/466), done.[K

remote: Total 32321 (delta 6618), reused 6439 (delta 6362), pack-reused 25484 (from 1)[K

Receiving objects: 100% (32321/32321), 54.48 MiB | 14.67 MiB/s, done.

Resolving deltas: 100% (23257/23257), done.


In [4]:
!mkdir models

## Convert model to GGUF format

In [5]:
!python llama.cpp/convert_hf_to_gguf.py ./original_model/ --outfile models/llama_3.1_FP16.gguf

INFO:hf-to-gguf:Loading model: original_model

INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only

INFO:hf-to-gguf:Exporting model...

INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'

INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'

INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {4096, 128256}

INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}

INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}

INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloa

## Build llama.cpp and quantize the Model

In [6]:
!mkdir llama.cpp/build && cd llama.cpp/build && cmake .. && cmake --build . --config Release

-- The C compiler identification is GNU 11.4.0

-- The CXX compiler identification is GNU 11.4.0

-- Detecting C compiler ABI info

-- Detecting C compiler ABI info - done

-- Check for working C compiler: /usr/bin/cc - skipped

-- Detecting C compile features

-- Detecting C compile features - done

-- Detecting CXX compiler ABI info

-- Detecting CXX compiler ABI info - done

-- Check for working CXX compiler: /usr/bin/c++ - skipped

-- Detecting CXX compile features

-- Detecting CXX compile features - done

-- Found Git: /usr/bin/git (found version "2.34.1")

-- Performing Test CMAKE_HAVE_LIBC_PTHREAD

-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success

-- Found Threads: TRUE

-- Found OpenMP_C: -fopenmp (found version "4.5")

-- Found OpenMP_CXX: -fopenmp (found version "4.5")

-- Found OpenMP: TRUE (found version "4.5")

-- OpenMP found

-- Using llamafile


-- CMAKE_SYSTEM_PROCESSOR: x86_64

-- x86 detected

-- Configuring done (1.6s)

-- Generating done (0.2s)

-- Build files

In [7]:
!cd llama.cpp/build/bin && ./llama-quantize /content/models/llama_3.1_FP16.gguf /content/models/llama_3.1-Q4_K_M.gguf q4_K_M

main: build = 3602 (554b0490)

main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

main: quantizing '/content/models/llama_3.1_FP16.gguf' to '/content/models/llama_3.1-Q4_K_M.gguf' as Q4_K_M

llama_model_loader: loaded meta data with 26 key-value pairs and 292 tensors from /content/models/llama_3.1_FP16.gguf (version GGUF V3 (latest))

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

llama_model_loader: - kv   0:                       general.architecture str              = llama

llama_model_loader: - kv   1:                               general.type str              = model

llama_model_loader: - kv   2:                               general.name str              = Original_Model

llama_model_loader: - kv   3:                         general.size_label str              = 8.0B

llama_model_loader: - kv   4:                            general.license str              = llama3.1

llama_model_loader: - kv   5:

## Now Inference using Quantized Model

In [8]:
!pip install llama-cpp-python==0.2.85

Collecting llama-cpp-python==0.2.85

  Downloading llama_cpp_python-0.2.85.tar.gz (49.3 MB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m

[?25h  Installing build dependencies ... [?25l[?25hdone

  Getting requirements to build wheel ... [?25l[?25hdone

  Installing backend dependencies ... [?25l[?25hdone

  Preparing metadata (pyproject.toml) ... [?25l[?25hdone



Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.85)

  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)



Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m

[?25hBuilding wheels for collected packages: llama-cpp-python

  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone

  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.85-cp310-cp310-linux_x86_64.whl size=

In [21]:
from llama_cpp import Llama

In [22]:
model_path = "/content/models/llama_3.1-Q4_K_M.gguf"

In [None]:
llm = Llama(model_path=model_path)

In [None]:
generation_kwargs = {
    "max_tokens":300,
    "echo":False,
    "top_k":1
}

prompt = "Which country hosted 2018 fifa world cup?"
res = llm(prompt, **generation_kwargs)
res.get("choices")[0].get("text")

## Save model to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!mkdir "/content/drive/My Drive/llama_models"

In [None]:
import shutil

source_file_path = '/content/models/llama_3.1-Q4_K_M.gguf'
destination_file_path = '/content/drive/My Drive/llama_models/llama_3.1-Q4_K_M.gguf'

shutil.copy(source_file_path, destination_file_path)

## Upload model to Huggingface Hub

In [13]:
from huggingface_hub import login
login('<hf_access_token_here>')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.

Token is valid (permission: fineGrained).

Your token has been saved to /root/.cache/huggingface/token

Login successful


In [14]:
from huggingface_hub import HfApi
api = HfApi()

model_id = "hf_profile/llama3.1-Q4_K_M-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj='/content/models/llama_3.1-Q4_K_M.gguf',
    path_in_repo="llama3.1-Q4_K_M.gguf",
    repo_id=model_id,
)

llama_3.1-Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wzebrowski/llama3.1-Q4_K_M-gguf/commit/c7278967dcef792b0b170754f0afd13eaa62157a', commit_message='Upload llama3.1-Q4_K_M.gguf with huggingface_hub', commit_description='', oid='c7278967dcef792b0b170754f0afd13eaa62157a', pr_url=None, pr_revision=None, pr_num=None)

## Inference by utilizing GPU

In [None]:
from huggingface_hub import snapshot_download

model_name = "hf_profile/llama3.1-Q4_K_M-gguf"
base_model = "./quantized_models/"
snapshot_download(repo_id=model_name, local_dir=base_model)

In [16]:
!nvidia-smi

Sun Aug 18 16:20:11 2024       

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |

|-----------------------------------------+----------------------+----------------------+

| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |

|                                         |                      |               MIG M. |


|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |

| N/A   31C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |

|                                         |                      |             Disabled |

+-----------------------------------------+----------------------+----------------------+

                                                        

In [None]:
!pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122

Collecting llama-cpp-python

  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.87-cu122/llama_cpp_python-0.2.87-cp310-cp310-linux_x86_64.whl (394.5 MB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m394.5/394.5 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m



Collecting diskcache>=5.6.1 (from llama-cpp-python)

  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)



Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m

[?25hInstalling collected packages: diskcache, llama-cpp-python

Successfully installed diskcache-5.6.3 llama-cpp-python-0.2.87


In [None]:
from llama_cpp import Llama
model_path = "./quantized_models/llama3.1-Q4_K_M.gguf"

model = Llama(model_path=model_path, n_gpu_layers=-1)

In [None]:
generation_kwargs = {
    "max_tokens":200,
    "echo":False,
    "top_k":1
}

prompt = "Which country hosted 2018 fifa world cup?"
res = model(prompt, **generation_kwargs)
res

Llama.generate: prefix-match hit



llama_print_timings:        load time =     545.44 ms

llama_print_timings:      sample time =      16.89 ms /   200 runs   (    0.08 ms per token, 11842.03 tokens per second)

llama_print_timings: prompt eval time =      67.41 ms /     9 tokens (    7.49 ms per token,   133.52 tokens per second)

llama_print_timings:        eval time =    5408.86 ms /   199 runs   (   27.18 ms per token,    36.79 tokens per second)

llama_print_timings:       total time =    5691.39 ms /   208 tokens


{'id': 'cmpl-94cf14c4-8b6e-4861-a2c8-0eec891a92c2',
 'object': 'text_completion',
 'created': 1723116609,
 'model': './quantized_models/llama3.1-Q4_K_M.gguf',
 'choices': [{'text': " The 2018 fifa world cup was the 21st fifa world cup, an international football tournament contested by the men's national teams of the member associations of fifa. It took place in russia from 14 june to 15 july 2018. It was the first world cup to be held in eastern europe, and the 11th time that it had been held in europe. At an estimated cost of over $14.2 billion, it. The 2018 fifa world cup was the 21st fifa world cup, an international football tournament contested by the men's national teams of the member associations of fifa. It took place in russia from 14 june to 15 july 2018. It was the first world cup to be held in eastern europe, and the 11th time that it had been held in europe. At an estimated cost of over $14.2 billion, it. The 2018",
   'index': 0,
   'logprobs': None,
   'finish_reason': 'l

In [None]:
output = model("Provide Information about world war 2 in 1000 words.", max_tokens=2048, stop=["\n"], echo=False)

In [None]:
print(output['choices'][0]['text'])