# Convert HuggingFace models to GGUF format

This notebook converts HuggingFace models to GGUF format that's supported
by llama.cpp. The notebook also supports downloading a model from HuggingFace
directly by setting the `download_model_id` param in Substratus.

Load the params provided by substratus

In [12]:
import json
from pathlib import Path

params = {}
params_path = Path("/content/params.json")
if params_path.is_file():
    with params_path.open("r", encoding="UTF-8") as params_file:
        params = json.load(params_file)
if 'name' not in params:
    raise Exception("Missing required param `name`")

name = params["name"]

output_path = params.get("output_path", "/content/model")
params

{'name': 'test',
 'download_model_id': 'lmsys/vicuna-13b-v1.5',
 'quantize': 'Q2_K, Q4_K_M'}

Download the model from huggingFace if `download_model_id` params is set. Otherwise
this expects HuggingFace model to be present at `/content/saved-model`

In [2]:
from huggingface_hub import snapshot_download

model_path = "/content/saved-model"

model_id = params.get("download_model_id")
if model_id:
    model_path = "/content/downloaded-model"
    snapshot_download(repo_id=model_id, local_dir=model_path,
                      local_dir_use_symlinks=False, revision="main")

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

Downloading (…)df14bcd3a6/README.md:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading (…)14bcd3a6/config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

Downloading (…)cd3a6/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Convert the model to GGUF 16 bit so it can be further used with `llama.cpp/example/quantize` tool

In [None]:
import os
# have to use this hack otherwise the python3 command won't work
os.environ["MODEL_PATH"] = model_path
os.environ["OUTFILE"] = f"{output_path}/{name}-f16.gguf"

! mkdir -p {output_path}
! ls -lash {model_path}
! python3 /content/llama.cpp/convert.py \
  --outfile $OUTFILE \
  --outtype f16 $MODEL_PATH

! ls -lash {output_path}

Optionally create additional quantized models

In [9]:
! quantize -h

usage: quantize [--help] [--allow-requantize] [--leave-output-tensor] model-f32.gguf [model-quant.gguf] type [nthreads]

  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q

In [None]:
quantize = params.get("quantize")
if quantize:
    quantize = [q.strip() for q in quantize.split(",")]
    for quantize_type in quantize:
        filename = f"{output_path}/{name}-{quantize_type}.gguf"
        os.environ["filename"] = filename
        os.environ["quantize_type"] = quantize_type
        print(f"Running {quantize_type} quantization and writing to {filename}")
        ! quantize $OUTFILE $filename $quantize_type
    ! ls -lash {output_path}

main: build = 1154 (3358c38)
main: quantizing '/content/model/test-f16.gguf' to '/content/model/test-Q2_K.gguf' as Q2_K
llama_model_loader: loaded meta data with 18 key-value pairs and 363 tensors from /content/model/test-f16.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight f16      [  5120, 13824,  