# Convert HuggingFace models to GGUF format

This notebook converts HuggingFace models to GGUF format that's supported
by llama.cpp. The notebook also supports downloading a model from HuggingFace
directly by setting the `download_model_id` param in Substratus.

Load the params provided by substratus

In [1]:
import json
from pathlib import Path

params = {}
params_path = Path("/content/params.json")
if params_path.is_file():
    with params_path.open("r", encoding="UTF-8") as params_file:
        params = json.load(params_file)
if 'name' not in params:
    raise Exception("Missing required param `name`")

name = params["name"]

output_path = params.get("output_path", "/content/model")

Download the model from huggingFace if `download_model_id` params is set. Otherwise
this expects HuggingFace model to be present at `/content/saved-model`

In [2]:
from huggingface_hub import snapshot_download

model_path = "/content/saved-model"

download_model_id = params.get("download_model_id")
if download_model_id:
    model_path = "/content/downloaded-model"
    snapshot_download(repo_id=download_model_id, local_dir=model_path,
                      local_dir_use_symlinks=False, revision="main")

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading (…)797effff84/README.md:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading (…)fff84/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Downloading (…)7effff84/config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading (…)fff84/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Convert the model to GGUF 16 bit so it can be further used with `llama.cpp/example/quantize` tool

In [3]:
import os
# have to use this hack otherwise the python3 command won't work
os.environ["MODEL_PATH"] = model_path
outfile = f"{output_path}/{name}-f16.gguf"
os.environ["OUTFILE"] = outfile

! mkdir -p {output_path}
! ls -lash {model_path}
! python3 /content/llama.cpp/convert.py \
  --outfile $OUTFILE \
  --outtype f16 $MODEL_PATH

! ls -lash {output_path}

total 13G
4.0K drwxr-xr-x 2 root root 4.0K Sep  4 03:43 .
8.0K drwxr-xr-x 1 root root 4.0K Sep  4 03:43 ..
4.0K -rw-r--r-- 1 root root 1.5K Sep  4 03:40 .gitattributes
4.0K -rw-r--r-- 1 root root 3.3K Sep  4 03:40 README.md
4.0K -rw-r--r-- 1 root root  631 Sep  4 03:40 config.json
9.3G -rw-r--r-- 1 root root 9.3G Sep  4 03:43 pytorch_model-00001-of-00002.bin
3.3G -rw-r--r-- 1 root root 3.3G Sep  4 03:41 pytorch_model-00002-of-00002.bin
 24K -rw-r--r-- 1 root root  24K Sep  4 03:40 pytorch_model.bin.index.json
4.0K -rw-r--r-- 1 root root  438 Sep  4 03:40 special_tokens_map.json
1.8M -rw-r--r-- 1 root root 1.8M Sep  4 03:40 tokenizer.json
492K -rw-r--r-- 1 root root 489K Sep  4 03:40 tokenizer.model
4.0K -rw-r--r-- 1 root root  762 Sep  4 03:40 tokenizer_config.json
Loading model file /content/downloaded-model/pytorch_model-00001-of-00002.bin
Loading model file /content/downloaded-model/pytorch_model-00001-of-00002.bin
Loading model file /content/downloaded-model/pytorch_model-00002-of-

Upload the model if param `push_to_hub` was set

In [5]:
from huggingface_hub import HfApi
from pathlib import Path

push_to_hub = params.get("push_to_hub")
if push_to_hub:
    hf_api = HfApi()
    model_id = push_to_hub
    print(f"Creating HuggingFace repo {model_id}")
    hf_api.create_repo(model_id, exist_ok=True, repo_type="model")

def push_to_huggingface(file):
    hf_api.upload_file(
        path_or_fileobj=file,
        path_in_repo=Path(file).name,
        repo_id=model_id,
    )

Creating HuggingFace repo substratusai/weaviate-gorilla-v4-random-split-gguf


In [6]:
if push_to_hub:
    push_to_huggingface(outfile)
    readme_path = Path(model_path) / "README.md"
    if readme_path.exists():
        push_to_huggingface(readme_path)

weaviate-gorilla-random-split-f16.gguf:   0%|          | 0.00/13.5G [00:00<?, ?B/s]

Optionally create additional quantized models

In [2]:
! quantize -h

usage: quantize [--help] [--allow-requantize] [--leave-output-tensor] model-f32.gguf [model-quant.gguf] type [nthreads]

  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q

In [7]:
quantize = params.get("quantize")
if quantize:
    quantize = [q.strip() for q in quantize.split(",")]
    for quantize_type in quantize:
        filename = f"{output_path}/{name}-{quantize_type}.gguf"
        os.environ["filename"] = filename
        os.environ["quantize_type"] = quantize_type
        print(f"Running {quantize_type} quantization and writing to {filename}")
        ! quantize $OUTFILE $filename $quantize_type
        if push_to_hub:
            push_to_huggingface(filename)
    ! ls -lash {output_path}

Running Q4_K_M quantization and writing to /content/model/weaviate-gorilla-random-split-Q4_K_M.gguf
main: build = 1156 (2753415)
main: quantizing '/content/model/weaviate-gorilla-random-split-f16.gguf' to '/content/model/weaviate-gorilla-random-split-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /content/model/weaviate-gorilla-random-split-f16.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - t

weaviate-gorilla-random-split-Q4_K_M.gguf:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

total 17G
4.0K drwxr-xr-x 2 root root 4.0K Sep  4 03:51 .
8.0K drwxr-xr-x 1 root root 4.0K Sep  4 03:44 ..
3.9G -rw-r--r-- 1 root root 3.9G Sep  4 03:53 weaviate-gorilla-random-split-Q4_K_M.gguf
 13G -rw-r--r-- 1 root root  13G Sep  4 03:43 weaviate-gorilla-random-split-f16.gguf
