*More details in this article: [Fast Inference with GGUF LoRA Adapters on the CPU](https://newsletter.kaitchup.com/p/fast-inference-with-gguf-lora-adapters)*

This notebook shows how to GGUF a LoRA adapter and how to use it with llama.cpp. It also evaluates (perplexity) the GGUFed LoRA and compare it with the santdard merging+GGUF.

1.   save_pretrained_gguf/push_to_hub_gguf - Preparing a merged file which can be converted to GGUF needs high RAM, colab crashes
2.   unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
ERROR:lora-to-gguf:Model MllamaForConditionalGeneration is not supported
3. Ollama for llama-3.2_11b (vision) conversion to GGUF is not available yet
4. When the LORA adapter is merged with Llama3.2B_3B foundation model, it becomes a LLM, not VLLM.
5. Perplexity - run using llama.cpp is benchmarking only the language decoder in the merged model. (Can just check to compare with the foundation model)
6. Is it possible to patch a vision encoder part to this decoder? Which part of the LORA adapter can be patched? Can we split the weights?
7. Use Llama from scratch -> vision encoder?
8. Possible to use vision tower from Llama3.2_11B to connect with Llama3.2_3B?
9. Libraries like Hugging Face Transformers can be very helpful for loading the pre-trained CLIP and Llama models and building the combined architecture.

# Installation

In [None]:
!pip install --upgrade transformers peft accelerate

Install llama.cpp (this can take a while to compile)

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && GGML_CUDA=1 make && pip install -r requirements.txt

Run the next cell only if you have issues with PyTorch.

In [None]:
!pip3 install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# The Standard Way: Merging, then GGUF

## Merging an Adapter with Transformers

In [None]:
'''
from peft import  PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
import torch

model_name = "meta-llama/Llama-3.2-3B-Instruct"
sft_adapter = "kaitchup/Llama-3.2-3B-Instruct-UltraChat" #Your adapter to merge

compute_dtype = torch.float16 #or use float16


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
      model_name, device_map={"": 0}, torch_dtype=compute_dtype)

model = PeftModel.from_pretrained(model, sft_adapter)
model = model.merge_and_unload()
model.save_pretrained("./SFT_LoRA_Merged/")
tokenizer.save_pretrained("./SFT_LoRA_Merged/")
'''

## Conversion to GGUF

In [None]:
gguf_model = './SFT_LoRA_Merged/FP16.gguf'
!python llama.cpp/convert_hf_to_gguf.py ./SFT_LoRA_Merged/ --outtype f16 --outfile {gguf_model}

Evaluation of the merged model converted to GGUF:

In [None]:
!./llama.cpp/llama-perplexity -m {gguf_model} -f wikitext-2-raw/wiki.test.raw > ./SFT_LoRA_Merged/merged_LoRA_perplexity.txt

# GGUF LoRA

We must download the adapter (if you don't have it locally already).

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="sindsub/llama_flickr8k_lora_model", local_dir="./llama_flickr8k_lora_model")

## Conversion to GGUF

In [7]:
!python llama.cpp/convert_lora_to_gguf.py --outfile ./llama_flickr8k_lora_model.gguf ./llama_flickr8k_lora_model/

INFO:lora-to-gguf:Loading base model from Hugging Face: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
ERROR:lora-to-gguf:Model MllamaForConditionalGeneration is not supported


In [8]:
!python llama.cpp/convert-lora-to-ggml.py ./llama_flickr8k_lora_model/adapter_config.json

python3: can't open file '/content/llama.cpp/convert-lora-to-ggml.py': [Errno 2] No such file or directory


In [None]:
#Convert adapter_config.json to ggml-adapter-model.bin with llama.cpp
#https://sarinsuriyakoon.medium.com/unsloth-lora-with-ollama-lightweight-solution-to-full-cycle-llm-development-edadb6d9e0f0

## Evaluation

We will mount the GGUFed LoRA on top of the GGUFed base model. We must convert the base model to GGUF first.

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="meta-llama/Llama-3.2-3B-Instruct", local_dir="./Llama-3.2-3B-Instruct/")

In [None]:
gguf_base_model = './Llama-3.2-3B-Instruct/FP16_base.gguf'
!python llama.cpp/convert_hf_to_gguf.py ./Llama-3.2-3B-Instruct/ --outtype f16 --outfile {gguf_base_model}

Evaluation

In [None]:
import os

llama_cpp_path = "./llama.cpp"
script_path = os.path.join(llama_cpp_path, "convert-lora-to-ggml.py")

if os.path.exists(script_path):
    print(f"The script {script_path} exists.")
    # If the script exists, we can proceed with generating the command to use it.
    # However, the previous output showed an error, so there might be an issue with the script itself or its usage.
    # I will provide the command assuming the script is the correct tool, but note the previous error.
    print("Based on previous attempts, there might be an issue with this script or its usage.")
    print("Attempting to use the script:")
    print(f"!python {script_path} ./llama_flickr8k_lora_model/adapter_config.json")
else:
    print(f"The script {script_path} does not exist.")
    print("Searching for alternative conversion tools in llama.cpp...")
    # Search for files that might be relevant for LoRA conversion
    !find ./llama.cpp -name "*lora*"

In [11]:
# Convert the LoRA adapter to GGUF format using the found script
!python ./llama.cpp/convert_lora_to_gguf.py --outfile ./llama_flickr8k_lora_model.gguf ./llama_flickr8k_lora_model/

INFO:lora-to-gguf:Loading base model from Hugging Face: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
ERROR:lora-to-gguf:Model MllamaForConditionalGeneration is not supported


# Task
Run the perplexity benchmark on a LoRA adapter model using "https://huggingface.co/datasets/wikitext/resolve/main/wikitext-2-raw-v1.zip" as the test dataset.

## Merge lora adapter

### Subtask:
Merge your specific LoRA adapter with the base model using the `peft` library.


**Reasoning**:
The instructions require merging a LoRA adapter with a base model using the peft library. This involves loading the base model and tokenizer, loading the peft model with the adapter, merging the adapter, and saving the resulting merged model and tokenizer. The provided code block already does this.



In [18]:
from peft import  PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
import torch

model_name = "meta-llama/Llama-3.2-3B-Instruct"
sft_adapter = "/content/llama_flickr8k_lora_model" #Your adapter to merge - using the path provided by the user

compute_dtype = torch.float16 #or use float16

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
      model_name, device_map={"": 0}, torch_dtype=compute_dtype)

model = PeftModel.from_pretrained(model, sft_adapter)
model = model.merge_and_unload()
model.save_pretrained("./SFT_LoRA_Merged/")
tokenizer.save_pretrained("./SFT_LoRA_Merged/")

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]



('./SFT_LoRA_Merged/tokenizer_config.json',
 './SFT_LoRA_Merged/special_tokens_map.json',
 './SFT_LoRA_Merged/chat_template.jinja',
 './SFT_LoRA_Merged/tokenizer.json')

In [19]:
gguf_merged_model = './SFT_LoRA_Merged/merged_lora_model.gguf'
!python llama.cpp/convert_hf_to_gguf.py ./SFT_LoRA_Merged/ --outtype f16 --outfile {gguf_merged_model}

INFO:hf-to-gguf:Loading model: SFT_LoRA_Merged
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00002.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = {8192, 3072}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.0.attn_k.wei

# Task
Run the perplexity benchmark on the COCO dataset captions using the GGUF merged model "/content/llama_flickr8k_lora_model.gguf" and the `llama.cpp/llama-perplexity` tool.

## Download coco captions

### Subtask:
Download the COCO dataset caption annotations file.


**Reasoning**:
Download and unzip the COCO dataset caption annotations file as instructed.



In [20]:
!wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
!unzip annotations_trainval2017.zip

--2025-07-23 07:28:09--  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 3.5.25.184, 3.5.29.89, 3.5.3.112, ...
Connecting to images.cocodataset.org (images.cocodataset.org)|3.5.25.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252907541 (241M) [application/zip]
Saving to: ‘annotations_trainval2017.zip’


2025-07-23 07:28:16 (38.6 MB/s) - ‘annotations_trainval2017.zip’ saved [252907541/252907541]

Archive:  annotations_trainval2017.zip
  inflating: annotations/instances_train2017.json  
  inflating: annotations/instances_val2017.json  
  inflating: annotations/captions_train2017.json  
  inflating: annotations/captions_val2017.json  
  inflating: annotations/person_keypoints_train2017.json  
  inflating: annotations/person_keypoints_val2017.json  


## Extract captions

### Subtask:
Extract the image captions from the downloaded annotations file and save them to a plain text file.


**Reasoning**:
Extract the image captions from the downloaded annotations file and save them to a plain text file.



In [21]:
import json

with open('annotations/captions_val2017.json', 'r') as f:
    coco_data = json.load(f)

with open('coco_val_captions.txt', 'w') as f:
    for annotation in coco_data['annotations']:
        f.write(annotation['caption'] + '\n')

## Run perplexity benchmark

### Subtask:
Use the `llama.cpp/llama-perplexity` tool to run the benchmark on the extracted captions file using the GGUF merged model.


**Reasoning**:
Use the llama-perplexity tool to run the benchmark on the extracted COCO captions using the GGUF merged model.



In [22]:
!./llama.cpp/llama-perplexity -m ./SFT_LoRA_Merged/merged_lora_model.gguf -f coco_val_captions.txt > ./SFT_LoRA_Merged/merged_lora_perplexity_coco.txt

/bin/bash: line 1: ./llama.cpp/llama-perplexity: No such file or directory


**Reasoning**:
The previous command failed because the executable `./llama.cpp/llama-perplexity` was not found. I need to check if the executable exists and if the path is correct.



In [23]:
import os

perplexity_script_path = "./llama.cpp/llama-perplexity"

if os.path.exists(perplexity_script_path):
    print(f"The perplexity executable exists at: {perplexity_script_path}")
    # The file exists, so the previous error might be due to permissions or other execution issues.
    # I will try running the command again, ensuring the path is correct.
    print("Attempting to run the perplexity benchmark again:")
    !./llama.cpp/llama-perplexity -m ./SFT_LoRA_Merged/merged_lora_model.gguf -f coco_val_captions.txt > ./SFT_LoRA_Merged/merged_lora_perplexity_coco.txt
else:
    print(f"The perplexity executable does NOT exist at: {perplexity_script_path}")
    print("Listing files in ./llama.cpp to find the executable:")
    !ls ./llama.cpp

The perplexity executable does NOT exist at: ./llama.cpp/llama-perplexity
Listing files in ./llama.cpp to find the executable:
AUTHORS			       examples    poetry.lock
build-xcframework.sh	       flake.lock  prompts
ci			       flake.nix   __pycache__
cmake			       ggml	   pyproject.toml
CMakeLists.txt		       gguf-py	   pyrightconfig.json
CMakePresets.json	       grammars    README.md
CODEOWNERS		       include	   requirements
common			       LICENSE	   requirements.txt
CONTRIBUTING.md		       licenses    scripts
convert_hf_to_gguf.py	       Makefile    SECURITY.md
convert_hf_to_gguf_update.py   media	   src
convert_llama_ggml_to_gguf.py  models	   tests
convert_lora_to_gguf.py        mypy.ini    tools
docs			       pocs	   vendor


**Reasoning**:
The `llama-perplexity` executable was not found in the `./llama.cpp` directory. I need to rebuild llama.cpp to ensure the executable is created.



In [24]:
!cd llama.cpp && GGML_CUDA=1 make

Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.


**Reasoning**:
The Makefile build is deprecated according to the error message. I need to use the CMake build system to compile llama.cpp and generate the required executables, including `llama-perplexity`.



In [25]:
!cd llama.cpp && mkdir build && cd build && cmake .. && make -j $(nproc)

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- ggml version: 0.0.59

**Reasoning**:
The llama.cpp project has been successfully built using CMake, and the output confirms that the `llama-perplexity` executable was built and placed in the `./llama.cpp/build/bin` directory. Now I can use the correct path to run the perplexity benchmark.



In [26]:
!./llama.cpp/build/bin/llama-perplexity -m ./SFT_LoRA_Merged/merged_lora_model.gguf -f coco_val_captions.txt > ./SFT_LoRA_Merged/merged_lora_perplexity_coco.txt

build: 5966 (14c28dfc) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 28 key-value pairs and 255 tensors from ./SFT_LoRA_Merged/merged_lora_model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SFT_LoRA_Merged
llama_model_loader: - kv   3:                         general.size_label str              = 3.2B
llama_model_loader: - kv   4:                          llama.block_count u32              = 28
llama_model_loader: - kv   5:                       llama.context_length u32              = 131072
llama_model_loader: - kv   6:                     llama.embedding_l

In [28]:
# Download merged_lora_model.gguf
from google.colab import files
files.download('./SFT_LoRA_Merged/merged_lora_model.gguf')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>