## Chapter 6: Deploying It Locally

### Spoilers

In this chapter, we will:

- Load adapters and merge them to the base model for faster inference
- Query the model to generate responses or completions
- Convert the fine-tuned model to the GGUF file format used by llama.cpp
- Use Ollama and llama.cpp to serve the model through web interfaces and REST APIs

### Setup

For better reproducibility during training, use the pinned versions below, the same versions used in the book:

In [None]:
# Original version
# !pip install transformers==4.46.2 peft==0.13.2 accelerate==1.1.1 trl==0.12.1 bitsandbytes==0.45.2 datasets==3.1.0 huggingface-hub==0.26.2 safetensors==0.4.5 pandas==2.2.2 matplotlib==3.8.0 numpy==1.26.4
# Updated version - October/2025
!pip install transformers==4.56.1 peft==0.17.0 accelerate==1.10.0 trl==0.23.1 bitsandbytes==0.47.0 datasets==4.0.0 huggingface-hub==0.34.4 safetensors==0.6.2 pandas==2.2.2 matplotlib==3.10.0 numpy==2.0.2

In [None]:
# If you're running on Colab
#!pip install datasets bitsandbytes trl

In [None]:
# If you're running on runpod.io's Jupyter Template
#!pip install datasets bitsandbytes trl transformers peft huggingface-hub accelerate safetensors pandas matplotlib

### Imports

In [None]:
import pandas as pd
import requests
import torch
from contextlib import nullcontext
from dataclasses import asdict
from datasets import load_dataset
from peft import PeftModel, PeftConfig, AutoPeftModelForCausalLM, get_model_status, \
    get_layer_status, prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig

In [None]:
# If you're running on Colab, you need to download the helper functions' Python file
!wget https://raw.githubusercontent.com/dvgodoy/FineTuningLLMs/refs/heads/main/helper_functions.py

from helper_functions import *

### The Goal

We convert our fine-tuned models and adapters to the GGUF format and then quantize them so they are small enough to run on consumer-grade hardware (without GPUs). Next, we configure and import these models and adapters into Ollama or llama.cpp so we can serve them. That way, we can query them directly in a web interface or using a REST API.

### The Road So Far

****
**IMPORTANT UPDATE**: The release of `trl` version 0.20 brought several changes to the `SFTConfig`:
- packing is performed differently than it was, unless `packing_strategy='wrapped'` is set;
- the `max_seq_length` argument was renamed to `max_length`;
- the `bf16` defaults to `True` but, at the time of this update (Aug/2025), it didn't check if the BF16 type was actually available or not, so it's included in the configuration now.

Moreover, at the time of writing, the current version of `trl` (0.21) has a known issue where training fails if the LoRA configuration has already been applied to the model, as the trainer freezes the whole model, including the adapters.

However, it works as expected when the configuration is passed as the `peft_config` argument to the trainer, since it is applied after freezing the existing layers. 

If the model already contains the adapters, as in our case, training still works, but we need to use the underlying original model instead (`model.base_model.model`) to ensure the `save_model()` method functions correctly.
****

In [None]:
# From Chapter 2
supported = torch.cuda.is_bf16_supported(including_emulation=False)
compute_dtype = (torch.bfloat16 if supported else torch.float32)

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=compute_dtype
)
model_q4 = AutoModelForCausalLM.from_pretrained(
  "facebook/opt-350m", device_map='cuda:0', quantization_config=nf4_config
)
# From Chapter 3
model_q4 = prepare_model_for_kbit_training(model_q4)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model_q4, config)

# From Chapter 4
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
tokenizer = modify_tokenizer(tokenizer)
tokenizer = add_template(tokenizer)

peft_model = modify_model(peft_model, tokenizer)

dataset = load_dataset("dvgodoy/yoda_sentences", split="train")
dataset = dataset.rename_column("sentence", "prompt")
dataset = dataset.rename_column("translation_extra", "completion")

# **IMPORTANT UPDATE**: unfortunately, in more recent versions of the `trl` library,
# the "instruction" format is not properly supported anymore, thus leading to the chat
# template not being applied to the dataset. In order to avoid this issue, we can
# convert the dataset to the "conversational" format.
dataset = dataset.map(format_dataset)
dataset = dataset.remove_columns(['prompt', 'completion', 'translation'])

# From Chapter 5
min_effective_batch_size = 8
lr = 3e-4
max_seq_length = 64
collator_fn = None
packing = (collator_fn is None)
steps = 20
num_train_epochs = 10

sft_config = SFTConfig(
    output_dir='./future_name_on_the_hub',
    # Dataset
    packing=packing,
    packing_strategy='wrapped',  # added to approximate original packing behavior
    max_length=64,               # renamed in v0.20
    # Gradients / Memory
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentrant': False},
    gradient_accumulation_steps=2,
    per_device_train_batch_size=min_effective_batch_size,
    auto_find_batch_size=True,
    # Training
    num_train_epochs=num_train_epochs,
    learning_rate=lr,
    # Env and Logging
    report_to='tensorboard',
    logging_dir='./logs',
    logging_strategy='steps',
    logging_steps=steps,
    save_strategy='steps',
    save_steps=steps,
    bf16=supported               # ensures bf16 (the new default) is only used when it is actually available
)

trainer = SFTTrainer(
    model=peft_model,
    processing_class=tokenizer,
    train_dataset=dataset,
    data_collator=collator_fn,
    args=sft_config
)
trainer.train()
trainer.save_model('yoda-adapter') # trainer.push_to_hub()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
20,3.3681
40,2.2997
60,2.0115
80,1.8842
100,1.8059
120,1.7562
140,1.722
160,1.6822
180,1.6506
200,1.661


### Loading Models and Adapters

In [None]:
repo_or_folder = 'dvgodoy/opt-350m-lora-yoda'
model = AutoPeftModelForCausalLM.from_pretrained(repo_or_folder,
                                                 device_map='auto',
                                                 adapter_name='yoda')
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 512, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
          (project_out): Linear(in_features=1024, out_features=512, bias=False)
          (project_in): Linear(in_features=512, out_features=1024, bias=False)
          (layers): ModuleList(
            (0-23): 24 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (v_proj): lora.Linear(
                  (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                  (lora_dropout): ModuleDict(
                    (yoda): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (yoda): Linear(in_features=1024, out_features=8, bias=False)
     

****
**IMPORTANT UPDATE**: currently, the embedding layer is only resized if the tokenizer's length exceeds it, thus fixing a long-standing issue that was pointed out in the original version of the book.

Due to that issue, one had to load the base model according to the LoRA config, and then combine it with the adapter using the `PeftModel` class, as shown below:

```python
repo_or_folder = 'dvgodoy/opt-350m-lora-yoda'
config = PeftConfig.from_pretrained(repo_or_folder)
base_model = AutoModelForCausalLM.from_pretrained(
  config.base_model_name_or_path,
  device_map='auto'
)
model = PeftModel.from_pretrained(
  base_model,
  repo_or_folder,
  adapter_name='yoda'
)
```
****

In [None]:
model.merge_adapter(['yoda'])

In [None]:
repo_or_folder = 'dvgodoy/opt-350m-lora-yoda'
tokenizer = AutoTokenizer.from_pretrained(repo_or_folder)

In [None]:
df = pd.DataFrame(asdict(layer) for layer in get_layer_status(model))
df

Unnamed: 0,name,module_type,enabled,active_adapters,merged_adapters,requires_grad,available_adapters,devices
0,model.model.decoder.layers.0.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
1,model.model.decoder.layers.0.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
2,model.model.decoder.layers.1.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
3,model.model.decoder.layers.1.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
4,model.model.decoder.layers.2.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
5,model.model.decoder.layers.2.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
6,model.model.decoder.layers.3.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
7,model.model.decoder.layers.3.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
8,model.model.decoder.layers.4.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
9,model.model.decoder.layers.4.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}


In [None]:
print(get_model_status(model))

TunerModelStatus(base_model_type='OPTForCausalLM', adapter_model_type='LoraModel', peft_types={'yoda': 'LORA'}, trainable_params=0, total_params=331982848, num_adapter_layers=48, enabled=True, active_adapters=['yoda'], merged_adapters=['yoda'], requires_grad={'yoda': False}, available_adapters=['yoda'], devices={'yoda': ['cuda']})


In [None]:
model.unload()

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=409

### Querying the Model

In [None]:
def gen_prompt(tokenizer, sentence):
    converted_sample = [
        {"role": "user", "content": sentence},
    ]
    prompt = tokenizer.apply_chat_template(converted_sample,
                                           tokenize=False,
                                           add_generation_prompt=True)
    return prompt

In [None]:
prompt = gen_prompt(tokenizer, 'There is bacon in this sandwich.')
print(prompt)

<|im_start|>user
There is bacon in this sandwich.<|im_end|>
<|im_start|>assistant



In [None]:
def generate(model, tokenizer, prompt,
             max_new_tokens=64,
             skip_special_tokens=False,
             response_only=False):
    # Tokenizes the formatted prompt
    tokenized_input = tokenizer(prompt,
                                add_special_tokens=False,
                                return_tensors="pt").to(model.device)

    model.eval()
    # Generates the response/completion
    # if it was trained using mixed precision, uses autocast context
    ctx = torch.autocast(device_type=model.device.type, dtype=model.dtype) \
          if model.dtype in [torch.float16, torch.bfloat16] else nullcontext()
    with ctx:    
        generation_output = model.generate(**tokenized_input,
                                           eos_token_id=tokenizer.eos_token_id,
                                           max_new_tokens=max_new_tokens)

    # If required, removes the tokens belonging to the prompt
    if response_only:
        input_length = tokenized_input['input_ids'].shape[1]
        generation_output = generation_output[:, input_length:]

    # Decodes the tokens back into text
    output = tokenizer.batch_decode(generation_output,
                                    skip_special_tokens=skip_special_tokens)[0]
    return output

In [None]:
print(generate(model, tokenizer,prompt, skip_special_tokens=False, response_only=False))

<|im_start|>user
There is bacon in this sandwich.<|im_end|>
<|im_start|>assistant
In this sandwich, bacon there is.<|im_end|>


In [None]:
print(generate(model, tokenizer,prompt, skip_special_tokens=True, response_only=True))

In this sandwich, bacon there is.


In [None]:
sentences  = ['There is bacon in this sandwich.', 'Add some cheddar to it.']

In [None]:
def batch_generate(model, tokenizer, sentences,
             max_new_tokens=64,
             skip_special_tokens=False,
             response_only=False):

    # Converts prompts into conversational format
    converted_samples = [[{"role": "user", "content": sentence}]
                         for sentence in sentences]

    # Applies the chat template to format the prompts
    prompts = tokenizer.apply_chat_template(converted_samples,
                                            tokenize=False,
                                            add_generation_prompt=True)

    # Forces padding to the left for batch generation
    tokenizer.padding_side = 'left'
    # Tokenizes the formatted prompts with padding
    tokenized_inputs = tokenizer(prompts,
                                 padding=True,
                                 add_special_tokens=False,
                                 return_tensors='pt').to(model.device)

    model.eval()
    # Generates the responses/completions
    # if it was trained using mixed precision, uses autocast context
    ctx = torch.autocast(device_type=model.device.type, dtype=model.dtype) \
          if model.dtype in [torch.float16, torch.bfloat16] else nullcontext()
    with ctx:   
        generation_output = model.generate(**tokenized_inputs,
                                           eos_token_id=tokenizer.eos_token_id,
                                           pad_token_id=tokenizer.pad_token_id,
                                           max_new_tokens=max_new_tokens)

    # If required, removes the tokens belonging to the prompts
    if response_only:
        input_length = tokenized_inputs['input_ids'].shape[1]
        generation_output = generation_output[:, input_length:]

    # Decodes the tokens back into text
    output = tokenizer.batch_decode(generation_output,
                                    skip_special_tokens=skip_special_tokens)
    return output[0] if isinstance(sentences, str) else output

In [None]:
batch_generate(model, tokenizer, sentences, skip_special_tokens=True, response_only=True)

['In this sandwich, bacon there is.', 'To it, add some cheddar, you must.']

### Llama.cpp

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/llama_cpp.png?raw=True)

<center>Figure 6.1 - Screenshot of llama.cppâ€™s GitHub Repo</center>

#### Converting Adapters

In order to convert an adapter to the GGUF format, we need to do the following:

- save the adapter to a local folder, either by calling the `save_model()` method after training, as we did in the last chapter, or by downloading it from the Hugging Face Hub (see the aside for details)
- clone the llama.cpp repository from GitHub

In [None]:
!git clone https://github.com/ggerganov/llama.cpp

- install the `gguf-py` package

In [None]:
!pip install llama.cpp/gguf-py

In [None]:
# required in newer versions
!pip install mistral-common

***
**Downloading Models from the Hub**

In [None]:
from huggingface_hub import login
login()

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="dvgodoy/phi3-mini-yoda-adapter", local_dir='./phi3-mini-yoda-adapter')

***

- run the `convert_lora_to_gguf.py` script

In [None]:
!python ./llama.cpp/convert_lora_to_gguf.py \
        ./phi3-mini-yoda-adapter \
        --outfile adapter.gguf \
        --outtype q8_0

- the `outtype` may be one of the following choices: `f32`, `f16`, `bf16`, `q8_0`, or `auto`, which defaults to the highest-fidelity 16-bit float type depending on the first loaded tensor.

#### Converting Full Models

##### Using "GGUF My Repo"

https://huggingface.co/spaces/ggml-org/gguf-my-repo

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/gguf_my_repo.png?raw=True)

<center>Figure 6.2 - Screenshot of "GGUF My Repo" space on Hugging Face</center>

##### Using Unsloth

In [None]:
# Original
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# Updated  - August/2025
!pip install unsloth
!pip install protobuf==3.20.1  # this is required at the time of the update
!pip install --no-deps xformers trl peft accelerate bitsandbytes

In [None]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained('dvgodoy/phi3-mini-yoda-adapter')

```
ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Mistral patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

model.safetensors:â€‡100%
â€‡2.26G/2.26Gâ€‡[00:30<00:00,â€‡202MB/s]
generation_config.json:â€‡100%
â€‡194/194â€‡[00:00<00:00,â€‡17.5kB/s]
tokenizer_config.json:
â€‡3.34k/?â€‡[00:00<00:00,â€‡202kB/s]
tokenizer.model:â€‡100%
â€‡500k/500kâ€‡[00:00<00:00,â€‡1.10MB/s]
added_tokens.json:â€‡100%
â€‡293/293â€‡[00:00<00:00,â€‡21.3kB/s]
special_tokens_map.json:â€‡100%
â€‡458/458â€‡[00:00<00:00,â€‡43.5kB/s]
tokenizer.json:
â€‡1.84M/?â€‡[00:00<00:00,â€‡6.64MB/s]
adapter_model.safetensors:â€‡100%
â€‡50.4M/50.4Mâ€‡[00:00<00:00,â€‡60.2MB/s]

Unsloth 2025.8.9 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
```

In [None]:
model

```
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32064, 3072, padding_idx=32009)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (k_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (v_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
              (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
              (down_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm((3072,), eps=1e-05)
            (post_attention_layernorm): MistralRMSNorm((3072,), eps=1e-05)
          )
        )
        (norm): MistralRMSNorm((3072,), eps=1e-05)
      )
      (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
    )
  )
)
```

In [None]:
# This command may fail for several reasons, as it depends on the environment
# and the stability of llama.cpp (which is installed during its execution)

# Removing the llama.cpp folder we cloned above, so Unsloth can install it on its own
!rm -rf llama.cpp/

model.save_pretrained_gguf("gguf_model", tokenizer, quantization_method = "q4_k_m")

 ```
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.3G

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.0 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 32/32 [00:01<00:00, 24.95it/s]

Unsloth: Saving tokenizer... Done.
Unsloth: Saving gguf_model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving gguf_model/pytorch_model-00002-of-00002.bin...
Done.

Unsloth: Converting mistral model. Can use fast conversion = True.

==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at gguf_model into f16 GGUF format.
The output location will be /content/gguf_model/unsloth.F16.gguf
This might take 3 minutes...

Unsloth: Extending gguf_model/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).

INFO:hf-to-gguf:Loading model: gguf_model
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {3072, 3072}
...
INFO:hf-to-gguf:blk.31.attn_norm.weight,     torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.31.ffn_norm.weight,      torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:output_norm.weight,          torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:output.weight,               torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 4096
INFO:hf-to-gguf:gguf: embedding length = 3072
INFO:hf-to-gguf:gguf: feed forward length = 8192
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 32
INFO:hf-to-gguf:gguf: rope theta = 10000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 32000
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 32009
INFO:gguf.vocab:Setting add_bos_token to False
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/content/gguf_model/unsloth.F16.gguf: n_tensors = 291, total_size = 7.6G
Writing: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 7.64G/7.64G [02:24<00:00, 53.0Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /content/gguf_model/unsloth.F16.gguf
Unsloth: Conversion completed! Output location: /content/gguf_model/unsloth.F16.gguf
Unsloth: [2] Converting GGUF 16bit into q4_k_m. This might take 20 minutes...
main: build = 6230 (30649cab)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/gguf_model/unsloth.F16.gguf' to '/content/gguf_model/unsloth.Q4_K_M.gguf' as Q4_K_M using 4 threads
llama_model_loader: loaded meta data with 34 key-value pairs and 291 tensors from /content/gguf_model/unsloth.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gguf_Model
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 3.8B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["unsloth", "llama.cpp"]
llama_model_loader: - kv   7:                          llama.block_count u32              = 32
llama_model_loader: - kv   8:                       llama.context_length u32              = 4096
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 llama.attention.key_length u32              = 96
llama_model_loader: - kv  16:               llama.attention.value_length u32              = 96
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 32064
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 96
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 32009
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  33:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                        output.weight - [ 3072, 32064,     1,     1], type =    f16, converting to q6_K .. size =   187.88 MiB ->    77.06 MiB
[   2/ 291]                   output_norm.weight - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
...
[ 290/ 291]               blk.31.ffn_norm.weight - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 291/ 291]                 blk.31.ffn_up.weight - [ 3072,  8192,     1,     1], type =    f16, converting to q4_K .. size =    48.00 MiB ->    13.50 MiB
llama_model_quantize_impl: model size  =  7288.51 MB
llama_model_quantize_impl: quant size  =  2210.78 MB

main: quantize time = 435560.84 ms
main:    total time = 435560.84 ms
Unsloth: Conversion completed! Output location: /content/gguf_model/unsloth.Q4_K_M.gguf
```

##### Using Docker Images

To convert the model, we need to run the command below:

```
docker run --rm
           -v "/path/to/saved_model":/repo
           ghcr.io/ggerganov/llama.cpp:full
           --convert "/repo"
           --outtype f32
           --outfile /repo/gguf-model-f32.gguf
```

1. `--rm`: It automatically removes the container from execution after it finishes running, which can be very
useful in cases such as ours, where weâ€™re only interested in running a script once.
2. `-v [local path]:[path inside container]`: It maps a folder on your computer to a folder inside the
container. This allows the container to "see" your local folder as if it were located inside the container
itself.
3. `[docker image]`: Weâ€™re using llama.cppâ€™s Docker image, ghcr.io/ggerganov/llama.cpp:full
4. `--convert [path inside container]`: This is the command weâ€™re executingâ€”it isnâ€™t a Docker command,
but rather a command thatâ€™s available in the particular image weâ€™re using.
5. `--outtype [GGUF type]`: This is an argument of the --convert command that specifies the data type of the
resulting GGUF file.
6. `--outfile [GGUF filename]`: This is yet another argument of the --convert command. It specifies the
name of the GGUF file (note that it points to a path inside the containerâ€”/repoâ€”which was mapped to a
folder on your local computer, so in the end, the file is generated directly in your local folder).

To quantize the converted model, we need to run the following command:

```
docker run --rm
           -v "/path/to/saved_model":/repo
           ghcr.io/ggerganov/llama.cpp:full
           --quantize "/repo/gguf-model-f32.gguf"
           "/repo/gguf-model-Q4_K_M.gguf"
           "Q4_K_M"
```

7. `--quantize [GGUF filename]`: This is the new command weâ€™re executing, it is a command available in this
particular image only, and it should specify which GGUF file is to be quantized (usually the outfile from
the convert command).
8. `[quantized GGFUF filename]`: the name of the quantized file after the script finishes; make sure to point to
the mapped folder so you can access it directly in your local folder as well
9. `[quantization type]`: For a full list of quantization types, please check the [documentation](https://github.com/ggerganov/llama.cpp/blob/main/examples/quantize/README.md).

##### Building llama.cpp

```python
!git clone https://github.com/ggerganov/llama.cpp
!pip install llama.cpp/gguf-py
!pip install -r llama.cpp/requirements.txt
```

```python
!python ./llama.cpp/convert_hf_to_gguf.py /path/to/saved_model --outtype f16
```

```python
!cd llama.cpp && make clean && make
```

```python
!./llama.cpp/quantize
    ./path/to/saved_model/ggml-model-f16.gguf
    ./path/to/saved_model/ggml-model-q4_0.gguf
    q4_0
```

### Serving Models

#### Ollama

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/ollama.png?raw=True)
<center>Figure 6.3 - Screenshot of Ollamaâ€™s page</center>

```
ollama run phi3:mini
```

##### Installing Ollama

****
**IMPORTANT UPDATE**: at the time of this update (August/2025), the latest versions of Ollama crashed while trying to query a model built from a custom file containing adapters. Version 0.9.6, however, works without any issues.
****

In [None]:
# !curl -fsSL https://ollama.com/install.sh |  sh
# Update - August/2025
!curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.9.6 sh

```
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
```

##### Running Ollama in Colab

In [None]:
# Adapter from https://stackoverflow.com/questions/77697302/how-to-run-ollama-in-google-colab

import os
import asyncio
import threading

# NB: You may need to set these depending and get cuda working depending which backend you are running.
# Set environment variable for NVIDIA library
# Set environment variables for CUDA
os.environ['PATH'] += ':/usr/local/cuda/bin'
# Set LD_LIBRARY_PATH to include both /usr/lib64-nvidia and CUDA lib directories
os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    # define an async pipe function
    async def pipe(lines):
        async for line in lines:
            print(line.decode().strip())

        await asyncio.gather(
            pipe(process.stdout),
            pipe(process.stderr),
        )

    # call it
    await asyncio.gather(pipe(process.stdout), pipe(process.stderr))

async def start_ollama_serve():
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro)
    loop.close()

In [None]:
# Create a new event loop that will run in a new thread
new_loop = asyncio.new_event_loop()

# Start ollama serve in a separate thread so the cell won't block execution
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()

>>> starting

##### Model Files

| Instruction |	Description |
|---|---|
|FROM (required) | Defines the base model to use. |
|PARAMETER | Sets the parameters for how Ollama will run the model. |
|TEMPLATE | The full prompt template to be sent to the model. |
|SYSTEM | Specifies the system message that will be set in the template. |
|ADAPTER | Defines the (Q)LoRA adapters to apply to the model. |
|LICENSE | Specifies the legal license. |
|MESSAGE | Specify message history. |

```
ollama show --modelfile phi3:mini
```


```
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM phi3:mini

FROM /usr/share/ollama/.ollama/models/blobs/sha256-633fc...
TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"
PARAMETER stop <|end|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
LICENSE """Microsoft.
Copyright (c) Microsoft Corporation.
...
```

In [None]:
from transformers import AutoTokenizer
tokenizer_phi3 = AutoTokenizer.from_pretrained('microsoft/phi-3-mini-4k-instruct')
print(tokenizer_phi3.chat_template)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}


##### Importing Models

###### Custom (Full) Model File

```python
modelfile = """
FROM ./phi3-full-model
TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"
PARAMETER stop <|end|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
"""

with open('phi3-full-modelfile', 'w') as f:
    f.write(modelfile)
```

```
!ollama create our_own_phi3 -f phi3-full-modelfile
```

```
!ollama list
```

###### Custom Adapters

In [None]:
adapterfile = """
FROM phi3:mini
ADAPTER ./adapter.gguf
TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"
PARAMETER stop <|end|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
"""

with open('phi3-adapter-file', 'w') as f:
    f.write(adapterfile)

In [None]:
!ollama create our_own_phi3_adapted -f phi3-adapter-file

In [None]:
!ollama list

[GIN] 2025/08/21 - 14:22:45 | 200 |      39.731Âµs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/21 - 14:22:45 | 200 |    1.833702ms |       127.0.0.1 | GET      "/api/tags"
NAME                           ID              SIZE      MODIFIED       
our_own_phi3_adapted:latest    4aa0c981ee99    2.2 GB    36 seconds ago    
phi3:mini                      4f2222927938    2.2 GB    36 seconds ago    


##### Querying the Model

In [None]:
!pip install ollama

In [None]:
import ollama

prompt = "The Force is strong in this one!"
response = ollama.generate(model='our_own_phi3_adapted',
                           prompt=prompt)
print(response)

In [None]:
print(response['response'])

In this one, the Force is strong.


In [None]:
messages = [{'role': 'user', 'content': prompt}]
formatted = tokenizer_phi3.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)

response = ollama.generate(model='our_own_phi3_adapted',
                           prompt=formatted,
                           raw=True)
print(response['response'])

<|user|>
The Force is strong in this one!<|end|>
<|assistant|>

[GIN] 2025/08/21 - 14:24:12 | 200 | 19.741434736s |       127.0.0.1 | POST     "/api/generate"
In this one, the Force is strong!


#### Llama.cpp

Using the full Docker image, one that can be used to convert, quantize, and serve:

```
docker run -v "/path/to/saved_model":/model  \
           -p 8080:8000 \
           ghcr.io/ggerganov/llama.cpp:full \
           --server \
           -m /model/gguf-model-Q4_K_M.gguf \
           --port 8000 \
           --host 0.0.0.0
```

1. `-v [local path]:[path inside container]`: It maps a folder on your computer to a folder inside the
container, so effectively speaking, the container can "see" your local folder as if it were located inside the
container itself.
2. `-p [host port]:[container port]`: It forwards requests sent to a port on the host (e.g., 8080) to a port
inside the container (e.g., 8000).
3. `[docker image]`: Weâ€™re using llama.cppâ€™s Docker image, ghcr.io/ggerganov/llama.cpp:full.
4. `--server`: This is the command weâ€™re executingâ€”itâ€™s not a Docker command, but rather one thatâ€™s available
within the specific image weâ€™re utilizing.
5. `-m /model/[quantized_qguf_file].qguf`: This is the model weâ€™re serving.
6. `--port [container port]`: This is the port inside the container used to serve the model. It should match
the container port specified in the second argument.
7. `--host [ip address]`: This is the local IP address used to serve the model.

Using a smaller Docker image thatâ€™s specifically built for serving:

```
docker run -v "path/to/saved_model":/model \
           -p 8080:8000 \
           ghcr.io/ggerganov/llama.cpp:server \
           -m /model/gguf-model-Q4_K_M.gguf \
           --port 8000 \
           --host 0.0.0.0
```

##### Web Interface

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/llama_cpp_ui.png?raw=True)
<center>Figure 6.4 - Screenshot of llama.cppâ€™s web UI</center>

If you click on the settings button at the top-right corner, you'll see plenty of parameters you can set, such as temperature:

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/llama_cpp_settings.png?raw=True)
<center>Figure 6.5 - Screenshot of llama.cppâ€™s settings</center>

##### REST API

```python
url = 'http://0.0.0.0:8080/completion'
headers = {'Content-Type': 'application/json'}

data = {'prompt': 'There is bacon in this sandwich.',
        'n_predict': 128}

response = requests.post(url, json=data, headers=headers)
```

```python
print(response.json()['content'])
```

```
 There is no bacon in this sandwich. This statement is a paradox because it contradicts itself, yet it seems to suggest that the sandwich has both bacon and no bacon at the same time.

2. This statement is also a paradox, as it claims that it is a lie that it is lying. If the statement is true, then it is indeed a lie, making it false. But if it is false, then it is not a lie, making it true. This creates a circular reasoning that can't be resolved.

3. This statement is a paradox
```

### Thank You!

If you have any suggestions, or if you find any errors, please don't hesitate to contact me through [GitHub](https://github.com/dvgodoy), [X](https://x.com/dvgodoy), [BlueSky](https://bsky.app/profile/dvgodoy.bsky.social), or [LinkedIn](https://www.linkedin.com/in/dvgodoy/).

If you'd like to receive notifications about new book releases, updates, freebies, and discounts, follow me at:

<center><a href="https://danielgodoy.gumroad.com/subscribe">https://danielgodoy.gumroad.com/subscribe</a></center>

I'm looking forward to hearing back from you!