<a href="https://colab.research.google.com/github/rickiepark/fine-tuning-llm/blob/main/Chapter6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 6장 로컬에 배포하기

### 스포일러
이 장에서는 다음과 같은 내용을 배웁니다.

- 빠른 추론을 위해 어댑터를 로드하여 베이스 모델과 병합합니다.
- 모델을 사용해 응답 또는 완성을 생성합니다.
- 미세 튜닝된 모델을 llama.cpp에서 사용하는 GGUF 파일 포맷으로 변환합니다.
- Ollama와 llama.cpp를 사용하고 웹 인터페이스와 REST API를 통해 모델을 서빙(serving)합니다.

### 패키지 설치

훈련 재현성을 위해 이 책에서 사용하는 다음 버전과 동일 버전을 사용하세요.

In [None]:
!pip install transformers==4.56.1 peft==0.17.0 accelerate==1.10.0 trl==0.23.1 bitsandbytes==0.47.0 datasets==4.0.0 huggingface-hub==0.34.4 safetensors==0.6.2 pandas==2.2.2 matplotlib==3.10.0 numpy==2.0.2

Collecting transformers==4.55.2
  Downloading transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft==0.17.0
  Downloading peft-0.17.0-py3-none-any.whl.metadata (14 kB)
Collecting accelerate==1.10.0
  Downloading accelerate-1.10.0-py3-none-any.whl.metadata (19 kB)
Collecting trl==0.21.0
  Downloading trl-0.21.0-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes==0.47.0
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Downloading transformers-4.55.2-py3-none-any.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading peft-0.17.0-py3-none-any.whl (503 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 라이브러리 임포트

In [None]:
import pandas as pd
import requests
import torch
from contextlib import nullcontext
from dataclasses import asdict
from datasets import load_dataset
from peft import PeftModel, PeftConfig, AutoPeftModelForCausalLM, get_model_status, \
    get_layer_status, prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig

In [None]:
# 깃허브에서 helper_functions.py 파일을 다운로드합니다.
!wget https://raw.githubusercontent.com/rickiepark/fine-tuning-llm/refs/heads/main/helper_functions.py

from helper_functions import *

--2025-09-02 02:50:58--  https://raw.githubusercontent.com/rickiepark/fine-tuning-llm/refs/heads/main/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6492 (6.3K) [text/plain]
Saving to: ‘helper_functions.py’


2025-09-02 02:50:59 (71.1 MB/s) - ‘helper_functions.py’ saved [6492/6492]



### 목표

(GPU가 없는) 개인용 하드웨어에서 실행할 수 있도록 미세 튜닝된 모델과 어댑터를 GGUF 포맷으로 바꾼 다음 양자화합니다. 그다음 이런 모델과 어댑터를 서빙하기 위해 Ollama나 llama.cpp로 로드합니다. 이렇게 하면 웹 인터페이스나 REST API를 사용해 모델에게 직접 쿼리(query)를 보낼 수 있습니다.

### 준비 코드

In [None]:
# 2장
supported = torch.cuda.is_bf16_supported(including_emulation=False)
compute_dtype = (torch.bfloat16 if supported else torch.float32)

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=compute_dtype
)
model_q4 = AutoModelForCausalLM.from_pretrained(
  "facebook/opt-350m", device_map='cuda:0', quantization_config=nf4_config
)
# 3장
model_q4 = prepare_model_for_kbit_training(model_q4)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model_q4, config)

# 4장
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
tokenizer = modify_tokenizer(tokenizer)
tokenizer = add_template(tokenizer)

peft_model = modify_model(peft_model, tokenizer)

dataset = load_dataset("dvgodoy/yoda_sentences", split="train")
dataset = dataset.rename_column("sentence", "prompt")
dataset = dataset.rename_column("translation_extra", "completion")
# 프롬프트/완성 쌍을 대화 메시지로 변환합니다.
dataset = dataset.map(format_dataset)
dataset = dataset.remove_columns(["prompt", "completion", "translation"])

# 5장
min_effective_batch_size = 8
lr = 3e-4
max_seq_length = 64
collator_fn = None
packing = (collator_fn is None)
steps = 20
num_train_epochs = 10

sft_config = SFTConfig(
    output_dir='./future_name_on_the_hub',
    # 데이터셋
    packing=packing,
    packing_strategy='wrapped',
    max_length=max_seq_length,
    # 그레이디언트 / 메모리
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentrant': False},
    gradient_accumulation_steps=2,
    per_device_train_batch_size=min_effective_batch_size,
    auto_find_batch_size=True,
    # 훈련
    num_train_epochs=num_train_epochs,
    learning_rate=lr,
    # 환경 및 로깅
    report_to='tensorboard',
    logging_dir='./logs',
    logging_strategy='steps',
    logging_steps=steps,
    save_strategy='steps',
    save_steps=steps,
    bf16=supported
)

trainer = SFTTrainer(
    model=peft_mode,
    processing_class=tokenizer,
    train_dataset=dataset,
    data_collator=collator_fn,
    args=sft_config
)
trainer.train()
trainer.save_model('yoda-adapter') # trainer.push_to_hub()

Map:   0%|          | 0/720 [00:00<?, ? examples/s]



Tokenizing train dataset:   0%|          | 0/720 [00:00<?, ? examples/s]

Packing train dataset:   0%|          | 0/720 [00:00<?, ? examples/s]

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
20,3.3623
40,2.3007
60,2.0119
80,1.8822
100,1.8075
120,1.7628
140,1.7295
160,1.6878
180,1.6569
200,1.6683


### 모델과 어댑터를 로드하기

In [None]:
repo_or_folder = 'dvgodoy/opt-350m-lora-yoda'
model = AutoPeftModelForCausalLM.from_pretrained(repo_or_folder,
                                                 device_map='auto',
                                                 adapter_name='yoda')
model

adapter_config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/71.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/3.16M [00:00<?, ?B/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 512, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
          (project_out): Linear(in_features=1024, out_features=512, bias=False)
          (project_in): Linear(in_features=512, out_features=1024, bias=False)
          (layers): ModuleList(
            (0-23): 24 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (v_proj): lora.Linear(
                  (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                  (lora_dropout): ModuleDict(
                    (yoda): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (yoda): Linear(in_features=1024, out_features=8, bias=False)
     

****
**중요**: 현재는 토크나이저의 어휘사전 크기가 임베딩 층의 크기를 넘어서는 경우에만 임베딩 층의 크기가 변경됩니다. 따라서 책에서 언급한 오랜 문제가 해결되었습니다.

이 문제 때문에 다음처럼 어댑터의 LoRA 설정에 따라 베이스 모델을 로드하고, PeftModel 클래스를 사용해 모델과 어댑터를 병합해야 했습니다.

```python
repo_or_folder = 'dvgodoy/opt-350m-lora-yoda'
config = PeftConfig.from_pretrained(repo_or_folder)
base_model = AutoModelForCausalLM.from_pretrained(
  config.base_model_name_or_path,
  device_map='auto'
)
model = PeftModel.from_pretrained(
  base_model,
  repo_or_folder,
  adapter_name='yoda'
)
```
****

In [None]:
model.merge_adapter(['yoda'])

In [None]:
repo_or_folder = 'dvgodoy/opt-350m-lora-yoda'
tokenizer = AutoTokenizer.from_pretrained(repo_or_folder)

In [None]:
df = pd.DataFrame(asdict(layer) for layer in get_layer_status(model))
df

Unnamed: 0,name,module_type,enabled,active_adapters,merged_adapters,requires_grad,available_adapters,devices
0,model.model.decoder.layers.0.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
1,model.model.decoder.layers.0.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
2,model.model.decoder.layers.1.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
3,model.model.decoder.layers.1.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
4,model.model.decoder.layers.2.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
5,model.model.decoder.layers.2.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
6,model.model.decoder.layers.3.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
7,model.model.decoder.layers.3.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
8,model.model.decoder.layers.4.self_attn.v_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}
9,model.model.decoder.layers.4.self_attn.q_proj,lora.Linear,True,[yoda],[yoda],{'yoda': False},[yoda],{'yoda': ['cuda']}


In [None]:
print(get_model_status(model))

TunerModelStatus(base_model_type='OPTForCausalLM', adapter_model_type='LoraModel', peft_types={'yoda': 'LORA'}, trainable_params=0, total_params=331982848, num_adapter_layers=48, enabled=True, active_adapters=['yoda'], merged_adapters=['yoda'], requires_grad={'yoda': False}, available_adapters=['yoda'], devices={'yoda': ['cuda']})


In [None]:
model.unload()

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=409

### 모델에 쿼리하기

In [None]:
def gen_prompt(tokenizer, sentence):
    converted_sample = [
        {"role": "user", "content": sentence},
    ]
    prompt = tokenizer.apply_chat_template(converted_sample,
                                           tokenize=False,
                                           add_generation_prompt=True)
    return prompt

In [None]:
prompt = gen_prompt(tokenizer, 'There is bacon in this sandwich.')
print(prompt)

<|im_start|>user
There is bacon in this sandwich.<|im_end|>
<|im_start|>assistant



In [None]:
def generate(model, tokenizer, prompt,
             max_new_tokens=64,
             skip_special_tokens=False,
             response_only=False):
    # 포맷팅된 프롬프트를 토큰화합니다.
    tokenized_input = tokenizer(prompt,
                                add_special_tokens=False,
                                return_tensors="pt").to(model.device)

    model.eval()
    # 혼합 정밀도를 사용해 훈련하는 경우 autocast 컨택스트를 사용합니다.
    ctx = torch.autocast(device_type=model.device.type, dtype=model.dtype) \
        if model.dtype in [torch.float16, torch.bfloat16] else nullcontext()
    with ctx:
        generation_output = model.generate(**tokenized_input,
                                           eos_token_id=tokenizer.eos_token_id,
                                           max_new_tokens=max_new_tokens)

    # 필요한 경우 프롬프트에 속한 토큰을 제외합니다.
    if response_only:
        input_length = tokenized_input['input_ids'].shape[1]
        generation_output = generation_output[:, input_length:]

    # 토큰을 다시 텍스트로 디코딩합니다.
    output = tokenizer.batch_decode(generation_output,
                                    skip_special_tokens=skip_special_tokens)[0]
    return output

In [None]:
print(generate(model, tokenizer,prompt, skip_special_tokens=False, response_only=False))

<|im_start|>user
There is bacon in this sandwich.<|im_end|>
<|im_start|>assistant
In this sandwich, bacon there is.<|im_end|>


In [None]:
print(generate(model, tokenizer,prompt, skip_special_tokens=True, response_only=True))

In this sandwich, bacon there is.


In [None]:
sentences  = ['There is bacon in this sandwich.', 'Add some cheddar to it.']

In [None]:
def batch_generate(model, tokenizer, sentences,
             max_new_tokens=64,
             skip_special_tokens=False,
             response_only=False):

    # 프롬프트를 대화 포맷으로 변경합니다.
    converted_samples = [[{"role": "user", "content": sentence}]
                         for sentence in sentences]

    # 프롬프트를 포맷팅하기 위해 채팅 템플릿을 적용합니다.
    prompts = tokenizer.apply_chat_template(converted_samples,
                                            tokenize=False,
                                            add_generation_prompt=True)

    # 배치 생성을 위해 왼쪽 패딩으로 설정
    tokenizer.padding_side = 'left'
    # 패딩을 포함해 포맷팅된 프롬프트를 토큰화합니다.
    tokenized_inputs = tokenizer(prompts,
                                 padding=True,
                                 add_special_tokens=False,
                                 return_tensors='pt').to(model.device)

    model.eval()
    # 혼합 정밀도를 사용해 훈련하는 경우 autocast 컨택스트를 사용합니다.
    ctx = torch.autocast(device_type=model.device.type, dtype=model.dtype) \
        if model.dtype in [torch.float16, torch.bfloat16] else nullcontext()
    with ctx:
        generation_output = model.generate(**tokenized_inputs,
                                           eos_token_id=tokenizer.eos_token_id,
                                           pad_token_id=tokenizer.pad_token_id,
                                           max_new_tokens=max_new_tokens)

    # 필요한 경우 프롬프트에 해당하는 토큰 제외
    if response_only:
        input_length = tokenized_inputs['input_ids'].shape[1]
        generation_output = generation_output[:, input_length:]

    # 토큰을 다시 텍스트로 디코딩합니다.
    output = tokenizer.batch_decode(generation_output,
                                    skip_special_tokens=skip_special_tokens)
    if isinstance(sentences, str):
        output = output[0]
    return output

In [None]:
batch_generate(model, tokenizer, sentences, skip_special_tokens=True, response_only=True)

['In this sandwich, bacon there is.', 'To it, add some cheddar, you must.']

### Llama.cpp

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/llama_cpp.png?raw=True)

<center>그림 6.1 - llama.cpp의 깃허브 저장소 스크린샷</center>

#### 어댑터 변환

어댑터를 GGUF 포맷으로 변환하려면 다음 단계를 따릅니다.

- 이전 장에서 했던 것처럼 훈련 후 `save_model()` 메서드를 호출하거나 허깅 페이스 허브에서 어댑터를 다운로드(자세한 내용은 사이드바 참조)하여 로컬 폴더에 저장합니다.
- 깃허브에서 llama.cpp 저장소를 복제(clone)합니다.

In [None]:
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 60558, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 60558 (delta 1), reused 2 (delta 1), pack-reused 60545 (from 1)[K
Receiving objects: 100% (60558/60558), 150.73 MiB | 14.97 MiB/s, done.
Resolving deltas: 100% (43981/43981), done.


- `gguf-py` 패키지와 `mistral-common` 패키지를 설치합니다.

In [None]:
!pip install llama.cpp/gguf-py

Processing ./llama.cpp/gguf-py
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: gguf
  Building wheel for gguf (pyproject.toml) ... [?25l[?25hdone
  Created wheel for gguf: filename=gguf-0.17.1-py3-none-any.whl size=104600 sha256=8944201e1ee9bbf870fa3a510789aeb3ada09ee2c91d1565a23a69a3358fc46d
  Stored in directory: /root/.cache/pip/wheels/f6/b7/15/8e6796fb0734c2c4fa1234732c782043ead15465c2e6f75560
Successfully built gguf
Installing collected packages: gguf
Successfully installed gguf-0.17.1


In [None]:
!pip install mistral-common

Collecting mistral-common
  Downloading mistral_common-1.8.4-py3-none-any.whl.metadata (5.1 kB)
Collecting pydantic-extra-types>=2.10.5 (from pydantic-extra-types[pycountry]>=2.10.5->mistral-common)
  Downloading pydantic_extra_types-2.10.5-py3-none-any.whl.metadata (3.9 kB)
Collecting pycountry>=23 (from pydantic-extra-types[pycountry]>=2.10.5->mistral-common)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading mistral_common-1.8.4-py3-none-any.whl (6.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydantic_extra_types-2.10.5-py3-none-any.whl (38 kB)
Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pycountry, pydantic-extra-types, mistral-common
Successfully installed mistral-common-1.8.4 pycountry-24.6.1 pydant

***
**허깅 페이스 허브에서 모델 다운로드하기**

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="dvgodoy/phi3-mini-yoda-adapter", local_dir='./phi3-mini-yoda-adapter')

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/50.4M [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/696 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

training_args.bin:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

'/content/phi3-mini-yoda-adapter'

***

- `convert_lora_to_gguf.py` 스크립트를 실행합니다.

In [None]:
!python ./llama.cpp/convert_lora_to_gguf.py \
        ./phi3-mini-yoda-adapter \
        --outfile adapter.gguf \
        --outtype q8_0

INFO:lora-to-gguf:Loading base model from Hugging Face: microsoft/Phi-3-mini-4k-instruct
config.json: 100% 967/967 [00:00<00:00, 5.69MB/s]
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:lora-to-gguf:Exporting model...
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_a, torch.float32 --> F32, shape = {8192, 8}
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_b, torch.float32 --> F32, shape = {8, 3072}
INFO:hf-to-gguf:blk.0.ffn_up.weight.lora_a, torch.float32 --> F32, shape = {3072, 8}
INFO:hf-to-gguf:blk.0.ffn_up.weight.lora_b, torch.float32 --> F32, shape = {8, 16384}
INFO:hf-to-gguf:blk.0.attn_output.weight.lora_a, torch.float32 --> F32, shape = {3072, 8}
INFO:hf-to-gguf:blk.0.attn_output.weight.lora_b, torch.float32 --> F32, shape = {8, 3072}
INFO:hf-to-gguf:blk.0.attn_qkv.weight.lora_a, torch.float32 --> F32, shape = {3072, 8}
INFO:hf-to-gguf:blk.0.attn_qkv.weight.lora_b, torch.float32 --> F32, shape = {8, 9216}
INFO:hf-to-gguf:blk.1.ffn_down.weight.lora_a, torch.floa

- `outtype`은 `f32`, `f16`, `bf16`, `q8_0`, `auto` 중 하나입니다. `auto`는 첫 번째로 로드된 텐서를 따라 가장 높은 정밀도의 16비트 부동 소수점 타입을 사용합니다.

#### 전체 모델 변환하기

##### "GGUF My Repo" 사용하기

https://huggingface.co/spaces/ggml-org/gguf-my-repo

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/gguf_my_repo.png?raw=True)

<center>그림 6.2 - 허깅 페이스의 "GGUF My Repo" 스페이스</center>

##### Unsloth 사용하기

In [None]:
!pip install unsloth protobuf==3.20.1
!pip install --no-deps xformers trl peft accelerate bitsandbytes

Collecting unsloth
  Downloading unsloth-2025.8.10-py3-none-any.whl.metadata (52 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/52.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.3/52.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf==3.20.1
  Downloading protobuf-3.20.1-py2.py3-none-any.whl.metadata (720 bytes)
Collecting unsloth_zoo>=2025.8.9 (from unsloth)
  Downloading unsloth_zoo-2025.8.9-py3-none-any.whl.metadata (9.5 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.31-py3-none-any.whl.metadata (11 kB)
Collecting datasets<4.0.0,>=3.4.1 (from unsloth)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting cut_cross_entropy (from unsloth_zoo>=2025.8.9->unsloth)
  Downloading cut_cross_entropy-25.1.1-py3-none-



In [None]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained('dvgodoy/phi3-mini-yoda-adapter')

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.10: Fast Mistral patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/50.4M [00:00<?, ?B/s]

Unsloth 2025.8.10 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


```
==((====))==  Unsloth 2024.10.0: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/50.4M [00:00<?, ?B/s]

Unsloth 2024.10.0 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
```

In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32064, 3072, padding_idx=32009)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (k_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (v_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (defaul

```
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32064, 3072, padding_idx=32009)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (k_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (v_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
              (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
              (down_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm((3072,), eps=1e-05)
            (post_attention_layernorm): MistralRMSNorm((3072,), eps=1e-05)
          )
        )
        (norm): MistralRMSNorm((3072,), eps=1e-05)
      )
      (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
    )
  )
)
```

In [None]:
# 이 명령은 여러 이유로 실패할 수 있으며 환경과 (실행 도중 설치된) llama.cpp의 안정성에 따라 다릅니다.

# Unsloth가 직접 설치하도록 클론한 llama.cpp 폴더를 삭제합니다.
!rm -rf llama.cpp/

model.save_pretrained_gguf("gguf_model", tokenizer, quantization_method = "q4_k_m")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.3G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.38 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:02<00:00, 15.04it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving gguf_model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving gguf_model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at gguf_model into f16 GGUF format.
The output location will be /content/gguf_model/unsloth.F16.gguf
This might take 3 minutes...


Unsloth: Extending gguf_model/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).


INFO:hf-to-gguf:Loading model: gguf_model
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:

 ```
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.3G

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.02 out of 12.67 RAM for saving.
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving gguf_model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving gguf_model/pytorch_model-00002-of-00002.bin...
Done.

Unsloth: Converting mistral model. Can use fast conversion = True.

==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...

Unsloth: Extending gguf_model/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).

Unsloth: [1] Converting model at gguf_model into f16 GGUF format.
The output location will be /content/gguf_model/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: gguf_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {3072, 3072}
...
INFO:hf-to-gguf:blk.31.attn_norm.weight,     torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.31.ffn_norm.weight,      torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:output_norm.weight,          torch.float16 --> F32, shape = {3072}
INFO:hf-to-gguf:output.weight,               torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 4096
INFO:hf-to-gguf:gguf: embedding length = 3072
INFO:hf-to-gguf:gguf: feed forward length = 8192
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 32
INFO:hf-to-gguf:gguf: rope theta = 10000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 32000
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 32009
INFO:gguf.vocab:Setting add_bos_token to False
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/content/gguf_model/unsloth.F16.gguf: n_tensors = 291, total_size = 7.6G
Writing: 100%|██████████| 7.64G/7.64G [01:56<00:00, 65.5Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /content/gguf_model/unsloth.F16.gguf
Unsloth: Conversion completed! Output location: /content/gguf_model/unsloth.F16.gguf
Unsloth: [2] Converting GGUF 16bit into q4_k_m. This will take 20 minutes...
main: build = 3934 (3752217e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/gguf_model/unsloth.F16.gguf' to '/content/gguf_model/unsloth.Q4_K_M.gguf' as Q4_K_M using 4 threads
llama_model_loader: loaded meta data with 34 key-value pairs and 291 tensors from /content/gguf_model/unsloth.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 3 Mini 4k Instruct Bnb 4bit
llama_model_loader: - kv   3:                       general.organization str              = Unsloth
llama_model_loader: - kv   4:                           general.finetune str              = 4k-instruct-bnb-4bit
llama_model_loader: - kv   5:                           general.basename str              = phi-3
llama_model_loader: - kv   6:                         general.size_label str              = mini
llama_model_loader: - kv   7:                          llama.block_count u32              = 32
llama_model_loader: - kv   8:                       llama.context_length u32              = 4096
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 llama.attention.key_length u32              = 96
llama_model_loader: - kv  16:               llama.attention.value_length u32              = 96
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 32064
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 96
llama_model_loader: - kv  20:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 32009
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                    token_embd.weight - [ 3072, 32064,     1,     1], type =    f16, converting to q4_K .. size =   187.88 MiB ->    52.84 MiB
[   2/ 291]                  blk.0.attn_q.weight - [ 3072,  3072,     1,     1], type =    f16, converting to q4_K .. size =    18.00 MiB ->     5.06 MiB
...
[ 290/ 291]                   output_norm.weight - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 291/ 291]                        output.weight - [ 3072, 32064,     1,     1], type =    f16, converting to q6_K .. size =   187.88 MiB ->    77.06 MiB
llama_model_quantize_internal: model size  =  7288.51 MB
llama_model_quantize_internal: quant size  =  2210.78 MB

main: quantize time = 426187.37 ms
main:    total time = 426187.37 ms
Unsloth: Conversion completed! Output location: /content/gguf_model/unsloth.Q4_K_M.gguf
```

##### 도커 이미지 사용하기

모델을 변환하려면 다음 명령을 실행합니다.

```
docker run --rm
           -v "/path/to/saved_model":/repo
           ghcr.io/ggerganov/llama.cpp:full
           --convert "/repo"
           --outtype f32
           --outfile /repo/gguf-model-f32.gguf
```

1. `--rm`: 실행이 완료된 후 컨테이너를 자동으로 삭제합니다. 이 예제와 같이 스크립트를 한 번 실행하는 경우에 유용합니다.
2. `-v [local path]:[path inside container]`: 컴퓨터에 있는 폴더를 컨테이너 내부 폴더에 매핑합니다. 이를 통해 컨테이너는 내부에 있는 폴더처럼 로컬 폴더를 참조할 수 있습니다.
3. `[docker image]`: llama.cpp의 도커 이미지 ghcr.io/ggerganov/llama.cpp:full를 사용합니다.
4. `--convert [path inside container]`: 이것이 실행할 명령입니다. 이는 도커 명령이 아니라 여기서 사용하는 특정 이미지에 있는 명령입니다.
5. `--outtype [GGUF type]`: --convert 명령의 매개변수로 GGUF 파일의 데이터 타입을 지정합니다.
6. `--outfile [GGUF filename]`: --convert 명령의 또 다른 매개변수입니다. GGUF 파일의 이름을 지정합니다(컨테이너 내부 경로 /repo를 지정했습니다. 이 경로는 로컬 컴퓨터에 있는 폴더에 매핑되어 있으므로 로컬 폴더에 파일이 생성됩니다).

변환된 모델을 양자화하려면 다음 명령을 실행해야 합니다.

```
docker run --rm
           -v "/path/to/saved_model":/repo
           ghcr.io/ggerganov/llama.cpp:full
           --quantize "/repo/gguf-model-f32.gguf"
           "/repo/gguf-model-Q4_K_M.gguf"
           "Q4_K_M"
```

7. `--quantize [GGUF filename]`: 실행할 새로운 명령으로 이 특정 이미지에서만 있는 명령입니다. 어떤 GGUF 파일을 양자화할지 지정합니다(일반적으로 convert 명령의 outfile).
8. `[quantized GGFUF filename]`: 스크립트가 실행 완료된 후 양자화된 파일 이름. 로컬 폴더에 접근할 수 있는 매핑된 폴더를 지정하세요.
9. `[quantization type]`: 전체 양자화 타입 목록은 [온라인 문서](https://github.com/ggerganov/llama.cpp/blob/main/examples/quantize/README.md)를 참고하세요.

##### llama.cpp 빌드하기

```python
!git clone https://github.com/ggerganov/llama.cpp
!pip install llama.cpp/gguf-py
!pip install -r llama.cpp/requirements.txt
```

```python
!python ./llama.cpp/convert_hf_to_gguf.py /path/to/saved_model --outtype f16
```

```python
!cd llama.cpp && make clean && make
```

```python
!./llama.cpp/quantize
    ./path/to/saved_model/ggml-model-f16.gguf
    ./path/to/saved_model/ggml-model-q4_0.gguf
    q4_0
```

### 모델 서빙하기

#### Ollama

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/ollama.png?raw=True)
<center>Figure 6.3 - Screenshot of Ollama’s page</center>

```
ollama run phi3:mini
```

##### Ollama 설치하기

In [None]:
!curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.9.6 sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


```
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
```

##### 코랩에서 Ollama 실행하기

In [None]:
# https://stackoverflow.com/questions/77697302/how-to-run-ollama-in-google-colab를 참고함

import os
import asyncio
import threading

# 노트: 실행 중인 백엔드에 따라 이 설정을 하고 cuda를 활성화해야 할 수 있습니다.
# NVIDIA 라이브러리를 위해 환경 변수를 설정합니다.
# CUDA를 위해 환경 변수를 설정합니다.
os.environ['PATH'] += ':/usr/local/cuda/bin'
# LD_LIBRARY_PATH에 /usr/lib64-nvidia와 CUDA lib 디렉토리가 포함되도록 설정합니다.
os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    # 비동기 함수를 정의합니다.
    async def pipe(lines):
        async for line in lines:
            print(line.decode().strip())

        await asyncio.gather(
            pipe(process.stdout),
            pipe(process.stderr),
        )

    # 호출
    await asyncio.gather(pipe(process.stdout), pipe(process.stderr))

async def start_ollama_serve():
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro)
    loop.close()

In [None]:
# 새 스레드에서 실행될 이벤트 루프를 만듭니다.
new_loop = asyncio.new_event_loop()

# 셀 때문에 실행이 중지되지 않도록 별도의 스레드에서 ollama serve를 시작합니다.
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()

>>> starting ollama serve


##### 모델 파일

| 명령 |	설명 |
|---|---|
|FROM (필수) | 사용할 베이스 모델을 정의합니다. |
|PARAMETER | Ollama가 모델을 어떻게 실행할지에 대한 매개변수를 설정합니다. |
|TEMPLATE | 모델에 전달할 완전한 프롬프트 템플릿 |
|SYSTEM | 템플릿에 포함할 시스템 메시지를 지정합니다. |
|ADAPTER | 모델에 적용할 (Q)LoRA 어댑터를 정의합니다. |
|LICENSE | 법적 라이센스를 지정합니다. |
|MESSAGE | 메시지 기록을 지정합니다. |

```
ollama show --modelfile phi3:mini
```


```
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM phi3:mini

FROM /usr/share/ollama/.ollama/models/blobs/sha256-633fc...
TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"
PARAMETER stop <|end|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
LICENSE """Microsoft.
Copyright (c) Microsoft Corporation.
...
```

In [None]:
from transformers import AutoTokenizer

tokenizer_phi3 = AutoTokenizer.from_pretrained('microsoft/phi-3-mini-4k-instruct')
print(tokenizer_phi3.chat_template)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}


##### 모델 임포트하기

###### 사용자 정의 (전체) 모델 파일

```python
modelfile = """
FROM ./phi3-full-model
TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"
PARAMETER stop <|end|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
"""

with open('phi3-full-modelfile', 'w') as f:
    f.write(modelfile)
```

```
!ollama create our_own_phi3 -f phi3-full-modelfile
```

```
!ollama list
```

###### 사용자 정의 어댑터

In [None]:
adapterfile = """
FROM phi3:mini
ADAPTER ./adapter.gguf
TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"
PARAMETER stop <|end|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
"""

with open('phi3-adapter-file', 'w') as f:
    f.write(adapterfile)

In [None]:
!ollama create our_own_phi3_adapted -f phi3-adapter-file

[GIN] 2025/09/02 - 03:25:17 | 200 |     123.538µs |       127.0.0.1 | HEAD     "/"
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[GIN] 2025/09/02 - 03:25:18 | 201 |  420.460532ms |       127.0.0.1 | POST     "/api/blobs/sha256:1e0d1652754702c76c0dce6c1f9ad17ca57783c5587ba428116f7666175e5405"
[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026l[?2026h[?25l[A[A[1G[?25h[?2026ltime=2025-09-02T03:25:19.691Z level=INFO source=download.go:177 msg

In [None]:
!ollama list

[GIN] 2025/09/02 - 03:25:58 | 200 |      37.413µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/02 - 03:25:58 | 200 |     826.923µs |       127.0.0.1 | GET      "/api/tags"
NAME                           ID              SIZE      MODIFIED               
our_own_phi3_adapted:latest    4aa0c981ee99    2.2 GB    Less than a second ago    
phi3:mini                      4f2222927938    2.2 GB    Less than a second ago    


##### 모델에 쿼리하기

In [None]:
!pip install ollama

Collecting ollama
  Downloading ollama-0.5.3-py3-none-any.whl.metadata (4.3 kB)
Downloading ollama-0.5.3-py3-none-any.whl (13 kB)
Installing collected packages: ollama
Successfully installed ollama-0.5.3


In [None]:
import ollama

prompt = "The Force is strong in this one!"
response = ollama.generate(model='our_own_phi3_adapted',
                           prompt=prompt)
print(response)

time=2025-09-02T03:26:05.882Z level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-633fc5be925f9a484b61d6f9b9a78021eeb462100bd557309f01ba84cac26adf gpu=GPU-8caa9397-9fac-1744-bf08-28392bfdf945 parallel=2 available=13180731392 required="6.1 GiB"
time=2025-09-02T03:26:06.146Z level=INFO source=server.go:135 msg="system memory" total="12.7 GiB" free="6.7 GiB" free_swap="0 B"
time=2025-09-02T03:26:06.147Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[12.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="2.0 GiB" memory.weights.repeating="1.9 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB"
llama_model_loader: loaded

In [None]:
print(response['response'])

In this one, the Force is strong. Yes, hrrrm.


In [None]:
messages = [{'role': 'user', 'content': prompt}]
formatted = tokenizer_phi3.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)

response = ollama.generate(model='our_own_phi3_adapted',
                           prompt=formatted,
                           raw=True)
print(response)

<|user|>
The Force is strong in this one!<|end|>
<|assistant|>

[GIN] 2025/09/02 - 03:26:15 | 200 |   394.70269ms |       127.0.0.1 | POST     "/api/generate"
model='our_own_phi3_adapted' created_at='2025-09-02T03:26:15.668060038Z' done=True done_reason='stop' total_duration=394641382 load_duration=78638877 prompt_eval_count=17 prompt_eval_duration=14965866 eval_count=16 eval_duration=300399596 response='In this one, the Force is strong. Yes, hrrrm.' thinking=None context=None


#### Llama.cpp

변환, 양자화, 서빙에 사용할 수 있는 완전한 도커 이미지를 사용할 수 있습니다.

```
docker run -v "/path/to/saved_model":/model  \
           -p 8080:8000 \
           ghcr.io/ggerganov/llama.cpp:full \
           --server \
           -m /model/gguf-model-Q4_K_M.gguf \
           --port 8000 \
           --host 0.0.0.0
```

1. `-v [local path]:[path inside container]`: 컴퓨터에 있는 폴더를 컨테이너 내부 폴더에 매핑합니다. 컨테이너가 로컬 폴더를 컨테이너 내부에 있는 것처럼 접근할 수 있습니다.
2. `-p [host port]:[container port]`: 호스트 포트(예를 들면, 8080)로 전달된 요청을 컨테이너 내부 포트(예를 들면, 8000)으로 전달합니다.
3. `[docker image]`: 사용할 llama.cpp의 도커 이미지. 여기서는 ghcr.io/ggerganov/llama.cpp:full
4. `--server`: 실행할 명령. 도커 명령이 아니라 이 특정 이미지에서 제공하는 명령입니다.
5. `-m /model/[quantized_qguf_file].qguf`: 서빙할 모델
6. `--port [container port]`: 모델을 서빙하기 위해 사용할 컨테이너 내부 포트. 두 번째 인자에 지정된 container port와 같아야 합니다.
7. `--host [ip address]`: 모델 서빙에 사용할 로컬 IP 주소

도커를 사용해 변환하거나 양자화하는데 관심이 없다면 특별히 서빙을 위해 만들어진 작은 도커 이미지를 선택할 수 있습니다.

```
docker run -v "path/to/saved_model":/model \
           -p 8080:8000 \
           ghcr.io/ggerganov/llama.cpp:server \
           -m /model/gguf-model-Q4_K_M.gguf \
           --port 8000 \
           --host 0.0.0.0
```

##### 웹 인터페이스

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/llama_cpp_ui.png?raw=True)
<center>그림 6.4 llama.cpp 웹 UI</center>

오른쪽 위에 있는 설정 버튼은 온도를 포함하여 다양한 파라미터를 제공합니다.

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch6/llama_cpp_settings.png?raw=True)
<center>그림 6.5 llama.cpp 설정</center>

##### REST API

```python
url = 'http://0.0.0.0:8080/completion'
headers = {'Content-Type': 'application/json'}

data = {'prompt': 'There is bacon in this sandwich.',
        'n_predict': 128}

response = requests.post(url, json=data, headers=headers)
```

```python
print(response.json()['content'])
```

```
 There is no bacon in this sandwich. This statement is a paradox because it contradicts itself, yet it seems to suggest that the sandwich has both bacon and no bacon at the same time.

2. This statement is also a paradox, as it claims that it is a lie that it is lying. If the statement is true, then it is indeed a lie, making it false. But if it is false, then it is not a lie, making it true. This creates a circular reasoning that can't be resolved.

3. This statement is a paradox
```

### 감사합니다!

제안 사항이나 오류를 발견하면 주저하지 말고 [깃허브](https://github.com/dvgodoy), [X](https://x.com/dvgodoy), [BlueSky](https://bsky.app/profile/dvgodoy.bsky.social), 또는 [링크드인](https://www.linkedin.com/in/dvgodoy/)으로 연락주세요.

새로운 책 출시, 업데이트, 할인에 대한 알림을 받고 싶다면 검로드(Gumroad)에서 저를 팔로우하세요.

<center><a href="https://danielgodoy.gumroad.com/subscribe">https://danielgodoy.gumroad.com/subscribe</a></center>

여러분의 의견을 기다리겠습니다!