Based on The Unsloth Gemma3 finetuning notebook.

### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [69]:
from unsloth import FastModel
import torch
max_seq_length = 4096

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

==((====))==  Unsloth 2025.12.9: Fast Gemma3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.559 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


We now add LoRA adapters so we only need to update a small amount of parameters!

In [70]:
model = FastModel.get_peft_model(
    model,
    
    # Lora Rank r - Controls the number of trainable parameters in the LoRA adapter matrices. A higher rank increases model capacity but also memory usage. 
    # Usually 16 or 32, but 270M is tiny so we can do 128.
    r = 128,
    
    # Specify which parts of the model you want to apply LoRA adapters to — either the attention, the MLP, or both.
    # Attention: q_proj, k_proj, v_proj, o_proj
    # MLP: gate_proj, up_proj, down_proj
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],

    # Scales the strength of the fine-tuned adjustments in relation to the rank (r).
    # Usually r, or r * 2
    lora_alpha = 128,

    # Randomly sets a fraction of LoRA activations to zero during training to prevent overfitting
    # Supports any, but = 0 is optimized
    lora_dropout = 0, 

    # Supports any, but = "none" is optimized
    bias = "none",    
    
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    # True or "unsloth" for very long context
    use_gradient_checkpointing = "unsloth", 
    
    # Seed
    random_state = 3407,

    # When you need higher-rank adapter settings for fine-tuning large models (LLMs or diffusion models) without the training instability or performance degradation
    use_rslora = False,  

    # Initializes LoRA matrices with the top 'r' singular vectors from the pretrained weights. This can improve accuracy but may cause a significant memory spike at the start of training.
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [71]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma3",
)

In [72]:
import json

def load_jsonl(path):
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]

raw_data = load_jsonl("./finetuning_dataset_v2.jsonl")
raw_data = raw_data
len(raw_data)

6196

In [73]:
from datasets import Dataset

dataset = Dataset.from_list(raw_data)
dataset

Dataset({
    features: ['messages'],
    num_rows: 6196
})

In [20]:
dataset[5]

{'messages': [{'content': 'You are a recipe extraction assistant. Your task is to analyze the provided text and extract recipe information in LD-JSON format following the schema.org Recipe specification.\n\nGiven a piece of text that describes a cooking recipe (possibly including a long blog post, personal stories, comments, etc.), extract a single Recipe object in JSON-LD format.\n\nReturn ONLY valid JSON in the following format:\n{\n  "@context": "https://schema.org",\n  "@type": "Recipe",\n  "name": "Recipe Name",\n  "description": "Recipe description",\n  "recipeIngredient": ["quantity of ingredient 1", "quantity of ingredient 2", ...],\n  "recipeInstructions": [\n    {\n      "@type": "HowToStep",\n      "text": "Step 1 instruction"\n    }\n  ],\n  "prepTime": "PT15M",\n  "cookTime": "PT30M",\n  "totalTime": "PT45M",\n  "recipeYield": "4 servings",\n  "recipeCategory": "Main Course",\n  "recipeCuisine": "Italian",\n  "keywords": "pasta, dinner"\n}\n\nInclude as many fields as you 

We now use `convert_to_chatml` to try converting datasets to the correct format for finetuning purposes!

In [74]:
import regex as re

def convert_to_chatml(example):
    ct = example.get("messages")[2]['content']

    # Remove reasoning traces, since Gemma 3 270M is non-reasoning
    s = re.sub(r"<think>(.|\n)*<\/think>\n\n", '', ct)
    
    return {
        "conversations": [
            {"role": "system", "content": example["messages"][0]["content"]},
            {"role": "user", "content": example["messages"][1]["content"]},
            {"role": "assistant", "content": s}
        ]
    }

dataset = dataset.map(
    convert_to_chatml
)

Map:   0%|          | 0/6196 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

In [22]:
dataset[5].get('conversations')

[{'content': 'You are a recipe extraction assistant. Your task is to analyze the provided text and extract recipe information in LD-JSON format following the schema.org Recipe specification.\n\nGiven a piece of text that describes a cooking recipe (possibly including a long blog post, personal stories, comments, etc.), extract a single Recipe object in JSON-LD format.\n\nReturn ONLY valid JSON in the following format:\n{\n  "@context": "https://schema.org",\n  "@type": "Recipe",\n  "name": "Recipe Name",\n  "description": "Recipe description",\n  "recipeIngredient": ["quantity of ingredient 1", "quantity of ingredient 2", ...],\n  "recipeInstructions": [\n    {\n      "@type": "HowToStep",\n      "text": "Step 1 instruction"\n    }\n  ],\n  "prepTime": "PT15M",\n  "cookTime": "PT30M",\n  "totalTime": "PT45M",\n  "recipeYield": "4 servings",\n  "recipeCategory": "Main Course",\n  "recipeCuisine": "Italian",\n  "keywords": "pasta, dinner"\n}\n\nInclude as many fields as you can extract f

We now have to apply the chat template for `Gemma3` onto the conversations, and save it to `text`.

In [75]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/6196 [00:00<?, ? examples/s]

Let's see how the chat template did!

In [25]:
dataset[5]['text']

'<start_of_turn>user\nYou are a recipe extraction assistant. Your task is to analyze the provided text and extract recipe information in LD-JSON format following the schema.org Recipe specification.\n\nGiven a piece of text that describes a cooking recipe (possibly including a long blog post, personal stories, comments, etc.), extract a single Recipe object in JSON-LD format.\n\nReturn ONLY valid JSON in the following format:\n{\n  "@context": "https://schema.org",\n  "@type": "Recipe",\n  "name": "Recipe Name",\n  "description": "Recipe description",\n  "recipeIngredient": ["quantity of ingredient 1", "quantity of ingredient 2", ...],\n  "recipeInstructions": [\n    {\n      "@type": "HowToStep",\n      "text": "Step 1 instruction"\n    }\n  ],\n  "prepTime": "PT15M",\n  "cookTime": "PT30M",\n  "totalTime": "PT45M",\n  "recipeYield": "4 servings",\n  "recipeCategory": "Main Course",\n  "recipeCuisine": "Italian",\n  "keywords": "pasta, dinner"\n}\n\nInclude as many fields as you can e

<a name="Train"></a>
### Train the model
Now let's train our model. We do 100 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

Sometimes previous training runs aren't emptied correctly. Sometimes clearing memory manually helps, other times it doesn't. Python sucks.

In [56]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [76]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",

        # The number of samples processed in a single forward/backward pass on one GPU.
        # Primary Driver of VRAM Usage. 
        per_device_train_batch_size = 4,

        # The number of micro-batches to process before performing a single model weight update.
        # Primary Driver of Training Time.
        gradient_accumulation_steps = 4,
        
        warmup_steps = 5,
        #num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 50,

        # Controls how much a model changes its internal parameters (weights) in response to the estimated error each time the model weights are updated.
        learning_rate = 2e-5, # Reduce to 2e-5 for long training runs

        # Number of update steps between two logs
        logging_steps = 1,

        # An optimizer is an algorithm that adjusts a model's internal parameters (weights and biases) to minimize the loss function. 
        # It measures prediction errors, helping the model learn from data and improve accuracy.
        optim = "adamw_8bit",

        # A regularization term that penalizes large weights to prevent overfitting and improve generalization.
        weight_decay = 0.001,

        # A learning rate schedule is a predefined framework that adjusts the learning rate between epochs or iterations as the training progresses.
        lr_scheduler_type = "linear",
        
        seed = 3407,
        
        output_dir="outputs",

        # Use TrackIO/WandB etc
        report_to = "none", 
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=20):   0%|          | 0/6196 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [77]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=20):   0%|          | 0/6196 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 5th row again.

In [37]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<bos><start_of_turn>user\nYou are a recipe extraction assistant. Your task is to analyze the provided text and extract recipe information in LD-JSON format following the schema.org Recipe specification.\n\nGiven a piece of text that describes a cooking recipe (possibly including a long blog post, personal stories, comments, etc.), extract a single Recipe object in JSON-LD format.\n\nReturn ONLY valid JSON in the following format:\n{\n  "@context": "https://schema.org",\n  "@type": "Recipe",\n  "name": "Recipe Name",\n  "description": "Recipe description",\n  "recipeIngredient": ["quantity of ingredient 1", "quantity of ingredient 2", ...],\n  "recipeInstructions": [\n    {\n      "@type": "HowToStep",\n      "text": "Step 1 instruction"\n    }\n  ],\n  "prepTime": "PT15M",\n  "cookTime": "PT30M",\n  "totalTime": "PT45M",\n  "recipeYield": "4 servings",\n  "recipeCategory": "Main Course",\n  "recipeCuisine": "Italian",\n  "keywords": "pasta, dinner"\n}\n\nInclude as many fields as you 

Now let's print the masked out example - you should see only the answer is present:

In [78]:
# 100 tokens is essentially the equivalent of this string '<start_of_turn>model\n'.
# We want to pad it out so we're left with just the answer

tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[5]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         {\n  "@context": "https://schema.org",\n  "@type": "Recipe",\n  "name": "2-Ingredient Pineapple Angel Food Cake",\n  "description": "This light, fluffy dessert is amazingly simple and naturally fruity, made with just cake mix and crushed pineapple. It’s a great last-minute cake that comes together fast and still feels special, espec

In [79]:
space = tokenizer(" ", add_special_tokens=False).input_ids[0]

def show_supervised(idx=0, n=800):
    row = trainer.train_dataset[idx]
    labels = row["labels"]
    kept = sum(x != -100 for x in labels)
    masked = sum(x == -100 for x in labels)
    print("kept:", kept, "masked:", masked, "kept_ratio:", kept/(kept+masked))

    # Visualize: masked tokens -> spaces, unmasked -> original tokens
    vis = [space if x == -100 else x for x in labels]
    print(tokenizer.decode(vis)[:n])

show_supervised(2)
show_supervised(3)

kept: 1284 masked: 968 kept_ratio: 0.5701598579040853
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
kept: 506 masked: 1316 kept_ratio: 0.27771679473106475
                                                                                          

In [42]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3090. Max memory = 23.559 GB.
0.635 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
4bit: 1B -> 500mb (no context) 

In [80]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 6,196 | Num Epochs = 1 | Total steps = 50
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 30,375,936 of 298,474,112 (10.18% trained)


Step,Training Loss
1,0.6292
2,0.6458
3,0.4656
4,0.5733
5,0.3781
6,0.5111
7,0.2806
8,0.2844
9,0.3291
10,0.3115


In [44]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

172.27 seconds used for training.
2.87 minutes used for training.
Peak reserved memory = 14.807 GB.
Peak reserved memory for training = 14.172 GB.
Peak reserved memory % of max memory = 62.851 %.
Peak reserved memory for training % of max memory = 60.155 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [81]:
SYSTEM_PROMPT = """You are a recipe extraction assistant. Your task is to analyze the provided text and extract recipe information in LD-JSON format following the schema.org Recipe specification.

Given a piece of text that describes a cooking recipe (possibly including a long blog post, personal stories, comments, etc.), extract a single Recipe object in JSON-LD format.

Return ONLY valid JSON in the following format:
{
  "@context": "https://schema.org",
  "@type": "Recipe",
  "name": "Recipe Name",
  "description": "Recipe description",
  "recipeIngredient": ["quantity of ingredient 1", "quantity of ingredient 2", ...],
  "recipeInstructions": [
    {
      "@type": "HowToStep",
      "text": "Step 1 instruction"
    }
  ],
  "prepTime": "PT15M",
  "cookTime": "PT30M",
  "totalTime": "PT45M",
  "recipeYield": "4 servings",
  "recipeCategory": "Main Course",
  "recipeCuisine": "Italian",
  "keywords": "pasta, dinner"
}

Include as many fields as you can extract from the text. If you cannot find recipe information, return an empty object.

Do not include any explanatory text, only the JSON."""

In [82]:
user_content = """Extract the recipe data from this blog post:

Skip to main content
Beetroot & red onion tarte tatin
• Elena Silcock
Save recipe
Serves 4 - 6
Easy
Prep: 10 mins
Cook: 1 hr and 20 mins
A star rating of 4.5 out of 5. 78 ratings Rate
loading...Try this vegan tart for a show-stopping centrepiece. The bold red of beetroot against the vibrant green salad also makes it ideal for a meat-free celebration.
Gluten-free
Vegan
Vegetarian
Print
Ad
Skip to ingredientsAlternatives
Complete the dish
Showing items 1 to 3 of 3

Ingredients

Nutrition
• 400g beetroot cut into wedges
• 1 red onion cut into wedges
• 3 tbsp olive oil
• 2 tbsp rice wine vinegar
• 2 tbsp soft brown sugar
• 2 star anise
• flour for rolling
• 500g block puff pastry (we used vegan Jus-Rol)
• 1 orange zested
• peppery green salad to serve
Keep the screen awake with cook mode on the Good Food app.
Nutrition: Per serving
• kcal 444
• fat 27 g
• saturates 11 g
• carbs 40 g
• sugars 14 g
• fibre 5 g
• protein 6 g
• salt 0.9 g
Ad

Method
• step 1 Heat oven to 200C/180C fan/gas 6. In a bowl, toss the beetroot and onion in 2 tbsp of the oil, the vinegar and sugar. Add the star anise and season well. Heat the rest of the oil in a large, ovenproof non-stick frying pan , then nestle in the veg so that they cover the surface of the pan. Cover with foil and cook in the oven for 45 mins.
• step 2 On a well-floured surface, roll the pastry to a thickness of 0.5cm and cut out a circle the same size as your frying pan. Carefully take the pan out of the oven, remove the foil and wiggle the beets and onion around in the pan to make a compact layer. Put the pastry on top, tucking it in all around the edges, then return the pan to the oven and bake for 35 mins or until the pastry has puffed up and is a deep golden brown.
• step 3 Slide a palate knife around the edge of the tart, then put a plate on top of the pastry, serving side down. Flip the pan over to turn the tart out onto the plate – be careful not to burn yourself with the handle. Top with the orange zest and a sprinkle of sea salt, then serve with a peppery salad on the side.
"""

In [83]:
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role" : "user", "content" : user_content}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
).removeprefix('<bos>')

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 2000,
    temperature = 1, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

{
  "@context": "https://schema.org",
  "@type": "Recipe",
  "name": "Gluten-Free Vegan Tart with Beetroot and Red Onion",
  "description": "A vegan tart featuring beetroot and red onions with a gluten-free, vegan-friendly filling. This vibrant tart provides a satisfying balance of sweet and savory, featuring gluten-free ingredients and vegan flavors.",
  "recipeIngredient": [
    "400g beetroot cut into wedges",
    "1 red onion cut into wedges",
    "3 tbsp olive oil",
    "2 tbsp rice wine vinegar",
    "2 tbsp soft brown sugar",
    "2 star anise",
    "flour for rolling",
    "500g block puff pastry (we used vegan Jus-Rol)",
    "1 orange zested",
    "peppery green salad to serve"
  ],
  "recipeInstructions": [
    {
      "@type": "HowToStep",
      "text": "Heat oven to 200C/180C fan/gas 6. In a bowl, toss the beetroot and onion in 2 tbsp of the oil, the vinegar and sugar. Add the star anise and season well. Heat the rest of the oil in a large, ovenproof non-stick frying pan, t

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")
# model.push_to_hub("your_name/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/gemma-3", token = "...") # Online saving

('gemma-3/tokenizer_config.json',
 'gemma-3/special_tokens_map.json',
 'gemma-3/chat_template.jinja',
 'gemma-3/tokenizer.model',
 'gemma-3/added_tokens.json',
 'gemma-3/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "gemma-3", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = False,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_16bit")
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("gemma-3-finetune")
    tokenizer.save_pretrained("gemma-3-finetune")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/gemma-3-finetune", token = "")
    tokenizer.push_to_hub("hf/gemma-3-finetune", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [104]:
if True: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-finetune",
        tokenizer,
        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /home/vlad/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `gemma-3-finetune`: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00,  4.42it/s]


Successfully copied all 1 files from cache to `gemma-3-finetune`
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `gemma-3-finetune`: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 264.67it/s]


Successfully copied all 1 files from cache to `gemma-3-finetune`


Unsloth: Preparing safetensor model files: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46603.38it/s]
Unsloth: Merging weights into 16bit: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.08it/s]


Unsloth: Merge process complete. Saved to `/home/vlad/jupyter/gemma-3-finetune`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF bf16 might take 3 minutes.
\        /    [2] Converting GGUF bf16 to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into bf16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['gemma-3-270m-it.BF16.gguf']
Unsloth: [2] Converting GGUF bf16 into q8_0. This might take 10 minutes...


Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['gemma-3-270m-it.Q8_0.gguf']
Unsloth: example usage for text only LLMs: llama-cli --model gemma-3-270m-it.Q8_0.gguf -p "why is the sky blue?"
Unsloth: Saved Ollama Modelfile to current directory
Unsloth: convert model to ollama format by running - ollama create model_name -f ./Modelfile - inside current directory.


Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "HF_ACCOUNT/gemma-finetune-gguf",
        tokenizer,
        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
        token = "hf_...",
    )