To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.25.1
!pip install torchao==0.15.0

### Unsloth

In [23]:
from unsloth import FastLanguageModel
import torch

# Models supported for Phone Deployment
fourbit_models = [
    "unsloth/Qwen3-4B",              # Any Qwen3 model like 0.6B, 4B, 8B, 32B
    "unsloth/Qwen3-32B",
    "unsloth/Llama-3.1-8B-Instruct", # Llama 3 models work
    "unsloth/Llama-3.3-70B-Instruct",
    "unsloth/gemma-3-270m-it",       # Gemma 3 models work
    "unsloth/gemma-3-27b-it",
    "unsloth/Qwen2.5-7B-Instruct",   # And more models!
    "unsloth/Phi-4-mini-instruct",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
    full_finetuning = True,
    qat_scheme = "int4", # Gemma3 needs int4 due to large vocab (262K)
)

==((====))==  Unsloth 2025.12.6: Fast Gemma3 patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.
To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained
Unsloth: Applying QAT to mitigate quantization degradation


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [27]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [28]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

We sample 10k rows to speed up training for the first successful phone deployment.

In [30]:
dataset = dataset.shuffle(seed=3407).select(range(10000))

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [31]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

In [32]:
dataset[100]

{'conversations': [{'content': 'Consider the following optimization problem: \n\nMinimize the function f(x) = x^2 - 5x + 6 using cuckoo search algorithm over the interval [0,6]. \n\nUse an initial population size of 15 cuckoos, a maximum number of generations of 1000, a step size of 0.01, and a probability of cuckoo eggs being laid in a new nest of 0.25. \n\nDetermine the global minimum and value of the objective function.',
   'role': 'user'},
  {'content': "To solve the optimization problem using the cuckoo search algorithm, we will follow these steps:\n\n1. Initialize the population of 15 cuckoos randomly within the interval [0, 6].\n2. Evaluate the fitness of each cuckoo (i.e., calculate the objective function f(x) for each cuckoo).\n3. Perform the search for a maximum number of generations (1000).\n4. In each generation, update the cuckoo's position using the step size (0.01) and the probability of laying eggs in a new nest (0.25).\n5. Evaluate the fitness of the new cuckoos and r

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [33]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [34]:
dataset[100]["text"]

"<start_of_turn>user\nConsider the following optimization problem: \n\nMinimize the function f(x) = x^2 - 5x + 6 using cuckoo search algorithm over the interval [0,6]. \n\nUse an initial population size of 15 cuckoos, a maximum number of generations of 1000, a step size of 0.01, and a probability of cuckoo eggs being laid in a new nest of 0.25. \n\nDetermine the global minimum and value of the objective function.<end_of_turn>\n<start_of_turn>model\nTo solve the optimization problem using the cuckoo search algorithm, we will follow these steps:\n\n1. Initialize the population of 15 cuckoos randomly within the interval [0, 6].\n2. Evaluate the fitness of each cuckoo (i.e., calculate the objective function f(x) for each cuckoo).\n3. Perform the search for a maximum number of generations (1000).\n4. In each generation, update the cuckoo's position using the step size (0.01) and the probability of laying eggs in a new nest (0.25).\n5. Evaluate the fitness of the new cuckoos and replace the 

<a name="Train"></a>
### Train the model
Fine-tuning requires careful experimentation. To avoid wasting hours on a broken pipeline, we start with a 5-step sanity check. This ensures the training stabilizes and the model exports correctly to your phone.

Run this short test first. If the export succeeds, come back and set max_steps = -1 (or num_train_epochs = 1) for the full training run.

In [35]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",

        # --- FASTEST SETTINGS ---
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 2,

        # --- THE SPEED RUN ---
        max_steps = 5,  # Finish in seconds!
        warmup_steps = 1,

        # Stability settings (kept just in case)
        learning_rate = 5e-6,
        optim = "adamw_torch",
        max_grad_norm = 1.0,

        logging_steps = 1,
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/10000 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


In [36]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
14.111 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [37]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1 | Total steps = 5
O^O/ \_/ \    Batch size per device = 16 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (16 x 2 x 1) = 32
 "-____-"     Trainable parameters = 435,870,336 of 435,870,336 (100.00% trained)


Step,Training Loss
1,18.1457
2,17.3064
3,18.096
4,17.5309
5,17.3119


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_percentage = round(used_memory_for_training / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_training} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {training_percentage} %.")

115.0737 seconds used for training.
1.92 minutes used for training.
Peak reserved memory = 14.111 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 95.726 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Save"></a>
### Saving, loading finetuned models

To save the model for phone deployment, we first save the model and tokenizer via `save_pretrained`.

In [39]:
# Save the model and tokenizer directly
model.save_pretrained("gemma_phone_model")
tokenizer.save_pretrained("gemma_phone_model")

('gemma_phone_model/tokenizer_config.json',
 'gemma_phone_model/special_tokens_map.json',
 'gemma_phone_model/chat_template.jinja',
 'gemma_phone_model/tokenizer.model',
 'gemma_phone_model/added_tokens.json',
 'gemma_phone_model/tokenizer.json')

### Install ExecuTorch Export Dependencies

The export requires executorch and optimum-executorch. We install executorch with `--no-deps` to avoid replacing your existing CUDA torch (executorch is CPU-only, designed for mobile deployment).

In [None]:
%%capture
# Install ExecuTorch export dependencies (stable release)
!pip install optimum==1.24.0 pytorch-tokenizers
!pip install executorch==1.0.1 --extra-index-url https://download.pytorch.org/whl/executorch --no-deps
!pip install git+https://github.com/huggingface/optimum-executorch.git@v0.1.0 --no-deps

We then export directly from the local folder using Optimum Executorch as per [the documentation.](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md)

In [None]:
from optimum.executorch import ExecuTorchModelForCausalLM
import shutil
import os

print("Exporting Gemma3-270M to ExecuTorch format...")

# Export the trained model using Python API
et_model = ExecuTorchModelForCausalLM.from_pretrained(
    "gemma_phone_model",
    export=True,
    recipe="xnnpack",
    task="text-generation",
)

# Copy .pte file to output directory
temp_dir = et_model._temp_dir.name
os.makedirs("gemma_output", exist_ok=True)

for f in os.listdir(temp_dir):
    src = os.path.join(temp_dir, f)
    dst = os.path.join("gemma_output", f)
    shutil.copy2(src, dst)
    size_mb = os.path.getsize(dst) / 1024 / 1024
    print(f"Exported: {f} ({size_mb:.2f} MB)")

print("\nExport complete!")

And we have the file Gemma3 model.pte of size 306M!

In [None]:
!ls -lh gemma_output/model.pte

### Test Inference on Exported Model

In [None]:
from transformers import AutoTokenizer

# Load the exported model for inference
et_model = ExecuTorchModelForCausalLM.from_pretrained("gemma_output", export=False)
tokenizer = AutoTokenizer.from_pretrained("gemma_phone_model")

# Test generation
prompt = "<start_of_turn>user\nWhat is 2 + 2?<end_of_turn>\n<start_of_turn>model\n"
output = et_model.text_generation(tokenizer, prompt, max_seq_len=50)
print(f"Generated: {output}")

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
