To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Long-Context GRPO for reinforcement learning ‚Äî train stably at massive sequence lengths. Fine-tune models with up to 7x more context length efficiently. [Read Blog](https://unsloth.ai/docs/new/grpo-long-context)

3√ó faster training with optimized sequence packing ‚Äî higher throughput with no quality loss.[Read Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)

500k context-length fine-tuning ‚Äî push long-context models further with memory-efficient training. [Read Blog](https://unsloth.ai/docs/new/500k-context-length-fine-tuning)

Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.57.3
!pip install --no-deps trl==0.25.1
!pip install torchao==0.14.0 optimum==1.24.0 pytorch-tokenizers executorch
!pip install git+https://github.com/huggingface/optimum-executorch.git@v0.1.0 --no-deps

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch

# Models supported for Phone Deployment
fourbit_models = [
    "unsloth/Qwen3-4B",              # Any Qwen3 model like 0.6B, 4B, 8B, 32B
    "unsloth/Qwen3-32B",
    "unsloth/Llama-3.1-8B-Instruct", # Llama 3 models work
    "unsloth/Llama-3.3-70B-Instruct",
    "unsloth/gemma-3-270m-it",       # Gemma 3 models work
    "unsloth/gemma-3-27b-it",
    "unsloth/Qwen2.5-7B-Instruct",   # And more models!
    "unsloth/Phi-4-mini-instruct",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
    full_finetuning = True,
    qat_scheme = "int4", # Gemma3 needs int4 due to large vocab (262K)
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.9: Fast Gemma3 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.
To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained


model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth: Applying QAT to mitigate quantization degradation


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [3]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [4]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We sample 10k rows to speed up training for the first successful phone deployment.

In [5]:
dataset = dataset.shuffle(seed=3407).select(range(10000))

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [6]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

In [7]:
dataset[100]

{'conversations': [{'content': 'You are a helpful assistant, with no access to external functions.',
   'role': 'system'},
  {'content': '20 // 3 - 4 ** 2 + 5=', 'role': 'user'},
  {'content': "The answer to this expression is -6. \nHere's how to solve it: \n1. First, we need to evaluate the exponentiation operator (**), which has higher precedence than the other operators. So, 4 ** 2 equals 16. \n2. Next, we need to perform the division operation (20 // 3), which gives us 6 (since 20 divided by 3 is 6 with a remainder of 2). \n3. Now we can substitute the values we have so far into the expression: \n6 - 16 + 5 \n4. Finally, we can perform the remaining addition and subtraction operations from left to right: \n6 - 16 = -10 \n-10 + 5 = -5 \nTherefore, the final answer is -6.",
   'role': 'assistant'},
  {'content': 'Can you explain the concept of neural networks?',
   'role': 'user'},
  {'content': 'Sure, I can explain neural networks. Neural networks are a type of machine learning algo

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [8]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [9]:
dataset[100]["text"]

"<start_of_turn>user\nYou are a helpful assistant, with no access to external functions.\n\n20 // 3 - 4 ** 2 + 5=<end_of_turn>\n<start_of_turn>model\nThe answer to this expression is -6. \nHere's how to solve it: \n1. First, we need to evaluate the exponentiation operator (**), which has higher precedence than the other operators. So, 4 ** 2 equals 16. \n2. Next, we need to perform the division operation (20 // 3), which gives us 6 (since 20 divided by 3 is 6 with a remainder of 2). \n3. Now we can substitute the values we have so far into the expression: \n6 - 16 + 5 \n4. Finally, we can perform the remaining addition and subtraction operations from left to right: \n6 - 16 = -10 \n-10 + 5 = -5 \nTherefore, the final answer is -6.<end_of_turn>\n<start_of_turn>user\nCan you explain the concept of neural networks?<end_of_turn>\n<start_of_turn>model\nSure, I can explain neural networks. Neural networks are a type of machine learning algorithm that are inspired by the structure and functio

<a name="Train"></a>
### Train the model
Fine-tuning requires careful experimentation. To avoid wasting hours on a broken pipeline, we start with a 5-step sanity check. This ensures the training stabilizes and the model exports correctly to your phone.

Run this short test first. If the export succeeds, come back and set max_steps = -1 (or num_train_epochs = 1) for the full training run.

In [10]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        max_steps = 60,
        warmup_steps = 5,
        learning_rate = 5e-6,
        optim = "adamw_torch",
        max_grad_norm = 1.0,
        logging_steps = 1,
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


In [11]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.666 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 268,098,176 of 268,098,176 (100.00% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.603
2,2.6463
3,2.751
4,2.564
5,2.197
6,2.1417
7,2.1798
8,1.8403
9,2.1675
10,2.0056


In [13]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_percentage = round(used_memory_for_training / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_training} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {training_percentage} %.")

851.6698 seconds used for training.
14.19 minutes used for training.
Peak reserved memory = 13.57 GB.
Peak reserved memory for training = 11.904 GB.
Peak reserved memory % of max memory = 92.056 %.
Peak reserved memory for training % of max memory = 80.754 %.


<a name="Save"></a>
### Saving, loading finetuned models

To save the model for phone deployment, we first save the model and tokenizer via `save_pretrained`.

In [14]:
# Save the model and tokenizer directly
model.save_pretrained("gemma_phone_model")
tokenizer.save_pretrained("gemma_phone_model")

('gemma_phone_model/tokenizer_config.json',
 'gemma_phone_model/special_tokens_map.json',
 'gemma_phone_model/chat_template.jinja',
 'gemma_phone_model/tokenizer.model',
 'gemma_phone_model/added_tokens.json',
 'gemma_phone_model/tokenizer.json')

We then export directly from the local folder using Optimum Executorch as per [the documentation.](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md)

In [15]:
%%file export_gemma_model.py
from optimum.executorch import ExecuTorchModelForCausalLM
import shutil
import os

print("Exporting Gemma3-270M to ExecuTorch format...")

# Export the trained model using Python API
et_model = ExecuTorchModelForCausalLM.from_pretrained(
    "gemma_phone_model",
    export = True,
    recipe = "xnnpack",
    task = "text-generation",
)

# Copy .pte file to output directory
temp_dir = et_model._temp_dir.name
os.makedirs("gemma_output", exist_ok=True)

for f in os.listdir(temp_dir):
    src = os.path.join(temp_dir, f)
    dst = os.path.join("gemma_output", f)
    shutil.copy2(src, dst)
    size_mb = os.path.getsize(dst) / 1024 / 1024
    print(f"Exported: {f} ({size_mb:.2f} MB)")

print("\nExport complete!")

Writing export_gemma_model.py


In [16]:
!python export_gemma_model.py

2026-01-02 09:15:27.775125: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767345327.794751    6153 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767345327.800610    6153 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767345327.815563    6153 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345327.815587    6153 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345327.815591    6153 computation_placer.cc:177] computation placer alr

And we have the file Gemma3 model.pte of size 306M!

In [17]:
!ls -lh gemma_output/model.pte

-rw-r--r-- 1 root root 1.7G Jan  2 09:20 gemma_output/model.pte


### Test Inference on Exported Model

In [18]:
%%file test_executorch.py
from transformers import AutoTokenizer
from optimum.executorch import ExecuTorchModelForCausalLM

# Load the exported model for inference
et_model = ExecuTorchModelForCausalLM.from_pretrained("gemma_output", export = False)
tokenizer = AutoTokenizer.from_pretrained("gemma_phone_model")

# Test generation
prompt = "<start_of_turn>user\nWhat is 2 + 2?<end_of_turn>\n<start_of_turn>model\n"
output = et_model.text_generation(tokenizer, prompt, max_seq_len = 50)
print(f"Generated: {output}")

Writing test_executorch.py


In [19]:
!python test_executorch.py

2026-01-02 09:25:10.490054: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767345910.715842    8915 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767345910.779700    8915 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767345911.259764    8915 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345911.259808    8915 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345911.259812    8915 computation_placer.cc:177] computation placer alr

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
