### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm
# Install latest Hugging Face for Gemma-3!
!pip install --no-deps git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [3]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-1b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-07 07:17:18 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Gemma3 patching. Transformers: 4.50.0.dev0. vLLM: 0.8.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [4]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes.Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template.

In [5]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [24]:
import pickle
with open("/content/data_gemma3_fine_tune.pickle","rb") as f:
  dataset=pickle.load(f)

In [27]:
from sklearn.model_selection import train_test_split
train_dataset, test_dataset=train_test_split(dataset,test_size=0.05)

In [29]:
from datasets import Dataset
dataset=Dataset.from_list(train_dataset)

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [30]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Let's see how first row looks like!

In [33]:
dataset[0]

{'messages': [{'content': '**Task:**  \nYou are given a **question** along with multiple **contexts** and their associated **metadata**.\n\n**Your goal is to:**\n\n1. **If the question is answered with "Yes" by any context:**  \n   - Return the **context** that supports a "Yes" answer, along with its **metadata** and the answer `"Yes"`.\n\n2. **If no context supports a "Yes" answer:**  \n   - Return a **context that contradicts** the question (i.e., implies the answer is "No"), along with its **metadata** and the answer `"No"`.',
   'role': 'system'},
  {'content': '**Question:** Does the report mention sustainability initiatives?\n**Context 1:** "TAKING CARE OF THE ENVIRONMENT \n\n Maintaining and supporting the health of our natural environment \n is vital to our customers, our business and \n our planet\'s future. Working with partners in the UK \n and around the world, we continue to invest in innovation \n and conservation programmes that will help us \n restore and protect our ri

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [34]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["messages"],
                                          tokenize=False)
    return { "text" : texts }
pass
dataset = dataset.map(apply_chat_template, batched = True)

Map:   0%|          | 0/95 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [35]:
dataset[0]["text"]

'<bos><start_of_turn>user\n**Task:**  \nYou are given a **question** along with multiple **contexts** and their associated **metadata**.\n\n**Your goal is to:**\n\n1. **If the question is answered with "Yes" by any context:**  \n   - Return the **context** that supports a "Yes" answer, along with its **metadata** and the answer `"Yes"`.\n\n2. **If no context supports a "Yes" answer:**  \n   - Return a **context that contradicts** the question (i.e., implies the answer is "No"), along with its **metadata** and the answer `"No"`.\n\n**Question:** Does the report mention sustainability initiatives?\n**Context 1:** "TAKING CARE OF THE ENVIRONMENT \n\n Maintaining and supporting the health of our natural environment \n is vital to our customers, our business and \n our planet\'s future. Working with partners in the UK \n and around the world, we continue to invest in innovation \n and conservation programmes that will help us \n restore and protect our rivers, ensure a sustainable water \n 

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! We do 46 steps because we have a small dataset and also to speed things up and but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [36]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 46,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/95 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [37]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/95 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 0th row again:

In [38]:
tokenizer.decode(trainer.train_dataset[0]["input_ids"])

'<bos><bos><start_of_turn>user\n**Task:**  \nYou are given a **question** along with multiple **contexts** and their associated **metadata**.\n\n**Your goal is to:**\n\n1. **If the question is answered with "Yes" by any context:**  \n   - Return the **context** that supports a "Yes" answer, along with its **metadata** and the answer `"Yes"`.\n\n2. **If no context supports a "Yes" answer:**  \n   - Return a **context that contradicts** the question (i.e., implies the answer is "No"), along with its **metadata** and the answer `"No"`.\n\n**Question:** Does the report mention sustainability initiatives?\n**Context 1:** "TAKING CARE OF THE ENVIRONMENT \n\n Maintaining and supporting the health of our natural environment \n is vital to our customers, our business and \n our planet\'s future. Working with partners in the UK \n and around the world, we continue to invest in innovation \n and conservation programmes that will help us \n restore and protect our rivers, ensure a sustainable wate

Now let's print the masked out example - you should see only the answer is present:

In [40]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[0]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       **Answer:** yes\n**Chunk:** "TAKING CARE OF THE ENVIRONMENT \n\n Maintaining and supporting the health of our natural environment \n is vital to our customers, our business and \n our planet\'s future. Working with partners in the UK \n and around the world, we continue t

In [41]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.512 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [42]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 95 | Num Epochs = 4 | Total steps = 46
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 6,522,880/1,000,000,000 (0.65% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.5463
2,0.5167
3,0.4322
4,0.4815
5,0.4302
6,0.2553
7,0.193
8,0.2035
9,0.1701
10,0.1233


In [43]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

545.4 seconds used for training.
9.09 minutes used for training.
Peak reserved memory = 2.502 GB.
Peak reserved memory for training = 0.99 GB.
Peak reserved memory % of max memory = 16.973 %.
Peak reserved memory for training % of max memory = 6.716 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [51]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages=test_dataset[0]['messages'][:2]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation,
    tokenize=False
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 500, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><bos><start_of_turn>user\n**Task:**  \nYou are given a **question** along with multiple **contexts** and their associated **metadata**.\n\n**Your goal is to:**\n\n1. **If the question is answered with "Yes" by any context:**  \n   - Return the **context** that supports a "Yes" answer, along with its **metadata** and the answer `"Yes"`.\n\n2. **If no context supports a "Yes" answer:**  \n   - Return a **context that contradicts** the question (i.e., implies the answer is "No"), along with its **metadata** and the answer `"No"`.\n\n**Question:** Does the report mention sustainability initiatives?\n**Context 1:** "Our strategy & commitments in \n motion \n\n We are already taking action to build a more sustainable \n future and contribute to the well-being \n of the communities in which we operate. \n Join us and follow our progress at www.bs-group-sa.com/about-us/ \n sustainability/"\n **Metadata:** Company: B&S Group Sarl | Page: 9\n\n**Context 2:** "Our commitments \n\n We inves

In [52]:
# gold truth
test_dataset[0]['messages'][2]

{'role': 'assistant',
 'content': '**Answer:** no\n**Chunk:** "Our strategy & commitments in \n motion \n\n We are already taking action to build a more sustainable \n future and contribute to the well-being \n of the communities in which we operate. \n Join us and follow our progress at www.bs-group-sa.com/about-us/ \n sustainability/"\n**Company:** B&S Group Sarl\n**Page:** 9'}

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [54]:
# the target, we expect the model to generate the following output
test_dataset[0]['messages'][2]

{'role': 'assistant',
 'content': '**Answer:** no\n**Chunk:** "Our strategy & commitments in \n motion \n\n We are already taking action to build a more sustainable \n future and contribute to the well-being \n of the communities in which we operate. \n Join us and follow our progress at www.bs-group-sa.com/about-us/ \n sustainability/"\n**Company:** B&S Group Sarl\n**Page:** 9'}

In [55]:
messages = test_dataset[0]['messages'][:2]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize=False
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 500, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

**Answer:** no
**Chunk:** "Our strategy & commitments in 
 motion 

 We are already taking action to build a more sustainable 
 future and contribute to the well-being 
 of the communities in which we operate. 
 Join us and follow our progress at www.bs-group-sa.com/about-us/ sustainability/"
**Company:** B&S Group Sarl
**Page:** 9<end_of_turn>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [56]:
model.save_pretrained("gemma-3-1b-fine-tuned-lora-adapter")  # Local saving
tokenizer.save_pretrained("gemma-3-1b-fine-tuned-lora-adapter")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving

('gemma-3-1b-fine-tuned-lora-adapter/tokenizer_config.json',
 'gemma-3-1b-fine-tuned-lora-adapter/special_tokens_map.json',
 'gemma-3-1b-fine-tuned-lora-adapter/tokenizer.model',
 'gemma-3-1b-fine-tuned-lora-adapter/added_tokens.json',
 'gemma-3-1b-fine-tuned-lora-adapter/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `if False` to `if True`:

In [57]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "/content/gemma-3-1b-fine-tuned-lora-adapter", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = test_dataset[0]['messages'][:2]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation,
    tokenize=False
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 500, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

==((====))==  Unsloth 2025.3.19: Fast Gemma3 patching. Transformers: 4.50.0.dev0. vLLM: 0.8.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
**Answer:** no
**Chunk:** "Our strategy & commitments in 
 motion 

 We are already taking action to build a more sustainable 
 future and contribute to the well-being 
 of the communities in which we operate. 
 Join us and follow our progress at www.bs-group-sa.com/about-us/ sustainability/"
**Company:** B&S Group Sarl
**Page:** 9<end_of_turn>


### Saving to float16 for VLLM

Save to `float16` directly for deployment!

In [58]:
if False: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3-1b-fine-tuned-float16", tokenizer)

Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [01:32<00:00, 92.85s/it]


Upload / push to your Hugging Face account.

In [None]:
if False: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3-1b-fine-tuned-float16", tokenizer,
        token = "hf_..."
    )

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, You can convert easily to `Q8_0, F16 or BF16` precision.

In [61]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-1b-fine-tuned-float16",
        quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Unsloth: Updating system package directories
Unsloth: Install GGUF and other packages
Unsloth GGUF:hf-to-gguf:Loading model: gemma-3-1b-fine-tuned-float16
Unsloth GGUF:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
Unsloth GGUF:hf-to-gguf:Exporting model...
Unsloth GGUF:hf-to-gguf:gguf: loading model part 'model.safetensors'
Unsloth GGUF:hf-to-gguf:token_embd.weight,                 torch.bfloat16 --> Q8_0, shape = {1152, 262144}
Unsloth GGUF:hf-to-gguf:output_norm.weight,                torch.bfloat16 --> F32, shape = {1152}
Unsloth GGUF:hf-to-gguf:Set meta model
Unsloth GGUF:hf-to-gguf:Set model parameters
Unsloth GGUF:hf-to-gguf:Set model tokenizer
Unsloth GGUF:gguf.vocab:Setting special token type bos to 2
Unsloth GGUF:gguf.vocab:Setting special token type eos to 106
Unsloth GGUF:gguf.vocab:Setting special token type unk to 3
Unsloth GGUF:gguf.vocab:Setting special token type pad to 0
Unsloth GGUF:gguf.vocab:Setting add_bos_token to True
Unsloth GGUF:gguf.vocab:Set

Unsloth: GGUF conversion:   0%|          | 0/100 [00:00<?, ?it/s]

Unsloth GGUF:hf-to-gguf:Model successfully exported to ./
Unsloth: Converted to gemma-3-1b-fine-tuned-float16.Q8_0.gguf with size = 1.1G
Unsloth: Successfully saved GGUF to:
gemma-3-1b-fine-tuned-float16.Q8_0.gguf


Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [62]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-1b-fine-tuned-float16",
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "HF_ACCOUNT/gemma-3-1b-fine-tuned-gguf",
        token = "hf_...",
    )