### Tunning Small Model - Gemma3n -270M

#### https://unsloth.ai/

### Installation

In [None]:
! pip install unsloth

In [None]:
!pip install transformers==4.56.2

In [None]:
!pip install --no-deps trl==0.22.2

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [7]:
from unsloth import FastModel
import torch
max_seq_length = 2048
# fourbit_models = [
#     # 4bit dynamic quants for superior accuracy and low memory use
#     "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
#     "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
#     "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
#     "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

#     # Other popular models!
#     "unsloth/Llama-3.1-8B",
#     "unsloth/Llama-3.2-3B",
#     "unsloth/Llama-3.3-70B",
#     "unsloth/mistral-7b-instruct-v0.3",
#     "unsloth/Phi-4",
# ] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.2: Fast Gemma3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA RTX 4000 Ada Generation. Num GPUs = 1. Max memory: 19.674 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

#### We now add LoRA adapters so we only need to update a small amount of parameters!

In [8]:
#https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide#hyperparameters-and-recommendations
model = FastModel.get_peft_model(
    model,
    r = 32 , # 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Pluto chat cataset](https://huggingface.co/datasets/droidnext/pluto_dataset_chat) dataset. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma3",
)

In [None]:
from datasets import load_dataset
dataset = load_dataset("droidnext/pluto_dataset_chat", split = "train[:500]")

We now use `convert_to_chatml` to try converting datasets to the correct format for finetuning purposes!

In [None]:
def convert_to_chatml(example):
    return {
        "conversations": [
            {"role": "system", "content": example["task"]},
            {"role": "user", "content": example["input"]},
            {"role": "assistant", "content": example["expected_output"]}
        ]
    }

dataset = dataset.map(
    convert_to_chatml
)

Let's see how row 100 looks like!

In [14]:
dataset[100]

{'task': 'Given a input respond with a tone and persona of disney character Pluto . Input Format: Input chat message talking to pluto. Output Format: Pluto responding to input chat message.',
 'input': 'You can jump the hoop attempt 50!',
 'expected_output': '*clears hoop with flourish attempt 50*',
 'conversations': [{'content': 'Given a input respond with a tone and persona of disney character Pluto . Input Format: Input chat message talking to pluto. Output Format: Pluto responding to input chat message.',
   'role': 'system'},
  {'content': 'You can jump the hoop attempt 50!', 'role': 'user'},
  {'content': '*clears hoop with flourish attempt 50*', 'role': 'assistant'}]}

We now have to apply the chat template for `Gemma3` onto the conversations, and save it to `text`.

In [None]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Let's see how the chat template did!


In [16]:
dataset[100]['text']

'<start_of_turn>user\nGiven a input respond with a tone and persona of disney character Pluto . Input Format: Input chat message talking to pluto. Output Format: Pluto responding to input chat message.\n\nYou can jump the hoop attempt 50!<end_of_turn>\n<start_of_turn>model\n*clears hoop with flourish attempt 50*<end_of_turn>\n'

<a name="Train"></a>
### Train the model
Now let's train our model. We do 100 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 1, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 100,
        learning_rate = 5e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir="outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Let's verify masking the instruction part is done! Let's print the 100th row again.

In [19]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nGiven a input respond with a tone and persona of disney character Pluto . Input Format: Input chat message talking to pluto. Output Format: Pluto responding to input chat message.\n\nYou can jump the hoop attempt 50!<end_of_turn>\n<start_of_turn>model\n*clears hoop with flourish attempt 50*<end_of_turn>\n'

Now let's print the masked out example - you should see only the answer is present:

In [20]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                       *clears hoop with flourish attempt 50*<end_of_turn>\n'

In [21]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA RTX 4000 Ada Generation. Max memory = 19.674 GB.
0.549 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [22]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 2 | Total steps = 100
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 7,593,984 of 275,692,160 (2.75% trained)


Step,Training Loss
1,5.4291
2,5.3073
3,5.0449
4,4.2921
5,4.5635
6,3.9566
7,3.1464
8,2.8401
9,3.1276
10,3.5296


In [23]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

51.5878 seconds used for training.
0.86 minutes used for training.
Peak reserved memory = 1.234 GB.
Peak reserved memory for training = 0.685 GB.
Peak reserved memory % of max memory = 6.272 %.
Peak reserved memory for training % of max memory = 3.482 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [24]:
messages = [
    {'role': 'system','content':dataset['conversations'][10][0]['content']},
    {"role" : 'user', 'content' : dataset['conversations'][10][1]['content']}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
).removeprefix('<bos>')

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 125,
    temperature = 1, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

*sweats proudly, then rain bakes*<end_of_turn>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")
# model.push_to_hub("your_name/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/gemma-3", token = "...") # Online saving

('gemma-3/tokenizer_config.json',
 'gemma-3/special_tokens_map.json',
 'gemma-3/chat_template.jinja',
 'gemma-3/tokenizer.model',
 'gemma-3/added_tokens.json',
 'gemma-3/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "gemma-3", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = False,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
hf_token ="YOUR_HF_TOKEN"
hf_repo = "droidnext/gemma_3b_270m_pluto_tunning"

In [None]:
# Merge to 16bit
if True: # Pushing to HF Hub
    # model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_16bit", token = hf_token)
    model.push_to_hub_merged(hf_repo, tokenizer, save_method = "merged_16bit", token = hf_token)

if True: # Pushing to HF Hub
    # model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_4bit", token = hf_token)
    model.push_to_hub_merged(hf_repo, tokenizer, save_method = "merged_4bit", token = hf_token)

# Just LoRA adapters
if True: # Pushing to HF Hub
    # model.push_to_hub("hf/gemma-3-finetune", token = hf_token)
    # tokenizer.push_to_hub("hf/gemma-3-finetune", token = hf_token)
    model.push_to_hub(hf_repo, token = hf_token)
    tokenizer.push_to_hub(hf_repo, token = hf_token)


## Test Tunned model

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "droidnext/gemma_3b_270m_pluto_tunning"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",       # Automatically pick float16 if available
    device_map="auto"         # Automatically use GPU if available
)

In [8]:
# Example inference
TRANSFORMERS_VERBOSITY="info"
prompt = "Hold still for a picture!"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, 
 max_new_tokens = 125,
    temperature = 1, top_p = 0.95, top_k = 64)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hold still for a picture!

*takes a few hesitant steps, then bounds through a doorway with a snap*

*eyes alert, ears on high alert*

*marches proudly in front of the doorway, then back to the ground*

*marches proudly in front of the doorway, then back to the ground*

*marches proudly in front of the doorway, then back to the ground*

*marches proudly in front of the doorway, then back to the ground*

*marches proudly in front of the doorway, then back to the ground*

*marches proudly in front of the doorway, then back
