More details in this article: [Fine-Tuning Gemma 3 on Your Computer with LoRA and QLoRA (+model review)](https://kaitchup.substack.com/p/fine-tuning-gemma-3-on-your-computer)

This notebook shows how to fine-tune Gemma 3 with a single GPU. Full fine-tuning, LoRA, and QLoRA are supported.

* Gemma 3 1B and 4B can be fine-tuned with a 24 GB GPU without quantization (LoRA)
* Gemma 3 12B can be fine-tuned with a 24 GB GPU with quantization (QLoRA)
* Gemma 3 27B can be fine-tuned with a 40 GB GPU with quantization (QLoRA). Technically, using very short sequences and skipping the retraining of the embeddings would make QLoRA fine-tuning also possible on a 24 GB GPU.

# Install

We need a special commit of Transformers. It will probably be pushed to the main branch soon so a simple --upgrade of Transformers might be enough to make it work, soon.

In [None]:
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3 trl peft datasets accelerate bitsandbytes

Collecting git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-Gemma-3) to /tmp/pip-req-build-d1th1sit
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-d1th1sit
  Running command git checkout -q 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Resolved https://github.com/huggingface/transformers to commit 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=10936429 sha256=ced86dd1abe3b214b6e47150a40e65cc8de2cd8f2f260479512d17c3c220db02
  Stored in directory: /tmp/pip-eph

# Fine-Tuning Code

In [None]:
import torch, os, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM, #not supported yet
    AutoTokenizer,
    Gemma3ForConditionalGeneration, #we need this
    BitsAndBytesConfig,
    set_seed
)
from trl import SFTTrainer, SFTConfig
set_seed(1234)

compute_dtype = torch.bfloat16
attn_implementation = 'eager'

def fine_tune(model_name, batch_size=1, gradient_accumulation_steps=32, LoRA=False, QLoRA=False):

  #The tokenizer has a pad token!
  #Need to force adding the bos token according to the technical report
  tokenizer = AutoTokenizer.from_pretrained(model_name, add_bos=True)
  tokenizer.padding_side = 'right'

  #Google followed recent practices which involve not including the chat template in the tokenizer of base model...
  #Let's add it so we can fine-tune the model with the chat template.
  tokenizer.chat_template = AutoTokenizer.from_pretrained("google/gemma-3-4b-it", add_bos=True).chat_template

  ds_train = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:30000]")
  #Add the EOS token
  def process(row):
      row["text"] = tokenizer.apply_chat_template(row["messages"], tokenize=False, add_generation_prompt=False)+tokenizer.eos_token
      return row

  ds_train = ds_train.map(
      process,
      num_proc= multiprocessing.cpu_count(),
      load_from_cache_file=False,
  )

  ds_train = ds_train.remove_columns(["messages","prompt","prompt_id"])

  print(ds_train[0])


  if QLoRA:
    bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
              model_name, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
    )
    model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})
  else:
    model = Gemma3ForConditionalGeneration.from_pretrained(
              model_name, device_map={"": 0}, torch_dtype=compute_dtype, attn_implementation=attn_implementation
    )
    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})

  print(model)

  if LoRA or QLoRA:
    peft_config = LoraConfig(
            lora_alpha=16,
            lora_dropout=0.05,
            r=16,
            bias="none",
            task_type="CAUSAL_LM",
            target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"],
            modules_to_save=['embed_tokens', 'lm_head'] #because of the chat template potentially containing untrained special tokens, we need to retrain the embeddings
    )
  else:
      peft_config = None

  if LoRA:
    output_dir = "./LoRA/"
  elif QLoRA:
    output_dir = "./QLoRA/"
  else:
    output_dir = "./FFT/"

  training_arguments = SFTConfig(
          output_dir=output_dir,
          #eval_strategy="steps",
          #do_eval=True,
          optim="paged_adamw_8bit",
          per_device_train_batch_size=batch_size,
          gradient_accumulation_steps=gradient_accumulation_steps,
          #per_device_eval_batch_size=batch_size,
          log_level="debug",
          save_strategy="epoch",
          logging_steps=25,
          learning_rate=1e-5,
          bf16 = True,
          #eval_steps=25,
          num_train_epochs=1,
          warmup_ratio=0.1,
          lr_scheduler_type="linear",
          dataset_text_field="text",
          max_seq_length=1024,
          report_to="none"
  )

  trainer = SFTTrainer(
          model=model,
          train_dataset=ds_train,
          #eval_dataset=ds['test'],
          peft_config=peft_config,
          processing_class=tokenizer,
          args=training_arguments,
  )

  #--code by Unsloth: https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=pCqnaKmlO1U9

  gpu_stats = torch.cuda.get_device_properties(0)
  start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
  max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
  print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
  print(f"{start_gpu_memory} GB of memory reserved.")

  trainer_ = trainer.train()


  used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
  used_memory_for_trainer= round(used_memory - start_gpu_memory, 3)
  used_percentage = round(used_memory         /max_memory*100, 3)
  trainer_percentage = round(used_memory_for_trainer/max_memory*100, 3)
  print(f"{trainer_.metrics['train_runtime']} seconds used for training.")
  print(f"{round(trainer_.metrics['train_runtime']/60, 2)} minutes used for training.")
  print(f"Peak reserved memory = {used_memory} GB.")
  print(f"Peak reserved memory for training = {used_memory_for_trainer} GB.")
  print(f"Peak reserved memory % of max memory = {used_percentage} %.")
  print(f"Peak reserved memory for training % of max memory = {trainer_percentage} %.")
  print("-----")
  #----

# Example of LoRA Fine-Tuning for Gemma 3 4B

In [None]:
fine_tune("google/gemma-3-4b-pt", batch_size=1, gradient_accumulation_steps=32, LoRA=True)

Map (num_proc=12):   0%|          | 0/30000 [00:00<?, ? examples/s]

{'text': "<bos><start_of_turn>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<end_of_turn>\n<start_of_turn>model\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<end_of_turn>\n<start_of_turn>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<end_of_turn>\n<start_of_turn>mo

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Gemma3ForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(4096, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-26): 27 x SiglipEncoderLayer(
            (self_attn): SiglipAttention(
              (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear(in_features=1152,

Converting train dataset to ChatML:   0%|          | 0/30000 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/30000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/30000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/30000 [00:00<?, ? examples/s]

Using auto half precision backend
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
Currently training with a batch size of: 1
The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: text. If text are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 30,000
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 32
  Total optimization steps = 937
  Number of trainable parameters = 1,375,293,440


GPU = NVIDIA L4. Max memory = 22.161 GB.
11.891 GB of memory reserved.


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
25,49.239
50,48.4804
75,46.7481
100,44.4121
125,43.1718
150,41.8848
175,41.4747
200,40.7445
225,39.5927
250,40.0644


Saving model checkpoint to ./LoRA/checkpoint-937
tokenizer config file saved in ./LoRA/checkpoint-937/tokenizer_config.json
Special tokens file saved in ./LoRA/checkpoint-937/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




31032.9612 seconds used for training.
517.22 minutes used for training.
Peak reserved memory = 18.529 GB.
Peak reserved memory for training = 6.638 GB.
Peak reserved memory % of max memory = 83.611 %.
Peak reserved memory for training % of max memory = 29.954 %.
-----
