# Part 1: Introduction & Setup

### **Live Model Fine-Tuning: Customizing Gemma 3 270M in Under 30 Minutes**

<div style="text-align: center;">
  <img src="images/devfest_waterloo.png" alt="DevFest Waterloo" style="width: 100%; max-width: 600px;">
</div>

This session provides a comprehensive, live demonstration of a complete fine-tuning workflow for Google's Gemma 3 270M model, executed in under 30 minutes. The primary objective is to illustrate how recent advancements in model architecture, training techniques, and software optimization have made customizing powerful Large Language Models (LLMs) accessible to a broader audience, even in resource-constrained environments.

The traditional approach, full fine-tuning, involves updating every single parameter in the model. This process is computationally intensive, requiring significant GPU memory and extended training times. This notebook will showcase a modern, efficient alternative by strategically combining three key technologies:

1.  **Gemma 3 270M**: A compact, state-of-the-art open model from Google. Its small footprint makes it an ideal candidate for rapid experimentation and deployment on consumer-grade hardware.
2.  **QLoRA (Quantized Low-Rank Adaptation)**: A highly efficient fine-tuning method. QLoRA dramatically reduces memory by quantizing the model's weights to 4-bits and then inserting small, trainable "adapter" layers. We only update these adapters, leaving the millions of base model parameters frozen.
3.  **Unsloth**: An open-source library with optimized CUDA kernels that can make fine-tuning up to 2x faster while reducing memory usage by 60-70%. Unsloth is the critical enabler that makes this sub-30-minute demonstration feasible on a free Google Colab GPU.

This rapid fine-tuning is possible due to the convergence of these three elements: a capable base model, a memory-frugal training technique, and a speed-optimized library.

**Agenda:**
1.  **Prerequisites:** What are Fine-Tuning and LoRA?
2.  **Setup & Installation:** Getting our environment ready.
3.  **Data Generation:** Creating a synthetic dataset with the Gemini Batch API.
4.  **Live Fine-Tuning:** Loading, configuring, and training our model.
5.  **Evaluation:** Was it worth it? A side-by-side comparison with an LLM-as-a-Judge.
6.  **Conclusion & Resources.**


### The Overall Fine-Tuning Pipeline

This diagram illustrates the end-to-end workflow we will follow in this notebook, from generating a synthetic dataset to exporting a locally-runnable model.

```mermaid
graph LR
    subgraph "Part 1: Data Generation"
        A[Start: Define Creative Themes] --> B{Gemini Batch API};
        B --> C[Generate Synthetic Dataset];
    end

    subgraph "Part 2: Model Training"
        C --> D[Load Base Gemma 3 Model];
        D --> E[Inject QLoRA Adapters];
        E --> F(Fine-Tune with Unsloth);
    end

    subgraph "Part 3: Evaluation & Export"
        F --> G{A/B Test: Base vs. Fine-Tuned};
        G --> H[LLM-as-a-Judge Verdict];
        H --> I[Save LoRA Adapters];
        I --> J[Export to GGUF for Local Inference];
        J --> K[End: Ready for LM Studio];
    end

    style F fill:#a2de89,stroke:#333,stroke-width:2px
    style J fill:#f9d77f,stroke:#333,stroke-width:2px
```


### What is Fine-Tuning?

Fine-tuning adapts a general pre-trained model to a specific task using a smaller, focused dataset. It's like teaching a polymath to become a specialist in a particular field.

### What is LoRA (Low-Rank Adaptation)?

Full fine-tuning requires updating all model weights, which is computationally expensive. **LoRA** offers a much more efficient alternative. It freezes the original weights and injects small, trainable "adapter" matrices (A and B) into the model. We only train these adapters, which are a tiny fraction of the total parameters.

<div style="text-align: center;">
  <img src="images/what_is_lora.png" alt="LoRA Diagram" style="width: 100%; max-width: 800px;">
</div>

This makes fine-tuning incredibly fast and memory-efficient. But does it compromise on quality?

> #### **Key Insight from "LoRA Without Regret"**
>
> Research shows that LoRA can match the performance of full fine-tuning if configured correctly. One of the most critical, and often overlooked, conditions is **applying LoRA to all possible layers.**
>
> From the paper: *"Attention-only LoRA significantly underperforms MLP-only LoRA, and does not further improve performance on top of LoRA-on-MLP."*
>
> This is because the MLP (feed-forward) layers contain a vast number of parameters compared to the attention layers. By ignoring them, you're leaving most of the model's knowledge untapped and creating a bottleneck for learning. **Therefore, we will target all linear layers in our configuration.**
>
> Link to the blog post: [LoRA Without Regret](https://thinkingmachines.ai/blog/lora/)

<div style="text-align: center;">
  <img src="images/lora_without_regret.png" alt="LoRA Diagram" style="width: 100%; max-width: 800px;">
</div>


### Install Dependencies

Install Unsloth from GitHub for the latest patches and pin other libraries for reproducibility.

In [None]:
# Cell 1: Install Dependencies
# Install Unsloth from GitHub for the latest patches and pin other libraries for reproducibility.
!pip install "unsloth[colab-new]@git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" "trl<0.9.0" "peft<0.12.0" "accelerate<0.32.0" "bitsandbytes<0.44.0"

# Install the Google AI SDK for the Gemini API.
!pip install -q -U google-generativeai

### Configure API Keys

To run this notebook, you'll need API keys for both Google Gemini and the Hugging Face Hub. For security, we'll use Colab's built-in secret manager.

1.  Click on the **key icon** (🔑) in the left-hand sidebar.
2.  Create two new secrets:
    *   **Name:** `GOOGLE_API_KEY` -> **Value:** Your Gemini API Key
    *   **Name:** `HF_TOKEN` -> **Value:** Your Hugging Face access token (with `write` permissions)

Once you've added these secrets, the notebook will be able to access them securely.
<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/colab_secrets.png" width="300">


In [None]:
#@title  Notebook Configurations
#@markdown ---
#@markdown ### **Model & Dataset**
model_name = "unsloth/gemma-3-270m-it" #@param {type:"string"}
max_seq_length = 2048 #@param {type:"integer"}
use_pregenerated_dataset = True #@param {type:"boolean"}
hf_user = "your-username" #@param {type:"string"}

#@markdown ---
#@markdown ### **LoRA Parameters**
#@markdown These settings control the size and complexity of the LoRA adapters.
lora_r = 16 #@param {type:"slider", min:4, max:64, step:4}

#@markdown ---
#@markdown ### **Training Parameters**
#@markdown Adjust these to control the training process.
max_steps = 150 #@param {type:"integer"}
learning_rate = 2e-4 #@param {type:"number"}
#@markdown ---


### Import Libraries & Configure Secrets    

In [None]:
# Core ML and Unsloth imports
from unsloth import FastModel
import torch
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
from datasets import load_dataset

# Google Gemini API imports
from google import genai
from google.genai import types
import json
import time
from google.colab import userdata

# Get API keys from Colab secrets
hf_token = userdata.get('HF_TOKEN')
google_api_key = userdata.get('GOOGLE_API_KEY')

# Part 2: Data is Fuel - Synthetic Data Generation


### Bootstrapping Our Dataset with the Gemini Batch API

<div style="text-align: center;">
  <img src="images/gemini_batch_api.png" alt="Gemini Batch API" style="width: 100%; max-width: 800px;">
</div>

A common challenge in fine-tuning is the "cold start" problem, where you have a task in mind but lack labeled training examples. **Synthetic data generation** offers a powerful solution, allowing us to use a highly capable LLM to create a labeled dataset from scratch.

This section demonstrates how to use the **Gemini Batch API** to generate our creative writing dataset. This API is specifically designed for high-volume, asynchronous tasks. It processes large numbers of requests in parallel at a 50% cost discount compared to its real-time counterpart, making it an ideal tool for large-scale data generation.

The workflow is as follows:
1. A prompt template is designed to instruct the Gemini model to generate creative stories.
2. These prompts are formatted into a JSON Lines (JSONL) file.
3. The file is submitted as an asynchronous batch job.

Since batch jobs are not instantaneous, this demonstration will show the code for submitting the job and then proceed with a pre-generated set of results to maintain the pace of the live session. This approach showcases a sophisticated data augmentation strategy that enhances the model's generalization capabilities.


In [None]:
# --- CONFIGURATION ---
HF_DATASET_NAME = f"{hf_user}/gemma-3-creative-writing"

if not use_pregenerated_dataset:
    # --- Gemini API Setup ---
    # Initialize the client for Batch API usage
    client = genai.Client(api_key=google_api_key)

    # --- Data Generation Pipeline ---
    SYSTEM_PROMPT = "You are a master storyteller. Write a short, imaginative story based on the user's request. The story should be concise and suitable for a general audience. The story should not exceed 2048 tokens."

    STORY_THEMES = [
        "A robot who discovers music for the first time.",
        "A magical library where books come to life.",
        "A detective who solves crimes in a city powered by steam.",
        "Two pen pals from different planets meeting for the first time.",
        "A mischievous forest spirit who plays pranks on hikers.",
        "The last dragon on Earth sharing its wisdom with a young child.",
    ]

    def create_batch_requests(themes, system_prompt):
        """Creates a JSONL file for the Gemini Batch API."""
        jsonl_file_path = 'synthetic_story_requests.jsonl'
        with open(jsonl_file_path, 'w') as f:
            for i, theme in enumerate(themes):
                request = {
                    "key": f"request_{i}",
                    "request": {
                        "contents": [
                            {"role": "system", "parts": [{"text": system_prompt}]},
                            {"role": "user", "parts": [{"text": theme}]}
                        ],
                        "generation_config": {
                            "max_output_tokens": 2048,
                            "temperature": 0.9,
                        }
                    }
                }
                f.write(json.dumps(request) + '\n')
        return jsonl_file_path

    # Create and run the batch job (simplified for the plan)
    print("Creating batch requests file...")
    requests_file = create_batch_requests(STORY_THEMES * 50, SYSTEM_PROMPT) # Create 300 examples
    print(f"Uploading file: {requests_file}")
    uploaded_file = client.files.upload(file=requests_file)
    batch_job = client.batches.create(
        model="models/gemini-1.5-flash-latest",
        src=uploaded_file.name,
    )
    print(f"Batch job created: {batch_job.name}. Polling for results...")

    # Polling logic would go here...
    # After completion, results would be parsed and saved to a Hugging Face dataset.
    print("For the demo, we will now proceed with the pre-generated dataset.")
    use_pregenerated_dataset = True


# Part 3: Live Fine-Tuning with Unsloth


In [None]:
from unsloth import FastModel
import torch

model, tokenizer = FastModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    load_in_4bit = True, # Use QLoRA for memory efficiency
)


### Data Formatting for Gemma 3

Instruction-tuned models like Gemma 3 are pre-trained on data formatted with specific "chat templates." These templates use special tokens to delineate conversational turns (e.g., from a user and a model). Adhering to this format during fine-tuning is critical for the model to correctly interpret the task and generate responses in the desired style.

The Gemma 3 chat template follows this structure:

`<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n{response}<end_of_turn><eos>`

Failing to use the correct template introduces a **distribution shift** between our fine-tuning data and the model's original training data. This mismatch can confuse the model, leading to suboptimal performance. This formatting step is not merely cleaning but a crucial alignment of our custom data with the model's pre-existing conversational conditioning.


In [None]:
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

if use_pregenerated_dataset:
    # This is a placeholder. You should create and upload your own dataset.
    # If you haven't, you can use a public one like "databricks/databricks-dolly-15k"
    # For this demo, we assume a dataset with 'prompt' and 'response' columns.
    try:
        dataset = load_dataset(HF_DATASET_NAME, split="train")
    except Exception as e:
        print(f"Could not load dataset from {HF_DATASET_NAME}. Please ensure it exists and is public.")
        print("Using a fallback dataset for demonstration purposes.")
        dataset = load_dataset("tatsu-lab/alpaca", split="train[:500]")
        # Remap columns to match our expected format
        dataset = dataset.rename_columns({'input': 'prompt', 'output': 'response'})


# Format the dataset for Gemma 3
def format_for_gemma(example):
    return {
        "conversations": [
            {"role": "user", "content": example["prompt"]},
            {"role": "assistant", "content": example["response"]},
        ]
    }

dataset = dataset.map(format_for_gemma, remove_columns=dataset.column_names)

print("Dataset formatted. Here's an example:")
print(dataset[0]['conversations'])

# Apply the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma3",
)


### The Fine-Tuning Core: QLoRA with Unsloth

This section constitutes the core of the demonstration: the fine-tuning process itself. Here, QLoRA is employed to adapt the Gemma 3 model. The Unsloth library's `FastModel` class is central to this, automatically applying memory and speed optimizations that abstract away much of the boilerplate code.

A key focus here is the **deliberate and evidence-based selection of LoRA hyperparameters**. The choices for rank (`r`), `lora_alpha`, and `target_modules` are not arbitrary; they are directly informed by empirical findings and best practices to maximize performance.

| Hyperparameter     | Value             | Justification & Source                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|:-------------------|:------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **r (Rank)**       | 16                | The rank determines the capacity of the LoRA adapters. `r = 16` provides a robust balance between expressive power and parameter efficiency for many tasks. While higher ranks can capture more complex patterns, they also increase memory usage and the risk of overfitting. (Source: Unsloth LoRA Guide)                                                                                                                                                                |
| **lora_alpha**     | 32                | The alpha parameter acts as a scalar for the adapter weights. A common and effective heuristic is to set `lora_alpha = 2 * r`. This scaling gives more weight to the fine-tuned adjustments, helping the model learn the new task more effectively. (Source: Unsloth LoRA Guide)                                                                                                                                                                                           |
| **target_modules** | All Linear Layers | This is a critical choice based on strong empirical evidence. Research shows that applying LoRA to all major linear layers—encompassing both self-attention (`q_proj`, `k_proj`, `v_proj`, `o_proj`) and the feed-forward/MLP blocks (`gate_proj`, `up_proj`, `down_proj`)—yields performance far superior to applying it only to attention layers. This ensures the adaptation happens across the model's full representational capacity. (Source: "LoRA Without Regret") |



In [None]:
model = FastModel.get_peft_model(
    model,
    r = lora_r,
    lora_alpha = lora_r * 2, # Heuristic: 2 * r.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_dropout = 0.05, # A bit of regularization
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)


In [None]:
# 1. Set up the trainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text", # This is created by apply_chat_template
    args = SFTConfig(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4, # Effective batch size of 16
        warmup_steps = 10,
        max_steps = max_steps,
        learning_rate = learning_rate,
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir="outputs",
    ),
)

# 2. Mask out user prompts for better performance
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

# 3. Start training!
trainer_stats = trainer.train()


In [None]:
# Part 3.5: Inference & Evaluation
import torch

# --- 1. Inference with Fine-Tuned Model ---
prompt = "Write a short story about a lighthouse keeper who befriends a whale"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt = True).removeprefix('<bos>')
inputs = tokenizer(text, return_tensors = "pt").to("cuda")

# Generate output and decode, skipping the prompt
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
finetuned_model_output = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]

print("--- Fine-Tuned Model Output ---")
print(finetuned_model_output)


# --- 2. Inference with Base Model ---
# To save VRAM, we'll clear the fine-tuned model before loading the base model.
del model
torch.cuda.empty_cache()

base_model, _ = FastModel.from_pretrained(
    model_name = model_name, # Load the same base model for a fair comparison
    max_seq_length = max_seq_length,
    load_in_4bit = True,
)

# Generate output and decode, skipping the prompt
outputs = base_model.generate(**inputs, max_new_tokens = 256, use_cache = True)
base_model_output = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]

print("\n--- Base Model Output ---")
print(base_model_output)


### Evaluation Pipeline

The diagram below shows our evaluation process. We generate responses from both the original base model and our newly fine-tuned model, then pass both outputs to a more powerful "judge" model for an objective comparison.

<div style="text-align: center;">

```mermaid
graph TD
    subgraph "Input"
        A[Test Prompt]
    end

    subgraph "Model Inference"
        A --> B[Base Gemma 3 Model];
        A --> C[Fine-Tuned Gemma 3 Model];
        B --> D[Output A: Base Response];
        C --> E[Output B: Fine-Tuned Response];
    end

    subgraph "Judgment"
        D --> F{"LLM-as-a-Judge<br>(Gemini 1.5 Flash)"};
        E --> F;
        F --> G[JSON Verdict & Comparison];
    end

    style F fill:#f9d77f,stroke:#333,stroke-width:2px
    style G fill:#a2de89,stroke:#333,stroke-width:2px
```


### LLM-as-a-Judge: The Verdict

We have the outputs, but which is better? Let's ask an impartial judge.


In [None]:
import google.generativeai as genai
import json
import os

# Ensure the Gemini API is configured for the judge model
# This is separate from the client used for the Batch API
try:
    from google.colab import userdata
    # Re-configure for non-client usage, which is simpler for single-shot generation
    genai.configure(api_key=userdata.get('GOOGLE_API_KEY'))
except (ImportError, KeyError):
    if os.environ.get("GOOGLE_API_KEY"):
         genai.configure(api_key=os.environ.get("GOOGLE_API_KEY"))
    else:
         print("Please set your GOOGLE_API_KEY in Colab Secrets or as an environment variable to run the judge model.")

# Use f-strings to embed the model outputs directly into the prompt
JUDGE_PROMPT = f"""
You are an expert AI model evaluator. You will be given a prompt and two stories generated by two different AI models. Your task is to analyze them and determine which one is better.

**Prompt:** {prompt}

**Story A (from Base Model):**
{base_model_output}

**Story B (from Fine-Tuned Model):**
{finetuned_model_output}

**Evaluation Criteria:**
1.  **Creativity:** How original, imaginative, and engaging is the story?
2.  **Coherence:** Is the story logical, well-structured, and easy to follow?
3.  **Adherence to Prompt:** How well does the story capture the essence of the user's request?

**Instructions:**
Provide a step-by-step analysis comparing the two stories based on the criteria above. Conclude with a final verdict in JSON format, declaring which model produced the better story and why. Provide scores from 1-10 for each criterion.

**JSON Output Format:**
```json
{{
  "winner": "Model A" or "Model B",
  "reasoning": "A brief explanation for your choice, highlighting the key differences.",
  "scores": {{
    "model_a": {{ "creativity": X, "coherence": Y, "adherence": Z }},
    "model_b": {{ "creativity": X, "coherence": Y, "adherence": Z }}
  }}
}}
```
"""

try:
    judge_model = genai.GenerativeModel('gemini-1.5-flash-latest')
    response = judge_model.generate_content(JUDGE_PROMPT)
    print(response.text)
except Exception as e:
    print(f"An error occurred while calling the judge model: {e}")


# Part 4: Conclusion & Resources


### Saving the Model Adapters
The final step is to save the trained LoRA adapters. These adapters are lightweight, typically only a few megabytes in size, which underscores the efficiency of the PEFT methodology. They contain all the new knowledge learned during fine-tuning and can be easily stored, shared, and loaded on top of the base Gemma 3 model for future inference tasks.


In [None]:
# Save the LoRA adapters
model.save_pretrained("gemma-3-creative-writer-lora")
tokenizer.save_pretrained("gemma-3-creative-writer-lora")

# You can also merge the adapters and save as a full model
# model.save_pretrained_merged("gemma-3-creative-writer-merged", tokenizer, save_method = "merged_16bit")


### Exporting to GGUF for Local Inference

To run our fine-tuned model on a local machine using tools like LM Studio or Ollama, we need to convert it to the **GGUF (GPT-Generated Unified Format)**. This format is highly optimized for running efficiently on CPUs and various GPUs, making it the standard for local LLM inference.

The process involves two steps:
1.  **Merge Adapters:** First, we merge the trained LoRA adapters back into the base model to create a full, fine-tuned model.
2.  **Save as GGUF:** We then save this merged model in the GGUF format. Unsloth simplifies this with a built-in `save_pretrained_gguf` method.


In [None]:
# First, we need to reload the fine-tuned model.
# The previous cell deleted the model to save VRAM for the base model comparison.
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name = "gemma-3-creative-writer-lora", # Load our saved adapters
    max_seq_length = max_seq_length,
    load_in_4bit = True,
)

# Merge the LoRA adapters into the base model.
# This creates a new, full model with our fine-tuned weights.
model.save_pretrained_merged("gemma-3-creative-writer-merged", tokenizer, save_method = "merged_16bit",)

# Now, save the merged model in GGUF format.
model.save_pretrained_gguf("gemma-3-creative-writer-gguf", tokenizer)

print("Model successfully converted to GGUF format.")
print("You can now find the GGUF file in the 'gemma-3-creative-writer-gguf' directory.")


### Conclusion

In under 30 minutes, we have:
1.  Understood the theory behind effective LoRA configuration.
2.  Generated a plan for a high-quality synthetic dataset.
3.  Fine-tuned a Gemma 3 model on a custom task.
4.  **Objectively evaluated the improvement** using an LLM-as-a-Judge.

This demonstrates how accessible and powerful modern AI development can be with the right tools and techniques.

### Acknowledgements & Further Reading
*   **Unsloth:** [GitHub Repository](https://github.com/unslothai/unsloth)
*   **Key Paper:** [LoRA Without Regret - Thinking Machines Lab](https://thinkingmachines.ai/blog/lora/)
*   **Gemini API:** [Batch Mode Documentation](https://ai.google.dev/gemini-api/docs/batch-mode)
