# Finetuning a LLM in Macbook Pro
### System Configuration
- Macbook Pro M4 Pro
- 48GB RAM
- GPU Cores: 20
- Metal Support: Metal 4

### MPS (Metal Performance Shaders): 
- Apple-provided GPU backend that PyTorch can use on Apple Silicon. Using MPS speeds training/inference compared to CPU.
- PyTorch tensors and model weights can be moved to device="mps"
- Forward + backward passes happen on the GPU → much faster than CPU


### Environment Setup

```
uv init .venv --python=3.12
source .venv/bin/activate
uv add ipykernel jupyter
uv add torch torchvision torchaudio
uv add transformers accelerate datasets huggingface_hub safetensors

# Register your venv as a fresh kernel
python -m ipykernel install --user \
  --name=gemma270m \
  --display-name "Gemma 270M (venv)"
```

Restart IDE & select kernel: Gemma 270M (venv)

In [24]:
# Check whether MPS is available to train on Apple GPU (Metal):
import torch
print("MPS available:", torch.backends.mps.is_available())
print("MPS built:", torch.backends.mps.is_built())

MPS available: True
MPS built: True


### Configure Hugging Face authentication
In terminal run
```
huggingface-cli login
```
paste your token from https://huggingface.co/settings/tokens)

### Load the model and the tokenizer
- Tokenizer:
  - Maps words/subwords → integers
  - Handles special tokens like <pad>, <eos>
  - Ensures input sequences are exactly what the model expects

- Model:
  - PyTorch object containing all 270M weights
  - Decoder-only transformer: takes token IDs → predicts next token logits
  - from_pretrained downloads weights, config, and merges all necessary files

In [23]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "google/functiongemma-270m-it"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model
model = AutoModelForCausalLM.from_pretrained(model_name)

### Move model to GPU (MPS)
- We want PyTorch to use MPS (Metal Performance Shaders) for GPU acceleration. Mac M4 Pro has 20 GPU cores.
- `model.to(device)` moves all weights to GPU memory
- Forward pass (computing predictions) + backward pass (gradients) happen on GPU
- Using CPU is much slower because PyTorch cannot parallelize large matrix multiplications as efficiently

In [25]:
device = "mps" if torch.backends.mps.is_available() else "cpu"
model.to(device)

Gemma3ForCausalLM(
  (model): Gemma3TextModel(
    (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 640, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x Gemma3DecoderLayer(
        (self_attn): Gemma3Attention(
          (q_proj): Linear(in_features=640, out_features=1024, bias=False)
          (k_proj): Linear(in_features=640, out_features=256, bias=False)
          (v_proj): Linear(in_features=640, out_features=256, bias=False)
          (o_proj): Linear(in_features=1024, out_features=640, bias=False)
          (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
          (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
        )
        (mlp): Gemma3MLP(
          (gate_proj): Linear(in_features=640, out_features=2048, bias=False)
          (up_proj): Linear(in_features=640, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=640, bias=False)
          (act_fn): GELUTanh()
        )
        (input_layernorm): Gemma3RMSNorm((640,), eps=1e-06)

### Prepare Dataset

In [26]:
import json
from datasets import Dataset

# --- Tool Definition for a Travel Planner ---
def search_web(query: str) -> str:
    """
    Search the web using DuckDuckGo.

    Args:
        query: query string
    Returns:
        Top web result as string
    """
    return f"Search result for: {query}"


In [34]:
# Prepare the travel planner dataset
# Each example represents user intent and the function call the model should make.

simple_tool_calling = [
    {
        "user_content": "Best places to visit in Switzerland for 5 days",
        "tool_name": "search_web",
        "tool_arguments": '{"query": "best places to visit in Switzerland for 5 days"}'
    },
    {
        "user_content": "Best time to visit Japan for cherry blossoms",
        "tool_name": "search_web",
        "tool_arguments": '{"query": "best time to visit Japan for cherry blossoms"}'
    },
    {
        "user_content": "Top day trips from Zurich suitable for a 5-day Switzerland vacation",
        "tool_name": "search_web",
        "tool_arguments": '{"query": "top day trips from Zurich for a 5 day Switzerland vacation"}'
    },
    {
        "user_content": "Family-friendly 5-day Switzerland trip with easy trains and kids activities",
        "tool_name": "search_web",
        "tool_arguments": '{"query": "family friendly 5 day Switzerland trip trains kids activities"}'
    },
    {
        "user_content": "Top-rated car rental services in Los Angeles",
        "tool_name": "search_web",
        "tool_arguments": '{"query": "top rated car rental services in Los Angeles"}'
    },
    {
        "user_content": "Find family-friendly hotels in Paris for 2 nights",
        "tool_name": "search_web",
        "tool_arguments": '{"query": "family friendly hotels in Paris for 2 nights"}'
    },
]


In [35]:
# Prepare Dataset for training
from transformers.utils import get_json_schema

# Convert function to schema for model awareness - lets the model understand the function signature
TOOLS = [get_json_schema(search_web)]

DEFAULT_SYSTEM_MSG = (
    "You are a travel assistant that MUST use the available tools.\n"
    "You are NOT allowed to refuse travel-related questions.\n"
    "For ANY travel planning, destination, itinerary, transport, or activity question, "
    "you MUST call the `search_web` function.\n"
    "Do NOT ask follow-up questions.\n"
    "Do NOT provide explanations.\n"
    "ONLY return a function call."
)

# Map each raw sample to conversational format. converts each sample into messages + tool call format.
def create_conversation(sample):
    return {
        "messages": [
            {"role": "developer", "content": DEFAULT_SYSTEM_MSG},
            {"role": "user", "content": sample["user_content"]},
            {"role": "assistant", "tool_calls": [
                {
                    "type": "function",
                    "function": {
                        "name": sample["tool_name"],
                        "arguments": json.loads(sample["tool_arguments"])
                    }
                }
            ]}
        ],
        "tools": TOOLS
    }

# Convert list to Hugging Face dataset
dataset = Dataset.from_list(simple_tool_calling)

# Map to conversation format
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)

# Split into train/test (50/50)
dataset = dataset.train_test_split(test_size=0.5, shuffle=True)


Map:   0%|          | 0/6 [00:00<?, ? examples/s]

### Test Before Finetuning

In [36]:
def check_success_rate():
    success_count = 0
    for idx, item in enumerate(dataset['test']):
        # Take system + user messages only
        messages = [
            item["messages"][0],  # system message
            item["messages"][1],  # user message
        ]

        # Prepare inputs using chat template
        inputs = tokenizer.apply_chat_template(
            messages,
            tools=TOOLS,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt"
        )

        # Generate model output
        out = model.generate(
            **inputs.to(model.device),
            pad_token_id=tokenizer.eos_token_id,
            max_new_tokens=128
        )

        # Decode generated tokens
        output = tokenizer.decode(
            out[0][len(inputs["input_ids"][0]):],
            skip_special_tokens=False
        )

        print(f"{idx+1}. Prompt: {item['messages'][1]['content']}")
        print(f"   Output: {output}")

        # Check if the model called the correct tool
        expected_tool = item['messages'][2]['tool_calls'][0]['function']['name']

        if expected_tool in output:
            print("   -> ✅ correct!")
            success_count += 1
        else:
            print(f"   -> ❌ wrong (expected '{expected_tool}' missing)")

    print(f"\nOverall Success: {success_count} / {len(dataset['test'])}")

check_success_rate()


1. Prompt: Family-friendly 5-day Switzerland trip with easy trains and kids activities
   Output: <start_function_call>call:search_web{query:<escape>family-friendly 5-day Switzerland trip with easy trains and kids activities<escape>}<end_function_call><start_function_response>
   -> ✅ correct!
2. Prompt: Top day trips from Zurich suitable for a 5-day Switzerland vacation
   Output: <start_function_call>call:search_web{query:<escape>Top day trips from Zurich suitable for a 5-day Switzerland vacation<escape>}<end_function_call><start_function_response>
   -> ✅ correct!
3. Prompt: Find family-friendly hotels in Paris for 2 nights
   Output: <start_function_call>call:search_web{query:<escape>family-friendly hotels in Paris<escape>}<end_function_call><start_function_response>
   -> ✅ correct!

Overall Success: 3 / 3


### Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA)
- PEFT (Parameter-Efficient Fine-Tuning) is a method to adapt large pretrained models (like LLMs) to new tasks without retraining all model parameters.
- Instead of updating billions of weights, most of the base model is frozen, which greatly reduces training time, memory usage, and hardware requirements.
- LoRA (Low-Rank Adaptation) is a popular PEFT technique that adds small, trainable low-rank matrices to selected layers of the model.
- During training, only these added LoRA parameters are updated, while the original model weights remain unchanged.
- This allows the model to learn task-specific behavior efficiently, often achieving performance close to full fine-tuning.



In [37]:
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorWithPadding
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, set_peft_model_state_dict
from datasets import Dataset

# Ensure tokenizer has eos_token & pad_token. Causal LM training requires consistent special tokens and padding.
if tokenizer.eos_token is None:
    tokenizer.add_special_tokens({"eos_token": ""})
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})
    model.resize_token_embeddings(len(tokenizer))

# ===== PEFT/LoRA config =====
# r, alpha, and target_modules control the adapter capacity and where adapters are inserted
lora_config = LoraConfig(
    r=8, # low-rank dimension (tweakable)
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # typical transformer proj names; OK for Gemma-style
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Wrap model with LoRA (freezes base params by default)
model = get_peft_model(model, lora_config)

# ===== Prepare the dataset: create input_ids and labels =====
# We will build prompt = tokenizer.apply_chat_template(system+user, tools=TOOLS, add_generation_prompt=True)
# and target = serialized assistant tool-call string (from dataset["messages"][2]).
# Helper to serialize expected assistant tool call in the same format the model uses:
def serialize_tool_call(tool_call):
    # tool_call: dict with "function": {"name": ..., "arguments": {...}}
    name = tool_call["function"]["name"]
    args = tool_call["function"]["arguments"]
    # For simplicity we assume only a "query" arg exists (as in your dataset)
    # Escape braces/curly to mimic the training format used earlier
    q = args.get("query", "")
    # Build the same textual format observed earlier:
    return f"<start_function_call>call:{name}{{query:<escape>{q}<escape>}}<end_function_call><start_function_response>"

# Map HF dataset to tokenized training samples
def prepare_example(example):
    # example["messages"] exists (developer, user, assistant)
    system_msg = example["messages"][0]
    user_msg = example["messages"][1]
    assistant_tool_call = example["messages"][2]["tool_calls"][0]

    # Build prompt using tokenizer helper (you used apply_chat_template earlier)
    prompt_inputs = tokenizer.apply_chat_template(
        [system_msg, user_msg],
        tools=TOOLS,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt"
    )

    prompt_ids = prompt_inputs["input_ids"][0]  # tensor
    # Build target string and tokenize it
    target_text = serialize_tool_call(assistant_tool_call)
    target_ids = tokenizer(target_text, return_tensors="pt")["input_ids"][0]

    # Concatenate prompt + target for labels. We will mask prompt tokens with -100 so loss only computed on target tokens.
    input_ids = torch.cat([prompt_ids, target_ids], dim=0)
    labels = torch.cat([torch.full_like(prompt_ids, -100), target_ids], dim=0)

    return {"input_ids": input_ids, "labels": labels}

# Apply to dataset (this converts to torch tensors). Dataset currently is a dict with train/test splits.
def dataset_to_train_format(hf_dataset):
    # expects hf_dataset is HF dataset split (list-like)
    mapped = {"input_ids": [], "labels": []}
    for ex in hf_dataset:
        prepared = prepare_example(ex)
        mapped["input_ids"].append(prepared["input_ids"])
        mapped["labels"].append(prepared["labels"])
    # Create a small huggingface Dataset with tensors
    # We will store lists of tensors and then rely on a data collator to pad them
    return Dataset.from_dict({"input_ids": mapped["input_ids"], "labels": mapped["labels"]})

train_ds = dataset_to_train_format(dataset["train"])
eval_ds  = dataset_to_train_format(dataset["test"])

# ===== Data collator: pads input_ids and labels to same length =====
class DataCollatorForCausalWithPadding:
    def __init__(self, tokenizer, pad_token_id=None):
        self.tokenizer = tokenizer
        self.pad_token_id = pad_token_id if pad_token_id is not None else tokenizer.pad_token_id

    def __call__(self, features):
        # features: list of {"input_ids": tensor, "labels": tensor}
        input_ids = [torch.tensor(f["input_ids"]) for f in features]
        labels = [torch.tensor(f["labels"]) for f in features]
        # pad to longest length
        max_len = max(t.size(0) for t in input_ids)
        padded_inputs = []
        padded_labels = []
        for inp, lab in zip(input_ids, labels):
            pad_len = max_len - inp.size(0)
            padded_inputs.append(torch.cat([inp, torch.full((pad_len,), self.pad_token_id, dtype=torch.long)]))
            padded_labels.append(torch.cat([lab, torch.full((pad_len,), -100, dtype=torch.long)]))
        batch = {"input_ids": torch.stack(padded_inputs), "labels": torch.stack(padded_labels)}
        return batch

data_collator = DataCollatorForCausalWithPadding(tokenizer)

# ===== Training arguments =====
training_args = TrainingArguments(
    output_dir="./functiongemma_lora_out",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=3,
    learning_rate=5e-5,
    logging_steps=10,
    fp16=False,  # fp16 on MPS is often problematic; keep float32
    remove_unused_columns=False,
    push_to_hub=False
)

# ===== Trainer =====
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    data_collator=data_collator,
)

# ===== Run training =====
trainer.train()

# ===== Save LoRA adapter weights (and optionally full merged model) =====
model.save_pretrained("./functiongemma_lora_adapter")
tokenizer.save_pretrained("./functiongemma_lora_adapter")
print("LoRA adapter saved to ./functiongemma_lora_adapter")
# This is just the adapter weights, not the full base model.
# We can merge later if needed, but using LoRA keeps it lightweight.



Step,Training Loss




LoRA adapter saved to ./functiongemma_lora_adapter




### Merge LoRA Adapter and Base Model

In [38]:
#Load base model fresh
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name = "google/functiongemma-270m-it"
lora_path = "./functiongemma_lora_adapter"
merged_output_path = "./functiongemma_merged"

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float32,
    device_map=None
)

#Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    lora_path
)

#Merge and unload LoRA
merged_model = model.merge_and_unload()

#Save merged model to get standalone fine-tuned model
merged_model.save_pretrained(merged_output_path)
tokenizer.save_pretrained(merged_output_path)


('./functiongemma_merged/tokenizer_config.json',
 './functiongemma_merged/special_tokens_map.json',
 './functiongemma_merged/chat_template.jinja',
 './functiongemma_merged/tokenizer.json')

### Test After Finetuning by loading the merged model (no PEFT anymore)

In [39]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./functiongemma_merged")

model = AutoModelForCausalLM.from_pretrained(
    "./functiongemma_merged",
    dtype=torch.float32
)

model.to(device)
model.eval()

check_success_rate()

The tokenizer you are loading from './functiongemma_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


1. Prompt: Family-friendly 5-day Switzerland trip with easy trains and kids activities
   Output: <start_function_call>call:search_web{query:<escape>family-friendly 5-day Switzerland trip with easy trains and kids activities<escape>}<end_function_call><start_function_response>
   -> ✅ correct!
2. Prompt: Top day trips from Zurich suitable for a 5-day Switzerland vacation
   Output: <start_function_call>call:search_web{query:<escape>top day trips from Zurich suitable for a 5-day Switzerland vacation<escape>}<end_function_call><start_function_response>
   -> ✅ correct!
3. Prompt: Find family-friendly hotels in Paris for 2 nights
   Output: <start_function_call>call:search_web{query:<escape>family-friendly hotels in Paris<escape>}<end_function_call><start_function_response>
   -> ✅ correct!

Overall Success: 3 / 3
