# XLAM 2.0 Function Calling Fine-tuning

Fine-tune Qwen 2.5 Instruct on XLAM 2.0 (APIGen-MT) format dataset for improved function calling.

**Model**: Qwen/Qwen2.5-7B-Instruct  
**Method**: LoRA with 4-bit quantization  
**Framework**: Unsloth for 2x faster training

## 1. Setup and Installation

In [34]:
# Install dependencies
!pip install -q unsloth
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q xformers trl peft accelerate bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [35]:
import json
import torch
from datasets import load_dataset, Dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from pathlib import Path

## 2. Configuration

In [36]:
# Model configuration
MODEL_NAME = "unsloth/Qwen2.5-7B-Instruct"
MAX_SEQ_LENGTH = 2048
LOAD_IN_4BIT = True

# LoRA configuration (from APIGen-MT paper)
LORA_R = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0.0
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# Training configuration
BATCH_SIZE = 2
GRADIENT_ACCUMULATION_STEPS = 4
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3
WARMUP_STEPS = 10
LOGGING_STEPS = 10
SAVE_STEPS = 100

# Dataset configuration
HF_DATASET_REPO = "alwaysfurther/deepfabric-xlam-tools"  # Set to your HuggingFace repo, e.g., "username/xlam-dataset"
#LOCAL_DATASET_PATH = "xlam_v2_formatted.jsonl"  # XLAM 2.0 formatted dataset

# Output configuration
OUTPUT_DIR = "./xlam_checkpoints"
FINAL_MODEL_NAME = "/content/xlam-qwen2.5-7b-lora"

## 3. Load Model with Unsloth

In [38]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect
    load_in_4bit=LOAD_IN_4BIT,
)

print(f"Model loaded: {MODEL_NAME}")
print(f"Max sequence length: {MAX_SEQ_LENGTH}")
print(f"4-bit quantization: {LOAD_IN_4BIT}")

==((====))==  Unsloth 2025.10.1: Fast Qwen2 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 4. Max memory: 39.494 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded: unsloth/Qwen2.5-7B-Instruct
Max sequence length: 2048
4-bit quantization: True


## 4. Configure LoRA

In [39]:
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=TARGET_MODULES,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

print(f"LoRA configured with rank={LORA_R}, alpha={LORA_ALPHA}")

LoRA configured with rank=16, alpha=16


## 5. Load and Prepare Dataset

In [40]:
# ===================================
# 5. Load and Prepare Dataset
# ===================================


dataset = load_dataset(HF_DATASET_REPO, split="train")
print(f"Dataset loaded: {len(dataset)} samples")
print(f"Dataset format: {dataset.column_names}")

# Verify XLAM 2.0 format
if "conversations" not in dataset.column_names:
    raise ValueError("Expected XLAM 2.0 format with 'conversations' field")

print(f"\n=== Dataset Stats ===")
sample = dataset[0]
print(f"Example turns: {len(sample['conversations'])}")
func_calls = [t for t in sample['conversations'] if t.get('from') == 'function_call']
print(f"Function calls in first sample: {len(func_calls)}")
if func_calls:
    print(f"Example: {func_calls[0]['value'][:80]}...")

Loading dataset from HuggingFace: alwaysfurther/deepfabric-xlam-tools
Dataset loaded: 4004 samples
Dataset format: ['conversations', 'tools', 'system']

=== Dataset Stats ===
Example turns: 12
Function calls in first sample: 1
Example: {"name": "process_appointment_cancellation", "arguments": {"patient_name": "Jane...


In [54]:
# Better conversion: Include tools in the system message
def xlam_to_chat_template_with_tools(sample):
    """
    Convert XLAM 2.0 to chat template WITH tool definitions.
    This teaches the model when to use which tools.
    """
    messages = []

    # Build system message with domain policy AND tools
    system_parts = []

    if sample.get("system"):
        system_parts.append(sample["system"])

    if sample.get("tools"):
        system_parts.append("\nAvailable tools:")
        system_parts.append(sample["tools"])

    if system_parts:
        messages.append({
            "role": "system",
            "content": "\n".join(system_parts)
        })

    # Convert conversation turns
    role_mapping = {
        "human": "user",
        "gpt": "assistant",
        "function_call": "assistant",
        "observation": "user"
    }

    for turn in sample["conversations"]:
        role = role_mapping.get(turn["from"], "user")
        messages.append({
            "role": role,
            "content": turn["value"]
        })

    # Apply Qwen chat template
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )

    return {"text": formatted}

# Re-load and convert dataset
print("Re-loading dataset with tools included...")
dataset = load_dataset(HF_DATASET_REPO, split="train")
dataset = dataset.map(xlam_to_chat_template_with_tools, remove_columns=dataset.column_names)

print(f"✓ Dataset re-converted: {len(dataset)} samples")
print(f"\n=== Example with Tools ===")
print(dataset[0]['text'][:1500])
print("...")

Re-loading dataset with tools included...


Map:   0%|          | 0/4004 [00:00<?, ? examples/s]

✓ Dataset re-converted: 4004 samples

=== Example with Tools ===
<|im_start|>system
Dental Appointment Cancellation Policy: Appointments can be cancelled without a fee if at least 24 hours' notice is provided. Cancellations made with less than 24 hours' notice will incur a $50 cancellation fee. Patients must verify their full name and date of birth to initiate a cancellation. The system will confirm the appointment details before processing. Fees are automatically applied to the patient's account upon cancellation.

Available tools:
[{"name": "get_weather", "description": "Get current weather conditions for a location", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "City name or location (e.g., 'Paris', 'New York')"}, "time": {"type": "string", "description": "Time period for weather data"}}, "required": ["location"]}}, {"name": "get_time", "description": "Get current time for a timezone", "parameters": {"type": "object", "properties": {"

## 6. Training Configuration

In [55]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    warmup_steps=WARMUP_STEPS,
    num_train_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=LOGGING_STEPS,
    save_steps=SAVE_STEPS,
    save_total_limit=3,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    report_to="none",  # Change to "wandb" if you want logging
)

print(f"Training configuration:")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Gradient accumulation: {GRADIENT_ACCUMULATION_STEPS}")
print(f"  Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Epochs: {NUM_EPOCHS}")

Training configuration:
  Batch size: 2
  Gradient accumulation: 4
  Effective batch size: 8
  Learning rate: 0.0002
  Epochs: 3


## 7. Initialize Trainer

In [56]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_args,
    packing=False,  # Disable packing for function calling (needs clear boundaries)
)

print("Trainer initialized")

Unsloth: Tokenizing ["text"] (num_proc=52):   0%|          | 0/4004 [00:00<?, ? examples/s]

Trainer initialized


In [61]:
# ========================================
# VERIFY DATASET BEFORE TRAINING
# ========================================
print("=== PRE-TRAINING VERIFICATION ===")
print(f"Dataset variable name: 'dataset'")
print(f"Number of samples: {len(dataset)}")
print(f"Has 'text' column: {'text' in dataset.column_names}")

# Check first sample
sample_text = dataset[0]['text']
print(f"\nFirst sample length: {len(sample_text)} chars")

# Critical checks
if "Available tools:" in sample_text:
    print("✅ Tools ARE included in training data")
else:
    print("❌ ERROR: Tools NOT included - DO NOT TRAIN YET!")

if '{"name":' in sample_text and '"arguments":' in sample_text:
    print("✅ Function call examples ARE present")
else:
    print("⚠️  Warning: No function calls found in first sample")

print("\n=== First 1500 characters ===")
print(sample_text[:1500])
print("...")

print("\n=== Ready to train? ===")
if "Available tools:" in sample_text:
    print("YES - proceed with training")
else:
    print("NO - re-run conversion first!")

=== PRE-TRAINING VERIFICATION ===
Dataset variable name: 'dataset'
Number of samples: 4004
Has 'text' column: True

First sample length: 5795 chars
✅ Tools ARE included in training data
✅ Function call examples ARE present

=== First 1500 characters ===
<|im_start|>system
Dental Appointment Cancellation Policy: Appointments can be cancelled without a fee if at least 24 hours' notice is provided. Cancellations made with less than 24 hours' notice will incur a $50 cancellation fee. Patients must verify their full name and date of birth to initiate a cancellation. The system will confirm the appointment details before processing. Fees are automatically applied to the patient's account upon cancellation.

Available tools:
[{"name": "get_weather", "description": "Get current weather conditions for a location", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "City name or location (e.g., 'Paris', 'New York')"}, "time": {"type": "string", "descrip

## 8. Train Model

In [62]:
# Show GPU memory before training
import datetime
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU: {gpu_stats.name}")
print(f"Memory: {start_gpu_memory} GB / {max_memory} GB reserved")
print("\nStarting training: ")
print(datetime.datetime.now())
# Train
trainer_stats = trainer.train()

# Show GPU memory after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
print(f"\nTraining complete!")
print(f"Peak memory reserved: {used_memory} GB ({used_percentage}%)")
print(f"Memory used for training: {used_memory_for_training} GB")
print("\Finished training: ")
print(datetime.datetime.now())


GPU: NVIDIA A100-SXM4-40GB
Memory: 33.594 GB / 39.494 GB reserved

Starting training: 
2025-10-11 16:54:33.195109


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 4,004 | Num Epochs = 3 | Total steps = 378
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 40,370,176 of 7,655,986,688 (0.53% trained)


Step,Training Loss
10,0.1102
20,0.1122
30,0.116
40,0.1173
50,0.12
60,0.1167
70,0.1158
80,0.1186
90,0.1187
100,0.1176



Training complete!
Peak memory reserved: 33.594 GB (85.061%)
Memory used for training: 0.0 GB
\Finished training: 
2025-10-11 17:44:27.440057


## 9. Save Model

In [63]:
# Save LoRA adapters locally
model.save_pretrained(FINAL_MODEL_NAME)
tokenizer.save_pretrained(FINAL_MODEL_NAME)
print(f"Model saved to {FINAL_MODEL_NAME}")

# Optionally push to HuggingFace Hub
# model.push_to_hub("your-username/xlam-qwen2.5-7b-lora", token="your_token")
# tokenizer.push_to_hub("your-username/xlam-qwen2.5-7b-lora", token="your_token")

Model saved to /content/xlam-qwen2.5-7b-lora


In [64]:
# Verify the model has LoRA adapters loaded and trained
print("=== Model Verification ===")

# Check if model has PEFT adapters
if hasattr(model, 'peft_config'):
    print("✅ Model has PEFT adapters loaded")
    print(f"   Adapter names: {list(model.peft_config.keys())}")
else:
    print("❌ WARNING: Model doesn't have PEFT adapters!")

# Check if adapters are enabled (not disabled/merged)
if hasattr(model, 'active_adapter'):
    print(f"✅ Active adapter: {model.active_adapter}")
else:
    print("⚠️  Can't detect active adapter")

# Check trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTrainable params: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")

if trainable_params > 0:
    print("✅ Model has trainable parameters (LoRA is active)")
else:
    print("❌ WARNING: No trainable parameters!")

=== Model Verification ===
✅ Model has PEFT adapters loaded
   Adapter names: ['default']
✅ Active adapter: default

Trainable params: 40,370,176 (0.82%)
✅ Model has trainable parameters (LoRA is active)


## 10. Test Inference

In [65]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test with tools in system message (matching training format)
test_tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
]

test_messages = [
    {
        "role": "system",
        "content": f"""Weather Service Policy: Provides weather information for any location.

Available tools:
{json.dumps(test_tools)}"""
    },
    {
        "role": "user",
        "content": "What's the weather in San Francisco?"
    }
]

# Generate
inputs = tokenizer.apply_chat_template(test_messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract just the assistant response
if "<|im_start|>assistant\n" in response:
    assistant_msg = response.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0].strip()
    print("=== Assistant Response ===")
    print(assistant_msg)

    # Check if it's a function call
    if assistant_msg.startswith("{") and '"name"' in assistant_msg:
        print("\n✓ SUCCESS: Function call generated!")
        try:
            call_json = json.loads(assistant_msg)
            print(f"Function: {call_json.get('name')}")
            print(f"Arguments: {call_json.get('arguments')}")
        except:
            print("(JSON parsing failed but format looks correct)")
    else:
        print("\n⚠️  Still conversational, not a function call")

In [66]:
print("=== Training Data Format ===")
print(dataset[0]['text'][:1200])

# Check if tools are mentioned anywhere
sample_with_func = None
for i in range(min(100, len(dataset))):
    if 'function_call' in dataset[i]['text']:
        sample_with_func = dataset[i]['text']
        break

if sample_with_func:
    print("\n=== Sample with Function Call ===")
    print(sample_with_func[:1500])

=== Training Data Format ===
<|im_start|>system
Dental Appointment Cancellation Policy: Appointments can be cancelled without a fee if at least 24 hours' notice is provided. Cancellations made with less than 24 hours' notice will incur a $50 cancellation fee. Patients must verify their full name and date of birth to initiate a cancellation. The system will confirm the appointment details before processing. Fees are automatically applied to the patient's account upon cancellation.

Available tools:
[{"name": "get_weather", "description": "Get current weather conditions for a location", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "City name or location (e.g., 'Paris', 'New York')"}, "time": {"type": "string", "description": "Time period for weather data"}}, "required": ["location"]}}, {"name": "get_time", "description": "Get current time for a timezone", "parameters": {"type": "object", "properties": {"timezone": {"type": "string", "descr

In [68]:
# Show actual outputs, not just accuracy
print("=== ACTUAL MODEL OUTPUTS ===\n")

test_cases = [
    {
        "query": "Book a flight from NYC to LAX for tomorrow",
        "expected_tool": "book_flight",
        "tools": [
            {"name": "book_flight", "description": "Book airline tickets"},
            {"name": "get_weather", "description": "Get weather info"},
        ]
    },
    {
        "query": "What's the temperature in Paris?",
        "expected_tool": "get_weather",
        "tools": [
            {"name": "get_weather", "description": "Get weather info"},
            {"name": "book_hotel", "description": "Book hotel rooms"},
        ]
    },
]

FastLanguageModel.for_inference(model)

for i, case in enumerate(test_cases, 1):
    messages = [
        {"role": "system", "content": f"You are a helpful assistant.\n\nAvailable tools:\n{json.dumps(case['tools'])}"},
        {"role": "user", "content": case['query']}
    ]

    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
    outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract assistant response
    if "<|im_start|>assistant\n" in response:
        assistant_msg = response.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0].strip()
    else:
        assistant_msg = response.split("assistant\n")[-1].strip()

    print(f"Test {i}: {case['query']}")
    print(f"Expected: {case['expected_tool']}")
    print(f"Output: {assistant_msg[:200]}")

    # Check format
    if assistant_msg.startswith("{") and '"name"' in assistant_msg:
        print("✅ Proper function call format")
    else:
        print("⚠️  Not a function call - conversational response")
    print()

=== ACTUAL MODEL OUTPUTS ===

Test 1: Book a flight from NYC to LAX for tomorrow
Expected: book_flight
Output: I can help with that! To confirm, which airline would you prefer and what time would you like to depart?
⚠️  Not a function call - conversational response

Test 2: What's the temperature in Paris?
Expected: get_weather
Output: Let me check the current temperature in Paris for you.
⚠️  Not a function call - conversational response



In [None]:
# ========================================
# Try adding explicit instruction to use tools
# ========================================
print("=== TESTING WITH EXPLICIT INSTRUCTION ===\n")

test_messages_instructed = [
    {
        "role": "system",
        "content": f"""You are a helpful assistant with access to tools.

Available tools:
{json.dumps(test_tools)}

When the user's request requires a tool, respond ONLY with a JSON function call in this format:
{{"name": "tool_name", "arguments": {{"param": "value"}}}}"""
    },
    {
        "role": "user",
        "content": "What's the weather in Paris?"
    }
]

inputs = tokenizer.apply_chat_template(
    test_messages_instructed,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=128,
    temperature=None,
    do_sample=False,
    pad_token_id=tokenizer.pad_token_id
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True).strip()

print(f"Response: {response}")
print()

if response.startswith('{') and '"name"' in response:
    print("✅ Explicit instruction helped! Generated function call")
else:
    print("❌ Still conversational even with explicit instruction")

In [None]:
# ========================================
# Test with temperature=0.0 (completely greedy)
# ========================================
print("=== TESTING WITH TEMPERATURE 0.0 ===\n")

test_tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
]

test_messages = [
    {
        "role": "system",
        "content": f"You are a helpful assistant.\n\nAvailable tools:\n{json.dumps(test_tools)}"
    },
    {
        "role": "user",
        "content": "What's the weather in Paris?"
    }
]

FastLanguageModel.for_inference(model)

inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

# Completely greedy decoding
outputs = model.generate(
    inputs,
    max_new_tokens=128,
    temperature=None,  # Greedy
    do_sample=False,
    pad_token_id=tokenizer.pad_token_id
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True).strip()

print(f"Response: {response}")
print()

if response.startswith('{') and '"name"' in response:
    print("✅ Generated a JSON function call!")
    try:
        parsed = json.loads(response)
        print(f"   Function: {parsed.get('name')}")
        print(f"   Arguments: {parsed.get('arguments')}")
    except:
        print("   (JSON parsing failed but format looks correct)")
else:
    print("❌ Still conversational, not a function call")

In [None]:
# ========================================
# CRITICAL: Check how function calls appear in training data
# ========================================
import re

print("=== ANALYZING FUNCTION CALL FORMAT ===\n")

samples_with_json_calls = 0
samples_with_conversational = 0

for i in range(min(200, len(dataset))):
    text = dataset[i]['text']
    
    # Find all assistant responses
    assistant_responses = re.findall(r'<\|im_start\|>assistant\n(.*?)<\|im_end\|>', text, re.DOTALL)
    
    for response in assistant_responses:
        response = response.strip()
        
        # Check if it's a JSON function call
        if response.startswith('{') and '"name"' in response and '"arguments"' in response:
            samples_with_json_calls += 1
            
            # Print first example
            if samples_with_json_calls == 1:
                print("✓ Found JSON function call example:")
                print("=" * 80)
                print(response[:300])
                print("=" * 80)
            break
        # Check if it's conversational
        elif len(response) > 10 and not response.startswith('{'):
            samples_with_conversational += 1
            break

print(f"\n📊 Analysis of first 200 samples:")
print(f"  - Samples with JSON function calls: {samples_with_json_calls}")
print(f"  - Samples with conversational responses: {samples_with_conversational}")
print(f"  - Ratio: {samples_with_conversational/max(samples_with_json_calls, 1):.1f}x more conversational")

if samples_with_json_calls == 0:
    print("\n❌ CRITICAL: No JSON function calls found in training data!")
    print("This explains why the model doesn't generate them.")
elif samples_with_conversational > samples_with_json_calls * 3:
    print("\n⚠️  WARNING: Much more conversational data than function calls")
    print("Model is learning conversational responses more strongly")

## 11. Evaluation (Optional)

Compare fine-tuned model vs base model on function calling accuracy.

In [67]:
# Evaluation test cases
eval_cases = [
    {
        "query": "Book a flight from NYC to LAX for tomorrow",
        "expected_tool": "book_flight",
        "tools": [
            {"name": "book_flight", "description": "Book airline tickets"},
            {"name": "get_weather", "description": "Get weather info"},
        ]
    },
    {
        "query": "What's the temperature in Paris?",
        "expected_tool": "get_weather",
        "tools": [
            {"name": "get_weather", "description": "Get weather info"},
            {"name": "book_hotel", "description": "Book hotel rooms"},
        ]
    },
]

def evaluate_tool_calling(model, tokenizer, test_cases):
    """Evaluate model's tool calling accuracy."""
    correct = 0
    total = len(test_cases)

    for case in test_cases:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Available tools:\n{json.dumps(case['tools'])}\n\nUser: {case['query']}"}
        ]

        inputs = tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to("cuda")

        outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Check if expected tool is called
        if case['expected_tool'] in response:
            correct += 1
            print(f"✓ {case['query']} -> {case['expected_tool']}")
        else:
            print(f"✗ {case['query']} -> Expected {case['expected_tool']}, got: {response[:100]}")

    accuracy = (correct / total) * 100
    print(f"\nAccuracy: {accuracy:.1f}% ({correct}/{total})")
    return accuracy

# Run evaluation
print("=== Evaluating Fine-tuned Model ===")
finetuned_accuracy = evaluate_tool_calling(model, tokenizer, eval_cases)

# Compare with base model
print("\n=== Loading Base Model for Comparison ===")
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=LOAD_IN_4BIT,
)
FastLanguageModel.for_inference(base_model)

print("\n=== Evaluating Base Model ===")
base_accuracy = evaluate_tool_calling(base_model, base_tokenizer, eval_cases)

print(f"\n=== Results ===")
print(f"Base Model Accuracy: {base_accuracy:.1f}%")
print(f"Fine-tuned Model Accuracy: {finetuned_accuracy:.1f}%")
print(f"Improvement: {finetuned_accuracy - base_accuracy:.1f}%")

=== Evaluating Fine-tuned Model ===
✓ Book a flight from NYC to LAX for tomorrow -> book_flight
✓ What's the temperature in Paris? -> get_weather

Accuracy: 100.0% (2/2)

=== Loading Base Model for Comparison ===
==((====))==  Unsloth 2025.10.1: Fast Qwen2 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 4. Max memory: 39.494 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


=== Evaluating Base Model ===
✓ Book a flight from NYC to LAX for tomorrow -> book_flight
✓ What's the temperature in Paris? -> get_weather

Accuracy: 100.0% (2/2)

=== Results ===
Base Model Accuracy: 100.0%
Fine-tuned Model Accuracy: 100.0%
Improvement: 0.0%


## 12. Export for Production

Merge LoRA weights with base model for deployment.

In [48]:
# Merge and save full model (larger but faster inference)
model.save_pretrained_merged(
    f"{FINAL_MODEL_NAME}-merged",
    tokenizer,
    save_method="merged_16bit",  # or "merged_4bit" for smaller size
)
print(f"Merged model saved to {FINAL_MODEL_NAME}-merged")

# Save in GGUF format for llama.cpp (optional)
# model.save_pretrained_gguf(
#     f"{FINAL_MODEL_NAME}-gguf",
#     tokenizer,
#     quantization_method="q4_k_m"
# )

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|██████████| 4/4 [00:00<00:00, 31068.92it/s]
Unsloth: Merging weights into 16bit: 100%|██████████| 4/4 [02:36<00:00, 39.00s/it]


Unsloth: Merge process complete. Saved to `/content/xlam-qwen2.5-7b-lora-merged`
Merged model saved to /content/xlam-qwen2.5-7b-lora-merged
