# Tool-Use DPO: End-to-End Training Pipeline

This notebook runs the complete **Tool-Use DPO** pipeline for aligning LLMs to strictly adhere to API/tool schemas using Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO).

## Pipeline Overview

1. **Setup**: Install dependencies and configure environment
2. **Data Generation**: Generate synthetic DPO triplets using GPT-4o (optional)
3. **SFT Cold Start**: Supervised fine-tuning on correct tool calls
4. **DPO Training**: Preference optimization with chosen vs. rejected pairs
5. **Evaluation**: Measure SSPR (Strict Schema Pass Rate)
6. **Inference**: Test the aligned model

## Requirements
- Google Colab with GPU runtime (T4 or better, A100 recommended)
- OpenAI API key (optional, for data generation)
- ~15GB GPU VRAM (4-bit quantization reduces memory usage)

---
## 1. Setup & Installation

First, let's check the GPU and install all required dependencies.

In [None]:
# Check GPU availability
!nvidia-smi

Fri Dec 12 05:29:34 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   43C    P8             15W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:

# === INSTALLATION CELL ===
# Run this cell ONCE, then restart runtime and SKIP this cell

# Step 1: Fix numpy version FIRST (critical for compatibility)
!pip uninstall numpy -y
!pip install "numpy<2.0.0"

# Step 2: Install Unsloth (optimized for Colab)
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Step 3: Install other dependencies
!pip install trl transformers datasets accelerate bitsandbytes peft
!pip install pyyaml tqdm jsonschema openai python-dotenv wandb

print("Installation complete! Now run the next cell to restart runtime.")

Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
  Successfully uninstalled numpy-2.0.2
Collecting numpy<2.0.0
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m131.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is inc

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-w3p_qg_n/unsloth_f820425ff181490e8383124e1f23d2c3
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-w3p_qg_n/unsloth_f820425ff181490e8383124e1f23d2c3
  Resolved https://github.com/unslothai/unsloth.git to commit 345f5a5eb4ee17f79fde2d7c51b466fb9a213e98
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.12.3 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.12.3-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.gi

In [None]:
# === VERIFY INSTALLATION ===
# After restart, continue from HERE (skip installation and restart cells)

import numpy as np
print(f"NumPy version: {np.__version__}")
if np.__version__.startswith("2"):
    print("WARNING: NumPy 2.x detected! Run installation cell and restart runtime.")
else:
    print("NumPy version OK!")

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Test TRL import
from trl import SFTTrainer, SFTConfig
print("TRL imported successfully!")

NumPy version: 1.26.4
NumPy version OK!
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: NVIDIA L4
VRAM: 23.8 GB
TRL imported successfully!


In [None]:
# === MOUNT GOOGLE DRIVE ===
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Set up project directory on Google Drive
DRIVE_PROJECT_DIR = "/content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO"
os.makedirs(DRIVE_PROJECT_DIR, exist_ok=True)
os.makedirs(f"{DRIVE_PROJECT_DIR}/checkpoints", exist_ok=True)
os.makedirs(f"{DRIVE_PROJECT_DIR}/data", exist_ok=True)
os.makedirs(f"{DRIVE_PROJECT_DIR}/logs", exist_ok=True)

print(f"Project directory: {DRIVE_PROJECT_DIR}")
print(f"Checkpoints will be saved to: {DRIVE_PROJECT_DIR}/checkpoints")

# Check for existing checkpoints
sft_checkpoint = f"{DRIVE_PROJECT_DIR}/checkpoints/sft_cold_start"
dpo_checkpoint = f"{DRIVE_PROJECT_DIR}/checkpoints/dpo_final"

if os.path.exists(sft_checkpoint):
    print(f"✓ Found existing SFT checkpoint: {sft_checkpoint}")
else:
    print(f"✗ No SFT checkpoint found (will train from scratch)")

if os.path.exists(dpo_checkpoint):
    print(f"✓ Found existing DPO checkpoint: {dpo_checkpoint}")
else:
    print(f"✗ No DPO checkpoint found (will train from scratch)")

Mounted at /content/drive
Project directory: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO
Checkpoints will be saved to: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints
✗ No SFT checkpoint found (will train from scratch)
✗ No DPO checkpoint found (will train from scratch)


---
## 2. Create Project Structure

Create the necessary directories and source files.

In [None]:
import os

# Create directory structure
directories = ['src', 'configs', 'data', 'checkpoints', 'logs']
for d in directories:
    os.makedirs(d, exist_ok=True)
    print(f"Created directory: {d}")

Created directory: src
Created directory: configs
Created directory: data
Created directory: checkpoints
Created directory: logs


### 2.1 Create Source Modules

In [None]:
%%writefile src/__init__.py
# Tool-Use DPO Source Package

Writing src/__init__.py


In [None]:
%%writefile src/utils.py
import os
import random
import logging
import torch
import numpy as np
from typing import Optional

def setup_logging(log_file: Optional[str] = None, level=logging.INFO):
    """Configures logging to console and optional file."""
    handlers = [logging.StreamHandler()]
    if log_file:
        log_dir = os.path.dirname(log_file)
        if log_dir:  # Only create directory if path has a directory component
            os.makedirs(log_dir, exist_ok=True)
        handlers.append(logging.FileHandler(log_file))

    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=level,
        handlers=handlers
    )

def seed_everything(seed: int = 42):
    """Ensures reproducibility across random, numpy, and torch."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    logging.info(f"Global seed set to {seed}")

def get_device() -> str:
    """Returns the available computation device."""
    if torch.cuda.is_available():
        return "cuda"
    if torch.backends.mps.is_available():
        return "mps"
    return "cpu"

Writing src/utils.py


In [None]:
%%writefile src/config_parser.py
import yaml
import argparse
import logging
from typing import Dict, Any

def load_config(config_path: str) -> Dict[str, Any]:
    """Loads a YAML config file."""
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    return config

def parse_args_with_config() -> Dict[str, Any]:
    """
    Parses CLI arguments.
    Allows specifying a --config YAML file and overriding keys via CLI.
    Example: python train.py --config cfg.yaml --learning_rate 0.0001
    """
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", type=str, required=True, help="Path to YAML config")

    # Parse known args first to get the config path
    args, remaining_argv = parser.parse_known_args()

    config = load_config(args.config)

    # Add config keys as CLI arguments for overriding
    parser_override = argparse.ArgumentParser()
    for key, value in config.items():
        arg_type = type(value) if value is not None else str
        parser_override.add_argument(f"--{key}", type=arg_type, default=value)

    # Re-parse to allow overrides
    args_final = parser_override.parse_args(remaining_argv)

    # Convert to dict
    final_config = vars(args_final)
    logging.info(f"Loaded Configuration: {final_config}")
    return final_config

Writing src/config_parser.py


In [None]:
%%writefile src/validation.py
import json
import jsonschema
import logging
from typing import Tuple, Dict, Any

def validate_tool_call(json_str: str, schema: Dict[str, Any]) -> Tuple[bool, str]:
    """
    Validates a JSON string against a tool schema.
    Returns: (is_valid, error_message)
    """
    try:
        data = json.loads(json_str)
        jsonschema.validate(instance=data, schema=schema)
        return True, ""
    except json.JSONDecodeError:
        return False, "Invalid JSON Syntax"
    except jsonschema.ValidationError as e:
        return False, f"Schema Violation: {e.message}"
    except Exception as e:
        return False, f"Unknown Error: {str(e)}"

Writing src/validation.py


In [None]:
%%writefile src/model.py
import torch
from unsloth import FastLanguageModel
import logging

def load_model_and_tokenizer(
    model_name: str,
    max_seq_length: int = 2048,
    load_in_4bit: bool = True
):
    """
    Wrapper for Unsloth's FastLanguageModel.
    Optimized for Qwen-2.5-Coder and A10G/3090 hardware.
    """
    logging.info(f"Loading Unsloth model: {model_name}")

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        dtype=None,  # Auto-detect (Float16 or Bfloat16)
        load_in_4bit=load_in_4bit,
    )

    return model, tokenizer

def prepare_model_for_peft(model):
    """Configures LoRA adapters for training."""
    model = FastLanguageModel.get_peft_model(
        model,
        r=32,                # Increased from 16 for more capacity
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_alpha=64,       # 2x rank for optimal scaling
        lora_dropout=0,      # 0 is optimized for Unsloth
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
        use_rslora=True,     # Rank-stabilized LoRA for better training
        loftq_config=None,
    )
    return model

Writing src/model.py


In [None]:
%%writefile src/dataset.py
import json
from datasets import Dataset
import logging

SYSTEM_PROMPT = """You are a tool-calling assistant. When given a user request and tool specification, respond with ONLY a valid JSON object representing the tool call. Do not include any explanation, markdown formatting, or code blocks. Output raw JSON only."""

def format_dpo_pair(example):
    """
    Formats the triplet for ChatML.
    Qwen-2.5 expects specific chat templates.
    """
    # Include system prompt to constrain output format
    prompt_text = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{example['prompt']}<|im_end|>\n<|im_start|>assistant\n"

    return {
        "prompt": prompt_text,
        "chosen": example["chosen"] + "<|im_end|>",
        "rejected": example["rejected"] + "<|im_end|>"
    }

def load_dpo_dataset(data_path: str, tokenizer=None):
    """Loads JSONL data and prepares it for DPO training."""
    logging.info(f"Loading dataset from {data_path}")

    data = []
    with open(data_path, 'r') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))

    dataset = Dataset.from_list(data)

    # Map to ChatML format
    dataset = dataset.map(format_dpo_pair)

    return dataset

def load_sft_dataset(data_path: str, tokenizer):
    """
    Loads JSONL data for SFT Cold Start.
    Uses only 'prompt' + 'chosen'.
    """
    logging.info(f"Loading SFT dataset from {data_path}")

    def format_sft(example):
        # Include system prompt to teach the model output format
        text = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{example['prompt']}<|im_end|>\n<|im_start|>assistant\n{example['chosen']}<|im_end|>"
        return {"text": text}

    data = []
    with open(data_path, 'r') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))

    dataset = Dataset.from_list(data)
    dataset = dataset.map(format_sft)
    return dataset

Writing src/dataset.py


### 2.2 Create Configuration Files

In [None]:
%%writefile configs/sft_config.yaml
model_name: "Qwen/Qwen2.5-Coder-7B-Instruct"
data_path: "data/synthetic_triplets.jsonl"
output_dir: "checkpoints/sft_cold_start"
log_file: "logs/sft_training.log"
max_seq_length: 2048
batch_size: 8              # L4 can handle 8
grad_accum_steps: 2        # Effective batch = 16
learning_rate: 3.0e-5      # Slightly higher for stronger learning
epochs: 6                  # 6 epochs for thorough schema learning
warmup_steps: 100          # ~2% of total steps
weight_decay: 0.01         # Light regularization
seed: 42

Writing configs/sft_config.yaml


In [None]:
%%writefile configs/dpo_config.yaml
# Point this to the output of SFT
model_name: "checkpoints/sft_cold_start"
data_path: "data/synthetic_triplets.jsonl"
output_dir: "checkpoints/dpo_final"
log_file: "logs/dpo_training.log"
max_seq_length: 2048
batch_size: 4              # L4 can handle 4 for DPO
grad_accum_steps: 4        # Effective batch = 16
learning_rate: 1.0e-6      # Slightly higher for better preference learning
epochs: 3                  # 3 epochs for stronger preference alignment
beta: 0.05                 # Lower beta = stronger preference signal
loss_type: "ipo"           # IPO is more stable than standard DPO
max_prompt_length: 768     # Increased to handle longer prompts with system message
weight_decay: 0.01         # Light regularization
seed: 42

Writing configs/dpo_config.yaml


In [None]:
# === UPDATE CONFIGS TO USE GOOGLE DRIVE PATHS ===
# This cell updates the configs to save checkpoints to Google Drive

import yaml

# Update SFT config with Drive paths
with open('configs/sft_config.yaml', 'r') as f:
    sft_cfg = yaml.safe_load(f)

sft_cfg['output_dir'] = f"{DRIVE_PROJECT_DIR}/checkpoints/sft_cold_start"
sft_cfg['log_file'] = f"{DRIVE_PROJECT_DIR}/logs/sft_training.log"

with open('configs/sft_config.yaml', 'w') as f:
    yaml.dump(sft_cfg, f, default_flow_style=False)

# Update DPO config with Drive paths
with open('configs/dpo_config.yaml', 'r') as f:
    dpo_cfg = yaml.safe_load(f)

dpo_cfg['model_name'] = f"{DRIVE_PROJECT_DIR}/checkpoints/sft_cold_start"
dpo_cfg['output_dir'] = f"{DRIVE_PROJECT_DIR}/checkpoints/dpo_final"
dpo_cfg['log_file'] = f"{DRIVE_PROJECT_DIR}/logs/dpo_training.log"

with open('configs/dpo_config.yaml', 'w') as f:
    yaml.dump(dpo_cfg, f, default_flow_style=False)

print("✓ Configs updated to use Google Drive paths:")
print(f"  SFT output: {sft_cfg['output_dir']}")
print(f"  DPO model:  {dpo_cfg['model_name']}")
print(f"  DPO output: {dpo_cfg['output_dir']}")

✓ Configs updated to use Google Drive paths:
  SFT output: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/sft_cold_start
  DPO model:  /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/sft_cold_start
  DPO output: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/dpo_final


### 2.3 Create Sample Data

This sample data demonstrates the DPO triplet format (prompt, chosen, rejected).

In [None]:
%%writefile data/sample_triplets.jsonl
{"prompt": "Search for users in New York with a limit of 10 results.\nTool: search_users(location: str, limit: int)", "chosen": "{\"tool\": \"search_users\", \"parameters\": {\"location\": \"New York\", \"limit\": 10}}", "rejected": "{\"tool\": \"search_users\", \"parameters\": {\"location\": \"New York\", \"limit\": \"10\"}}", "schema": {"type": "object", "properties": {"tool": {"type": "string", "const": "search_users"}, "parameters": {"type": "object", "properties": {"location": {"type": "string"}, "limit": {"type": "integer"}}, "required": ["location", "limit"], "additionalProperties": false}}, "required": ["tool", "parameters"], "additionalProperties": false}}
{"prompt": "Create a new task with title 'Review PR' and priority high.\nTool: create_task(title: str, priority: enum['low', 'medium', 'high'])", "chosen": "{\"tool\": \"create_task\", \"parameters\": {\"title\": \"Review PR\", \"priority\": \"high\"}}", "rejected": "{\"tool\": \"create_task\", \"parameters\": {\"title\": \"Review PR\", \"priority\": \"urgent\"}}", "schema": {"type": "object", "properties": {"tool": {"type": "string", "const": "create_task"}, "parameters": {"type": "object", "properties": {"title": {"type": "string"}, "priority": {"type": "string", "enum": ["low", "medium", "high"]}}, "required": ["title", "priority"], "additionalProperties": false}}, "required": ["tool", "parameters"], "additionalProperties": false}}
{"prompt": "Send an email to john@example.com with subject 'Meeting' and body 'Let's meet tomorrow'.\nTool: send_email(to: str, subject: str, body: str)", "chosen": "{\"tool\": \"send_email\", \"parameters\": {\"to\": \"john@example.com\", \"subject\": \"Meeting\", \"body\": \"Let's meet tomorrow\"}}", "rejected": "{\"tool\": \"send_email\", \"parameters\": {\"to\": \"john@example.com\", \"subject\": \"Meeting\"}}", "schema": {"type": "object", "properties": {"tool": {"type": "string", "const": "send_email"}, "parameters": {"type": "object", "properties": {"to": {"type": "string"}, "subject": {"type": "string"}, "body": {"type": "string"}}, "required": ["to", "subject", "body"], "additionalProperties": false}}, "required": ["tool", "parameters"], "additionalProperties": false}}
{"prompt": "Book a flight from NYC to LAX on 2024-03-15 for 2 passengers.\nTool: book_flight(origin: str, destination: str, date: str, passengers: int)", "chosen": "{\"tool\": \"book_flight\", \"parameters\": {\"origin\": \"NYC\", \"destination\": \"LAX\", \"date\": \"2024-03-15\", \"passengers\": 2}}", "rejected": "{\"tool\": \"book_flight\", \"parameters\": {\"origin\": \"NYC\", \"destination\": \"LAX\", \"date\": \"2024-03-15\", \"passengers\": 2, \"class\": \"economy\"}}", "schema": {"type": "object", "properties": {"tool": {"type": "string", "const": "book_flight"}, "parameters": {"type": "object", "properties": {"origin": {"type": "string"}, "destination": {"type": "string"}, "date": {"type": "string"}, "passengers": {"type": "integer"}}, "required": ["origin", "destination", "date", "passengers"], "additionalProperties": false}}, "required": ["tool", "parameters"], "additionalProperties": false}}
{"prompt": "Get weather for San Francisco.\nTool: get_weather(city: str, units: enum['celsius', 'fahrenheit'])", "chosen": "{\"tool\": \"get_weather\", \"parameters\": {\"city\": \"San Francisco\", \"units\": \"celsius\"}}", "rejected": "{\"tool\": \"get_weather\", \"parameters\": {\"city\": \"San Francisco\", \"units\": \"metric\"}}", "schema": {"type": "object", "properties": {"tool": {"type": "string", "const": "get_weather"}, "parameters": {"type": "object", "properties": {"city": {"type": "string"}, "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, "required": ["city", "units"], "additionalProperties": false}}, "required": ["tool", "parameters"], "additionalProperties": false}}

Writing data/sample_triplets.jsonl


In [None]:
# Verify sample data
import json

with open(DRIVE_PROJECT_DIR+'/data/sample_triplets.jsonl', 'r') as f:
    samples = [json.loads(line) for line in f if line.strip()]

print(f"Loaded {len(samples)} sample triplets")
print("\nExample triplet:")
print(f"  Prompt: {samples[0]['prompt'][:60]}...")
print(f"  Chosen: {samples[0]['chosen']}")
print(f"  Rejected: {samples[0]['rejected']}")

Loaded 5 sample triplets

Example triplet:
  Prompt: Search for users in New York with a limit of 10 results.
Too...
  Chosen: {"tool": "search_users", "parameters": {"location": "New York", "limit": 10}}
  Rejected: {"tool": "search_users", "parameters": {"location": "New York", "limit": "10"}}


---
## 3. Data Generation (Optional)

Generate synthetic DPO triplets using GPT-4o. **Skip this section if you don't have an OpenAI API key** - the sample data is sufficient for testing.

Set `NUM_SAMPLES` to control how many samples to generate (100-500 recommended for training).

In [None]:
# Set your OpenAI API key (optional)
import os
from getpass import getpass

# Uncomment and run to set API key
# os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

GENERATE_DATA = True  # Set to True to generate synthetic data
NUM_SAMPLES = 200  # Number of samples to generate

In [None]:
if GENERATE_DATA:
    import json
    import logging
    from openai import OpenAI
    from src.utils import setup_logging
    from src.validation import validate_tool_call

    setup_logging()

    SYSTEM_PROMPT = """
    You are a Synthetic Data Generator for an LLM Alignment project.
    Your goal is to generate "Hard Negative" DPO triplets for Tool Use.

    Triplets: (Prompt, Chosen, Rejected)

    1. **Prompt**: A user query requiring a specific tool call.
    2. **Chosen**: A PERFECTLY valid JSON tool call adhering to the schema.
    3. **Rejected**: A SUBTLY incorrect tool call. It MUST be valid JSON, but it must fail the schema in one of these specific ways:
       - **Hallucinated Parameter**: Add a plausible argument not in the schema.
       - **Type Mismatch**: Pass a string "5" where an integer 5 is required.
       - **Enum Violation**: Pass "urgent" when allowed values are ["high", "medium", "low"].
       - **Missing Required**: Omit a mandatory argument.

    Output Format: JSON Lines. Each line must contain keys: "prompt", "chosen", "rejected", "schema".
    The "schema" key should contain the full JSON schema used for validation.
    """

    def generate_batch(client, num_samples=10):
        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Generate {num_samples} unique, diverse tool-use triplets."}
            ],
            response_format={"type": "json_object"}
        )
        content = completion.choices[0].message.content
        data = json.loads(content)
        return data.get("examples", data.get("triplets", [data]))

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    valid_samples = 0
    output_file = "data/synthetic_triplets.jsonl"

    with open(output_file, "a") as f:
        while valid_samples < NUM_SAMPLES:
            try:
                logging.info(f"Generating batch... ({valid_samples}/{NUM_SAMPLES})")
                batch = generate_batch(client, num_samples=min(5, NUM_SAMPLES - valid_samples))

                for item in batch:
                    schema = item.get("schema")
                    chosen_str = item["chosen"] if isinstance(item["chosen"], str) else json.dumps(item["chosen"])
                    rejected_str = item["rejected"] if isinstance(item["rejected"], str) else json.dumps(item["rejected"])

                    chosen_valid, _ = validate_tool_call(chosen_str, schema)
                    rejected_valid, _ = validate_tool_call(rejected_str, schema)

                    if chosen_valid and not rejected_valid:
                        item["chosen"] = chosen_str
                        item["rejected"] = rejected_str
                        f.write(json.dumps(item) + "\n")
                        valid_samples += 1
                        print(valid_samples)
                    else:
                        logging.warning("Discarded sample: Validation logic failed")

            except Exception as e:
                logging.error(f"Error during generation: {e}")

    print(f"\nGenerated {valid_samples} valid samples to {output_file}")
else:
    # Use sample data for training
    import shutil
    shutil.copy('data/sample_triplets.jsonl', 'data/synthetic_triplets.jsonl')
    print("Using sample data for training (5 examples)")
    print("Note: For better results, generate more data or set GENERATE_DATA=True")



1




2
3
4




5




6




7




8




9
10




11
12
13




14




15
16
17




18
19
20
21
22




23
24
25
26
27




28
29




30




31
32
33




34




35




36
37
38
39




40
41




42
43
44
45




46
47




48
49




50




51




52




53
54
55
56




57




58
59
60
61
62




63
64




65




66
67
68




69
70




71




72




73




74
75




76




77




78




79
80
81
82
83
84




85




86
87




88
89
90
91
92
93




94




95




96
97




98




99
100
101
102
103
104




105
106
107
108
109




110
111
112
113
114
115




116
117
118
119




120
121
122




123
124




125
126




127




128




129
130




131
132




133
134




135
136




137




138




139
140




141
142
143
144
145
146




147
148
149
150
151




152




153
154
155
156




157




158




159




160




161
162
163




164




165
166




167
168




169
170
171




172
173
174
175
176




177
178




179
180




181




182
183




184
185




186
187




188
189




190




191




192
193
194




195
196




197




198
199
200

Generated 200 valid samples to data/synthetic_triplets.jsonl


In [None]:
!zip -r tool_use_dpo_data.zip data/*

print("\nContents of the zip file:")
!unzip -l tool_use_dpo_data.zip

# Download using Colab's file download
from google.colab import files
files.download('tool_use_dpo_data.zip')

  adding: data/sample_triplets.jsonl (deflated 78%)
  adding: data/synthetic_triplets.jsonl (deflated 89%)

Contents of the zip file:
Archive:  tool_use_dpo_data.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     3832  2025-12-11 16:28   data/sample_triplets.jsonl
   131084  2025-12-11 17:56   data/synthetic_triplets.jsonl
---------                     -------
   134916                     2 files


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

---
## 4. Stage 1: SFT Cold Start Training

Supervised Fine-Tuning on the "chosen" responses to teach the model the basic tool call format.

In [None]:
import torch
from trl import SFTTrainer, SFTConfig
from src.model import load_model_and_tokenizer, prepare_model_for_peft
from src.dataset import load_sft_dataset
from src.config_parser import load_config
from src.utils import setup_logging, seed_everything

# Load SFT configuration
sft_config = load_config('configs/sft_config.yaml')
print("SFT Configuration:")
for k, v in sft_config.items():
    print(f"  {k}: {v}")


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


  import trl.experimental.openenv.utils as openenv_utils


SFT Configuration:
  batch_size: 8
  data_path: data/synthetic_triplets.jsonl
  epochs: 6
  grad_accum_steps: 2
  learning_rate: 3e-05
  log_file: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/logs/sft_training.log
  max_seq_length: 2048
  model_name: Qwen/Qwen2.5-Coder-7B-Instruct
  output_dir: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/sft_cold_start
  seed: 42
  warmup_steps: 100
  weight_decay: 0.01


In [None]:
# Initialize logging and seed
setup_logging(sft_config.get("log_file"))
seed_everything(sft_config.get("seed", 42))

# Load model and tokenizer
print("\nLoading model...")
model, tokenizer = load_model_and_tokenizer(
    model_name=sft_config["model_name"],
    max_seq_length=sft_config["max_seq_length"]
)

# Fix tokenizer eos_token for Qwen (required for latest TRL)
if tokenizer.eos_token is None or tokenizer.eos_token == "<EOS_TOKEN>":
    tokenizer.eos_token = "<|im_end|>"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"EOS token: {tokenizer.eos_token}")
print(f"PAD token: {tokenizer.pad_token}")

# Prepare model for PEFT (LoRA)
model = prepare_model_for_peft(model)
print("Model loaded and prepared for training!")


Loading model...
==((====))==  Unsloth 2025.12.4: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

EOS token: <|im_end|>
PAD token: <|PAD_TOKEN|>


Unsloth 2025.12.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Model loaded and prepared for training!


In [None]:
sft_config["data_path"]

'data/synthetic_triplets.jsonl'

In [None]:
# Load dataset
dataset = load_sft_dataset(DRIVE_PROJECT_DIR+"/"+sft_config["data_path"], tokenizer)
print(f"\nDataset loaded: {len(dataset)} samples")
print(f"Sample text:\n{dataset[0]['text'][:200]}...")

# Define formatting function for SFT (latest TRL API)
def formatting_func(example):
    return example["text"]

Map:   0%|          | 0/802 [00:00<?, ? examples/s]


Dataset loaded: 802 samples
Sample text:
<|im_start|>system
You are a tool-calling assistant. When given a user request and tool specification, respond with ONLY a valid JSON object representing the tool call. Do not include any explanation,...


In [None]:

# === Debug: Find where <EOS_TOKEN> is coming from ===
print("Checking all eos_token sources...")

print(f"tokenizer.eos_token: {tokenizer.eos_token}")
print(f"tokenizer.eos_token_id: {tokenizer.eos_token_id}")

if hasattr(model, 'config'):
    print(f"model.config.eos_token_id: {getattr(model.config, 'eos_token_id', 'N/A')}")

if hasattr(model, 'generation_config'):
    print(f"model.generation_config.eos_token_id: {getattr(model.generation_config, 'eos_token_id', 'N/A')}")

# Check if there's a special attribute Unsloth added
for attr in dir(tokenizer):
    if 'eos' in attr.lower():
        print(f"tokenizer.{attr}: {getattr(tokenizer, attr, 'N/A')}")

Checking all eos_token sources...
tokenizer.eos_token: <|im_end|>
tokenizer.eos_token_id: 151645
model.config.eos_token_id: 151645
model.generation_config.eos_token_id: [151645, 151643]


In [None]:
# === Comprehensive fix ===
eos_token = "<|im_end|>"
eos_token_id = tokenizer.convert_tokens_to_ids(eos_token)

# Fix tokenizer
tokenizer.eos_token = eos_token
tokenizer.eos_token_id = eos_token_id
tokenizer.pad_token = eos_token
tokenizer.pad_token_id = eos_token_id

# Fix model config
model.config.eos_token_id = eos_token_id
model.config.pad_token_id = eos_token_id
model.generation_config.eos_token_id = eos_token_id
model.generation_config.pad_token_id = eos_token_id

print(f"Set all eos_token to: {eos_token} (id: {eos_token_id})")

# Pre-tokenize the dataset ourselves to avoid TRL's processing
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=sft_config["max_seq_length"],
        padding=False,
    )

# Tokenize dataset before passing to trainer
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names,
)

print(f"Tokenized dataset: {tokenized_dataset}")

# Create config
sft_args = SFTConfig(
    per_device_train_batch_size=sft_config["batch_size"],
    gradient_accumulation_steps=sft_config["grad_accum_steps"],
    warmup_steps=sft_config["warmup_steps"],
    num_train_epochs=sft_config["epochs"],
    learning_rate=sft_config["learning_rate"],
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    output_dir=sft_config["output_dir"],
    optim="adamw_8bit",
    seed=sft_config["seed"],
)

# Force override eos_token
sft_args.eos_token = None

# Create trainer with pre-tokenized dataset
from transformers import DataCollatorForLanguageModeling

trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=tokenized_dataset,
    args=sft_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

print("SFT Trainer initialized!")



Set all eos_token to: <|im_end|> (id: 151645)


Map:   0%|          | 0/802 [00:00<?, ? examples/s]

Tokenized dataset: Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 802
})


ValueError: The specified `eos_token` ('<EOS_TOKEN>') is not found in the vocabulary of the given `processing_class` (Qwen2TokenizerFast). Ensure that the `eos_token` exists in the vocabulary before using it as an EOS token.

In [None]:
# Train!
print("Starting SFT training...\n")
trainer.train()

# Save model
trainer.save_model(sft_config["output_dir"])
tokenizer.save_pretrained(sft_config["output_dir"])
print(f"\nSFT model saved to: {sft_config['output_dir']}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5 | Num Epochs = 1 | Total steps = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 40,370,176 of 7,655,986,688 (0.53% trained)


Starting SFT training...



  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Unsloth: Will smartly offload gradients to save VRAM!


TypeError: 'function' object is not subscriptable

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

# === Fix tokenizer ===
eos_token = "<|im_end|>"
eos_token_id = tokenizer.convert_tokens_to_ids(eos_token)
tokenizer.eos_token = eos_token
tokenizer.eos_token_id = eos_token_id
tokenizer.pad_token = eos_token
tokenizer.pad_token_id = eos_token_id
tokenizer.padding_side = "right"
model.config.eos_token_id = eos_token_id
model.config.pad_token_id = eos_token_id

# Pre-tokenize the dataset (without padding - collator will handle it)
def tokenize_function(examples):
    result = tokenizer(
        examples["text"],
        truncation=True,
        max_length=sft_config["max_seq_length"],
        padding=False,
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)
print(f"Tokenized dataset: {len(tokenized_dataset)} samples")

# Use DataCollatorForSeq2Seq which handles padding properly
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding=True,
    return_tensors="pt",
)


# Check for existing checkpoint to resume from
import glob
resume_from_checkpoint = None
checkpoint_dirs = glob.glob(f"{sft_config['output_dir']}/checkpoint-*")
if checkpoint_dirs:
    # Get the latest checkpoint
    resume_from_checkpoint = max(checkpoint_dirs, key=lambda x: int(x.split('-')[-1]))
    print(f"✓ Found checkpoint to resume from: {resume_from_checkpoint}")
else:
    print("✗ No checkpoint found, training from scratch")

# Create training arguments with checkpointing
training_args = TrainingArguments(
    output_dir=sft_config["output_dir"],
    per_device_train_batch_size=sft_config["batch_size"],
    gradient_accumulation_steps=sft_config["grad_accum_steps"],
    warmup_steps=sft_config["warmup_steps"],
    num_train_epochs=sft_config["epochs"],
    learning_rate=sft_config["learning_rate"],
    weight_decay=sft_config.get("weight_decay", 0.01),
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    seed=sft_config["seed"],
    report_to="none",
    # Checkpoint settings
    save_strategy="steps",
    save_steps=100,              # Save every 100 steps
    save_total_limit=3,          # Keep only last 3 checkpoints
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)


print("Trainer initialized!")

# Train
print("Starting training...")
trainer.train()

# Save
trainer.save_model(sft_config["output_dir"])
tokenizer.save_pretrained(sft_config["output_dir"])
print(f"Model saved to: {sft_config['output_dir']}")


Map:   0%|          | 0/802 [00:00<?, ? examples/s]

Tokenized dataset: 802 samples
✗ No checkpoint found, training from scratch
Trainer initialized!
Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 802 | Num Epochs = 6 | Total steps = 306
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 2 x 1) = 16
 "-____-"     Trainable parameters = 80,740,352 of 7,696,356,864 (1.05% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.319
2,2.2675
3,2.309
4,2.2988
5,2.4094
6,2.3421
7,2.2234
8,2.2743
9,2.2754
10,2.2083


Model saved to: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/sft_cold_start


In [None]:
# # Train! (with resume support)
# if resume_from_checkpoint:
#     print(f"Resuming SFT training from: {resume_from_checkpoint}\n")
# else:
#     print("Starting SFT training from scratch...\n")

# trainer.train(resume_from_checkpoint=resume_from_checkpoint)

# # Save final model
# trainer.save_model(sft_config["output_dir"])
# tokenizer.save_pretrained(sft_config["output_dir"])
# print(f"\n✓ SFT model saved to: {sft_config['output_dir']}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 802 | Num Epochs = 6 | Total steps = 306
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 2 x 1) = 16
 "-____-"     Trainable parameters = 80,740,352 of 7,696,356,864 (1.05% trained)


Starting SFT training from scratch...



Step,Training Loss
1,0.1145
2,0.1387


KeyboardInterrupt: 

In [None]:
# Free memory before DPO training
del model, trainer
torch.cuda.empty_cache()
print("Memory cleared for DPO training")

Memory cleared for DPO training


---
## 5. Stage 2: DPO Training

Direct Preference Optimization using chosen vs. rejected pairs to teach the model to prefer schema-valid outputs.

In [None]:
import torch
from unsloth import PatchDPOTrainer, FastLanguageModel
from trl import DPOTrainer, DPOConfig
from src.dataset import load_dpo_dataset
from src.config_parser import load_config
from src.utils import setup_logging, seed_everything

# Patch TRL for Unsloth speedup (Must be before DPOTrainer init)
PatchDPOTrainer()

# Load DPO configuration
dpo_config = load_config('configs/dpo_config.yaml')
print("DPO Configuration:")
for k, v in dpo_config.items():
    print(f"  {k}: {v}")

DPO Configuration:
  batch_size: 4
  beta: 0.05
  data_path: data/synthetic_triplets.jsonl
  epochs: 3
  grad_accum_steps: 4
  learning_rate: 1e-06
  log_file: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/logs/dpo_training.log
  loss_type: ipo
  max_prompt_length: 768
  max_seq_length: 2048
  model_name: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/sft_cold_start
  output_dir: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/dpo_final
  seed: 42
  weight_decay: 0.01


In [None]:
# Initialize logging and seed
setup_logging(dpo_config.get("log_file"))
seed_everything(dpo_config.get("seed", 42))

# Load the SFT checkpoint
print("\nLoading SFT checkpoint...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=dpo_config["model_name"],
    max_seq_length=dpo_config["max_seq_length"],
    load_in_4bit=True,
)

# Fix tokenizer eos_token for Qwen (required for latest TRL)
if tokenizer.eos_token is None or tokenizer.eos_token == "<EOS_TOKEN>":
    tokenizer.eos_token = "<|im_end|>"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"EOS token: {tokenizer.eos_token}")
print(f"PAD token: {tokenizer.pad_token}")

# Enable training mode
FastLanguageModel.for_training(model)
print("Model loaded and ready for DPO training!")


Loading SFT checkpoint...
==((====))==  Unsloth 2025.12.4: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
unsloth/qwen2.5-coder-7b-instruct-bnb-4bit does not have a padding token! Will use pad_token = <|vision_pad|>.
EOS token: <|im_end|>
PAD token: <|vision_pad|>
Model loaded and ready for DPO training!


In [None]:
# Load DPO dataset
dataset = load_dpo_dataset(dpo_config["data_path"], tokenizer)
print(f"\nDPO Dataset loaded: {len(dataset)} samples")
print(f"\nSample prompt:\n{dataset[0]['prompt'][:150]}...")
print(f"\nChosen: {dataset[0]['chosen'][:100]}...")
print(f"\nRejected: {dataset[0]['rejected'][:100]}...")

Map:   0%|          | 0/802 [00:00<?, ? examples/s]


DPO Dataset loaded: 802 samples

Sample prompt:
<|im_start|>system
You are a tool-calling assistant. When given a user request and tool specification, respond with ONLY a valid JSON object represent...

Chosen: {"action": "schedule_meeting", "platform": "zoom", "with": "Dr. Smith", "date": "2023-12-15", "time"...

Rejected: {"action": "schedule_meeting", "platform": "zoom", "with": "Dr. Smith", "date": "2023-12-15"}<|im_en...


In [None]:
# Fix tokenizer eos_token before creating trainer (Unsloth/Qwen compatibility)
if tokenizer.eos_token is None or tokenizer.eos_token == "<EOS_TOKEN>":
    tokenizer.eos_token = "<|im_end|>"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
print(f"Using EOS token: {tokenizer.eos_token}")

# Create DPO trainer (updated API for latest TRL)
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # Unsloth handles reference model efficiently
    processing_class=tokenizer,
    train_dataset=dataset,
    args=DPOConfig(
        per_device_train_batch_size=dpo_config["batch_size"],
        gradient_accumulation_steps=dpo_config["grad_accum_steps"],
        warmup_ratio=0.1,
        num_train_epochs=dpo_config["epochs"],
        learning_rate=dpo_config["learning_rate"],  # CRITICAL: Very low LR (5e-7)
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        beta=dpo_config["beta"],
        loss_type=dpo_config["loss_type"],  # "ipo"
        max_length=dpo_config["max_seq_length"],
        output_dir=dpo_config["output_dir"],
        optim="adamw_8bit",
        seed=dpo_config["seed"],
    ),
)

print("DPO Trainer initialized!")

In [None]:
# Train!
print("Starting DPO training...\n")
dpo_trainer.train()

# Save model
dpo_trainer.save_model(dpo_config["output_dir"])
tokenizer.save_pretrained(dpo_config["output_dir"])
print(f"\nDPO model saved to: {dpo_config['output_dir']}")

In [None]:
import torch
from unsloth import FastLanguageModel
from trl import DPOTrainer, DPOConfig
from src.dataset import load_dpo_dataset
from src.config_parser import load_config
from src.utils import setup_logging, seed_everything

# Load DPO configuration
dpo_config = load_config('configs/dpo_config.yaml')
print("DPO Configuration:")
for k, v in dpo_config.items():
    print(f"  {k}: {v}")

# Initialize logging and seed
setup_logging(dpo_config.get("log_file"))
seed_everything(dpo_config.get("seed", 42))

# Load the SFT checkpoint
print("\nLoading SFT checkpoint...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=dpo_config["model_name"],
    max_seq_length=dpo_config["max_seq_length"],
    load_in_4bit=True,
)

# === Fix tokenizer (same as SFT) ===
eos_token = "<|im_end|>"
eos_token_id = tokenizer.convert_tokens_to_ids(eos_token)
tokenizer.eos_token = eos_token
tokenizer.eos_token_id = eos_token_id
tokenizer.pad_token = eos_token
tokenizer.pad_token_id = eos_token_id
tokenizer.padding_side = "right"
model.config.eos_token_id = eos_token_id
model.config.pad_token_id = eos_token_id
model.generation_config.eos_token_id = eos_token_id
model.generation_config.pad_token_id = eos_token_id

print(f"EOS token: {tokenizer.eos_token} (id: {eos_token_id})")

# Enable training mode
FastLanguageModel.for_training(model)
print("Model loaded and ready for DPO training!")

# Load DPO dataset
dataset = load_dpo_dataset(DRIVE_PROJECT_DIR+"/"+dpo_config["data_path"], tokenizer)
print(f"\nDPO Dataset loaded: {len(dataset)} samples")

# Check for existing DPO checkpoint to resume from
import glob
dpo_resume_from_checkpoint = None
dpo_checkpoint_dirs = glob.glob(f"{dpo_config['output_dir']}/checkpoint-*")
if dpo_checkpoint_dirs:
    dpo_resume_from_checkpoint = max(dpo_checkpoint_dirs, key=lambda x: int(x.split('-')[-1]))
    print(f"✓ Found DPO checkpoint to resume from: {dpo_resume_from_checkpoint}")
else:
    print("✗ No DPO checkpoint found, training from scratch")

# Create DPO config with fixes and checkpointing
dpo_args = DPOConfig(
    per_device_train_batch_size=dpo_config["batch_size"],
    gradient_accumulation_steps=dpo_config["grad_accum_steps"],
    warmup_ratio=0.1,
    num_train_epochs=dpo_config["epochs"],
    learning_rate=dpo_config["learning_rate"],
    weight_decay=dpo_config.get("weight_decay", 0.01),
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    beta=dpo_config["beta"],
    loss_type=dpo_config["loss_type"],
    max_length=dpo_config["max_seq_length"],
    max_prompt_length=dpo_config.get("max_prompt_length", 768),
    output_dir=dpo_config["output_dir"],
    optim="adamw_8bit",
    seed=dpo_config["seed"],
    report_to="none",
    dataset_num_proc=1,
    # Checkpoint settings
    save_strategy="steps",
    save_steps=50,               # Save every 50 steps
    save_total_limit=3,          # Keep only last 3 checkpoints
)

# Override eos_token to skip TRL's problematic check
dpo_args.eos_token = None

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,
    processing_class=tokenizer,
    train_dataset=dataset,
    args=dpo_args,
)

print("DPO Trainer initialized!")


DPO Configuration:
  batch_size: 4
  beta: 0.05
  data_path: data/synthetic_triplets.jsonl
  epochs: 3
  grad_accum_steps: 4
  learning_rate: 1e-06
  log_file: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/logs/dpo_training.log
  loss_type: ipo
  max_prompt_length: 768
  max_seq_length: 2048
  model_name: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/sft_cold_start
  output_dir: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/dpo_final
  seed: 42
  weight_decay: 0.01

Loading SFT checkpoint...
==((====))==  Unsloth 2025.12.4: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
unsloth/qwen

Map:   0%|          | 0/802 [00:00<?, ? examples/s]


DPO Dataset loaded: 802 samples
✗ No DPO checkpoint found, training from scratch


Extracting prompt in train dataset (num_proc=1):   0%|          | 0/802 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=1):   0%|          | 0/802 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=1):   0%|          | 0/802 [00:00<?, ? examples/s]

DPO Trainer initialized!


In [None]:
# Train! (with resume support)
if dpo_resume_from_checkpoint:
    print(f"Resuming DPO training from: {dpo_resume_from_checkpoint}\n")
else:
    print("Starting DPO training from scratch...\n")

dpo_trainer.train(resume_from_checkpoint=dpo_resume_from_checkpoint)

# Save final model
dpo_trainer.save_model(dpo_config["output_dir"])
tokenizer.save_pretrained(dpo_config["output_dir"])
print(f"\n✓ DPO model saved to: {dpo_config['output_dir']}")

The model is already on multiple devices. Skipping the move to device specified in `args`.


Starting DPO training from scratch...



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 802 | Num Epochs = 3 | Total steps = 153
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 80,740,352 of 7,696,356,864 (1.05% trained)


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss
1,96.7968,0.015757,0.007665,1.0,0.008092,-0.739339,-1.282674,-1.608313,-1.539703,0,0,0
2,97.3543,0.016097,0.009384,0.875,0.006713,-0.820314,-1.429197,-1.5977,-1.51944,No Log,No Log,No Log
3,97.3956,0.018604,0.01198,0.875,0.006623,-0.69575,-1.164992,-1.625004,-1.540164,No Log,No Log,No Log
4,95.787,0.02209,0.011343,1.0,0.010747,-0.815367,-1.331699,-1.561283,-1.487114,No Log,No Log,No Log
5,96.6703,0.018039,0.009583,0.875,0.008455,-0.864524,-1.313815,-1.582944,-1.51999,No Log,No Log,No Log
6,98.3242,0.018231,0.014002,0.875,0.004229,-0.689189,-1.106068,-1.714337,-1.587417,No Log,No Log,No Log
7,96.9666,0.019569,0.011883,0.875,0.007686,-0.761711,-1.409695,-1.562457,-1.430319,No Log,No Log,No Log
8,97.2344,0.017397,0.010422,0.9375,0.006975,-0.775216,-1.163365,-1.629797,-1.559888,No Log,No Log,No Log
9,96.2354,0.019728,0.010196,0.9375,0.009532,-0.785302,-1.271755,-1.629187,-1.542407,No Log,No Log,No Log
10,96.5487,0.019074,0.010345,0.9375,0.008729,-0.800614,-1.234551,-1.524867,-1.447686,No Log,No Log,No Log



✓ DPO model saved to: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/dpo_final


In [None]:
# Free memory before evaluation
del model, dpo_trainer
torch.cuda.empty_cache()
print("Memory cleared for evaluation")

Memory cleared for evaluation


---
## 6. Evaluation

Compare the baseline model vs. the DPO-aligned model using **SSPR (Strict Schema Pass Rate)**.

In [None]:
import json
import logging
from tqdm.notebook import tqdm
from unsloth import FastLanguageModel
from src.validation import validate_tool_call
from src.utils import setup_logging, get_device

setup_logging()

SYSTEM_PROMPT = """You are a tool-calling assistant. When given a user request and tool specification, respond with ONLY a valid JSON object representing the tool call. Do not include any explanation, markdown formatting, or code blocks. Output raw JSON only."""

def evaluate_model(model_path, data_path, max_new_tokens=128, model_name=None):
    """Evaluate a model and return SSPR score."""
    device = get_device()

    # Load Model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=2048,
        load_in_4bit=True,
    )
    FastLanguageModel.for_inference(model)

    # Load data
    data = []
    with open(data_path, 'r') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))

    passed = 0
    total = 0
    failures_by_type = {
        "json_error": 0,
        "hallucinated_param": 0,
        "type_mismatch": 0,
        "enum_violation": 0,
        "missing_required": 0,
        "other_schema_error": 0
    }

    for item in tqdm(data, desc="Evaluating"):
        # Include system prompt to match training format
        prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{item['prompt']}<|im_end|>\n<|im_start|>assistant\n"
        inputs = tokenizer([prompt], return_tensors="pt").to(device)

        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            use_cache=True,
            pad_token_id=tokenizer.eos_token_id
        )

        response = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
        is_valid, error = validate_tool_call(response, item["schema"])

        if is_valid:
            passed += 1
        else:
            if "Invalid JSON" in error:
                failures_by_type["json_error"] += 1
            elif "Additional properties" in error:
                failures_by_type["hallucinated_param"] += 1
            elif "is not of type" in error:
                failures_by_type["type_mismatch"] += 1
            elif "is not one of" in error:
                failures_by_type["enum_violation"] += 1
            elif "is a required property" in error:
                failures_by_type["missing_required"] += 1
            else:
                failures_by_type["other_schema_error"] += 1

        total += 1

    score = passed / total if total > 0 else 0.0

    # Clean up
    del model
    torch.cuda.empty_cache()

    return {
        "model": model_name or model_path,
        "total": total,
        "passed": passed,
        "sspr": score,
        "failures": failures_by_type
    }

In [None]:
# Evaluate baseline model
print("="*50)
print("BASELINE MODEL EVALUATION")
print("="*50)

baseline_results = evaluate_model(
    model_path="Qwen/Qwen2.5-Coder-7B-Instruct",
    data_path=DRIVE_PROJECT_DIR+"/"+"data/synthetic_triplets.jsonl",
    model_name="Qwen2.5-Coder-7B-Instruct (Baseline)"
)

print(f"\nModel: {baseline_results['model']}")
print(f"Total Samples: {baseline_results['total']}")
print(f"Passed: {baseline_results['passed']}")
print(f"SSPR: {baseline_results['sspr']:.2%}")
print("\nFailure Breakdown:")
for error_type, count in baseline_results['failures'].items():
    if count > 0:
        print(f"  - {error_type}: {count}")

BASELINE MODEL EVALUATION
==((====))==  Unsloth 2025.12.4: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Evaluating:   0%|          | 0/802 [00:00<?, ?it/s]


Model: Qwen2.5-Coder-7B-Instruct (Baseline)
Total Samples: 802
Passed: 60
SSPR: 7.48%

Failure Breakdown:
  - json_error: 3
  - hallucinated_param: 2
  - type_mismatch: 1
  - enum_violation: 24
  - missing_required: 711
  - other_schema_error: 1


In [None]:
# Evaluate DPO-aligned model (using Drive path)
print("="*50)
print("DPO MODEL EVALUATION")
print("="*50)

dpo_model_path = f"{DRIVE_PROJECT_DIR}/checkpoints/dpo_final"
dpo_results = evaluate_model(
    model_path=dpo_model_path,
    data_path=DRIVE_PROJECT_DIR+"/"+"data/synthetic_triplets.jsonl",
    model_name="DPO-Aligned Model"
)

print(f"\nModel: {dpo_results['model']}")
print(f"Total Samples: {dpo_results['total']}")
print(f"Passed: {dpo_results['passed']}")
print(f"SSPR: {dpo_results['sspr']:.2%}")
print("\nFailure Breakdown:")
for error_type, count in dpo_results['failures'].items():
    if count > 0:
        print(f"  - {error_type}: {count}")

DPO MODEL EVALUATION
==((====))==  Unsloth 2025.12.4: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
unsloth/qwen2.5-coder-7b-instruct-bnb-4bit does not have a padding token! Will use pad_token = <|vision_pad|>.


Evaluating:   0%|          | 0/802 [00:00<?, ?it/s]


Model: DPO-Aligned Model
Total Samples: 802
Passed: 186
SSPR: 23.19%

Failure Breakdown:
  - json_error: 1
  - hallucinated_param: 3
  - type_mismatch: 1
  - enum_violation: 51
  - missing_required: 557
  - other_schema_error: 3


In [None]:
# Debug: See actual model outputs
import json
from unsloth import FastLanguageModel
from src.utils import get_device

device = get_device()

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="checkpoints/dpo_final",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

# Load a few test samples
with open("data/synthetic_triplets.jsonl", 'r') as f:
    samples = [json.loads(line) for line in f][:3]

# Test each sample
for i, item in enumerate(samples):
    print(f"\n{'='*60}")
    print(f"SAMPLE {i+1}")
    print(f"{'='*60}")

    # Show the prompt
    print(f"\nORIGINAL PROMPT:\n{item['prompt'][:200]}...")

    # Format as ChatML
    prompt = f"<|im_start|>user\n{item['prompt']}<|im_end|>\n<|im_start|>assistant\n"

    inputs = tokenizer([prompt], return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        use_cache=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]

    print(f"\nMODEL OUTPUT:\n{response}")
    print(f"\nEXPECTED (chosen):\n{item['chosen']}")


==((====))==  Unsloth 2025.12.4: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
unsloth/qwen2.5-coder-7b-instruct-bnb-4bit does not have a padding token! Will use pad_token = <|vision_pad|>.


JSONDecodeError: Expecting value: line 2 column 1 (char 1)

In [None]:
# Check the raw file content
with open("data/synthetic_triplets.jsonl", 'r') as f:
    lines = f.readlines()

print(f"Total lines: {len(lines)}")
print(f"\nFirst 5 lines (raw):")
for i, line in enumerate(lines[:5]):
    print(f"Line {i}: {repr(line[:100])}...")

print(f"\nEmpty lines: {sum(1 for l in lines if not l.strip())}")

# Find problematic lines
print(f"\nProblematic lines:")
for i, line in enumerate(lines[:20]):
    if not line.strip():
        print(f"  Line {i}: EMPTY")
    else:
        try:
            json.loads(line)
        except json.JSONDecodeError as e:
            print(f"  Line {i}: {e}")

Total lines: 803

First 5 lines (raw):
Line 0: '{"prompt": "Schedule a meeting with Dr. Smith on zoom for tomorrow at 3 PM.", "chosen": "{\\"action\\"'...
Line 1: '{"prompt": "Set an alarm for tomorrow at 7 AM with high priority.", "chosen": "{\\"tool\\": \\"set_alar'...
Line 2: '{"prompt": "I need to send an invite to my calendar for a meeting at 3 PM tomorrow.", "chosen": "{\\"'...
Line 3: '{"prompt": "Calculate the distance between two geographical coordinates (lat/long).", "chosen": "{\\"'...
Line 4: '{"prompt": "I need to schedule a meeting on Monday at 10 AM with the marketing team.", "chosen": "{\\'...

Empty lines: 1

Problematic lines:


In [None]:
# Check first few training samples
import json
with open("data/synthetic_triplets.jsonl", 'r') as f:
    for i, line in enumerate(f):
        if i >= 3:
            break
        sample = json.loads(line)
        print(f"\n--- Sample {i+1} ---")
        print(f"Prompt: {sample['prompt'][:100]}...")
        print(f"Chosen: {sample['chosen'][:100]}...")
        print(f"Rejected: {sample['rejected'][:100]}...")

In [None]:
# Comparison Summary
print("\n" + "="*50)
print("COMPARISON SUMMARY")
print("="*50)
print(f"\n{'Model':<35} {'SSPR':>10}")
print("-"*45)
print(f"{'Baseline (Qwen2.5-Coder-7B)':<35} {baseline_results['sspr']:>10.2%}")
print(f"{'DPO-Aligned Model':<35} {dpo_results['sspr']:>10.2%}")
print("-"*45)
improvement = dpo_results['sspr'] - baseline_results['sspr']
print(f"{'Improvement':<35} {improvement:>+10.2%}")


COMPARISON SUMMARY

Model                                     SSPR
---------------------------------------------
Baseline (Qwen2.5-Coder-7B)              7.48%
DPO-Aligned Model                       23.19%
---------------------------------------------
Improvement                            +15.71%


In [None]:
# Debug: Show actual model outputs for first 5 samples
import json
from unsloth import FastLanguageModel
from src.utils import get_device

SYSTEM_PROMPT = """You are a tool-calling assistant. When given a user request and tool specification, respond with ONLY a valid JSON object representing the tool call. Do not include any explanation, markdown formatting, or code blocks. Output raw JSON only."""

def debug_model_outputs(model_path, data_path, num_samples=5):
    """Show actual model outputs to debug low SSPR."""
    device = get_device()

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=2048,
        load_in_4bit=True,
    )
    FastLanguageModel.for_inference(model)

    data = []
    with open(data_path, 'r') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))

    print(f"\n{'='*60}")
    print(f"DEBUG: Examining {num_samples} outputs from {model_path}")
    print(f"{'='*60}\n")

    for i, item in enumerate(data[:num_samples]):
        # Use system prompt to match training format
        prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{item['prompt']}<|im_end|>\n<|im_start|>assistant\n"
        inputs = tokenizer([prompt], return_tensors="pt").to(device)

        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            use_cache=True,
            pad_token_id=tokenizer.eos_token_id,
            temperature=0.1,
            do_sample=False,
        )

        response = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]

        print(f"--- Sample {i+1} ---")
        print(f"PROMPT: {item['prompt'][:80]}...")
        print(f"EXPECTED: {item['chosen'][:100]}...")
        print(f"ACTUAL: [{response}]")

        # Check if valid JSON
        try:
            json.loads(response)
            print("STATUS: ✓ Valid JSON")
        except json.JSONDecodeError as e:
            print(f"STATUS: ✗ Invalid JSON - {e}")
        print()

    del model
    torch.cuda.empty_cache()

# Debug the DPO model (uncomment to run)
# debug_model_outputs("checkpoints/dpo_final", "data/synthetic_triplets.jsonl", num_samples=5)

# Debug the baseline model (uncomment to run)
# debug_model_outputs("Qwen/Qwen2.5-Coder-7B-Instruct", "data/synthetic_triplets.jsonl", num_samples=5)

==((====))==  Unsloth 2025.12.4: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
unsloth/qwen2.5-coder-7b-instruct-bnb-4bit does not have a padding token! Will use pad_token = <|vision_pad|>.

DEBUG: Examining 5 outputs from checkpoints/dpo_final

--- Sample 1 ---
PROMPT: Schedule a meeting with Dr. Smith on zoom for tomorrow at 3 PM....
EXPECTED: {"action": "schedule_meeting", "platform": "zoom", "with": "Dr. Smith", "date": "2023-12-15", "time"...
ACTUAL: [{
    "action": "schedule_meeting",
    "details": {
        "host": "Dr. Smith",
        "platform": "Zoom",
        "date": "tomorrow",
        "time": "15:00"
    }
}

---
## 7. Inference

Test the DPO-aligned model with custom prompts.

In [None]:
from unsloth import FastLanguageModel
from src.utils import get_device

# Load the DPO model for inference
device = get_device()
print(f"Using device: {device}")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="checkpoints/dpo_final",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
print("Model loaded for inference!")

Using device: cuda


RuntimeError: Unsloth: No config file found - are you sure the `model_name` is correct?
If you're using a model on your local device, confirm if the folder location exists.
If you're using a HuggingFace online model, check if it exists.

In [None]:
SYSTEM_PROMPT = """You are a tool-calling assistant. When given a user request and tool specification, respond with ONLY a valid JSON object representing the tool call. Do not include any explanation, markdown formatting, or code blocks. Output raw JSON only."""

def run_inference(prompt, max_new_tokens=256):
    """Run inference on a single prompt."""
    prompt_text = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
    inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=max_new_tokens,
        use_cache=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
    return response.strip()

In [None]:
# Test with example prompts
test_prompts = [
    "Create a new task with title 'Review PR' and priority high.\nTool: create_task(title: str, priority: enum['low', 'medium', 'high'])",
    "Search for users in New York with a limit of 10 results.\nTool: search_users(location: str, limit: int)",
    "Send an email to alice@example.com with subject 'Hello' and body 'How are you?'.\nTool: send_email(to: str, subject: str, body: str)",
]

for prompt in test_prompts:
    print("="*60)
    print(f"Prompt: {prompt}")
    print("-"*60)
    response = run_inference(prompt)
    print(f"Response: {response}")
    print()

In [None]:
# Interactive inference - try your own prompts!
# Uncomment and modify the prompt below:

# custom_prompt = "Your custom tool use prompt here\nTool: your_tool(param1: type1, param2: type2)"
# response = run_inference(custom_prompt)
# print(f"Response: {response}")

---
## 8. Save Model to Google Drive (Optional)

Save your trained model to Google Drive for later use.

In [None]:
# Checkpoints are automatically saved to Google Drive!
# Location: /content/drive/MyDrive/Colab Notebooks/Tool-Use-DPO/checkpoints/

print(f"Checkpoints location: {DRIVE_PROJECT_DIR}/checkpoints/")
print("\nTo copy data to Drive as well:")
print(f"  !cp data/synthetic_triplets.jsonl \"{DRIVE_PROJECT_DIR}/data/\"")

# Uncomment to copy training data to Drive
# import shutil
# shutil.copy("data/synthetic_triplets.jsonl", f"{DRIVE_PROJECT_DIR}/data/")

---
## Summary

You have successfully run the complete Tool-Use DPO pipeline:

1. **Setup**: Installed Unsloth and dependencies
2. **Data**: Created/used synthetic DPO triplets
3. **SFT**: Fine-tuned the base model on correct tool calls
4. **DPO**: Aligned the model using preference optimization
5. **Evaluation**: Compared baseline vs. aligned model SSPR
6. **Inference**: Tested the model on custom prompts

### Key Findings

- **SSPR (Strict Schema Pass Rate)** measures how often the model produces valid, schema-compliant tool calls
- **IPO (Identity Preference Optimization)** provides more stable training than standard DPO
- **Hard Negatives** (subtle errors like type mismatches, hallucinated params) teach the model precision

### Next Steps

- Generate more training data (500+ samples) for better results
- Experiment with different `beta` values for DPO
- Try other base models (Llama, Mistral)
- Evaluate on held-out test sets