This repository contains the code, experiments, and data for the paper "What Actually Improves LLM Tool Calling?"
- SFT dominates: Supervised fine-tuning provides +47 points accuracy improvement, dwarfing DPO (<1 point) and RL (<1 point)
- Scaffolding doesn't help: JSON repair and best-of-N sampling provide 0 benefit when models already produce valid JSON
- Tool generalization works: SFT generalizes well to unseen tools (79% accuracy)
- Pattern generalization is harder: SFT struggles with unseen call patterns (42% accuracy)
- Diversity > Quantity: With 500 examples, diverse training achieves 53% vs 27% for homogeneous training
fc_benchmark/
├── scripts/
│ ├── data_utils.py # Data loading and preprocessing
│ ├── ast_eval.py # AST-based evaluation metrics
│ ├── clean_experiments.py # Main experiment runner (with train/test separation)
│ ├── generalization_ablation.py # Tool & pattern generalization experiments
│ ├── diversity_ablation.py # Diversity vs quantity experiment
│ ├── rl_generalization.py # RL experiments on generalization
│ └── ...
├── data/
│ └── bfcl/ # Berkeley Function Calling Leaderboard data
├── models/ # Trained LoRA adapters
├── results/ # Experiment results (JSON)
└── paper/
└── paper.tex # LaTeX source
pip install mlx mlx-lmpython -m fc_benchmark.scripts.clean_experimentsThis runs:
- Base model evaluation
- SFT training and evaluation
- SFT + DPO training and evaluation
- SFT + RL training and evaluation
python -m fc_benchmark.scripts.generalization_ablationTests:
- Base model vs SFT vs Scaffolding on unseen tools
- Base model vs SFT vs Scaffolding on unseen patterns (parallel_multiple)
python -m fc_benchmark.scripts.rl_generalizationTests whether reward-filtered RL improves generalization beyond SFT.
python -m fc_benchmark.scripts.diversity_ablationCompares:
- Low diversity: 500 examples from "simple" only
- High diversity: 125 examples from each of 4 categories
We use the Berkeley Function Calling Leaderboard (BFCL) dataset with four call categories:
- Simple: Single function calls
- Multiple: Sequential dependent calls
- Parallel: Independent concurrent calls
- Parallel-multiple: Combinations of parallel and sequential
Base model: mlx-community/Qwen2.5-1.5B-Instruct-4bit
Fine-tuning: LoRA adapters (rank 8, alpha 16) via MLX framework
Trained adapters available on HuggingFace: siddharthvader/tool-calling-lora-qwen2.5-1.5b
| Method | Accuracy | Improvement |
|---|---|---|
| Base model | 9.7% | --- |
| SFT | 57.0% | +47.3 |
| SFT + DPO | 57.7% | +48.0 |
| SFT + RL | 58.0% | +48.3 |
| Setting | Base | SFT |
|---|---|---|
| Unseen tools | 30% | 79% |
| Unseen patterns | 0% | 42% |
| Training Data | Overall Accuracy |
|---|---|
| Low diversity (simple only) | 27% |
| High diversity (all categories) | 53% |
@article{ramakrishnan2024toolcalling,
title={What Actually Improves LLM Tool Calling?},
author={Ramakrishnan, Siddharth},
year={2024}
}MIT