What Actually Improves LLM Tool Calling?

This repository contains the code, experiments, and data for the paper "What Actually Improves LLM Tool Calling?"

Key Findings

SFT dominates: Supervised fine-tuning provides +47 points accuracy improvement, dwarfing DPO (<1 point) and RL (<1 point)
Scaffolding doesn't help: JSON repair and best-of-N sampling provide 0 benefit when models already produce valid JSON
Tool generalization works: SFT generalizes well to unseen tools (79% accuracy)
Pattern generalization is harder: SFT struggles with unseen call patterns (42% accuracy)
Diversity > Quantity: With 500 examples, diverse training achieves 53% vs 27% for homogeneous training

Repository Structure

fc_benchmark/
├── scripts/
│   ├── data_utils.py          # Data loading and preprocessing
│   ├── ast_eval.py            # AST-based evaluation metrics
│   ├── clean_experiments.py   # Main experiment runner (with train/test separation)
│   ├── generalization_ablation.py  # Tool & pattern generalization experiments
│   ├── diversity_ablation.py  # Diversity vs quantity experiment
│   ├── rl_generalization.py   # RL experiments on generalization
│   └── ...
├── data/
│   └── bfcl/                  # Berkeley Function Calling Leaderboard data
├── models/                    # Trained LoRA adapters
├── results/                   # Experiment results (JSON)
└── paper/
    └── paper.tex              # LaTeX source

Running Experiments

Prerequisites

pip install mlx mlx-lm

Experiment 1: Training Methods Comparison

python -m fc_benchmark.scripts.clean_experiments

This runs:

Base model evaluation
SFT training and evaluation
SFT + DPO training and evaluation
SFT + RL training and evaluation

Experiment 2: Generalization (Tools & Patterns)

python -m fc_benchmark.scripts.generalization_ablation

Tests:

Base model vs SFT vs Scaffolding on unseen tools
Base model vs SFT vs Scaffolding on unseen patterns (parallel_multiple)

Experiment 3: RL for Generalization

python -m fc_benchmark.scripts.rl_generalization

Tests whether reward-filtered RL improves generalization beyond SFT.

Experiment 4: Diversity vs Quantity

python -m fc_benchmark.scripts.diversity_ablation

Compares:

Low diversity: 500 examples from "simple" only
High diversity: 125 examples from each of 4 categories

Data

We use the Berkeley Function Calling Leaderboard (BFCL) dataset with four call categories:

Simple: Single function calls
Multiple: Sequential dependent calls
Parallel: Independent concurrent calls
Parallel-multiple: Combinations of parallel and sequential

Model

Base model: mlx-community/Qwen2.5-1.5B-Instruct-4bit

Fine-tuning: LoRA adapters (rank 8, alpha 16) via MLX framework

Trained adapters available on HuggingFace: siddharthvader/tool-calling-lora-qwen2.5-1.5b

Results Summary

Method	Accuracy	Improvement
Base model	9.7%	---
SFT	57.0%	+47.3
SFT + DPO	57.7%	+48.0
SFT + RL	58.0%	+48.3

Generalization

Setting	Base	SFT
Unseen tools	30%	79%
Unseen patterns	0%	42%

Diversity Effect (500 examples each)

Training Data	Overall Accuracy
Low diversity (simple only)	27%
High diversity (all categories)	53%

Citation

@article{ramakrishnan2024toolcalling,
  title={What Actually Improves LLM Tool Calling?},
  author={Ramakrishnan, Siddharth},
  year={2024}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
paper		paper
results		results
scripts		scripts
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Actually Improves LLM Tool Calling?

Key Findings

Repository Structure

Running Experiments

Prerequisites

Experiment 1: Training Methods Comparison

Experiment 2: Generalization (Tools & Patterns)

Experiment 3: RL for Generalization

Experiment 4: Diversity vs Quantity

Data

Model

Results Summary

Generalization

Diversity Effect (500 examples each)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What Actually Improves LLM Tool Calling?

Key Findings

Repository Structure

Running Experiments

Prerequisites

Experiment 1: Training Methods Comparison

Experiment 2: Generalization (Tools & Patterns)

Experiment 3: RL for Generalization

Experiment 4: Diversity vs Quantity

Data

Model

Results Summary

Generalization

Diversity Effect (500 examples each)

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages