Skip to content

siddharthvader/tool_calling_study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What Actually Improves LLM Tool Calling?

This repository contains the code, experiments, and data for the paper "What Actually Improves LLM Tool Calling?"

Key Findings

  • SFT dominates: Supervised fine-tuning provides +47 points accuracy improvement, dwarfing DPO (<1 point) and RL (<1 point)
  • Scaffolding doesn't help: JSON repair and best-of-N sampling provide 0 benefit when models already produce valid JSON
  • Tool generalization works: SFT generalizes well to unseen tools (79% accuracy)
  • Pattern generalization is harder: SFT struggles with unseen call patterns (42% accuracy)
  • Diversity > Quantity: With 500 examples, diverse training achieves 53% vs 27% for homogeneous training

Repository Structure

fc_benchmark/
├── scripts/
│   ├── data_utils.py          # Data loading and preprocessing
│   ├── ast_eval.py            # AST-based evaluation metrics
│   ├── clean_experiments.py   # Main experiment runner (with train/test separation)
│   ├── generalization_ablation.py  # Tool & pattern generalization experiments
│   ├── diversity_ablation.py  # Diversity vs quantity experiment
│   ├── rl_generalization.py   # RL experiments on generalization
│   └── ...
├── data/
│   └── bfcl/                  # Berkeley Function Calling Leaderboard data
├── models/                    # Trained LoRA adapters
├── results/                   # Experiment results (JSON)
└── paper/
    └── paper.tex              # LaTeX source

Running Experiments

Prerequisites

pip install mlx mlx-lm

Experiment 1: Training Methods Comparison

python -m fc_benchmark.scripts.clean_experiments

This runs:

  • Base model evaluation
  • SFT training and evaluation
  • SFT + DPO training and evaluation
  • SFT + RL training and evaluation

Experiment 2: Generalization (Tools & Patterns)

python -m fc_benchmark.scripts.generalization_ablation

Tests:

  • Base model vs SFT vs Scaffolding on unseen tools
  • Base model vs SFT vs Scaffolding on unseen patterns (parallel_multiple)

Experiment 3: RL for Generalization

python -m fc_benchmark.scripts.rl_generalization

Tests whether reward-filtered RL improves generalization beyond SFT.

Experiment 4: Diversity vs Quantity

python -m fc_benchmark.scripts.diversity_ablation

Compares:

  • Low diversity: 500 examples from "simple" only
  • High diversity: 125 examples from each of 4 categories

Data

We use the Berkeley Function Calling Leaderboard (BFCL) dataset with four call categories:

  • Simple: Single function calls
  • Multiple: Sequential dependent calls
  • Parallel: Independent concurrent calls
  • Parallel-multiple: Combinations of parallel and sequential

Model

Base model: mlx-community/Qwen2.5-1.5B-Instruct-4bit

Fine-tuning: LoRA adapters (rank 8, alpha 16) via MLX framework

Trained adapters available on HuggingFace: siddharthvader/tool-calling-lora-qwen2.5-1.5b

Results Summary

Method Accuracy Improvement
Base model 9.7% ---
SFT 57.0% +47.3
SFT + DPO 57.7% +48.0
SFT + RL 58.0% +48.3

Generalization

Setting Base SFT
Unseen tools 30% 79%
Unseen patterns 0% 42%

Diversity Effect (500 examples each)

Training Data Overall Accuracy
Low diversity (simple only) 27%
High diversity (all categories) 53%

Citation

@article{ramakrishnan2024toolcalling,
  title={What Actually Improves LLM Tool Calling?},
  author={Ramakrishnan, Siddharth},
  year={2024}
}

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors