Tool-Use Coding Agent Benchmark

A lightweight tool-use coding agent for file operation tasks, with rule-based vs LLM-based selector benchmarking.

What this project does

This project focuses on a narrow but practical agent setting:

read a file
create a file
append to a file
edit a file

The goal is not to build a full autonomous software engineer.
The goal is to compare how different decision layers behave under the same executable tool pipeline.

Pipeline

task -> selector -> repo search / file selection -> tool execution -> optional verification -> metrics

Why this project is useful

This repository is designed to be:

executable
benchmarkable
comparable
easy to explain in an internship interview

Instead of only showing a demo, it includes:

structured execution flow
benchmark runner
metrics summary
error summary
saved outputs

Repository layout

tooluse-coding-agent/
├─ data/
│  └─ benchmark/
│     ├─ tasks.json
│     ├─ tasks_harder.json
│     └─ tasks_holdout.json
├─ docs/
│  └─ experiments.md
├─ logs/
├─ outputs/
│  └─ benchmarks/
├─ src/
│  ├─ agent_runner.py
│  ├─ benchmark_runner.py
│  ├─ config.py
│  ├─ error_analyzer.py
│  ├─ llm_selector.py
│  ├─ logger_utils.py
│  ├─ metrics.py
│  ├─ retry_manager.py
│  ├─ selector.py
│  ├─ task_parser.py
│  ├─ trajectory.py
│  └─ tools/
│     ├─ file_editor.py
│     ├─ file_reader.py
│     ├─ path_utils.py
│     ├─ repo_search.py
│     └─ test_runner.py
├─ tests/
│  ├─ test_task_parser.py
│  └─ test_path_utils.py
├─ requirements.txt
└─ README.md

Quick start

1. Install dependencies

pip install -r requirements.txt

2. Configure environment

If you want to run the LLM selector, create a .env file under the project root. Example:

LLM_MODEL=openai/qwen-plus
LLM_API_BASE=[https://dashscope.aliyuncs.com/compatible-mode/v1](https://dashscope.aliyuncs.com/compatible-mode/v1)
DASHSCOPE_API_KEY=your_api_key_here

If you only want to run the rule-based selector, this is not required.

3. Run benchmark

Run the default benchmark:

python src/benchmark_runner.py

Run LLM selector on holdout:

python src/benchmark_runner.py \
  --selector-mode llm \
  --task-file data/benchmark/tasks_holdout.json \
  --run-name llm_holdout

Run rule selector on holdout:

python src/benchmark_runner.py \
  --selector-mode rule \
  --task-file data/benchmark/tasks_holdout.json \
  --run-name rule_holdout

4. Run tests

pytest -q

Benchmark sets

tasks.json: Template-style benchmark used to verify basic pipeline stability.
tasks_harder.json: More natural-language-like benchmark used to measure selector robustness.
tasks_holdout.json: Holdout benchmark used as the most important summary result.

Results

Use holdout as the main result for reporting.

Current snapshot

Holdout benchmark (rule): TODO
Holdout benchmark (llm): TODO

You can fill the numbers after running:

python src/benchmark_runner.py --selector-mode rule --task-file data/benchmark/tasks_holdout.json --run-name rule_holdout
python src/benchmark_runner.py --selector-mode llm --task-file data/benchmark/tasks_holdout.json --run-name llm_holdout

Notes on interpretation

This project is still a prototype. That means:

the task space is intentionally narrow
benchmark scale is small
retry / reflection is lightweight
verification is minimal compared with full software engineering workflows
repo search is designed for small repositories, not production-scale systems

So this repository should be interpreted as:

a compact tool-use agent prototype
a selector comparison benchmark
a portfolio project for applied AI / agent internships

not as:

a production-ready autonomous engineer
a full-scale code agent platform

Next improvements

High-value next steps:

strengthen verifier abstraction
improve parser coverage
improve repo search ranking
add CI
report holdout results more prominently

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/benchmark		data/benchmark
docs		docs
outputs/benchmarks		outputs/benchmarks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
hello.py		hello.py
notes_x.txt		notes_x.txt
notes_y.txt		notes_y.txt
notes_z.txt		notes_z.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tool-Use Coding Agent Benchmark

What this project does

Pipeline

Why this project is useful

Repository layout

Quick start

Benchmark sets

Results

Notes on interpretation

Next improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tool-Use Coding Agent Benchmark

What this project does

Pipeline

Why this project is useful

Repository layout

Quick start

Benchmark sets

Results

Notes on interpretation

Next improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages