Open-source, self-hostable benchmark runner for AI coding agents.
- Pluggable Harnesses — Benchmark any agent: OpenShark, Hermes, Claude Code, Codex, OpenCode, or generic OpenAI-compatible APIs
- Benchmark Suites — SWE-bench, Terminal-bench, LiveCodeBench (extensible)
- Concurrent Execution — Semaphore-based parallelism with timeouts and retries
- Persistent Results — SQLite store with queryable history
- Beautiful TUI — Real-time ratatui dashboard with synthwave aesthetic
- Web Dashboard — Embedded Axum server for result visualization
- Report Export — Markdown, JSON, HTML
# Clone
git clone https://github.com/synthalorian/agentbench.git
cd agentbench
# Build
cargo build --release
# List available harnesses and benchmarks
./target/release/agentbench list
# Run the sample benchmark (uses local sample data, no API needed)
./target/release/agentbench run \
--config benches/sample-benchmark.yml \
--harness generic
# Dry-run mode — test without any external API calls
./target/release/agentbench run \
--config benches/sample-benchmark.yml \
--harness generic \
--dry-run
# Limit to first 2 tasks (quick smoke test)
./target/release/agentbench run \
--config benches/sample-benchmark.yml \
--harness generic \
--dry-run \
--max-tasks 2
# Run terminal-bench sample
./target/release/agentbench run \
--config benches/terminal-bench-sample.yml \
--harness generic \
--dry-run
# Run livecodebench sample
./target/release/agentbench run \
--config benches/livecodebench-sample.yml \
--harness generic \
--dry-run
# Start TUI dashboard
./target/release/agentbench tui
# Start web dashboard
./target/release/agentbench web --port 8910
# Generate a report from a previous run
./target/release/agentbench report <run-id>Benchmark suites are defined in YAML. See benches/sample-benchmark.yml:
name: "AgentBench Sample"
description: "Sample benchmark for testing AgentBench with local data"
benchmark_type: "swe_bench"
dataset:
source: "local"
path: "./data/swe-bench-sample.json"
harness:
name: "generic-openai"
adapter: "generic"
endpoint: "http://localhost:8080/v1"
model: "local-model"
extra:
max_tokens: 4096
temperature: 0.0
runner:
max_workers: 2
timeout_secs: 60
retries: 0
scoring:
metric: "pass_rate"
thresholds:
excellent: 0.80
good: 0.50
acceptable: 0.20| Adapter | Description | Config |
|---|---|---|
generic |
OpenAI-compatible API | endpoint, api_key, model |
mock |
Dry-run / testing | No config needed |
openshark |
OpenShark harness | endpoint, api_key |
hermes |
Hermes agent | endpoint or CLI path |
claude_code |
Claude Code CLI | workspace directory |
codex |
OpenAI Codex | api_key |
opencode |
OpenCode CLI/API | endpoint, api_key |
| Type | Description | Validation |
|---|---|---|
swe_bench |
Software engineering tasks | Docker + pytest |
terminal_bench |
Shell command tasks | stdout matching |
livecodebench |
Live coding challenges | Compile + test |
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run a specific test
cargo test test_swe_bench_load_tasks -- --nocaptureagentbench/
├── benches/ # Benchmark config YAML files
├── data/ # Sample datasets for testing
├── src/
│ ├── benchmark/ # Benchmark suite implementations
│ ├── harness/ # Harness adapters
│ ├── tui/ # Ratatui dashboard
│ ├── web/ # Axum web server
│ ├── config.rs # YAML config loading
│ ├── db.rs # SQLite persistence
│ ├── runner.rs # Concurrent task execution
│ ├── metrics.rs # Cost tracking
│ └── report.rs # Report generation
└── tests/ # Integration tests
┌─────────┐ ┌─────────────┐ ┌──────────────┐
│ CLI │───▶│ Runner │───▶│ Harness │
│ / TUI │ │ (tokio) │ │ Adapters │
└─────────┘ └─────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Benchmark │
│ Suites │
└──────────────┘
│
▼
┌──────────────┐
│ SQLite │
│ Results │
└──────────────┘
MIT — see LICENSE
Made by synth with synthshark