An automated benchmark framework designed for OpenAgent, covering four dimensions: performance baseline, dialogue quality, complex long-horizon tasks, and tool invocation. It enables fast, quantitative evaluation of agent performance.
- Overview
- Core Features
- Benchmark Dimensions
- Quick Start
- CLI Arguments
- Project Structure
- Datasets
- Reading Reports
- Extending AgentBench
- System Requirements
- Security & Privacy
- License
AgentBench is a lightweight benchmark framework for the OpenAgent ecosystem. Through standardized datasets, reproducible test workflows, and automated report generation, it helps developers:
- Quantify performance: obtain key metrics such as latency, token consumption, and success rate.
- Validate correctness: check output format, factual accuracy, and instruction-following fidelity.
- Evaluate tool invocation: verify output quality and evidence completeness in tool-calling scenarios (e.g.
web_search,shell,browser_use). - Stress-test long-horizon tasks: exercise deep reasoning and structured-output capabilities via multi-stage complex prompts.
The framework relies solely on the Python standard library (with only optional PyYAML for legacy dataset support), making it plug-and-play and easy to integrate into CI/CD pipelines.
| Feature | Description |
|---|---|
| Four benchmark suites | baseperf (performance baseline), dialogue (dialogue quality), hardchat (complex long-horizon tasks), tool (tool invocation validation) |
| Multi-round & retry | Supports --rounds multi-round testing and --max-attempts retry on failure for more stable results |
| Automated reports | Each run auto-generates report.md, summary.json, and details.jsonl, including a 95% Bootstrap confidence interval |
| Server-side log verification | For the Tool suite, optionally cross-checks invocation traces against OpenAgent server logs |
| Zero-config out-of-the-box | Rich built-in datasets; launch a full benchmark with a single command |
| Extensible architecture | Add new benchmark dimensions easily via the SuiteBase abstract base class |
- Goal: test basic OpenAgent availability, response latency, and token consumption.
- Dataset:
datasets/baseperf/dataset.jsonl - Validation logic: verify HTTP response success, valid JSON parsing, and non-empty assistant output.
- Output metrics: success rate, average latency, token mean / standard deviation.
- Goal: validate format compliance, factual accuracy, and instruction understanding.
- Dataset:
datasets/dialogue/dataset.jsonl - Validation logic:
- Format check: require JSON Object / JSON Array output with mandatory fields.
- Fact matching: verify key information via regular expressions.
- Instruction following: check character count, word count, line prefix, forbidden words, etc.
- Typical tasks: JSON formatting, factual Q&A, translation, text rewriting.
- Goal: test the agent's ability to handle high-complexity, multi-stage prompts with strict structured JSON output.
- Dataset:
datasets/hardchat/tasks.jsonl - Validation logic: verify that the returned JSON contains the three required stage fields (
collect,normalize,summarize). - Typical tasks:
- Deep web research and synthesis reports (WebSearch Synthesis)
- In-depth RFC technical document analysis and comparison
- Shell environment diagnostics and governance blueprints
- Browser-based multi-step information workflows
- Goal: verify that the agent can produce structured output conforming to specifications in tool-calling scenarios, and present a complete evidence chain.
- Dataset:
datasets/tool/dataset.jsonl - Validation logic:
- Validate mandatory JSON fields (e.g.
task_id,claims,evidence.tool_calls). - Check that required tool names appear in
evidence.tool_calls. - (Optional) Server-side log verification: scan OpenAgent's
openagent.logand console output to cross-check invocation traces.
- Validate mandatory JSON fields (e.g.
- Covered tool types:
web_search,web_fetch,shell,local_text_write,browser_use_open,browser_use_snapshot, etc.
Ensure Python >= 3.10 is installed.
Option 1: Direct install (recommended)
# Clone the repository
git clone https://github.com/the-open-agent/agentbench.git
cd agentbench
# Install optional dependency (PyYAML for legacy HardChat YAML datasets)
pip install -r requirements.txtOption 2: Docker
# Build the image
docker build -t agentbench .
# Run the benchmark (point to your OpenAgent instance)
docker run --rm agentbench \
--base-url http://host.docker.internal:14000 \
--provider-key $OPENAGENT_PROVIDER_KEYAgentBench sends requests to http://127.0.0.1:14000 by default. Make sure your OpenAgent service is running and listening on that address.
# Run all suites (recommended for first use)
python -m agentbench run \
--base-url http://127.0.0.1:14000 \
--model deepseek/deepseek-v4-flash \
--provider-key $OPENAGENT_PROVIDER_KEY \
--rounds 3 \
--max-attempts 2 \
--timeout 240
# Run only the tool invocation suite
python -m agentbench run --suite tool --provider-key $OPENAGENT_PROVIDER_KEY
# Run only the dialogue quality suite
python -m agentbench run --suite dialogue --provider-key $OPENAGENT_PROVIDER_KEYTip:
--provider-keycan also be passed via theOPENAGENT_PROVIDER_KEYenvironment variable to avoid exposing secrets on the command line.
After the run completes, the terminal will print something like:
Session written to: results/session-20260512-143052-a1b2c3d4
Navigate into that directory to inspect the outputs:
session-20260512-143052-a1b2c3d4/
├── meta.json # Test metadata (config, health check, timestamp)
├── summary.json # Aggregated stats (success rate, latency, tokens, top-10 failure reasons)
├── details.jsonl # Detailed record of every request
├── report.md # Human-readable comprehensive report
├── baseperf/
│ ├── report.md
│ ├── summary.json
│ └── details.jsonl
├── dialogue/
│ └── ...
├── hardchat/
│ └── ...
└── tool/
└── ...
| Argument | Default | Description |
|---|---|---|
command |
run |
Subcommand; currently only run is supported |
--suite |
all |
Select benchmark suite: baseperf, dialogue, hardchat, tool, all |
--base-url |
http://127.0.0.1:14000 |
OpenAgent service URL |
--model |
deepseek/deepseek-v4-flash |
Model name |
--provider-key |
$OPENAGENT_PROVIDER_KEY |
API key |
--rounds |
3 |
Number of rounds per task |
--max-attempts |
2 |
Maximum retries after a single-round failure |
--timeout |
240 |
Per-request timeout in seconds |
agentbench/
├── benchcore/ # Core engine
│ ├── runner.py # Benchmark main loop (health check → multi-round execution → result aggregation)
│ ├── models.py # Data models: RunContext, RunRecord, SuiteResult
│ ├── http_client.py # HTTP client (health_check, post_json)
│ ├── reporting.py # Report generation (Markdown + JSON + JSONL)
│ ├── stats.py # Statistics utilities (mean, std, Bootstrap 95% CI)
│ └── utils.py # Common utilities (JSON read/write, timestamps, JSON parsing fallback)
├── suites/ # Benchmark suite implementations
│ ├── base.py # Abstract base class SuiteBase
│ ├── baseperf.py # Performance baseline suite
│ ├── dialogue.py # Dialogue quality suite
│ ├── hardchat.py # Complex long-horizon task suite
│ ├── tool.py # Tool invocation validation suite
│ ├── easychat_common.py # Shared logic for BasePerf / Dialogue
│ └── openagent_log_verify.py # OpenAgent server-side log cross-check
├── datasets/ # Built-in datasets
│ ├── baseperf/dataset.jsonl
│ ├── dialogue/dataset.jsonl
│ ├── hardchat/tasks.jsonl
│ └── tool/dataset.jsonl
├── cli.py # CLI entry point
├── __main__.py # `python -m agentbench` entry point
├── requirements.txt # Python dependencies
├── pyproject.toml # Project metadata and packaging config
├── Dockerfile # Docker build file
├── .gitignore # Git ignore rules
├── .dockerignore # Docker build ignore rules
├── LICENSE # Apache 2.0 license
└── README.md # This file
All datasets use the JSON Lines format (one JSON object per line) for easy streaming and extension.
| Dataset | Tasks | Categories | Description |
|---|---|---|---|
baseperf |
40 | format / factual / instruction / readability / completion | Covers JSON formatting, factual Q&A, instruction following, text rewriting, and completion tasks |
dialogue |
32 | format / factual / instruction / readability | Split from baseperf; focused on dialogue quality evaluation |
hardchat |
6 | web_search / web_fetch / shell / file_ops / browser / mixed | High-difficulty long-horizon tasks requiring multi-stage structured output |
tool |
12 | WebSearch / WebFetch / Shell / File / Browser / Mixed | Validates structured output and evidence-chain completeness in tool-calling scenarios |
Some tool tasks use relative paths (e.g. ./scratch/, ./openagent/logs/). Create the corresponding directories in the current working directory before running, or modify the paths in datasets/tool/dataset.jsonl to suit your environment.
You can directly edit datasets/<suite>/dataset.jsonl. The tool suite uses a single datasets/tool/dataset.jsonl (one task per line). For backward compatibility, the runtime also supports the legacy tasks_index.json + tasks/*.json fallback loading pattern.
Represents the proportion of tasks across all rounds that passed validation for a given suite. The report also includes a 95% Bootstrap confidence interval to help assess statistical significance.
- mean: average response latency in milliseconds
- std: latency variability
- worst: worst-case latency
- mean / std: average and standard deviation, useful for cost and model-efficiency assessment.
Aggregates the most common failure types, for example:
empty_response: empty content returnedjson_format_invalid: JSON parsing failedmissing_key:xxx: a mandatory field is missingfact_pattern_missing:xxx: factual pattern match failedmax_chars_exceeded: response character count exceeds themax_charslimit specified in the task (dialogue tasks)missing_tool_call:...: the required tool name is not declared inevidence.tool_calls(or root-leveltool_calls); the validator recognizesname/tool/function.nameand dot-prefixed tool IDs.
- Create a new file under
suites/that inherits fromSuiteBase:
from .base import SuiteBase
from ..benchcore.models import RunRecord
class MySuite(SuiteBase):
name = "mysuite"
def load_tasks(self) -> list[dict]:
# Load your dataset
return [...]
def run_task(self, task, round_index, attempt, base_url, model, provider_key, timeout_s) -> RunRecord:
# Implement evaluation logic
return RunRecord(...)- Register it in
cli.pyundersuite_map:
from .suites.mysuite import MySuite
suite_map = {
...
"mysuite": MySuite(root),
}- Run:
python -m agentbench run --suite mysuitebenchcore/stats.py: customize statistical methods (e.g. swap the confidence-interval algorithm).benchcore/reporting.py: modify the Markdown report template or add new output formats.
| Item | Requirement |
|---|---|
| Python | >= 3.10 |
| OS | Windows, Linux, macOS |
| Network | Must be able to reach the OpenAgent service address |
Cross-platform note: AgentBench core code depends only on the Python standard library. All path operations use
pathlib, and HTTP requests use the standard-libraryurllib, so it runs correctly on Windows, Linux, and macOS.
- Never commit API keys, cookies, internal URLs, or evaluation results containing personal data to the repository;
results/is already ignored in.gitignore. - Optional console log fallback paths: environment variable
OPENAGENT_CONSOLE_LOG, orlogs/openagent_console.login the current working directory (seeopenagent_log_verify.py).
This project is licensed under the Apache License 2.0.
AgentBench is an important part of the OpenAgent ecosystem. If you discover interesting cases or have improvement suggestions during benchmarking, please open an issue at agentbench Issues or submit a PR to help refine the evaluation standards.