Terminal-Bench arena for measuring Lenos on real terminal tasks. Agon runs Lenos agents against Terminal-Bench tasks, capturing structured results for benchmarking and regression detection.
agon_bench/
├── runner/ # Runner adapter — invokes Lenos, writes JSONL results
├── tasks/ # Terminal-Bench task corpus
│ └── smoke/ # Minimal smoke task for harness validation
├── results/ # JSONL result output (git-ignored)
├── transcripts/ # Per-run transcripts (git-ignored)
└── STORAGE.md # Result/transcript format docs
Makefile # Build, run, and test commands
scripts/ # Shell scripts backing Makefile targets
- Docker — containerized execution
- Go 1.26+ — building lenos from source inside the Docker image
- Python 3.12+ — runner and test infrastructure (
python3,pip3) make(orbash) — primary local interfaceruff(optional) — Python formatting and lintingshfmt(optional) — shell script formatting
Provider credentials are never stored in this repo, baked into Docker images, or passed as environment variables. The container mounts your host Lenos config directory at runtime:
~/.local/share/lenos → /root/.local/share/lenos:ro
Lenos inside the container reads your existing config.json to authenticate with providers. The mount is read-only — the container can read credentials but cannot modify or write back to the host config.
The Docker image is based on the Terminal-Bench base image (ghcr.io/laude-institute/t-bench/ubuntu-24-04) which provides tmux and asciinema. On top of that, it bundles:
- lenos — built from source via
git clone+go build(avoids replace directive issues withgo install) - temenos — downloaded from pinned GitHub releases Build args control versions:
docker build -t agon-bench \
--build-arg LENOS_REF=main \
--build-arg TEMENOS_VERSION=v0.9.0 \
--build-arg GO_VERSION=1.26.2 \
-f agon_bench/runner/Dockerfile .To override the lenos binary with a locally built one, mount it at runtime:
docker run --rm \
-v $(pwd)/agon_bench/tasks/smoke:/tasks/smoke \
-v $(pwd)/agon_bench/results:/results \
-v $(pwd)/agon_bench/transcripts:/transcripts \
-v $(which lenos):/usr/local/bin/lenos \
agon-bench --task smokeThe CI workflow (.github/workflows/build-image.yml) runs two verification layers:
- Image build — PRs build-check (no push), main/tags push to GHCR
- Binary smoke check — verifies
lenos --version,temenos --version, Python version, and runner scripts are present in the built image - Smoke task verifier — runs
solution.shthenpytestagainst the smoke task test suite; confirms the task package is intact and tests pass against the reference solution No provider credentials required for Layer 1.
- Agent benchmark —
lenos runsolves Terminal-Bench tasks with a real model - Reads provider credentials from the host Lenos config mounted read-only
- Not part of the standard CI pipeline; run locally with
make run-smoke
Pre-built images are published to GHCR on every push to main and on tags:
ghcr.io/tta-lab/agon-runner:latest # main branch
ghcr.io/tta-lab/agon-runner:sha-<sha> # immutable per-commit
ghcr.io/tta-lab/agon-runner:v1.0.0 # semver tag
Pull the pre-built image instead of building locally:
docker pull ghcr.io/tta-lab/agon-runner:latestUse make or ./scripts/*.sh as the primary local interface.
# 1. Set up Lenos on your host (if not already done)
lenos
# 2. Build the benchmark image
make build-image
# 3. Run the smoke task (reads your host Lenos config)
make run-smoke
# 4. Verify reference solution (no keys needed)
make test-solution
# 5. View results
make results
# 6. Interactive shell with Lenos config mounted
make shell
# 7. Format and lint
make fmt
make lint
# 8. Clean up
make cleanAll commands delegate to scripts in scripts/. Run make help to see all targets.
- Create a directory under
agon_bench/tasks/<task-name>/ - Add the required files:
tasks/<task-name>/ ├── task.yaml # Task description and config ├── Dockerfile # Environment (extend agon-bench or custom) ├── docker-compose.yaml # Container orchestration ├── run-tests.sh # Test entrypoint ├── setup-uv-pytest.sh # Install test deps ├── run-uv-pytest.sh # Execute tests ├── solution.sh # Reference solution └── tests/ └── test_outputs.py # Verification tests - See Terminal-Bench task docs for the full format spec
- Tasks inherit the agon-bench base image by default (lenos + temenos + Python)
- Custom tasks needing different environments should follow the Terminal-Bench Docker guidelines
- More tasks: Add tasks from Terminal-Bench categories — file manipulation, git operations, system administration, data processing
- Batch runner: Run multiple tasks in sequence and aggregate JSONL results
- Model comparison: Run the same task set across different models and compare pass rates
- CI integration: Run benchmark as a GitHub Actions workflow on Lenos releases for regression detection
- Leaderboard: Publish benchmark results to track Lenos improvements over time