Agon

Terminal-Bench arena for measuring Lenos on real terminal tasks. Agon runs Lenos agents against Terminal-Bench tasks, capturing structured results for benchmarking and regression detection.

Structure

agon_bench/
├── runner/          # Runner adapter — invokes Lenos, writes JSONL results
├── tasks/           # Terminal-Bench task corpus
│   └── smoke/       # Minimal smoke task for harness validation
├── results/         # JSONL result output (git-ignored)
├── transcripts/     # Per-run transcripts (git-ignored)
└── STORAGE.md       # Result/transcript format docs
Makefile             # Build, run, and test commands
scripts/             # Shell scripts backing Makefile targets

Requirements

Host Tools

Docker — containerized execution
Go 1.26+ — building lenos from source inside the Docker image
Python 3.12+ — runner and test infrastructure (python3, pip3)
make (or bash) — primary local interface
ruff (optional) — Python formatting and linting
shfmt (optional) — shell script formatting

Secrets

Provider credentials are never stored in this repo, baked into Docker images, or passed as environment variables. The container mounts your host Lenos config directory at runtime:

~/.local/share/lenos  →  /root/.local/share/lenos:ro

Lenos inside the container reads your existing config.json to authenticate with providers. The mount is read-only — the container can read credentials but cannot modify or write back to the host config.

Binary Acquisition

The Docker image is based on the Terminal-Bench base image (ghcr.io/laude-institute/t-bench/ubuntu-24-04) which provides tmux and asciinema. On top of that, it bundles:

lenos — built from source via git clone + go build (avoids replace directive issues with go install)
temenos — downloaded from pinned GitHub releases Build args control versions:

docker build -t agon-bench \
  --build-arg LENOS_REF=main \
  --build-arg TEMENOS_VERSION=v0.9.0 \
  --build-arg GO_VERSION=1.26.2 \
  -f agon_bench/runner/Dockerfile .

To override the lenos binary with a locally built one, mount it at runtime:

docker run --rm \
  -v $(pwd)/agon_bench/tasks/smoke:/tasks/smoke \
  -v $(pwd)/agon_bench/results:/results \
  -v $(pwd)/agon_bench/transcripts:/transcripts \
  -v $(which lenos):/usr/local/bin/lenos \
  agon-bench --task smoke

CI and Image Publishing

The CI workflow (.github/workflows/build-image.yml) runs two verification layers:

Layer 1: No-key (always runs)

Image build — PRs build-check (no push), main/tags push to GHCR
Binary smoke check — verifies lenos --version, temenos --version, Python version, and runner scripts are present in the built image
Smoke task verifier — runs solution.sh then pytest against the smoke task test suite; confirms the task package is intact and tests pass against the reference solution No provider credentials required for Layer 1.

Layer 2: Secret-gated (requires host Lenos config)

Agent benchmark — lenos run solves Terminal-Bench tasks with a real model
Reads provider credentials from the host Lenos config mounted read-only
Not part of the standard CI pipeline; run locally with make run-smoke

GHCR Image Tags

Pre-built images are published to GHCR on every push to main and on tags:

ghcr.io/tta-lab/agon-runner:latest      # main branch
ghcr.io/tta-lab/agon-runner:sha-<sha>   # immutable per-commit
ghcr.io/tta-lab/agon-runner:v1.0.0      # semver tag

Pull the pre-built image instead of building locally:

docker pull ghcr.io/tta-lab/agon-runner:latest

Quickstart

Use make or ./scripts/*.sh as the primary local interface.

# 1. Set up Lenos on your host (if not already done)
lenos
# 2. Build the benchmark image
make build-image
# 3. Run the smoke task (reads your host Lenos config)
make run-smoke
# 4. Verify reference solution (no keys needed)
make test-solution
# 5. View results
make results
# 6. Interactive shell with Lenos config mounted
make shell
# 7. Format and lint
make fmt
make lint
# 8. Clean up
make clean

All commands delegate to scripts in scripts/. Run make help to see all targets.

Adding New Terminal-Bench Tasks

Create a directory under agon_bench/tasks/<task-name>/

Add the required files:

tasks/<task-name>/
├── task.yaml              # Task description and config
├── Dockerfile             # Environment (extend agon-bench or custom)
├── docker-compose.yaml    # Container orchestration
├── run-tests.sh           # Test entrypoint
├── setup-uv-pytest.sh     # Install test deps
├── run-uv-pytest.sh       # Execute tests
├── solution.sh            # Reference solution
└── tests/
    └── test_outputs.py    # Verification tests

See Terminal-Bench task docs for the full format spec
Tasks inherit the agon-bench base image by default (lenos + temenos + Python)
Custom tasks needing different environments should follow the Terminal-Bench Docker guidelines

Next Expansion Path

More tasks: Add tasks from Terminal-Bench categories — file manipulation, git operations, system administration, data processing
Batch runner: Run multiple tasks in sequence and aggregate JSONL results
Model comparison: Run the same task set across different models and compare pass rates
CI integration: Run benchmark as a GitHub Actions workflow on Lenos releases for regression detection
Leaderboard: Publish benchmark results to track Lenos improvements over time

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
agon_bench		agon_bench
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agon

Structure

Requirements

Host Tools

Secrets

Binary Acquisition

CI and Image Publishing

Layer 1: No-key (always runs)

Layer 2: Secret-gated (requires host Lenos config)

GHCR Image Tags

Quickstart

Adding New Terminal-Bench Tasks

Next Expansion Path

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agon

Structure

Requirements

Host Tools

Secrets

Binary Acquisition

CI and Image Publishing

Layer 1: No-key (always runs)

Layer 2: Secret-gated (requires host Lenos config)

GHCR Image Tags

Quickstart

Adding New Terminal-Bench Tasks

Next Expansion Path

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages