Skip to content

tta-lab/agon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agon

Terminal-Bench arena for measuring Lenos on real terminal tasks. Agon runs Lenos agents against Terminal-Bench tasks, capturing structured results for benchmarking and regression detection.

Structure

agon_bench/
├── runner/          # Runner adapter — invokes Lenos, writes JSONL results
├── tasks/           # Terminal-Bench task corpus
│   └── smoke/       # Minimal smoke task for harness validation
├── results/         # JSONL result output (git-ignored)
├── transcripts/     # Per-run transcripts (git-ignored)
└── STORAGE.md       # Result/transcript format docs
Makefile             # Build, run, and test commands
scripts/             # Shell scripts backing Makefile targets

Requirements

Host Tools

  • Docker — containerized execution
  • Go 1.26+ — building lenos from source inside the Docker image
  • Python 3.12+ — runner and test infrastructure (python3, pip3)
  • make (or bash) — primary local interface
  • ruff (optional) — Python formatting and linting
  • shfmt (optional) — shell script formatting

Secrets

Provider credentials are never stored in this repo, baked into Docker images, or passed as environment variables. The container mounts your host Lenos config directory at runtime:

~/.local/share/lenos  →  /root/.local/share/lenos:ro

Lenos inside the container reads your existing config.json to authenticate with providers. The mount is read-only — the container can read credentials but cannot modify or write back to the host config.

Binary Acquisition

The Docker image is based on the Terminal-Bench base image (ghcr.io/laude-institute/t-bench/ubuntu-24-04) which provides tmux and asciinema. On top of that, it bundles:

  • lenos — built from source via git clone + go build (avoids replace directive issues with go install)
  • temenos — downloaded from pinned GitHub releases Build args control versions:
docker build -t agon-bench \
  --build-arg LENOS_REF=main \
  --build-arg TEMENOS_VERSION=v0.9.0 \
  --build-arg GO_VERSION=1.26.2 \
  -f agon_bench/runner/Dockerfile .

To override the lenos binary with a locally built one, mount it at runtime:

docker run --rm \
  -v $(pwd)/agon_bench/tasks/smoke:/tasks/smoke \
  -v $(pwd)/agon_bench/results:/results \
  -v $(pwd)/agon_bench/transcripts:/transcripts \
  -v $(which lenos):/usr/local/bin/lenos \
  agon-bench --task smoke

CI and Image Publishing

The CI workflow (.github/workflows/build-image.yml) runs two verification layers:

Layer 1: No-key (always runs)

  • Image build — PRs build-check (no push), main/tags push to GHCR
  • Binary smoke check — verifies lenos --version, temenos --version, Python version, and runner scripts are present in the built image
  • Smoke task verifier — runs solution.sh then pytest against the smoke task test suite; confirms the task package is intact and tests pass against the reference solution No provider credentials required for Layer 1.

Layer 2: Secret-gated (requires host Lenos config)

  • Agent benchmarklenos run solves Terminal-Bench tasks with a real model
  • Reads provider credentials from the host Lenos config mounted read-only
  • Not part of the standard CI pipeline; run locally with make run-smoke

GHCR Image Tags

Pre-built images are published to GHCR on every push to main and on tags:

ghcr.io/tta-lab/agon-runner:latest      # main branch
ghcr.io/tta-lab/agon-runner:sha-<sha>   # immutable per-commit
ghcr.io/tta-lab/agon-runner:v1.0.0      # semver tag

Pull the pre-built image instead of building locally:

docker pull ghcr.io/tta-lab/agon-runner:latest

Quickstart

Use make or ./scripts/*.sh as the primary local interface.

# 1. Set up Lenos on your host (if not already done)
lenos
# 2. Build the benchmark image
make build-image
# 3. Run the smoke task (reads your host Lenos config)
make run-smoke
# 4. Verify reference solution (no keys needed)
make test-solution
# 5. View results
make results
# 6. Interactive shell with Lenos config mounted
make shell
# 7. Format and lint
make fmt
make lint
# 8. Clean up
make clean

All commands delegate to scripts in scripts/. Run make help to see all targets.

Adding New Terminal-Bench Tasks

  1. Create a directory under agon_bench/tasks/<task-name>/
  2. Add the required files:
    tasks/<task-name>/
    ├── task.yaml              # Task description and config
    ├── Dockerfile             # Environment (extend agon-bench or custom)
    ├── docker-compose.yaml    # Container orchestration
    ├── run-tests.sh           # Test entrypoint
    ├── setup-uv-pytest.sh     # Install test deps
    ├── run-uv-pytest.sh       # Execute tests
    ├── solution.sh            # Reference solution
    └── tests/
        └── test_outputs.py    # Verification tests
    
  3. See Terminal-Bench task docs for the full format spec
  4. Tasks inherit the agon-bench base image by default (lenos + temenos + Python)
  5. Custom tasks needing different environments should follow the Terminal-Bench Docker guidelines

Next Expansion Path

  • More tasks: Add tasks from Terminal-Bench categories — file manipulation, git operations, system administration, data processing
  • Batch runner: Run multiple tasks in sequence and aggregate JSONL results
  • Model comparison: Run the same task set across different models and compare pass rates
  • CI integration: Run benchmark as a GitHub Actions workflow on Lenos releases for regression detection
  • Leaderboard: Publish benchmark results to track Lenos improvements over time

About

Terminal-Bench arena for measuring Lenos on real terminal tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors