Release v0.1.5 — Orchestration reliability & right-size triage · zhjai/agent-arena

Hardening derived from real Codex↔Claude orchestration failures, each cross-critiqued by Codex and self-reviewed via agent-arena itself.

Right-size the arena (fix mode under-triage)

Principle #8 changed from "use the lightest arena" to "right-size the arena." The Quick Decision Gate now has bidirectional triage with explicit escalation triggers — persistent/irreversible side effects, structure/contract/policy redesign, genuinely interdependent decisions, repeating a past mistake, output-becomes-a-durable-contract — bounded by durability + external consumers (with a negative example) to avoid flipping into over-triage. Common Mistake #6 reworded to cover both over- and under-triage.

Read/analyze separation (reduce error_max_turns)

For bounded critique, Codex extracts material and Claude analyzes with no tools; Read,Glob,Grep + turn budget is reserved for genuine self-discovery. New context-budget protocol: feed raw evidence (paths, line numbers, omission notes), never Codex's conclusions, to protect Claude's independent first pass.

Stop over-redacting evidence

Task-relevant artifacts (experiment runs, media, predictions, metrics, outputs, dashboards) are evidence, not noise — excluding them forces inference from code instead of verifying real output.

Timing, timeouts, observability

Cross-agent calls take minutes in both directions (measured: single-turn no-tools ~6s; multi-turn repo review 2–5 min). A silent --output-format json run is not a hang. Set timeouts to match --max-turns (5–10 min, not 1 min), prefer stream-json to watch progress, record duration_ms/num_turns. Added a preflight runbook + failure classification.

User-facing failure handling

Arena Limitations template gains Failure type and Retry recommendation; failures must surface a retry decision (retry once, no loop; then stop/narrow/retry), never silently swallowed.

Full list: CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.5 — Orchestration reliability & right-size triage

Choose a tag to compare

Sorry, something went wrong.