v0.1.5 — Orchestration reliability & right-size triage
Hardening derived from real Codex↔Claude orchestration failures, each cross-critiqued by Codex and self-reviewed via agent-arena itself.
Right-size the arena (fix mode under-triage)
Principle #8 changed from "use the lightest arena" to "right-size the arena." The Quick Decision Gate now has bidirectional triage with explicit escalation triggers — persistent/irreversible side effects, structure/contract/policy redesign, genuinely interdependent decisions, repeating a past mistake, output-becomes-a-durable-contract — bounded by durability + external consumers (with a negative example) to avoid flipping into over-triage. Common Mistake #6 reworded to cover both over- and under-triage.
Read/analyze separation (reduce error_max_turns)
For bounded critique, Codex extracts material and Claude analyzes with no tools; Read,Glob,Grep + turn budget is reserved for genuine self-discovery. New context-budget protocol: feed raw evidence (paths, line numbers, omission notes), never Codex's conclusions, to protect Claude's independent first pass.
Stop over-redacting evidence
Task-relevant artifacts (experiment runs, media, predictions, metrics, outputs, dashboards) are evidence, not noise — excluding them forces inference from code instead of verifying real output.
Timing, timeouts, observability
Cross-agent calls take minutes in both directions (measured: single-turn no-tools ~6s; multi-turn repo review 2–5 min). A silent --output-format json run is not a hang. Set timeouts to match --max-turns (5–10 min, not 1 min), prefer stream-json to watch progress, record duration_ms/num_turns. Added a preflight runbook + failure classification.
User-facing failure handling
Arena Limitations template gains Failure type and Retry recommendation; failures must surface a retry decision (retry once, no loop; then stop/narrow/retry), never silently swallowed.
Full list: CHANGELOG.md