Releases: zhjai/agent-arena
v0.1.9 — review the raw file, not a summary
Adds Common Mistake #10: reviewing from a summary instead of the raw file. Format / structure / spec-compliance problems (frontmatter, schema, config, exact YAML) live in the precise text — a summary-fed reviewer is blind to them. Give the reviewer the actual file for those checks.
This captures the v0.1.8 lesson: the skill's own frontmatter stayed off-spec through many summary-fed Codex reviews until one file-reading review caught it. (v0.1.9 tag was previously used by a withdrawn timeout experiment, since reverted; reused here.)
v0.1.8 — spec-compliant frontmatter
Make the SKILL.md frontmatter conform to the agentskills spec: version/author moved under metadata (version a quoted string), tags/related_skills as comma-separated strings, dropped the nested metadata.hermes block. The prior frontmatter was off-spec (parsers tolerated it). Caught by a cross-environment Codex review reading the actual file against the spec — earlier summary-fed reviews never saw the frontmatter. (Unrelated to the withdrawn 0.1.8/0.1.9 timeout experiments, which were reverted; version reused.)
v0.1.7 — model-unavailable failure type
Reported from real use: specifying gpt-5.2-codex was rejected by the Codex account; re-running with the default model succeeded.
- Add
model-unavailableto the failure taxonomy (Preflight runbook + Arena Limitations template). It is distinct fromauth(authentication is fine) andrefusal(the model declined to answer). - Guidance: don't pin a specific model version unless the account is confirmed to support it; prefer the default model. The standard retry for
model-unavailableis to drop the override and re-run with the default model.
This is exactly the v0.1.5 Failure-type / Retry-recommendation mechanism working in the field — the report used that format and the one-variable-change retry succeeded; v0.1.7 just fills the taxonomy gap.
v0.1.6 — groundcheck fact-gate interop
Wires in the new companion skill groundcheck — single-agent, evidence-grounded fact-checking that treats hallucination (where multi-agent debate is weak, because a panel can reinforce a shared hallucination).
What's new
- Pre-debate fact-gate: after each agent's independent answer, run
groundcheck; anyrefutedclaim is sent back to its source agent (with the refuting evidence, not a conclusion) before cross-critique. Re-check new/changed claims post-debate. - Framing: agent-arena (multi-agent debate → overconfidence) + groundcheck (single-agent verification → hallucination) = two depths of one verification stack.
- Added groundcheck to
related_skills; companion section in both READMEs; synced zh README badge.
See the groundcheck fact-gate contract for send-back and anti-loop rules.
v0.1.5 — Orchestration reliability & right-size triage
Hardening derived from real Codex↔Claude orchestration failures, each cross-critiqued by Codex and self-reviewed via agent-arena itself.
Right-size the arena (fix mode under-triage)
Principle #8 changed from "use the lightest arena" to "right-size the arena." The Quick Decision Gate now has bidirectional triage with explicit escalation triggers — persistent/irreversible side effects, structure/contract/policy redesign, genuinely interdependent decisions, repeating a past mistake, output-becomes-a-durable-contract — bounded by durability + external consumers (with a negative example) to avoid flipping into over-triage. Common Mistake #6 reworded to cover both over- and under-triage.
Read/analyze separation (reduce error_max_turns)
For bounded critique, Codex extracts material and Claude analyzes with no tools; Read,Glob,Grep + turn budget is reserved for genuine self-discovery. New context-budget protocol: feed raw evidence (paths, line numbers, omission notes), never Codex's conclusions, to protect Claude's independent first pass.
Stop over-redacting evidence
Task-relevant artifacts (experiment runs, media, predictions, metrics, outputs, dashboards) are evidence, not noise — excluding them forces inference from code instead of verifying real output.
Timing, timeouts, observability
Cross-agent calls take minutes in both directions (measured: single-turn no-tools ~6s; multi-turn repo review 2–5 min). A silent --output-format json run is not a hang. Set timeouts to match --max-turns (5–10 min, not 1 min), prefer stream-json to watch progress, record duration_ms/num_turns. Added a preflight runbook + failure classification.
User-facing failure handling
Arena Limitations template gains Failure type and Retry recommendation; failures must surface a retry decision (retry once, no loop; then stop/narrow/retry), never silently swallowed.
Full list: CHANGELOG.md
v0.1.4 — Alternative model backends & one-line install
Highlights
🔌 Alternative model backends
Agent Arena now explicitly supports running on non-Anthropic models. A different model family is treated as a genuinely heterogeneous arena participant:
- Claude Code (Anthropic protocol) → connect GLM / DeepSeek / Qwen / Kimi / Doubao via an Anthropic-compatible proxy (One API, LiteLLM, etc.)
- Codex (OpenAI protocol) → connect those same models directly via OpenAI-compatible API
- New
## Alternative Model Backendssection with supported configurations, a connection-method table, task-packet declaration format, and a degradation rule.
📦 One-line install
npx skills add zhjai/agent-arena -g -a claude-codeInstalls via the vercel-labs skills CLI — works with Claude Code, Codex, Cursor, OpenCode, and 50+ agents. Verified the CLI auto-detects both skills from the repo layout.
🎨 New banner
Replaced the minimal SVG with an infographic-style banner showing the full 6-step protocol flow, debate-arena visual, and agent ecosystem.
See CHANGELOG.md for the full list. Chinese README: README.zh.md.
agent-arena v0.1.1
Changes
- Clarifies that Codex should discover Claude Code through the external
claudeCLI before falling back to same-model subagents. - Adds context-minimization-without-blindness guidance: external agents may read relevant source/docs/tests within the approved repo scope.
- Adds multi-round cross-critique requirements for non-trivial arenas instead of one-shot heterogeneous review.
- Adds
collaborative_designmode so Codex and Claude Code can co-design architectures, interfaces, experiments, and implementation plans instead of only acting as reviewer/checker. - Documents sensitive-scope exclusions for secrets, datasets, generated results, private logs, and unrelated directories.
v0.1.0 preview release
Preview release of Agent Arena and Deliberative Analysis portable skills.\n\nIncludes:\n- corrected Codex install path using CODEX_HOME / ~/.codex/skills\n- explicit protocol-vs-orchestrator capability boundary\n- safety/privacy and sensitive-context guidance\n- consistent mode list and degraded-mode examples\n- portable LICENSE files and OpenAI/Codex skill metadata\n- SECURITY.md and CHANGELOG.md