Skip to content

Releases: zhjai/agent-arena

v0.1.9 — review the raw file, not a summary

01 Jun 16:39

Choose a tag to compare

Adds Common Mistake #10: reviewing from a summary instead of the raw file. Format / structure / spec-compliance problems (frontmatter, schema, config, exact YAML) live in the precise text — a summary-fed reviewer is blind to them. Give the reviewer the actual file for those checks.

This captures the v0.1.8 lesson: the skill's own frontmatter stayed off-spec through many summary-fed Codex reviews until one file-reading review caught it. (v0.1.9 tag was previously used by a withdrawn timeout experiment, since reverted; reused here.)

v0.1.8 — spec-compliant frontmatter

01 Jun 16:33

Choose a tag to compare

Make the SKILL.md frontmatter conform to the agentskills spec: version/author moved under metadata (version a quoted string), tags/related_skills as comma-separated strings, dropped the nested metadata.hermes block. The prior frontmatter was off-spec (parsers tolerated it). Caught by a cross-environment Codex review reading the actual file against the spec — earlier summary-fed reviews never saw the frontmatter. (Unrelated to the withdrawn 0.1.8/0.1.9 timeout experiments, which were reverted; version reused.)

v0.1.7 — model-unavailable failure type

01 Jun 12:41

Choose a tag to compare

Reported from real use: specifying gpt-5.2-codex was rejected by the Codex account; re-running with the default model succeeded.

  • Add model-unavailable to the failure taxonomy (Preflight runbook + Arena Limitations template). It is distinct from auth (authentication is fine) and refusal (the model declined to answer).
  • Guidance: don't pin a specific model version unless the account is confirmed to support it; prefer the default model. The standard retry for model-unavailable is to drop the override and re-run with the default model.

This is exactly the v0.1.5 Failure-type / Retry-recommendation mechanism working in the field — the report used that format and the one-variable-change retry succeeded; v0.1.7 just fills the taxonomy gap.

v0.1.6 — groundcheck fact-gate interop

30 May 20:03

Choose a tag to compare

Wires in the new companion skill groundcheck — single-agent, evidence-grounded fact-checking that treats hallucination (where multi-agent debate is weak, because a panel can reinforce a shared hallucination).

What's new

  • Pre-debate fact-gate: after each agent's independent answer, run groundcheck; any refuted claim is sent back to its source agent (with the refuting evidence, not a conclusion) before cross-critique. Re-check new/changed claims post-debate.
  • Framing: agent-arena (multi-agent debate → overconfidence) + groundcheck (single-agent verification → hallucination) = two depths of one verification stack.
  • Added groundcheck to related_skills; companion section in both READMEs; synced zh README badge.

See the groundcheck fact-gate contract for send-back and anti-loop rules.

v0.1.5 — Orchestration reliability & right-size triage

30 May 10:59

Choose a tag to compare

Hardening derived from real Codex↔Claude orchestration failures, each cross-critiqued by Codex and self-reviewed via agent-arena itself.

Right-size the arena (fix mode under-triage)

Principle #8 changed from "use the lightest arena" to "right-size the arena." The Quick Decision Gate now has bidirectional triage with explicit escalation triggers — persistent/irreversible side effects, structure/contract/policy redesign, genuinely interdependent decisions, repeating a past mistake, output-becomes-a-durable-contract — bounded by durability + external consumers (with a negative example) to avoid flipping into over-triage. Common Mistake #6 reworded to cover both over- and under-triage.

Read/analyze separation (reduce error_max_turns)

For bounded critique, Codex extracts material and Claude analyzes with no tools; Read,Glob,Grep + turn budget is reserved for genuine self-discovery. New context-budget protocol: feed raw evidence (paths, line numbers, omission notes), never Codex's conclusions, to protect Claude's independent first pass.

Stop over-redacting evidence

Task-relevant artifacts (experiment runs, media, predictions, metrics, outputs, dashboards) are evidence, not noise — excluding them forces inference from code instead of verifying real output.

Timing, timeouts, observability

Cross-agent calls take minutes in both directions (measured: single-turn no-tools ~6s; multi-turn repo review 2–5 min). A silent --output-format json run is not a hang. Set timeouts to match --max-turns (5–10 min, not 1 min), prefer stream-json to watch progress, record duration_ms/num_turns. Added a preflight runbook + failure classification.

User-facing failure handling

Arena Limitations template gains Failure type and Retry recommendation; failures must surface a retry decision (retry once, no loop; then stop/narrow/retry), never silently swallowed.

Full list: CHANGELOG.md

v0.1.4 — Alternative model backends & one-line install

29 May 03:59

Choose a tag to compare

Highlights

🔌 Alternative model backends

Agent Arena now explicitly supports running on non-Anthropic models. A different model family is treated as a genuinely heterogeneous arena participant:

  • Claude Code (Anthropic protocol) → connect GLM / DeepSeek / Qwen / Kimi / Doubao via an Anthropic-compatible proxy (One API, LiteLLM, etc.)
  • Codex (OpenAI protocol) → connect those same models directly via OpenAI-compatible API
  • New ## Alternative Model Backends section with supported configurations, a connection-method table, task-packet declaration format, and a degradation rule.

📦 One-line install

npx skills add zhjai/agent-arena -g -a claude-code

Installs via the vercel-labs skills CLI — works with Claude Code, Codex, Cursor, OpenCode, and 50+ agents. Verified the CLI auto-detects both skills from the repo layout.

🎨 New banner

Replaced the minimal SVG with an infographic-style banner showing the full 6-step protocol flow, debate-arena visual, and agent ecosystem.


See CHANGELOG.md for the full list. Chinese README: README.zh.md.

agent-arena v0.1.1

27 May 09:26

Choose a tag to compare

Changes

  • Clarifies that Codex should discover Claude Code through the external claude CLI before falling back to same-model subagents.
  • Adds context-minimization-without-blindness guidance: external agents may read relevant source/docs/tests within the approved repo scope.
  • Adds multi-round cross-critique requirements for non-trivial arenas instead of one-shot heterogeneous review.
  • Adds collaborative_design mode so Codex and Claude Code can co-design architectures, interfaces, experiments, and implementation plans instead of only acting as reviewer/checker.
  • Documents sensitive-scope exclusions for secrets, datasets, generated results, private logs, and unrelated directories.

v0.1.0 preview release

22 May 06:09

Choose a tag to compare

Preview release of Agent Arena and Deliberative Analysis portable skills.\n\nIncludes:\n- corrected Codex install path using CODEX_HOME / ~/.codex/skills\n- explicit protocol-vs-orchestrator capability boundary\n- safety/privacy and sensitive-context guidance\n- consistent mode list and degraded-mode examples\n- portable LICENSE files and OpenAI/Codex skill metadata\n- SECURITY.md and CHANGELOG.md