Skip to content

feat(benchmark): add TAU-2 Memory V2 eval runner#2003

Merged
yangxinxin-7 merged 17 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-benchmark
May 13, 2026
Merged

feat(benchmark): add TAU-2 Memory V2 eval runner#2003
yangxinxin-7 merged 17 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-benchmark

Conversation

@huangruiteng
Copy link
Copy Markdown
Contributor

@huangruiteng huangruiteng commented May 12, 2026

Background

This PR comes from the local Agent Harness TAU-2 memory experiments, where we tested no-memory, OpenViking Memory V2, pre-write recall, and fine-grained custom memory variants.

The initial benchmark implementation was provided by @yangxinxin-7, and the Memory V2 eval scope was discussed together.

The clearest current result is under the reasoning-high setting:

Route retail avg airline avg task-weighted total
no memory 0.83750 0.72500 0.80000
OpenViking Memory V2, experience-only 0.75000 0.77500 0.75833
fine-grained custom memory + pre-write/category/scope 0.85000 0.80000 0.83333

The local conclusion is that fine-grained memory plus recall at better decision nodes is promising. For this PR, we are not moving the custom memory / trajectory-view retrofit into OpenViking yet. The goal is to unblock downstream OV users with a clean TAU-2 Memory V2 evaluation path first; the custom-memory trajectory prompt changes can follow as a separate PR after this benchmark path is usable.

Summary

Adds an OpenViking-style TAU-2 benchmark entry point under benchmark/tau2/ for running Memory V2 evals end to end:

  • train TAU-2 sessions and commit them into OpenViking Memory V2;
  • evaluate with experience-memory recall at the first user turn;
  • optionally recall again before write-like tool calls (memory_v2_prewrite);
  • write per-cell results, corpus manifests, retrieval traces, and scoreboards.

The default benchmark config runs retail + airline over two Memory V2 strategies with 8 repeats each:

Strategy Memory path Retrieval timing
memory_v2_experience_only OpenViking Memory V2 experiences first user turn
memory_v2_prewrite same corpus as above first user turn + before write-like tool calls

Changes

  • Add TAU-2 orchestration scripts and configs under benchmark/tau2/.
  • Keep TAU-2 as an external checkout / CLI dependency via TAU2_REPO / TAU2_CLI; no dataset or generated artifacts are vendored.
  • Add an optional confirmation-aware user simulator patch helper for benchmark-quality runs. config/official.yaml keeps the upstream simulator path available.
  • Add the Memory V2 runner for train → commit → search/read → injected eval.
  • Add corpus probes and runtime retrieval traces so empty or mis-scoped memory runs are visible in artifacts.

Notes

  • This PR is scoped to Memory V2 eval plumbing plus pre-write recall. It intentionally does not include category rerank, custom procedure templates, or trajectory prompt changes.
  • The PR smoke result below validates the OpenViking TAU-2 path, not a leaderboard claim. Comparable effect claims still require a frozen/full corpus, same-protocol baseline, and full retail+airline repeats.
  • A small Agent Harness comparison was checked for sanity: retail task5 reaches reward=1.0 / db_match=1.0 in both this PR smoke and historical Harness pre-write runs, but the corpora and trace semantics differ, so this is only a smoke-level consistency check.

Validation

  • python3 -m compileall -q benchmark/tau2/scripts && git diff --check
  • plan-only default config: 32 cells, seeds 300..307 for each domain/strategy pair
  • retail task5 Memory V2 prewrite smoke:
    • avg_reward = 1.0
    • db_match_rate = 1.0
    • corpus probe: non-empty search/read result
    • retrieval trace: first-user recall and before-write recall both executed and injected memory

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

PR Reviewer Guide 🔍

(Review updated until commit 581594a)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Modifies external TAU2 repo files without backup

The _ensure_confirmation_aware_prompt function directly appends to TAU2 user simulator prompt files without creating backups or obtaining explicit user consent. This can overwrite customizations in the external TAU2 checkout.

def _ensure_confirmation_aware_prompt(repo: Path) -> bool:
    patched = False
    for path in _prompt_paths(repo):
        if not path.is_file():
            continue
        text = path.read_text(encoding="utf-8")
        if _has_confirmation_aware_prompt(text):
            continue
        path.write_text(text.rstrip() + CONFIRMATION_AWARE_APPENDIX + "\n", encoding="utf-8")
        patched = True
    return patched
Broad exception handlers hide errors

The _probe_corpus and _retrieve functions use bare except Exception: blocks that swallow all exceptions, potentially hiding real errors in memory read operations.

except Exception:
    text = getattr(match, "abstract", "") or getattr(match, "overview", "") or ""
Broad exception handlers hide errors

The _probe_corpus and _retrieve functions use bare except Exception: blocks that swallow all exceptions, potentially hiding real errors in memory read operations.

except Exception:
    text = getattr(match, "abstract", "") or getattr(match, "overview", "") or ""

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@huangruiteng huangruiteng changed the title benchmark: add TAU-2 eval scaffold feat(benchmark): add TAU-2 Memory V2 eval runner May 13, 2026
@huangruiteng huangruiteng marked this pull request as ready for review May 13, 2026 06:47
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 581594a

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@yangxinxin-7
Copy link
Copy Markdown
Collaborator

串行执行时单 cell 失败会导致整个 run 终止

run_eval.py:744 里任意一个 cell 失败就直接 raise RuntimeError,剩余 cell 不会继续执行,scoreboard 也不会生成。

完整评估是 32 个 cell(2 domain × 2 strategy × 8 repeat),跑到第 20 个 cell TAU-2 偶发报错,前 19 个的结果虽然写到 cell_results/ 里了,但后面 12 个直接跳过,scoreboard 也没有。

建议改成收集失败、继续执行,最后统一写 scoreboard 并在里面标注哪些 cell 失败了。

@yangxinxin-7 yangxinxin-7 merged commit c9ebba0 into volcengine:main May 13, 2026
1 check passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 13, 2026
@huangruiteng
Copy link
Copy Markdown
Contributor Author

感谢指出,这个长跑场景确实会影响人工完整评估的可用性。

这版默认 fail-fast 是有意保守:benchmark 场景里如果某个 cell 因 TAU-2/配置/运行环境问题失败,直接中止可以避免 partial scoreboard 被误读成完整证据。

我同意后续可以补一个显式的 fail-soft / continue-on-cell-error 模式:继续跑剩余 cell,仍生成 scoreboard/summary,但把失败 cell、missing cell 和 overall validity 明确标出来。这样适合人工长跑和排障;默认 strict 路径仍保持 fail-fast。PR 已经合入,我会把这个作为后续 eval runner 改进项,不混到 trajectory memory PR 里。

ZaynJarvis pushed a commit that referenced this pull request May 13, 2026
* benchmark: add tau2 eval scaffold

* benchmark: gate pending tau2 memory adapter

* benchmark: use litellm provider model default

* benchmark: fold preflight into tau2 runner

* benchmark: document tau2 dependency setup

* benchmark: simplify tau2 simulator patch

* benchmark: keep simulator patch prompt clean

* benchmark: clarify simulator patch config

* benchmark: clarify tau2 adapter boundary

* benchmark: wire tau2 memory v2 eval

* benchmark: harden tau2 memory agent tool calls

* benchmark: tolerate empty tau2 assistant responses

* benchmark: normalize tau2 llm environment

* benchmark: add tau2 memory prewrite strategy

* benchmark: support current tau2 runner api

* benchmark: align tau2 memory prewrite parity

* benchmark: make tau2 eval traces safer

---------

Co-authored-by: huangruiteng <huangruiteng@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants