feat(benchmark): add TAU-2 Memory V2 eval runner#2003
Conversation
PR Reviewer Guide 🔍(Review updated until commit 581594a)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
|
Persistent review updated to latest commit 581594a |
PR Code Suggestions ✨No code suggestions found for the PR. |
|
串行执行时单 cell 失败会导致整个 run 终止
完整评估是 32 个 cell(2 domain × 2 strategy × 8 repeat),跑到第 20 个 cell TAU-2 偶发报错,前 19 个的结果虽然写到 建议改成收集失败、继续执行,最后统一写 scoreboard 并在里面标注哪些 cell 失败了。 |
|
感谢指出,这个长跑场景确实会影响人工完整评估的可用性。 这版默认 fail-fast 是有意保守:benchmark 场景里如果某个 cell 因 TAU-2/配置/运行环境问题失败,直接中止可以避免 partial scoreboard 被误读成完整证据。 我同意后续可以补一个显式的 fail-soft / continue-on-cell-error 模式:继续跑剩余 cell,仍生成 scoreboard/summary,但把失败 cell、missing cell 和 overall validity 明确标出来。这样适合人工长跑和排障;默认 strict 路径仍保持 fail-fast。PR 已经合入,我会把这个作为后续 eval runner 改进项,不混到 trajectory memory PR 里。 |
* benchmark: add tau2 eval scaffold * benchmark: gate pending tau2 memory adapter * benchmark: use litellm provider model default * benchmark: fold preflight into tau2 runner * benchmark: document tau2 dependency setup * benchmark: simplify tau2 simulator patch * benchmark: keep simulator patch prompt clean * benchmark: clarify simulator patch config * benchmark: clarify tau2 adapter boundary * benchmark: wire tau2 memory v2 eval * benchmark: harden tau2 memory agent tool calls * benchmark: tolerate empty tau2 assistant responses * benchmark: normalize tau2 llm environment * benchmark: add tau2 memory prewrite strategy * benchmark: support current tau2 runner api * benchmark: align tau2 memory prewrite parity * benchmark: make tau2 eval traces safer --------- Co-authored-by: huangruiteng <huangruiteng@bytedance.com>
Background
This PR comes from the local Agent Harness TAU-2 memory experiments, where we tested no-memory, OpenViking Memory V2, pre-write recall, and fine-grained custom memory variants.
The initial benchmark implementation was provided by @yangxinxin-7, and the Memory V2 eval scope was discussed together.
The clearest current result is under the reasoning-high setting:
The local conclusion is that fine-grained memory plus recall at better decision nodes is promising. For this PR, we are not moving the custom memory / trajectory-view retrofit into OpenViking yet. The goal is to unblock downstream OV users with a clean TAU-2 Memory V2 evaluation path first; the custom-memory trajectory prompt changes can follow as a separate PR after this benchmark path is usable.
Summary
Adds an OpenViking-style TAU-2 benchmark entry point under
benchmark/tau2/for running Memory V2 evals end to end:memory_v2_prewrite);The default benchmark config runs
retail + airlineover two Memory V2 strategies with 8 repeats each:memory_v2_experience_onlymemory_v2_prewriteChanges
benchmark/tau2/.TAU2_REPO/TAU2_CLI; no dataset or generated artifacts are vendored.config/official.yamlkeeps the upstream simulator path available.Notes
reward=1.0 / db_match=1.0in both this PR smoke and historical Harness pre-write runs, but the corpora and trace semantics differ, so this is only a smoke-level consistency check.Validation
python3 -m compileall -q benchmark/tau2/scripts && git diff --check300..307for each domain/strategy pairavg_reward = 1.0db_match_rate = 1.0