feat(benchmark): add TAU-2 Memory V2 eval runner by huangruiteng · Pull Request #2003 · volcengine/OpenViking

huangruiteng · 2026-05-12T17:03:33Z

Background

This PR comes from the local Agent Harness TAU-2 memory experiments, where we tested no-memory, OpenViking Memory V2, pre-write recall, and fine-grained custom memory variants.

The initial benchmark implementation was provided by @yangxinxin-7, and the Memory V2 eval scope was discussed together.

The clearest current result is under the reasoning-high setting:

Route	retail avg	airline avg	task-weighted total
no memory	0.83750	0.72500	0.80000
OpenViking Memory V2, experience-only	0.75000	0.77500	0.75833
fine-grained custom memory + pre-write/category/scope	0.85000	0.80000	0.83333

The local conclusion is that fine-grained memory plus recall at better decision nodes is promising. For this PR, we are not moving the custom memory / trajectory-view retrofit into OpenViking yet. The goal is to unblock downstream OV users with a clean TAU-2 Memory V2 evaluation path first; the custom-memory trajectory prompt changes can follow as a separate PR after this benchmark path is usable.

Summary

Adds an OpenViking-style TAU-2 benchmark entry point under benchmark/tau2/ for running Memory V2 evals end to end:

train TAU-2 sessions and commit them into OpenViking Memory V2;
evaluate with experience-memory recall at the first user turn;
optionally recall again before write-like tool calls (memory_v2_prewrite);
write per-cell results, corpus manifests, retrieval traces, and scoreboards.

The default benchmark config runs retail + airline over two Memory V2 strategies with 8 repeats each:

Strategy	Memory path	Retrieval timing
`memory_v2_experience_only`	OpenViking Memory V2 experiences	first user turn
`memory_v2_prewrite`	same corpus as above	first user turn + before write-like tool calls

Changes

Add TAU-2 orchestration scripts and configs under benchmark/tau2/.
Keep TAU-2 as an external checkout / CLI dependency via TAU2_REPO / TAU2_CLI; no dataset or generated artifacts are vendored.
Add an optional confirmation-aware user simulator patch helper for benchmark-quality runs. config/official.yaml keeps the upstream simulator path available.
Add the Memory V2 runner for train → commit → search/read → injected eval.
Add corpus probes and runtime retrieval traces so empty or mis-scoped memory runs are visible in artifacts.

Notes

This PR is scoped to Memory V2 eval plumbing plus pre-write recall. It intentionally does not include category rerank, custom procedure templates, or trajectory prompt changes.
The PR smoke result below validates the OpenViking TAU-2 path, not a leaderboard claim. Comparable effect claims still require a frozen/full corpus, same-protocol baseline, and full retail+airline repeats.
A small Agent Harness comparison was checked for sanity: retail task5 reaches reward=1.0 / db_match=1.0 in both this PR smoke and historical Harness pre-write runs, but the corpora and trace semantics differ, so this is only a smoke-level consistency check.

Validation

python3 -m compileall -q benchmark/tau2/scripts && git diff --check
plan-only default config: 32 cells, seeds 300..307 for each domain/strategy pair
retail task5 Memory V2 prewrite smoke:
- avg_reward = 1.0
- db_match_rate = 1.0
- corpus probe: non-empty search/read result
- retrieval trace: first-user recall and before-write recall both executed and injected memory

github-actions · 2026-05-12T17:04:38Z

PR Reviewer Guide 🔍

(Review updated until commit `581594a`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Modifies external TAU2 repo files without backup The _ensure_confirmation_aware_prompt function directly appends to TAU2 user simulator prompt files without creating backups or obtaining explicit user consent. This can overwrite customizations in the external TAU2 checkout. def _ensure_confirmation_aware_prompt(repo: Path) -> bool: patched = False for path in _prompt_paths(repo): if not path.is_file(): continue text = path.read_text(encoding="utf-8") if _has_confirmation_aware_prompt(text): continue path.write_text(text.rstrip() + CONFIRMATION_AWARE_APPENDIX + "\n", encoding="utf-8") patched = True return patched Broad exception handlers hide errors The _probe_corpus and _retrieve functions use bare except Exception: blocks that swallow all exceptions, potentially hiding real errors in memory read operations. except Exception: text = getattr(match, "abstract", "") or getattr(match, "overview", "") or "" Broad exception handlers hide errors The _probe_corpus and _retrieve functions use bare except Exception: blocks that swallow all exceptions, potentially hiding real errors in memory read operations. except Exception: text = getattr(match, "abstract", "") or getattr(match, "overview", "") or ""

github-actions · 2026-05-12T17:06:01Z

PR Code Suggestions ✨

No code suggestions found for the PR.

github-actions · 2026-05-13T06:49:00Z

Persistent review updated to latest commit 581594a

github-actions · 2026-05-13T06:50:28Z

PR Code Suggestions ✨

No code suggestions found for the PR.

yangxinxin-7 · 2026-05-13T07:30:44Z

串行执行时单 cell 失败会导致整个 run 终止

run_eval.py:744 里任意一个 cell 失败就直接 raise RuntimeError，剩余 cell 不会继续执行，scoreboard 也不会生成。

完整评估是 32 个 cell（2 domain × 2 strategy × 8 repeat），跑到第 20 个 cell TAU-2 偶发报错，前 19 个的结果虽然写到 cell_results/ 里了，但后面 12 个直接跳过，scoreboard 也没有。

建议改成收集失败、继续执行，最后统一写 scoreboard 并在里面标注哪些 cell 失败了。

huangruiteng · 2026-05-13T07:45:58Z

感谢指出，这个长跑场景确实会影响人工完整评估的可用性。

这版默认 fail-fast 是有意保守：benchmark 场景里如果某个 cell 因 TAU-2/配置/运行环境问题失败，直接中止可以避免 partial scoreboard 被误读成完整证据。

我同意后续可以补一个显式的 fail-soft / continue-on-cell-error 模式：继续跑剩余 cell，仍生成 scoreboard/summary，但把失败 cell、missing cell 和 overall validity 明确标出来。这样适合人工长跑和排障；默认 strict 路径仍保持 fail-fast。PR 已经合入，我会把这个作为后续 eval runner 改进项，不混到 trajectory memory PR 里。

* benchmark: add tau2 eval scaffold * benchmark: gate pending tau2 memory adapter * benchmark: use litellm provider model default * benchmark: fold preflight into tau2 runner * benchmark: document tau2 dependency setup * benchmark: simplify tau2 simulator patch * benchmark: keep simulator patch prompt clean * benchmark: clarify simulator patch config * benchmark: clarify tau2 adapter boundary * benchmark: wire tau2 memory v2 eval * benchmark: harden tau2 memory agent tool calls * benchmark: tolerate empty tau2 assistant responses * benchmark: normalize tau2 llm environment * benchmark: add tau2 memory prewrite strategy * benchmark: support current tau2 runner api * benchmark: align tau2 memory prewrite parity * benchmark: make tau2 eval traces safer --------- Co-authored-by: huangruiteng <huangruiteng@bytedance.com>

huangruiteng added 9 commits May 12, 2026 22:59

benchmark: add tau2 eval scaffold

cca62c2

benchmark: gate pending tau2 memory adapter

a132c3b

benchmark: use litellm provider model default

b68e459

benchmark: fold preflight into tau2 runner

37e9b50

benchmark: document tau2 dependency setup

95ea695

benchmark: simplify tau2 simulator patch

e59a4a0

benchmark: keep simulator patch prompt clean

32b9b42

benchmark: clarify simulator patch config

7ef2743

benchmark: clarify tau2 adapter boundary

00bd6ad

github-project-automation Bot added this to OpenViking project May 12, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 12, 2026

github-actions Bot added the Review effort 2/5 label May 12, 2026

huangruiteng added 7 commits May 13, 2026 02:37

benchmark: wire tau2 memory v2 eval

85d5363

benchmark: harden tau2 memory agent tool calls

90f040a

benchmark: tolerate empty tau2 assistant responses

b52e65b

benchmark: normalize tau2 llm environment

1c84468

benchmark: add tau2 memory prewrite strategy

a6b7535

benchmark: support current tau2 runner api

d44b07c

benchmark: align tau2 memory prewrite parity

581594a

huangruiteng changed the title ~~benchmark: add TAU-2 eval scaffold~~ feat(benchmark): add TAU-2 Memory V2 eval runner May 13, 2026

huangruiteng marked this pull request as ready for review May 13, 2026 06:47

github-actions Bot added Review effort 4/5 and removed Review effort 2/5 labels May 13, 2026

benchmark: make tau2 eval traces safer

14c4391

yangxinxin-7 approved these changes May 13, 2026

View reviewed changes

yangxinxin-7 merged commit c9ebba0 into volcengine:main May 13, 2026
1 check passed

github-project-automation Bot moved this from Backlog to Done in OpenViking project May 13, 2026

huangruiteng mentioned this pull request May 13, 2026

feat(memory): add TAU-2 trajectory-view treatment #2017

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): add TAU-2 Memory V2 eval runner#2003

feat(benchmark): add TAU-2 Memory V2 eval runner#2003
yangxinxin-7 merged 17 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-benchmark

huangruiteng commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

yangxinxin-7 commented May 13, 2026

Uh oh!

Uh oh!

huangruiteng commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huangruiteng commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Summary

Changes

Notes

Validation

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 581594a)

Uh oh!

github-actions Bot commented May 12, 2026

PR Code Suggestions ✨

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

PR Code Suggestions ✨

Uh oh!

yangxinxin-7 commented May 13, 2026

Uh oh!

Uh oh!

huangruiteng commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huangruiteng commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

(Review updated until commit `581594a`)