bench(tau2): add generic scope fairness check by huangruiteng · Pull Request #2172 · volcengine/OpenViking

huangruiteng · 2026-05-21T10:30:15Z

Summary

This is a narrow follow-up to #2017. PR-B showed that TAU-2 trajectory memory can beat the no-memory baseline, but the headline read compared a no-scope no-memory baseline against a trajectory-memory treatment that used a domain-specific scope prompt. This PR removes the domain-specific scope prompts, routes TAU-2 scope treatments through a benchmark-neutral generic prompt, and adds the missing no-memory scope plumbing so the fair attribution read can use the same generic scope on both sides.

Changes:

pass scope_prompt_file through the no-memory TAU-2 eval path, so no-memory can be measured under the same scope prompt as memory treatments
add a generic advisory-memory scope guard that avoids retail/airline-specific business hints
update TAU-2 trajectory/content-matrix configs to use the generic scope prompt and remove the retail/airline scope prompt files
add a paired TAU-2 scope-fairness config with no-memory, generic-scope no-memory, and trajectory top4 first-user + pre-write under the same generic scope

Why

The original domain scope prompt is useful, but it can also help no-memory runs and is harder to defend as a benchmark-neutral memory protocol. Without a same-scope baseline, the headline PR-B delta can over-attribute some scope-prompt benefit to memory. The new generic-scope config makes the fair read explicit and reproducible without embedding retail/airline-specific business hints.

The important read is the attribution boundary:

Read	No-memory scope	Treatment scope	No-memory avg reward	Trajectory memory avg reward	Delta
Original PR-B headline read	none	domain-specific	0.80156	0.85313	+0.05157
Same domain-specific scope diagnostic	domain-specific	domain-specific	0.81719	0.85313	+0.03594
Fair generic-scope read	generic	generic	0.79844	0.84219	+0.04375

So the headline result remains positive, but the benchmark-neutral fair read is the generic-scope delta (+0.04375). The same-domain-scope row is retained only as an attribution diagnostic explaining why the original mixed-scope read was too generous.

Validation

Local checks:

python3 -m py_compile benchmark/tau2/scripts/run_eval.py
python3 benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/prb_scope_fairness.yaml --run-id pr_scope_fairness_generic_only_preflight2 --domain retail --repeat-count 1 --num-tasks 1 --strategy-id no_memory_generic_scope --strict-preflight --plan-only
python3 benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/trajectory.yaml --run-id pr_trajectory_generic_scope_plancheck --domain retail --repeat-count 1 --num-tasks 1 --strategy-id memory_v2_trajectory_prewrite_scope --strict-preflight --plan-only
plan inspection confirmed both no-memory and trajectory scope cells now resolve to benchmark/tau2/config/scope_prompts/generic_memory_scope.md

Full 8-trial diagnostic runs above come from the PR-B follow-up workspace. The original PR-B headline row is included to show the mixed-scope attribution issue; the generic-scope row is the fair read enabled by this PR.

Boundary

This PR does not add category rerank, selector/controller logic, failure-memory handling, or outcome-aware experience aggregation. Those remain separate experimental lines.

github-actions · 2026-05-21T10:31:12Z

PR Reviewer Guide 🔍

(Review updated until commit `af1d2f0`)

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅ 2017 - Fully compliant Compliant requirements: Add TAU-2 runner support for scope prompt in no-memory path Add generic scope prompt file Update TAU-2 configs to use generic scope prompt Add scope fairness config
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-05-21T10:31:24Z

PR Code Suggestions ✨

No code suggestions found for the PR.

github-actions · 2026-05-21T11:12:08Z

Persistent review updated to latest commit af1d2f0

github-actions · 2026-05-21T11:12:23Z

PR Code Suggestions ✨

No code suggestions found for the PR.

yangxinxin-7

LGTM

bench(tau2): add generic scope fairness check

f6944cf

github-project-automation Bot added this to OpenViking project May 21, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 21, 2026

huangruiteng mentioned this pull request May 21, 2026

feat(memory): upgrade trajectory extraction to beat no-memory baseline #2017

Merged

huangruiteng added 2 commits May 21, 2026 18:59

bench(tau2): focus scope fairness on generic prompt

d9de3ca

bench(tau2): remove domain-specific scope prompts

af1d2f0

huangruiteng marked this pull request as ready for review May 21, 2026 11:11

github-actions Bot added the Review effort 2/5 label May 21, 2026

yangxinxin-7 self-requested a review May 21, 2026 11:46

yangxinxin-7 self-assigned this May 21, 2026

yangxinxin-7 approved these changes May 21, 2026

View reviewed changes

yangxinxin-7 merged commit 8817abd into volcengine:main May 21, 2026
6 checks passed

github-project-automation Bot moved this from Backlog to Done in OpenViking project May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(tau2): add generic scope fairness check#2172

bench(tau2): add generic scope fairness check#2172
yangxinxin-7 merged 3 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-generic-scope-fairness

huangruiteng commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

yangxinxin-7 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huangruiteng commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Boundary

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit af1d2f0)

Uh oh!

github-actions Bot commented May 21, 2026

PR Code Suggestions ✨

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

PR Code Suggestions ✨

Uh oh!

yangxinxin-7 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huangruiteng commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading

(Review updated until commit `af1d2f0`)

yangxinxin-7 left a comment •

edited

Loading