Skip to content

bench(tau2): add generic scope fairness check#2172

Merged
yangxinxin-7 merged 3 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-generic-scope-fairness
May 21, 2026
Merged

bench(tau2): add generic scope fairness check#2172
yangxinxin-7 merged 3 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-generic-scope-fairness

Conversation

@huangruiteng
Copy link
Copy Markdown
Contributor

@huangruiteng huangruiteng commented May 21, 2026

Summary

This is a narrow follow-up to #2017. PR-B showed that TAU-2 trajectory memory can beat the no-memory baseline, but the headline read compared a no-scope no-memory baseline against a trajectory-memory treatment that used a domain-specific scope prompt. This PR removes the domain-specific scope prompts, routes TAU-2 scope treatments through a benchmark-neutral generic prompt, and adds the missing no-memory scope plumbing so the fair attribution read can use the same generic scope on both sides.

Changes:

  • pass scope_prompt_file through the no-memory TAU-2 eval path, so no-memory can be measured under the same scope prompt as memory treatments
  • add a generic advisory-memory scope guard that avoids retail/airline-specific business hints
  • update TAU-2 trajectory/content-matrix configs to use the generic scope prompt and remove the retail/airline scope prompt files
  • add a paired TAU-2 scope-fairness config with no-memory, generic-scope no-memory, and trajectory top4 first-user + pre-write under the same generic scope

Why

The original domain scope prompt is useful, but it can also help no-memory runs and is harder to defend as a benchmark-neutral memory protocol. Without a same-scope baseline, the headline PR-B delta can over-attribute some scope-prompt benefit to memory. The new generic-scope config makes the fair read explicit and reproducible without embedding retail/airline-specific business hints.

The important read is the attribution boundary:

Read No-memory scope Treatment scope No-memory avg reward Trajectory memory avg reward Delta
Original PR-B headline read none domain-specific 0.80156 0.85313 +0.05157
Same domain-specific scope diagnostic domain-specific domain-specific 0.81719 0.85313 +0.03594
Fair generic-scope read generic generic 0.79844 0.84219 +0.04375

So the headline result remains positive, but the benchmark-neutral fair read is the generic-scope delta (+0.04375). The same-domain-scope row is retained only as an attribution diagnostic explaining why the original mixed-scope read was too generous.

Validation

Local checks:

  • python3 -m py_compile benchmark/tau2/scripts/run_eval.py
  • python3 benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/prb_scope_fairness.yaml --run-id pr_scope_fairness_generic_only_preflight2 --domain retail --repeat-count 1 --num-tasks 1 --strategy-id no_memory_generic_scope --strict-preflight --plan-only
  • python3 benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/trajectory.yaml --run-id pr_trajectory_generic_scope_plancheck --domain retail --repeat-count 1 --num-tasks 1 --strategy-id memory_v2_trajectory_prewrite_scope --strict-preflight --plan-only
  • plan inspection confirmed both no-memory and trajectory scope cells now resolve to benchmark/tau2/config/scope_prompts/generic_memory_scope.md

Full 8-trial diagnostic runs above come from the PR-B follow-up workspace. The original PR-B headline row is included to show the mixed-scope attribution issue; the generic-scope row is the fair read enabled by this PR.

Boundary

This PR does not add category rerank, selector/controller logic, failure-memory handling, or outcome-aware experience aggregation. Those remain separate experimental lines.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

PR Reviewer Guide 🔍

(Review updated until commit af1d2f0)

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

2017 - Fully compliant

Compliant requirements:

  • Add TAU-2 runner support for scope prompt in no-memory path
  • Add generic scope prompt file
  • Update TAU-2 configs to use generic scope prompt
  • Add scope fairness config
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@huangruiteng huangruiteng marked this pull request as ready for review May 21, 2026 11:11
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit af1d2f0

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@yangxinxin-7 yangxinxin-7 self-requested a review May 21, 2026 11:46
@yangxinxin-7 yangxinxin-7 self-assigned this May 21, 2026
Copy link
Copy Markdown
Collaborator

@yangxinxin-7 yangxinxin-7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yangxinxin-7 yangxinxin-7 merged commit 8817abd into volcengine:main May 21, 2026
6 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants