volcengine · yangxinxin-7 · May 13, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/benchmark/tau2/.gitignore b/benchmark/tau2/.gitignore
@@ -0,0 +1,5 @@
+result/
+.env.tau2
+.external/
+.venv-tau2/
+__pycache__/
diff --git a/benchmark/tau2/README.md b/benchmark/tau2/README.md
@@ -0,0 +1,146 @@
+# TAU-2 Benchmark
+
+This directory contains a small OpenViking-style entry point for TAU-2 memory
+evaluation. The first version is intentionally narrow:
+
+- fresh OpenViking Memory V2 experience-only baseline;
+- Memory V2 pre-write recall treatment.
+
+Trajectory / procedure-view prompts, category rerank, and other harness-only
+diagnostics are intentionally left out of this first PR.
+
+## Layout
+
+```text
+benchmark/tau2/
+├── config/
+│   ├── baseline.yaml
+│   ├── official.yaml
+│   └── prewrite.yaml
+├── scripts/
+│   ├── run_eval.py
+│   ├── setup_tau2_repo.sh
+│   └── tau2_common.py
+└── run_full_eval.sh
+```
+
+Generated artifacts are written to `benchmark/tau2/result/<run_id>/`.
+
+## Quick Start
+
+This benchmark delegates task simulation and scoring to an external TAU-2
+checkout. Point the runner at that checkout and CLI explicitly when they are not
+on the default path:
+
+```bash
+export TAU2_REPO=/path/to/tau2-bench
+export TAU2_CLI=/path/to/tau2
+```
+
+For a local one-command setup, clone and install TAU-2 into ignored benchmark
+directories:
+
+```bash
+benchmark/tau2/scripts/setup_tau2_repo.sh
+source benchmark/tau2/.env.tau2
+```
+
+Plan the default benchmark without running TAU-2:
+
+```bash
+python benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/baseline.yaml --plan-only
+```
+
+Add `--preflight` or `--strict-preflight` when you want the runner to write a
+small environment/config check next to the run plan.
+
+After setup, verify the local TAU-2 link and write a one-cell run plan:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/baseline.yaml \
+  --strict-preflight \
+  --domain retail \
+  --strategy-id memory_v2_experience_only \
+  --task-id 5 \
+  --repeat-count 1
+```
+
+Plan a one-cell Memory V2 pre-write smoke:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/baseline.yaml \
+  --domain retail \
+  --strategy-id memory_v2_prewrite \
+  --num-tasks 1 \
+  --repeat-count 1
+```
+
+Run the Memory V2 8-trial matrix (`retail + airline` x 2 strategies x 8 repeats):
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/baseline.yaml \
+  --execute
+```
+
+For a small E2E smoke, keep both the eval and train slices tiny:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/baseline.yaml \
+  --domain retail \
+  --strategy-id memory_v2_experience_only \
+  --num-tasks 1 \
+  --train-num-tasks 1 \
+  --repeat-count 1 \
+  --execute
+```
+
+When using Doubao through an OpenAI-compatible endpoint, set `OPENAI_API_KEY`
+and `OPENAI_API_BASE` for LiteLLM before running upstream TAU-2.
+
+Start the OpenViking service before executing memory cells, and verify it with
+`ov status`. For evidence runs, use a clean OpenViking workspace/config and set
+`OPENVIKING_URL` explicitly so local custom memory templates do not pollute the
+Memory V2 baseline.
+
+## Memory Adapter
+
+`memory_v2_experience_only` and `memory_v2_prewrite` cells run through a small
+TAU-2 agent adapter in this directory:
+
+- train by writing TAU-2 training conversations into OpenViking sessions;
+- evaluate by retrieving OpenViking experience memory at the first user turn;
+- for pre-write recall, retrieve again before write-like tool calls and
+  regenerate that step with the matched memories;
+- emit artifact metadata to identify the OpenViking account, agent,
+  corpus, retrieval mode, and simulator policy used by each cell.
+
+## User Simulator Policy
+
+The runner default is the official TAU-2 user simulator if
+`eval.user_simulator_policy` is omitted. The bundled OpenViking memory benchmark
+config sets `confirmation_aware`, because a memory benchmark should not treat
+user confirmation as task completion before the backend write has happened.
+
+`confirmation_aware` applies a small idempotent prompt patch to the configured
+TAU-2 checkout before planning or running. The patch appends only the behavioral
+confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
+the upstream PR link is kept in run artifacts, not in the simulator prompt.
+Reference: [sierra-research/tau2-bench#297](https://github.com/sierra-research/tau2-bench/pull/297).
+
+Use `config/official.yaml` with a clean TAU-2 checkout when you need an
+official-user-simulator parity run. If the checkout was already patched, the
+artifact records that boundary instead of labeling the run pure official.
+
+## Evidence Boundary
+
+Only completed `retail + airline` runs with the same config, same seeds/repeats,
+and non-empty artifacts should be read as benchmark evidence. Partial runs,
+single-task probes, or missing OpenViking corpus identity are diagnostics.
+Executed runs write per-cell JSON under `cell_results/` and a strategy/domain
+aggregate under `scoreboard.json`. Memory training artifacts are shared by
+domain and strategy under `memory_corpora/`, so repeated eval cells reuse the
+same fresh corpus instead of rewriting it.
diff --git a/benchmark/tau2/config/baseline.yaml b/benchmark/tau2/config/baseline.yaml
@@ -0,0 +1,53 @@
+benchmark:
+  name: tau2_openviking_baseline
+  domains:
+    - retail
+    - airline
+  train_split_name: train
+  eval_split_name: test
+  repeat_count: 8
+  task_max_concurrency: 10
+  max_steps: 200
+  seed: 300
+  agent: llm_agent
+  user: user_simulator
+  reasoning_effort: high
+
+paths:
+  tau2_repo: ${TAU2_REPO:-data/external_benchmarks/tau2-bench}
+  tau2_cli: ${TAU2_CLI:-tau2}
+  output_dir: benchmark/tau2/result
+
+eval:
+  # The runner default is official if this field is omitted. The OpenViking
+  # memory benchmark config opts into a confirmation-aware TAU-2 user simulator
+  # prompt; run_eval.py applies that small prompt patch idempotently when needed.
+  user_simulator_policy: confirmation_aware
+
+model:
+  agent_llm: ${TAU2_AGENT_LLM:-openai/doubao-seed-2-0-pro-260215}
+  user_llm: ${TAU2_USER_LLM:-openai/doubao-seed-2-0-pro-260215}
+  temperature: 0.0
+
+openviking:
+  url: ${OPENVIKING_URL:-http://localhost:1933}
+  account: ${OPENVIKING_ACCOUNT:-default}
+  agent_id: ${OPENVIKING_AGENT_ID:-tau2-openviking-agent}
+  retrieval_top_k: 4
+  replay_write_policy: read_only
+
+strategies:
+  - id: memory_v2_experience_only
+    label: OpenViking Memory V2 experience-only
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_experience_only
+    train_memory_mode: experience_only
+    retrieval_mode: first_user
+  - id: memory_v2_prewrite
+    label: OpenViking Memory V2 pre-write recall
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_experience_only
+    train_memory_mode: experience_only
+    retrieval_mode: first_user_prewrite
diff --git a/benchmark/tau2/config/official.yaml b/benchmark/tau2/config/official.yaml
@@ -0,0 +1,7 @@
+extends: baseline.yaml
+
+benchmark:
+  name: tau2_openviking_official_user_simulator
+
+eval:
+  user_simulator_policy: official
diff --git a/benchmark/tau2/config/prewrite.yaml b/benchmark/tau2/config/prewrite.yaml
@@ -0,0 +1,13 @@
+extends: baseline.yaml
+
+benchmark:
+  name: tau2_openviking_prewrite
+
+strategies:
+  - id: memory_v2_prewrite
+    label: OpenViking Memory V2 pre-write recall
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_experience_only
+    train_memory_mode: experience_only
+    retrieval_mode: first_user_prewrite
diff --git a/benchmark/tau2/run_full_eval.sh b/benchmark/tau2/run_full_eval.sh
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+PYTHON_BIN="${PYTHON_BIN:-python3}"
+CONFIG="$SCRIPT_DIR/config/baseline.yaml"
+EXECUTE=false
+PREFLIGHT=false
+STRICT_PREFLIGHT=false
+RUN_ID=""
+RUN_EVAL_EXTRA=()
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --config)
+      CONFIG="$2"
+      shift 2
+      ;;
+    --run-id)
+      RUN_ID="$2"
+      shift 2
+      ;;
+    --execute)
+      EXECUTE=true
+      shift
+      ;;
+    --preflight)
+      PREFLIGHT=true
+      shift
+      ;;
+    --strict-preflight)
+      STRICT_PREFLIGHT=true
+      shift
+      ;;
+    --domain|--repeat-count|--strategy-id|--task-id|--num-tasks|--train-num-tasks)
+      RUN_EVAL_EXTRA+=("$1" "$2")
+      shift 2
+      ;;
+    --help|-h)
+      cat <<'EOF'
+Usage:
+  benchmark/tau2/run_full_eval.sh [--config PATH] [--run-id ID] [--execute] [--preflight]
+
+Without --execute the script only writes run_plan artifacts.
+EOF
+      exit 0
+      ;;
+    *)
+      echo "Unknown argument: $1" >&2
+      exit 1
+      ;;
+  esac
+done
+
+RUN_ARGS=()
+if [[ -n "$RUN_ID" ]]; then
+  RUN_ARGS+=(--run-id "$RUN_ID")
+fi
+
+cd "$REPO_ROOT"
+if [[ "$STRICT_PREFLIGHT" == true ]]; then
+  RUN_EVAL_EXTRA+=(--strict-preflight)
+elif [[ "$PREFLIGHT" == true ]]; then
+  RUN_EVAL_EXTRA+=(--preflight)
+fi
+
+if [[ "$EXECUTE" == true ]]; then
+  "$PYTHON_BIN" "$SCRIPT_DIR/scripts/run_eval.py" --config "$CONFIG" "${RUN_ARGS[@]}" "${RUN_EVAL_EXTRA[@]}" --execute
+else
+  "$PYTHON_BIN" "$SCRIPT_DIR/scripts/run_eval.py" --config "$CONFIG" "${RUN_ARGS[@]}" "${RUN_EVAL_EXTRA[@]}" --plan-only
+fi