Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions benchmark/tau2/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
result/
.env.tau2
.external/
.venv-tau2/
__pycache__/
146 changes: 146 additions & 0 deletions benchmark/tau2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# TAU-2 Benchmark

This directory contains a small OpenViking-style entry point for TAU-2 memory
evaluation. The first version is intentionally narrow:

- fresh OpenViking Memory V2 experience-only baseline;
- Memory V2 pre-write recall treatment.

Trajectory / procedure-view prompts, category rerank, and other harness-only
diagnostics are intentionally left out of this first PR.

## Layout

```text
benchmark/tau2/
├── config/
│ ├── baseline.yaml
│ ├── official.yaml
│ └── prewrite.yaml
├── scripts/
│ ├── run_eval.py
│ ├── setup_tau2_repo.sh
│ └── tau2_common.py
└── run_full_eval.sh
```

Generated artifacts are written to `benchmark/tau2/result/<run_id>/`.

## Quick Start

This benchmark delegates task simulation and scoring to an external TAU-2
checkout. Point the runner at that checkout and CLI explicitly when they are not
on the default path:

```bash
export TAU2_REPO=/path/to/tau2-bench
export TAU2_CLI=/path/to/tau2
```

For a local one-command setup, clone and install TAU-2 into ignored benchmark
directories:

```bash
benchmark/tau2/scripts/setup_tau2_repo.sh
source benchmark/tau2/.env.tau2
```

Plan the default benchmark without running TAU-2:

```bash
python benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/baseline.yaml --plan-only
```

Add `--preflight` or `--strict-preflight` when you want the runner to write a
small environment/config check next to the run plan.

After setup, verify the local TAU-2 link and write a one-cell run plan:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--strict-preflight \
--domain retail \
--strategy-id memory_v2_experience_only \
--task-id 5 \
--repeat-count 1
```

Plan a one-cell Memory V2 pre-write smoke:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--domain retail \
--strategy-id memory_v2_prewrite \
--num-tasks 1 \
--repeat-count 1
```

Run the Memory V2 8-trial matrix (`retail + airline` x 2 strategies x 8 repeats):

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--execute
```

For a small E2E smoke, keep both the eval and train slices tiny:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--domain retail \
--strategy-id memory_v2_experience_only \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1 \
--execute
```

When using Doubao through an OpenAI-compatible endpoint, set `OPENAI_API_KEY`
and `OPENAI_API_BASE` for LiteLLM before running upstream TAU-2.

Start the OpenViking service before executing memory cells, and verify it with
`ov status`. For evidence runs, use a clean OpenViking workspace/config and set
`OPENVIKING_URL` explicitly so local custom memory templates do not pollute the
Memory V2 baseline.

## Memory Adapter

`memory_v2_experience_only` and `memory_v2_prewrite` cells run through a small
TAU-2 agent adapter in this directory:

- train by writing TAU-2 training conversations into OpenViking sessions;
- evaluate by retrieving OpenViking experience memory at the first user turn;
- for pre-write recall, retrieve again before write-like tool calls and
regenerate that step with the matched memories;
- emit artifact metadata to identify the OpenViking account, agent,
corpus, retrieval mode, and simulator policy used by each cell.

## User Simulator Policy

The runner default is the official TAU-2 user simulator if
`eval.user_simulator_policy` is omitted. The bundled OpenViking memory benchmark
config sets `confirmation_aware`, because a memory benchmark should not treat
user confirmation as task completion before the backend write has happened.

`confirmation_aware` applies a small idempotent prompt patch to the configured
TAU-2 checkout before planning or running. The patch appends only the behavioral
confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
the upstream PR link is kept in run artifacts, not in the simulator prompt.
Reference: [sierra-research/tau2-bench#297](https://github.com/sierra-research/tau2-bench/pull/297).

Use `config/official.yaml` with a clean TAU-2 checkout when you need an
official-user-simulator parity run. If the checkout was already patched, the
artifact records that boundary instead of labeling the run pure official.

## Evidence Boundary

Only completed `retail + airline` runs with the same config, same seeds/repeats,
and non-empty artifacts should be read as benchmark evidence. Partial runs,
single-task probes, or missing OpenViking corpus identity are diagnostics.
Executed runs write per-cell JSON under `cell_results/` and a strategy/domain
aggregate under `scoreboard.json`. Memory training artifacts are shared by
domain and strategy under `memory_corpora/`, so repeated eval cells reuse the
same fresh corpus instead of rewriting it.
53 changes: 53 additions & 0 deletions benchmark/tau2/config/baseline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
benchmark:
name: tau2_openviking_baseline
domains:
- retail
- airline
train_split_name: train
eval_split_name: test
repeat_count: 8
task_max_concurrency: 10
max_steps: 200
seed: 300
agent: llm_agent
user: user_simulator
reasoning_effort: high

paths:
tau2_repo: ${TAU2_REPO:-data/external_benchmarks/tau2-bench}
tau2_cli: ${TAU2_CLI:-tau2}
output_dir: benchmark/tau2/result

eval:
# The runner default is official if this field is omitted. The OpenViking
# memory benchmark config opts into a confirmation-aware TAU-2 user simulator
# prompt; run_eval.py applies that small prompt patch idempotently when needed.
user_simulator_policy: confirmation_aware

model:
agent_llm: ${TAU2_AGENT_LLM:-openai/doubao-seed-2-0-pro-260215}
user_llm: ${TAU2_USER_LLM:-openai/doubao-seed-2-0-pro-260215}
temperature: 0.0

openviking:
url: ${OPENVIKING_URL:-http://localhost:1933}
account: ${OPENVIKING_ACCOUNT:-default}
agent_id: ${OPENVIKING_AGENT_ID:-tau2-openviking-agent}
retrieval_top_k: 4
replay_write_policy: read_only

strategies:
- id: memory_v2_experience_only
label: OpenViking Memory V2 experience-only
memory_backend: openviking
train_required: true
corpus_id: memory_v2_experience_only
train_memory_mode: experience_only
retrieval_mode: first_user
- id: memory_v2_prewrite
label: OpenViking Memory V2 pre-write recall
memory_backend: openviking
train_required: true
corpus_id: memory_v2_experience_only
train_memory_mode: experience_only
retrieval_mode: first_user_prewrite
7 changes: 7 additions & 0 deletions benchmark/tau2/config/official.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
extends: baseline.yaml

benchmark:
name: tau2_openviking_official_user_simulator

eval:
user_simulator_policy: official
13 changes: 13 additions & 0 deletions benchmark/tau2/config/prewrite.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
extends: baseline.yaml

benchmark:
name: tau2_openviking_prewrite

strategies:
- id: memory_v2_prewrite
label: OpenViking Memory V2 pre-write recall
memory_backend: openviking
train_required: true
corpus_id: memory_v2_experience_only
train_memory_mode: experience_only
retrieval_mode: first_user_prewrite
72 changes: 72 additions & 0 deletions benchmark/tau2/run_full_eval.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
PYTHON_BIN="${PYTHON_BIN:-python3}"
CONFIG="$SCRIPT_DIR/config/baseline.yaml"
EXECUTE=false
PREFLIGHT=false
STRICT_PREFLIGHT=false
RUN_ID=""
RUN_EVAL_EXTRA=()

while [[ $# -gt 0 ]]; do
case "$1" in
--config)
CONFIG="$2"
shift 2
;;
--run-id)
RUN_ID="$2"
shift 2
;;
--execute)
EXECUTE=true
shift
;;
--preflight)
PREFLIGHT=true
shift
;;
--strict-preflight)
STRICT_PREFLIGHT=true
shift
;;
--domain|--repeat-count|--strategy-id|--task-id|--num-tasks|--train-num-tasks)
RUN_EVAL_EXTRA+=("$1" "$2")
shift 2
;;
--help|-h)
cat <<'EOF'
Usage:
benchmark/tau2/run_full_eval.sh [--config PATH] [--run-id ID] [--execute] [--preflight]

Without --execute the script only writes run_plan artifacts.
EOF
exit 0
;;
*)
echo "Unknown argument: $1" >&2
exit 1
;;
esac
done

RUN_ARGS=()
if [[ -n "$RUN_ID" ]]; then
RUN_ARGS+=(--run-id "$RUN_ID")
fi

cd "$REPO_ROOT"
if [[ "$STRICT_PREFLIGHT" == true ]]; then
RUN_EVAL_EXTRA+=(--strict-preflight)
elif [[ "$PREFLIGHT" == true ]]; then
RUN_EVAL_EXTRA+=(--preflight)
fi

if [[ "$EXECUTE" == true ]]; then
"$PYTHON_BIN" "$SCRIPT_DIR/scripts/run_eval.py" --config "$CONFIG" "${RUN_ARGS[@]}" "${RUN_EVAL_EXTRA[@]}" --execute
else
"$PYTHON_BIN" "$SCRIPT_DIR/scripts/run_eval.py" --config "$CONFIG" "${RUN_ARGS[@]}" "${RUN_EVAL_EXTRA[@]}" --plan-only
fi
Loading