You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per user doctrine ("from now on all tests use the same mode regardless of local or online"), the L5 + A/B test runners should switch from text-output + grep scoring to structured `stream-json` capture with `jq`-based scoring. This fixes #164's measurement bugs at the source, eliminates fixture/format brittleness, and gives local + CI runs an identical capture pipeline.
Current pain (surfaced by v0.5.0-rc.1 dogfood)
`tokens_total=0` across every run (capture sees the `text` output, not the per-tool-call usage that `stream-json` exposes).
Merged-DB report drops most rows due to `UNIQUE constraint failed` (cause: per-pair DBs merged via `.dump | grep INSERT INTO` which collides on auto-increment IDs that aren't renumbered β fragile pipeline that stream-json bypasses entirely).
`tools-required` scorer reads from `debug_trajectory` table, which is populated server-side only when `TMB_DEBUG_TRAJECTORY=1`. Brittle env coupling; stream-json captures hook + tool events directly.
Local L5 dogfood runs use the same text-grep scoring β same gotchas as CI. Single-mode parity required.
Proposed
Switch `l5_run_claude` + `l5_run_arm` (in `tests/dogfood/lib/{flow,ab}-helpers.sh`) to use:
```
claude -p \
--output-format stream-json \
--include-hook-events \
--include-partial-messages \
--plugin-dir "$PLUGIN_ROOT" \
--dangerously-skip-permissions \
"$prompt" 2>&1
```
Pipe stdout to `<run_dir>/trajectory.jsonl` (one JSON event per line).
Replace text-grep scorers (`tests/dogfood/lib/scorers.sh`) with jq-based parsing of trajectory.jsonl:
`outcome` β same SQL on the trajectory DB (unchanged; bro's MCP writes still land in DB)
`trajectory_required` β `jq 'select(.type=="tool_use") | .name'` against the required tools list, no debug_trajectory dependency
`trajectory_forbidden` β same shape, inverted
`cost` β sum `.usage.{input,output}_tokens` across events, no `tokens_total=0` bug
Local + CI parity: same helpers, same flags, same scoring. The only difference is where the runner is invoked from.
Hook events become observable: `--include-hook-events` emits each hook firing as a JSON event. We can finally assert "the no-source-edit hook fired and denied" as a positive signal in `trajectory_required` instead of inferring from absence of Edit.
Per user doctrine ("from now on all tests use the same mode regardless of local or online"), the L5 + A/B test runners should switch from text-output + grep scoring to structured `stream-json` capture with `jq`-based scoring. This fixes #164's measurement bugs at the source, eliminates fixture/format brittleness, and gives local + CI runs an identical capture pipeline.
Current pain (surfaced by v0.5.0-rc.1 dogfood)
Proposed
Switch `l5_run_claude` + `l5_run_arm` (in `tests/dogfood/lib/{flow,ab}-helpers.sh`) to use:
```
claude -p \
--output-format stream-json \
--include-hook-events \
--include-partial-messages \
--plugin-dir "$PLUGIN_ROOT" \
--dangerously-skip-permissions \
"$prompt" 2>&1
```
Pipe stdout to `<run_dir>/trajectory.jsonl` (one JSON event per line).
Replace text-grep scorers (`tests/dogfood/lib/scorers.sh`) with jq-based parsing of trajectory.jsonl:
Drop the merged-DB pipeline in `.github/workflows/ab-scenario.yml`. Each per-pair has its own `trajectory.jsonl`; report aggregates from those files directly. Closes π A/B framework: 3 measurement bugs surfaced by h4 (#153) runΒ #164's UNIQUE-constraint and tokens=0 issues.
Local + CI parity: same helpers, same flags, same scoring. The only difference is where the runner is invoked from.
Hook events become observable: `--include-hook-events` emits each hook firing as a JSON event. We can finally assert "the no-source-edit hook fired and denied" as a positive signal in `trajectory_required` instead of inferring from absence of Edit.
Scope
Closes
Token cost
Same per-call cost as today. Removes the merge-DB step latency in CI. Local runs become identical to CI runs.