🛠️ Switch L5/A/B capture to claude --output-format stream-json --include-hook-events (closes #164, unifies local + CI mode)

Per user doctrine ("from now on all tests use the same mode regardless of local or online"), the L5 + A/B test runners should switch from text-output + grep scoring to structured \`stream-json\` capture with \`jq\`-based scoring. This fixes #164's measurement bugs at the source, eliminates fixture/format brittleness, and gives local + CI runs an identical capture pipeline.

## Current pain (surfaced by v0.5.0-rc.1 dogfood)

- \`tokens_total=0\` across every run (capture sees the \`text\` output, not the per-tool-call usage that \`stream-json\` exposes).
- Merged-DB report drops most rows due to \`UNIQUE constraint failed\` (cause: per-pair DBs merged via \`.dump | grep INSERT INTO\` which collides on auto-increment IDs that aren't renumbered — fragile pipeline that stream-json bypasses entirely).
- \`tools-required\` scorer reads from \`debug_trajectory\` table, which is populated server-side only when \`TMB_DEBUG_TRAJECTORY=1\`. Brittle env coupling; stream-json captures hook + tool events directly.
- Local L5 dogfood runs use the same text-grep scoring → same gotchas as CI. **Single-mode parity required.**

## Proposed

1. **Switch \`l5_run_claude\` + \`l5_run_arm\` (in \`tests/dogfood/lib/{flow,ab}-helpers.sh\`)** to use:
   \`\`\`
   claude -p \\
     --output-format stream-json \\
     --include-hook-events \\
     --include-partial-messages \\
     --plugin-dir "\$PLUGIN_ROOT" \\
     --dangerously-skip-permissions \\
     "\$prompt" 2>&1
   \`\`\`
   Pipe stdout to \`<run_dir>/trajectory.jsonl\` (one JSON event per line).

2. **Replace text-grep scorers** (\`tests/dogfood/lib/scorers.sh\`) with jq-based parsing of trajectory.jsonl:
   - \`outcome\` — same SQL on the trajectory DB (unchanged; bro's MCP writes still land in DB)
   - \`trajectory_required\` — \`jq 'select(.type=="tool_use") | .name'\` against the required tools list, no debug_trajectory dependency
   - \`trajectory_forbidden\` — same shape, inverted
   - \`cost\` — sum \`.usage.{input,output}_tokens\` across events, no \`tokens_total=0\` bug

3. **Drop the merged-DB pipeline** in \`.github/workflows/ab-scenario.yml\`. Each per-pair has its own \`trajectory.jsonl\`; report aggregates from those files directly. Closes #164's UNIQUE-constraint and tokens=0 issues.

4. **Local + CI parity**: same helpers, same flags, same scoring. The only difference is where the runner is invoked from.

5. **Hook events become observable**: \`--include-hook-events\` emits each hook firing as a JSON event. We can finally assert "the no-source-edit hook fired and denied" as a positive signal in \`trajectory_required\` instead of inferring from absence of Edit.

## Scope

- \`tests/dogfood/lib/flow-helpers.sh\` — \`l5_run_claude\` capture command
- \`tests/dogfood/lib/ab-helpers.sh\` — \`l5_run_arm\` capture command
- \`tests/dogfood/lib/scorers.sh\` — all four scorers re-written against jsonl
- \`.github/workflows/ab-scenario.yml\` — drop the merge-DB step
- \`.github/workflows/l5-dogfood.yml\` — match
- Add \`tests/dogfood/scripts/score-jsonl.sh\` (or extend \`ab-report.sh\`) — central jq-based scorer

## Closes

- #164 (A/B framework measurement bugs)

## Token cost

Same per-call cost as today. Removes the merge-DB step latency in CI. Local runs become identical to CI runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🛠️ Switch L5/A/B capture to claude --output-format stream-json --include-hook-events (closes #164, unifies local + CI mode) #179

Current pain (surfaced by v0.5.0-rc.1 dogfood)

Proposed

Scope

Closes

Token cost

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

🛠️ Switch L5/A/B capture to claude --output-format stream-json --include-hook-events (closes #164, unifies local + CI mode) #179

Description

Current pain (surfaced by v0.5.0-rc.1 dogfood)

Proposed

Scope

Closes

Token cost

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions