Skip to content

πŸ› οΈ Switch L5/A/B capture to claude --output-format stream-json --include-hook-events (closes #164, unifies local + CI mode)Β #179

@ZaxShen

Description

@ZaxShen

Per user doctrine ("from now on all tests use the same mode regardless of local or online"), the L5 + A/B test runners should switch from text-output + grep scoring to structured `stream-json` capture with `jq`-based scoring. This fixes #164's measurement bugs at the source, eliminates fixture/format brittleness, and gives local + CI runs an identical capture pipeline.

Current pain (surfaced by v0.5.0-rc.1 dogfood)

  • `tokens_total=0` across every run (capture sees the `text` output, not the per-tool-call usage that `stream-json` exposes).
  • Merged-DB report drops most rows due to `UNIQUE constraint failed` (cause: per-pair DBs merged via `.dump | grep INSERT INTO` which collides on auto-increment IDs that aren't renumbered β€” fragile pipeline that stream-json bypasses entirely).
  • `tools-required` scorer reads from `debug_trajectory` table, which is populated server-side only when `TMB_DEBUG_TRAJECTORY=1`. Brittle env coupling; stream-json captures hook + tool events directly.
  • Local L5 dogfood runs use the same text-grep scoring β†’ same gotchas as CI. Single-mode parity required.

Proposed

  1. Switch `l5_run_claude` + `l5_run_arm` (in `tests/dogfood/lib/{flow,ab}-helpers.sh`) to use:
    ```
    claude -p \
    --output-format stream-json \
    --include-hook-events \
    --include-partial-messages \
    --plugin-dir "$PLUGIN_ROOT" \
    --dangerously-skip-permissions \
    "$prompt" 2>&1
    ```
    Pipe stdout to `<run_dir>/trajectory.jsonl` (one JSON event per line).

  2. Replace text-grep scorers (`tests/dogfood/lib/scorers.sh`) with jq-based parsing of trajectory.jsonl:

    • `outcome` β€” same SQL on the trajectory DB (unchanged; bro's MCP writes still land in DB)
    • `trajectory_required` β€” `jq 'select(.type=="tool_use") | .name'` against the required tools list, no debug_trajectory dependency
    • `trajectory_forbidden` β€” same shape, inverted
    • `cost` β€” sum `.usage.{input,output}_tokens` across events, no `tokens_total=0` bug
  3. Drop the merged-DB pipeline in `.github/workflows/ab-scenario.yml`. Each per-pair has its own `trajectory.jsonl`; report aggregates from those files directly. Closes πŸ› A/B framework: 3 measurement bugs surfaced by h4 (#153) runΒ #164's UNIQUE-constraint and tokens=0 issues.

  4. Local + CI parity: same helpers, same flags, same scoring. The only difference is where the runner is invoked from.

  5. Hook events become observable: `--include-hook-events` emits each hook firing as a JSON event. We can finally assert "the no-source-edit hook fired and denied" as a positive signal in `trajectory_required` instead of inferring from absence of Edit.

Scope

  • `tests/dogfood/lib/flow-helpers.sh` β€” `l5_run_claude` capture command
  • `tests/dogfood/lib/ab-helpers.sh` β€” `l5_run_arm` capture command
  • `tests/dogfood/lib/scorers.sh` β€” all four scorers re-written against jsonl
  • `.github/workflows/ab-scenario.yml` β€” drop the merge-DB step
  • `.github/workflows/l5-dogfood.yml` β€” match
  • Add `tests/dogfood/scripts/score-jsonl.sh` (or extend `ab-report.sh`) β€” central jq-based scorer

Closes

Token cost

Same per-call cost as today. Removes the merge-DB step latency in CI. Local runs become identical to CI runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Priority: HighHigh priority β€” blocks meaningful workflowsTestsTest infrastructure (L0-L6)WorkflowBro / SWE / pr-reviewer doctrine + planning skills

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions