You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All 4 wired flows failed (01-onboarding, 02-simple-task, 95-anonymous-cold-restart, D-direct-mode)
Every failure: `tokens_total=0, latency_ms=0, trajectory rows=0`
95-anonymous outcome scorer DID pass (4/4 — fixture-set state)
Diagnosis
The 0 tokens + 0 trajectory rows + 0 latency strongly suggests `claude -p` did not actually invoke an LLM in CI. Either:
Auth failed silently — `CLAUDE_CODE_OAUTH_TOKEN` is the right env var name but maybe needs a different format in CI vs local.
`-p` mode doesn't engage `@bro` trigger — the trigger word logic might require interactive mode.
`--plugin-dir` doesn't load the plugin in `-p` mode — feature dimension we never validated.
Plugin loaded but MCP server didn't spawn — `.mcp.json` interpretation differs in `-p` mode.
Plugin loaded + MCP up but bro persona doesn't trigger — CLAUDE.md trigger rule expects message contains "bro" but `-p` might process it differently.
The runner suppresses claude's output (`l6_run_claude ... >/dev/null`) so we have NO diagnostic info from the actual claude invocation. First fix: remove that redirect.
Diagnosis steps (in order — lightest first)
Make `l6_run_claude` flow stdout/stderr to CI logs. Drop the `>/dev/null` from flow run.sh files. Re-trigger. ~$0.20 to read actual claude output.
Add a standalone diagnostic step BEFORE flow runs: `claude -p "say hello in one word"` with no plugin, no env. Just verify auth works at the most basic level.
Add a step that runs `claude --plugin-dir <PLUGIN_PATH> -p "say hi"` without engaging bro. Verify plugin loads in -p mode.
Add a step with the env + plugin + bro: `claude --plugin-dir <PLUGIN_PATH> -p "@bro say hi"`. See if bro triggers.
Based on which step fails, identify the broken layer.
Token secret IS available (workflow's "Verify secret is present" step printed "Token present.")
L6 v2 multi-scorer architecture worked exactly as designed: when claude didn't engage, the trajectory_required scorer caught it (subset semantics, not strict)
Cost spent: ~$0 (claude never ran)
Acceptance criteria
L6 v2 runner exposes claude's stdout/stderr in CI logs
Diagnostic CI run identifies which layer breaks (auth / plugin / bro / MCP)
Once identified: either (a) fix the broken layer, OR (b) document it as "L6 `-p` mode doesn't work; use L5+L6 combined Docker instead"
Re-run L6 light → all 4 wired flows produce non-empty trajectory + correct outcome
Stakes
This blocks #102 (v0.4.1 stable cut) — the framework that's supposed to replace manual L5 dogfood needs to actually work first. Until then, manual L5 stays the only validation.
Cost so far
~$0 (the 4 failed runs each spent 0 tokens because claude never invoked). The diagnostic re-run will be similarly cheap until we actually engage claude.
Symptom
First L6 run on dev (workflow_dispatch via
gh workflow run l6-dogfood.yml --ref dev, run id 24963880924, 2026-04-26 18:28 UTC):Diagnosis
The 0 tokens + 0 trajectory rows + 0 latency strongly suggests `claude -p` did not actually invoke an LLM in CI. Either:
The runner suppresses claude's output (`l6_run_claude ... >/dev/null`) so we have NO diagnostic info from the actual claude invocation. First fix: remove that redirect.
Diagnosis steps (in order — lightest first)
Make `l6_run_claude` flow stdout/stderr to CI logs. Drop the `>/dev/null` from flow run.sh files. Re-trigger. ~$0.20 to read actual claude output.
Add a standalone diagnostic step BEFORE flow runs: `claude -p "say hello in one word"` with no plugin, no env. Just verify auth works at the most basic level.
Add a step that runs `claude --plugin-dir <PLUGIN_PATH> -p "say hi"` without engaging bro. Verify plugin loads in -p mode.
Add a step with the env + plugin + bro: `claude --plugin-dir <PLUGIN_PATH> -p "@bro say hi"`. See if bro triggers.
Based on which step fails, identify the broken layer.
What was good
Acceptance criteria
Stakes
This blocks #102 (v0.4.1 stable cut) — the framework that's supposed to replace manual L5 dogfood needs to actually work first. Until then, manual L5 stays the only validation.
Cost so far
~$0 (the 4 failed runs each spent 0 tokens because claude never invoked). The diagnostic re-run will be similarly cheap until we actually engage claude.