L6 light failed — claude -p produced 0 trajectory rows in CI; diagnose headless mode

## Symptom

First L6 run on dev (workflow_dispatch via `gh workflow run l6-dogfood.yml --ref dev`, run id 24963880924, 2026-04-26 18:28 UTC):

- 12/16 scaffolded flows passed (auto-skip — expected)
- **All 4 wired flows failed** (01-onboarding, 02-simple-task, 95-anonymous-cold-restart, D-direct-mode)
- Every failure: \`tokens_total=0, latency_ms=0, trajectory rows=0\`
- 95-anonymous outcome scorer DID pass (4/4 — fixture-set state)

## Diagnosis

The 0 tokens + 0 trajectory rows + 0 latency strongly suggests **\`claude -p\` did not actually invoke an LLM** in CI. Either:

1. **Auth failed silently** — \`CLAUDE_CODE_OAUTH_TOKEN\` is the right env var name but maybe needs a different format in CI vs local.
2. **\`-p\` mode doesn't engage \`@bro\` trigger** — the trigger word logic might require interactive mode.
3. **\`--plugin-dir\` doesn't load the plugin in \`-p\` mode** — feature dimension we never validated.
4. **Plugin loaded but MCP server didn't spawn** — \`.mcp.json\` interpretation differs in \`-p\` mode.
5. **Plugin loaded + MCP up but bro persona doesn't trigger** — CLAUDE.md trigger rule expects message contains "bro" but \`-p\` might process it differently.

The runner suppresses claude's output (\`l6_run_claude ... >/dev/null\`) so we have NO diagnostic info from the actual claude invocation. **First fix: remove that redirect.**

## Diagnosis steps (in order — lightest first)

1. **Make \`l6_run_claude\` flow stdout/stderr to CI logs.** Drop the \`>/dev/null\` from flow run.sh files. Re-trigger. ~$0.20 to read actual claude output.

2. **Add a standalone diagnostic step BEFORE flow runs**: \`claude -p "say hello in one word"\` with no plugin, no env. Just verify auth works at the most basic level.

3. **Add a step that runs \`claude --plugin-dir <PLUGIN_PATH> -p "say hi"\`** without engaging bro. Verify plugin loads in -p mode.

4. **Add a step with the env + plugin + bro**: \`claude --plugin-dir <PLUGIN_PATH> -p "@bro say hi"\`. See if bro triggers.

5. Based on which step fails, identify the broken layer.

## What was good

- Workflow YAML successfully published to main via PR #115 — \`gh workflow run\` works
- Token secret IS available (workflow's "Verify secret is present" step printed "Token present.")
- L6 v2 multi-scorer architecture worked exactly as designed: when claude didn't engage, the trajectory_required scorer caught it (subset semantics, not strict)
- Cost spent: ~\$0 (claude never ran)

## Acceptance criteria

- [ ] L6 v2 runner exposes claude's stdout/stderr in CI logs
- [ ] Diagnostic CI run identifies which layer breaks (auth / plugin / bro / MCP)
- [ ] Once identified: either (a) fix the broken layer, OR (b) document it as "L6 \`-p\` mode doesn't work; use L5+L6 combined Docker instead"
- [ ] Re-run L6 light → all 4 wired flows produce non-empty trajectory + correct outcome

## Stakes

This blocks #102 (v0.4.1 stable cut) — the framework that's supposed to replace manual L5 dogfood needs to actually work first. Until then, manual L5 stays the only validation.

## Cost so far

~\$0 (the 4 failed runs each spent 0 tokens because claude never invoked). The diagnostic re-run will be similarly cheap until we actually engage claude.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L6 light failed — claude -p produced 0 trajectory rows in CI; diagnose headless mode #116

Symptom

Diagnosis

Diagnosis steps (in order — lightest first)

What was good

Acceptance criteria

Stakes

Cost so far

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

L6 light failed — claude -p produced 0 trajectory rows in CI; diagnose headless mode #116

Description

Symptom

Diagnosis

Diagnosis steps (in order — lightest first)

What was good

Acceptance criteria

Stakes

Cost so far

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions