Skip to content

L6 light failed — claude -p produced 0 trajectory rows in CI; diagnose headless mode #116

@ZaxShen

Description

@ZaxShen

Symptom

First L6 run on dev (workflow_dispatch via gh workflow run l6-dogfood.yml --ref dev, run id 24963880924, 2026-04-26 18:28 UTC):

  • 12/16 scaffolded flows passed (auto-skip — expected)
  • All 4 wired flows failed (01-onboarding, 02-simple-task, 95-anonymous-cold-restart, D-direct-mode)
  • Every failure: `tokens_total=0, latency_ms=0, trajectory rows=0`
  • 95-anonymous outcome scorer DID pass (4/4 — fixture-set state)

Diagnosis

The 0 tokens + 0 trajectory rows + 0 latency strongly suggests `claude -p` did not actually invoke an LLM in CI. Either:

  1. Auth failed silently — `CLAUDE_CODE_OAUTH_TOKEN` is the right env var name but maybe needs a different format in CI vs local.
  2. `-p` mode doesn't engage `@bro` trigger — the trigger word logic might require interactive mode.
  3. `--plugin-dir` doesn't load the plugin in `-p` mode — feature dimension we never validated.
  4. Plugin loaded but MCP server didn't spawn — `.mcp.json` interpretation differs in `-p` mode.
  5. Plugin loaded + MCP up but bro persona doesn't trigger — CLAUDE.md trigger rule expects message contains "bro" but `-p` might process it differently.

The runner suppresses claude's output (`l6_run_claude ... >/dev/null`) so we have NO diagnostic info from the actual claude invocation. First fix: remove that redirect.

Diagnosis steps (in order — lightest first)

  1. Make `l6_run_claude` flow stdout/stderr to CI logs. Drop the `>/dev/null` from flow run.sh files. Re-trigger. ~$0.20 to read actual claude output.

  2. Add a standalone diagnostic step BEFORE flow runs: `claude -p "say hello in one word"` with no plugin, no env. Just verify auth works at the most basic level.

  3. Add a step that runs `claude --plugin-dir <PLUGIN_PATH> -p "say hi"` without engaging bro. Verify plugin loads in -p mode.

  4. Add a step with the env + plugin + bro: `claude --plugin-dir <PLUGIN_PATH> -p "@bro say hi"`. See if bro triggers.

  5. Based on which step fails, identify the broken layer.

What was good

  • Workflow YAML successfully published to main via PR 🔧 chore(ci): publish L6 + L5+L6 workflow files to main #115 — `gh workflow run` works
  • Token secret IS available (workflow's "Verify secret is present" step printed "Token present.")
  • L6 v2 multi-scorer architecture worked exactly as designed: when claude didn't engage, the trajectory_required scorer caught it (subset semantics, not strict)
  • Cost spent: ~$0 (claude never ran)

Acceptance criteria

  • L6 v2 runner exposes claude's stdout/stderr in CI logs
  • Diagnostic CI run identifies which layer breaks (auth / plugin / bro / MCP)
  • Once identified: either (a) fix the broken layer, OR (b) document it as "L6 `-p` mode doesn't work; use L5+L6 combined Docker instead"
  • Re-run L6 light → all 4 wired flows produce non-empty trajectory + correct outcome

Stakes

This blocks #102 (v0.4.1 stable cut) — the framework that's supposed to replace manual L5 dogfood needs to actually work first. Until then, manual L5 stays the only validation.

Cost so far

~$0 (the 4 failed runs each spent 0 tokens because claude never invoked). The diagnostic re-run will be similarly cheap until we actually engage claude.

Metadata

Metadata

Assignees

Labels

BugSomething isn't workingPriority: HighHigh priority — blocks meaningful workflowsTestsTest infrastructure (L0-L6)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions