feat(codex-fleet): tighten claude-supervisor classifier + replay harness#116
Merged
NagyVikt merged 1 commit intoMay 15, 2026
Conversation
Extracts the asking/blocked classifier from claude-supervisor.sh into a
sourceable pure-bash lib so the daemon and a new fixture-driven harness
share one implementation. Tightens three failure modes that produced
false positives (and missed real asks/blocks) under the prior logic:
1. last_line_is_prompt no longer accepts bare ":$" as a waiting cursor;
bare "?$" is admitted only when the line carries a known question
lead-word (Continue/Approve/Should I/Do you want/Choose/Select/...).
2. is_busy is anchored to the LAST non-empty line. A stale "Working ("
in scrollback no longer masks a fresh interactive cursor below it.
3. is_asking scopes ASK_PATTERN matching to the recent N lines AND
requires the tightened last_line_is_prompt gate — both gates needed.
Extends BLOCKED_PATTERNS with the codex-fleet stuck states the
supervisor was previously deaf to: CONFLICT (content / merge conflict,
"error: uncommitted changes", "fatal: <git>", Permission denied
(publickey), "gh: command not found", "Bad credentials",
"MCP server <name> not found|missing|unavailable", "429 Too Many
Requests", and the canonical "BLOCKED:" prefix.
Adds scripts/codex-fleet/test/test-claude-supervisor-classifier.sh +
24 pane-capture fixtures under
scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/.
Filename prefix encodes the expected label (busy|asking|blocked|quiet).
24 pass, 0 fail. Daemon --once --dry-run runs clean without tmux.
Cost: fewer ASK false positives → fewer sonnet/medium calls per tick.
Sonnet stays the workhorse for the remaining real asks; opus stays
gated on the now-more-accurate BLOCKED set. Strike guard unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tightens the
claude-supervisor.shasking/blocked classifier so it stops paying Sonnet for false positives and stops ignoring real blockers. Extracts the classifier into a pure-bash lib that the daemon and a new fixture-driven replay harness both consume.scripts/codex-fleet/lib/claude-supervisor-classifier.sh(new, sourceable)scripts/codex-fleet/claude-supervisor.sh(replaces inline classifier withsource)scripts/codex-fleet/test/test-claude-supervisor-classifier.sh(new) + 24 fixtures underscripts/codex-fleet/test/fixtures/claude-supervisor-classifier/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/(proposal + spec + tasks)Audit — top 3 failure modes
The supervisor has never produced a real
metrics.tsvon disk (no/tmp/claude-viz/claude-supervisor/metrics.tsvexists yet), so this audit is grounded in the classifier's structural failure modes rather than replayed production rows. Each failure mode is pinned by a fixture underscripts/codex-fleet/test/fixtures/claude-supervisor-classifier/; the harness asserts the post-fix label.FM-A — bare
:$cursor admits scrollbackContinue?as a live ask (false positive)The prior
last_line_is_promptaccepted any line ending in:or?as a waiting cursor:'[?:][[:space:]]*$'Worker tails routinely end with
Reading file: rust/fleet-launcher/src/heartbeat.rs:mid-work. With that line at the bottom and an olderContinue? yes — answered earlierten lines back in the 80-line capture, the prioris_askingreturned true and paid--model sonnet --effort mediumto paste an answer codex never asked for. Pure cost.Fixture
quiet__colon_ending_stale_continue.txt:Fix: drop the bare
[?:][[:space:]]*$rule. Bare-?is admitted only when the same line carries a known question lead-word (Continue|Approve|Proceed|Confirm|Apply|Should I|Do you want|Would you like|Which option/approach/one|Choose|Select|Pick|Need clarification|Need more …|Please clarify/confirm/choose/specify). Bare-:no longer counts at all.FM-B — stale
Working (in scrollback masks fresh asks (false negative)is_busypreviously grep'd the entire 80-line window:A pane that finished
Working (12s)40 lines ago and is now sitting on a fresh menu read asbusy→ skipped entirely. Real asks slipped past the supervisor.Fixture
asking__busy_in_scrollback_fresh_cursor.txt:Fix: anchor
is_busyto the LAST non-empty line only. codex rewrites theWorking (…)footer in place; if the worker is genuinely busy, that line is the bottom of the capture. Once the turn completes, codex replaces it withWorked for …and the pane is no longer busy.FM-C —
BLOCKED_PATTERNSmissed real codex-fleet stuck states (false negative)The prior set only covered the plan-side blockers (
PLAN_SUBTASK_NOT_FOUND,stale-claim,told me not to rescue,less than 5% of your 5h limit,Blocked.). Real production stuck states the supervisor stayed silent on:CONFLICT (content): Merge conflict in …error: uncommitted changesfatal: unable to access 'https://github.com/…'(any^fatal:from git)Permission denied (publickey)(ssh key bust on push)gh: command not foundBad credentials(GH token rejected)MCP server <name> (not found|missing|unavailable)429 Too Many Requests(cap-swap normally catches this; supervisor is the safety net)BLOCKED:(canonical Guardex blocker prefix)Fixtures covering each:
blocked__merge_conflict.txt,blocked__uncommitted_changes.txt,blocked__fatal_git.txt,blocked__permission_denied.txt,blocked__mcp_server_missing.txt,blocked__bad_credentials.txt. Fix: extendBLOCKED_PATTERNSwith the patterns above;MCP server .*(not found|...)admits the server name token in the middle (MCP server colony not found …).Logic changes
last_line_is_promptbare-:$rulelast_line_is_promptbare-?$rulelast_line_is_promptnumbered-list ruleN) itemat bottom(recommended)or(default)tag on same lineis_busyscopeis_askingscopeCLAUDE_SUPERVISOR_RECENT_LINES)BLOCKED_PATTERNSCost impact
The Sonnet/Opus tier split is unchanged:
state=askingstill routes to--model sonnet --effort medium,state=blockedstill routes to--model claude-opus-4-7 --effort high. Both still share the JSON-schema + prompt-cache prefix. Effect of the changes:askingfalse positives → fewer Sonnet calls per tick (the dominant cost line).blockeddetection may produce slightly more Opus calls, but those are precisely the panes that previously sat stuck forever — the supervisor was undercounting the genuinely-stuck class.Replay harness
Fixture naming:
<expected-label>__<short-name>.txt. Labels:busy | asking | blocked | quiet. The harness sources the pure-bash lib (no daemon side effects) and runsclassify_tailper fixture.Test plan
bash -nclean on lib, harness, daemon.bash scripts/codex-fleet/test/test-claude-supervisor-classifier.sh— 24 pass, 0 fail.claude-supervisor.sh --once --dry-runruns without tmux (no-op tick, rc=0).openspec validate agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25 --type change --strict— valid.🤖 Generated with Claude Code