From 4062099e050607576aeeb480accfe1141ce47ae5 Mon Sep 17 00:00:00 2001 From: NagyVikt Date: Fri, 15 May 2026 22:36:14 +0200 Subject: [PATCH] feat(codex-fleet): tighten claude-supervisor classifier + replay harness MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extracts the asking/blocked classifier from claude-supervisor.sh into a sourceable pure-bash lib so the daemon and a new fixture-driven harness share one implementation. Tightens three failure modes that produced false positives (and missed real asks/blocks) under the prior logic: 1. last_line_is_prompt no longer accepts bare ":$" as a waiting cursor; bare "?$" is admitted only when the line carries a known question lead-word (Continue/Approve/Should I/Do you want/Choose/Select/...). 2. is_busy is anchored to the LAST non-empty line. A stale "Working (" in scrollback no longer masks a fresh interactive cursor below it. 3. is_asking scopes ASK_PATTERN matching to the recent N lines AND requires the tightened last_line_is_prompt gate — both gates needed. Extends BLOCKED_PATTERNS with the codex-fleet stuck states the supervisor was previously deaf to: CONFLICT (content / merge conflict, "error: uncommitted changes", "fatal: ", Permission denied (publickey), "gh: command not found", "Bad credentials", "MCP server not found|missing|unavailable", "429 Too Many Requests", and the canonical "BLOCKED:" prefix. Adds scripts/codex-fleet/test/test-claude-supervisor-classifier.sh + 24 pane-capture fixtures under scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/. Filename prefix encodes the expected label (busy|asking|blocked|quiet). 24 pass, 0 fail. Daemon --once --dry-run runs clean without tmux. Cost: fewer ASK false positives → fewer sonnet/medium calls per tick. Sonnet stays the workhorse for the remaining real asks; opus stays gated on the now-more-accurate BLOCKED set. Strike guard unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../.openspec.yaml | 2 + .../proposal.md | 63 ++ .../spec.md | 89 +++ .../tasks.md | 38 ++ .../README.md | 30 + .../architect/.openspec.yaml | 9 + .../architect/README.md | 12 + .../architect/prompt.md | 34 ++ .../architect/proposal.md | 15 + .../architect/specs/architect/spec.md | 10 + .../architect/tasks.md | 33 + .../checkpoints.md | 4 + .../coordinator-prompt.md | 41 ++ .../critic/.openspec.yaml | 9 + .../critic/README.md | 12 + .../critic/prompt.md | 34 ++ .../critic/proposal.md | 15 + .../critic/specs/critic/spec.md | 10 + .../critic/tasks.md | 33 + .../executor/.openspec.yaml | 9 + .../executor/README.md | 12 + .../executor/checkpoints.md | 4 + .../executor/prompt.md | 34 ++ .../executor/proposal.md | 15 + .../executor/specs/executor/spec.md | 10 + .../executor/tasks.md | 33 + .../kickoff-prompts.md | 108 ++++ .../open-questions.md | 6 + .../phases.md | 15 + .../planner/.openspec.yaml | 9 + .../planner/README.md | 12 + .../planner/plan.md | 65 ++ .../planner/prompt.md | 34 ++ .../planner/proposal.md | 15 + .../planner/specs/planner/spec.md | 10 + .../planner/tasks.md | 33 + .../summary.md | 8 + .../verifier/.openspec.yaml | 9 + .../verifier/README.md | 12 + .../verifier/prompt.md | 34 ++ .../verifier/proposal.md | 15 + .../verifier/specs/verifier/spec.md | 10 + .../verifier/tasks.md | 33 + .../writer/.openspec.yaml | 9 + .../writer/README.md | 12 + .../writer/prompt.md | 34 ++ .../writer/proposal.md | 15 + .../writer/specs/writer/spec.md | 10 + .../writer/tasks.md | 33 + scripts/codex-fleet/claude-supervisor.sh | 565 ++++++++++++++++++ .../lib/claude-supervisor-classifier.sh | 228 +++++++ ...sking__busy_in_scrollback_fresh_cursor.txt | 11 + .../asking__press_digit_to_continue.txt | 6 + .../asking__recommended_numbered_menu.txt | 7 + ...__stale_blocker_in_scrollback_menu_now.txt | 11 + .../asking__yn_lowercase_default.txt | 4 + .../asking__yn_uppercase_destructive.txt | 4 + .../blocked__bad_credentials.txt | 4 + .../blocked__fatal_git.txt | 3 + .../blocked__five_pct_limit.txt | 5 + .../blocked__mcp_server_missing.txt | 4 + .../blocked__merge_conflict.txt | 6 + .../blocked__permission_denied.txt | 4 + .../blocked__plan_subtask_not_found.txt | 5 + .../blocked__stale_claim_told_not_rescue.txt | 5 + .../blocked__uncommitted_changes.txt | 4 + .../busy__codex_working_active.txt | 6 + .../busy__esc_to_interrupt_last_line.txt | 5 + .../quiet__colon_ending_stale_continue.txt | 10 + .../quiet__do_you_want_in_narration.txt | 6 + .../quiet__empty_pane.txt | 2 + .../quiet__narrative_should_i_no_cursor.txt | 8 + ...quiet__numbered_list_narrative_summary.txt | 7 + .../quiet__prompt_ready_for_input.txt | 3 + .../quiet__worked_for_completion_footer.txt | 5 + .../test/test-claude-supervisor-classifier.sh | 72 +++ 76 files changed, 2151 insertions(+) create mode 100644 openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/.openspec.yaml create mode 100644 openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/proposal.md create mode 100644 openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/specs/claude-supervisor-classifier-audit/spec.md create mode 100644 openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/tasks.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/README.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/architect/.openspec.yaml create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/architect/README.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/architect/prompt.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/architect/proposal.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/architect/specs/architect/spec.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/architect/tasks.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/checkpoints.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/coordinator-prompt.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/critic/.openspec.yaml create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/critic/README.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/critic/prompt.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/critic/proposal.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/critic/specs/critic/spec.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/critic/tasks.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/executor/.openspec.yaml create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/executor/README.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/executor/checkpoints.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/executor/prompt.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/executor/proposal.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/executor/specs/executor/spec.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/executor/tasks.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/kickoff-prompts.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/open-questions.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/phases.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/planner/.openspec.yaml create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/planner/README.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/planner/plan.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/planner/prompt.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/planner/proposal.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/planner/specs/planner/spec.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/planner/tasks.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/summary.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/verifier/.openspec.yaml create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/verifier/README.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/verifier/prompt.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/verifier/proposal.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/verifier/specs/verifier/spec.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/verifier/tasks.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/writer/.openspec.yaml create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/writer/README.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/writer/prompt.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/writer/proposal.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/writer/specs/writer/spec.md create mode 100644 openspec/plan/agent-claude-masterplan-claude-supervisor-classifier-audit-2026-05-15-22-25/writer/tasks.md create mode 100755 scripts/codex-fleet/claude-supervisor.sh create mode 100644 scripts/codex-fleet/lib/claude-supervisor-classifier.sh create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/asking__busy_in_scrollback_fresh_cursor.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/asking__press_digit_to_continue.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/asking__recommended_numbered_menu.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/asking__stale_blocker_in_scrollback_menu_now.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/asking__yn_lowercase_default.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/asking__yn_uppercase_destructive.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__bad_credentials.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__fatal_git.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__five_pct_limit.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__mcp_server_missing.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__merge_conflict.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__permission_denied.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__plan_subtask_not_found.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__stale_claim_told_not_rescue.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/blocked__uncommitted_changes.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/busy__codex_working_active.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/busy__esc_to_interrupt_last_line.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/quiet__colon_ending_stale_continue.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/quiet__do_you_want_in_narration.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/quiet__empty_pane.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/quiet__narrative_should_i_no_cursor.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/quiet__numbered_list_narrative_summary.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/quiet__prompt_ready_for_input.txt create mode 100644 scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/quiet__worked_for_completion_footer.txt create mode 100755 scripts/codex-fleet/test/test-claude-supervisor-classifier.sh diff --git a/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/.openspec.yaml b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/.openspec.yaml new file mode 100644 index 0000000..9f70866 --- /dev/null +++ b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-15 diff --git a/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/proposal.md b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/proposal.md new file mode 100644 index 0000000..937d646 --- /dev/null +++ b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/proposal.md @@ -0,0 +1,63 @@ +## Why + +`claude-supervisor.sh`'s asking/blocked classifier had two structural +weaknesses that produced classifier cost and precision problems: + +1. The "is this a real interactive cursor?" gate (`last_line_is_prompt`) + accepted any line ending in `:` or `?` as a waiting cursor. Worker + tails routinely end with `Reading file:` mid-work, so an old + `Continue?` or `Should I` 40 lines back in the 80-line capture + was enough to call sonnet/medium and paste an answer codex never + asked for. Pure false-positive cost. +2. `is_busy` grep'd the whole 80-line window for `Working (` or + `esc to interrupt`. A pane that finished `Working (12s)` 40 lines + ago and is now sitting at a fresh `[Y/n] ` cursor read as "busy" + and never reached the ask path — real asks slipped past the + supervisor entirely. + +Additionally, several real codex-fleet blockers (merge conflict, +uncommitted-changes, fatal git, ssh permission-denied, MCP server +missing, bad GH credentials, BLOCKED: prefix) were missing from +`BLOCKED_PATTERNS`, so panes parked on these states classified as +`quiet` and the supervisor stayed silent. + +## What Changes + +- Extract the classifier (BUSY/ASK/BLOCKED patterns + `is_busy`, + `is_asking`, `is_blocked`, `last_line_is_prompt`, `classify_tail`, + `tail_hash`) into a pure-bash library at + `scripts/codex-fleet/lib/claude-supervisor-classifier.sh` so the + daemon and a replay harness share one implementation. +- Tighten `last_line_is_prompt`: drop the bare `[?:][[:space:]]*$` + rule. Bare-`?` is only admitted when the line carries a known + question lead-word. `:$` no longer counts. +- Tighten `is_busy`: anchor BUSY_PATTERNS to the LAST non-empty line + only. codex rewrites the `Working (…)` footer in place; if the + worker is busy it's at the bottom. Stale `Working (` in scrollback + no longer masks a fresh interactive prompt. +- Tighten `is_asking`: scope ASK_PATTERN matching to the recent N + non-empty lines (default 8 via `CLAUDE_SUPERVISOR_RECENT_LINES`) + AND require `last_line_is_prompt` to pass. +- Extend `BLOCKED_PATTERNS` with the codex-fleet-specific stuck + states listed under "Why". +- Add a fixture-driven replay harness at + `scripts/codex-fleet/test/test-claude-supervisor-classifier.sh` + with 24 pane-capture fixtures covering the false-positive, + missed-block, and previously-correct cases. Filename prefix + encodes the expected classification. + +## Impact + +- Cost: fewer ASK false positives → fewer sonnet/medium calls per + tick. Sonnet stays the workhorse for the remaining real asks. + Opus calls are gated on the (now more accurate) BLOCKED set; + strike guard caps per-pane spend. +- Behavior: panes the supervisor used to ignore (real ask under + stale `Working (`, merge-conflict, MCP-missing, bad GH creds) + now classify correctly. +- Risk: the harness pins the precision/recall trade-off — any + future loosening of the gates regresses the harness. +- Surfaces touched: `claude-supervisor.sh` replaces its inline + classifier with `source`; new lib; new test + fixtures. No + daemon-state files, no plan-watcher changes, no cap-swap-daemon + changes. diff --git a/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/specs/claude-supervisor-classifier-audit/spec.md b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/specs/claude-supervisor-classifier-audit/spec.md new file mode 100644 index 0000000..fdbff47 --- /dev/null +++ b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/specs/claude-supervisor-classifier-audit/spec.md @@ -0,0 +1,89 @@ +## ADDED Requirements + +### Requirement: classifier lives in a pure-bash library + +The system SHALL keep the classifier (BUSY/ASK/BLOCKED pattern +arrays, `is_busy`, `is_asking`, `is_blocked`, `last_line_is_prompt`, +`classify_tail`, and `tail_hash`) at +`scripts/codex-fleet/lib/claude-supervisor-classifier.sh`. The lib +SHALL be safe to `source` with no side effects (no tmux calls, no +claude calls, no file writes at source time). `claude-supervisor.sh` +SHALL source this lib rather than inlining the classifier. + +#### Scenario: library is sourceable with no side effects +- **WHEN** `scripts/codex-fleet/lib/claude-supervisor-classifier.sh` + is sourced from a bash script +- **THEN** no tmux, claude, or filesystem-mutating command runs +- **AND** `classify_tail`, `is_busy`, `is_asking`, `is_blocked`, + `last_line_is_prompt`, and `tail_hash` are defined + +### Requirement: classify_tail returns one of four labels + +`classify_tail ""` SHALL echo exactly one of +`busy`, `asking`, `blocked`, `quiet`. `asking` SHALL outrank `blocked` +when both conditions match — a pane that mentioned a stale blocker +but is now showing an interactive menu wants an answer to the menu. + +#### Scenario: asking outranks blocked +- **WHEN** the recent tail contains both a BLOCKED_PATTERN (e.g., + stale-claim) and an ASK_PATTERN (e.g., a numbered menu with a + `(recommended)` option) AND `last_line_is_prompt` accepts the + bottom line +- **THEN** `classify_tail` echoes `asking` + +### Requirement: busy is anchored to the last non-empty line + +`is_busy` SHALL match BUSY_PATTERNS only against the last non-empty +line of the ANSI-stripped tail. A stale `Working (` or +`esc to interrupt` earlier in scrollback SHALL NOT mask a fresh +interactive cursor at the bottom. + +#### Scenario: stale Working in scrollback does not mask a fresh ask +- **WHEN** the tail contains `Working (` nine lines from the bottom + AND the bottom line is a bare prompt sigil (`❯`) under a numbered + menu with `(recommended)` option +- **THEN** `classify_tail` echoes `asking`, not `busy` + +### Requirement: last_line_is_prompt rejects bare-colon endings + +`last_line_is_prompt` SHALL NOT accept a bare trailing `:` as a +waiting cursor. A bare trailing `?` SHALL be accepted only when the +same line carries a known question lead-word (Continue, Approve, +Proceed, Confirm, Apply, Should I, Do you want, Would you like, +Which option/approach/one, Choose, Select, Pick, Need clarification, +Need more …, Please clarify/confirm/choose/specify). + +#### Scenario: narrative status line ending in ":" does not trigger asking +- **WHEN** the bottom line of the tail is `Reading file: …:` AND an + older `Continue?` appears earlier in scrollback +- **THEN** `classify_tail` echoes `quiet` + +### Requirement: BLOCKED_PATTERNS cover codex-fleet stuck states + +`BLOCKED_PATTERNS` SHALL match each of: git merge conflict +(`CONFLICT (content`), `error: uncommitted changes`, `fatal: ` git +errors, `Permission denied (publickey)`, `gh: command not found`, +`Bad credentials`, `MCP server (not found|missing|unavailable)`, +`429 Too Many Requests`, and the canonical `BLOCKED:` prefix — +in addition to the prior set (PLAN_SUBTASK_NOT_FOUND, stale-claim, +told-not-to-rescue, less-than-5%-limit, etc.). + +#### Scenario: missing MCP server is classified as blocked +- **WHEN** the tail contains `Error: MCP server colony not found in + the registered servers` +- **THEN** `classify_tail` echoes `blocked` + +### Requirement: replay harness pins classifier behavior + +The system SHALL ship a replay harness at +`scripts/codex-fleet/test/test-claude-supervisor-classifier.sh` +that discovers every `*.txt` fixture under +`scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/`, +parses the expected label from the filename prefix +(`