diff --git a/openspec/plans/codex-fleet-orchestration-v2-2026-05-14/design.md b/openspec/plans/codex-fleet-orchestration-v2-2026-05-14/design.md new file mode 100644 index 0000000..4a27724 --- /dev/null +++ b/openspec/plans/codex-fleet-orchestration-v2-2026-05-14/design.md @@ -0,0 +1,460 @@ +# codex-fleet orchestration v2 — design + +Three items deferred from the 2026-05-14 improvement set because they need +cross-system work. Each section: problem → proposal → minimal viable cut → +acceptance → risks. All three are independent in spec but partially ordered +in shipping (see each section's *Scheduling order* subsection). + +Sibling artifacts in the same plan slug touch overlays, dispatch UX, and +ticker hygiene — do not duplicate that scope here. This doc is the single +source of truth for the **pull / orchestrator / dispatch** layer only. + +--- + +## 1. Event-driven worker pulls — kill poll. workers wake on real events. + +### Problem + +Every codex pane runs the loop from `scripts/codex-fleet/worker-prompt.md`: + +``` +ready = mcp__colony__task_ready_for_agent({ agent, limit: 1 }) +if empty: sleep 60; goto 2 +``` + +`worker-prompt.md` also pins step 1 to `mcp__colony__hivemind_context` on +boot. In practice the hivemind probe leaks into the steady-state loop on +some panes (the model re-fetches context when its working memory drifts), +so each idle pane is ~1 `task_ready_for_agent` call per 60s, occasionally +plus a `hivemind_context` re-fetch. + +Rough magnitude on the default 8-pane fleet, idle: + +- 8 panes × 1 ready-call / 60s = 8 calls/min = 480/hour +- + drift-rate `hivemind_context` ≈ 1 call/pane/5min = 96/hour +- Total floor when nothing is queued ≈ **~575 Colony reads/hour, doing nothing** + +Each call is cheap on Colony, but every read is a fresh context inflation +on the pane side (the model re-reads the tool result, the surrounding +prompt, and any working-memory delta). On `model_reasoning_effort=xhigh` +that idle floor is the dominant fleet cost when no plans are publishing. + +### Proposal + +Replace the 60s poll with a per-agent event source that the worker blocks +on until Colony has something for that agent. Two flavours: + +1. **MCP streaming.** New tool `mcp__colony__task_ready_stream({ agent })` + that holds the call open and emits a single `TaskReady` event when a + claim is available. Cleaner: rides the existing MCP transport, gets + auth + cancellation for free. Costs a Colony server change (new + long-lived handler, plus a per-agent subscriber map). +2. **File-tail.** Colony appends one JSON line per ready-event to + `$COLONY_HOME/queue/.jsonl`. Worker runs `tail -F` over MCP + shell or via a thin filesystem-watching MCP tool. No new long-lived + Colony connection, just a small append-on-claim hook. + +Both shapes preserve today's worker loop *semantics* — the worker still +calls `task_ready_for_agent` after waking to claim authoritatively. The +event only carries `{ task_id, plan_slug, sub_idx, ready_at }` as a wake +signal, not a claim grant. Colony stays the source of truth. + +### Minimal viable cut + +**Pick file-tail for v0.** Reasoning: + +- Smaller diff to Colony: one append-on-`task_make_available` hook inside + the existing claim-state transition, plus a startup truncate. No new + MCP handler, no streaming/cancellation correctness window. +- Per-agent file gives natural multi-reader fan-out — a future cockpit + process can tail the same file for telemetry without competing with the + worker. +- `tail -F` survives Colony restart automatically (file-renamed-on-rotate + is the common rotate path; `-F` re-opens). +- If file-tail proves too brittle (FS event reliability across overlayfs + or NFS), we promote to MCP streaming with the same event shape; only + the transport changes, the worker prompt grammar does not. + +Concretely v0 is three things: + +1. Colony writes `$COLONY_HOME/queue/.jsonl` lines on every + `available` transition that names the agent (direct claim, broadcast + plan publish, hand-off-to-agent). +2. `worker-prompt.md` boot adds: `exec tail -F "$COLONY_HOME/queue/$AGENT.jsonl" | ` + where the loop reads lines and only then runs `task_ready_for_agent`. +3. A poll backstop stays at **300s** (not 60s) for safety against missed + FS events. Five-minute cap is acceptable — the wakeups handle the hot + path. + +### Acceptance + +- With 8 idle panes and no plan publishes, total Colony reads over 10 + minutes drop from ~80 (today's 1/min/pane floor) to **≤ 10** (the + 300s backstop firing roughly twice per pane). +- A `task_post(kind: 'queue', content: 'plan=X/sub-N agent=Y')` + published into Colony triggers the matching pane to claim within + **≤ 2s** wallclock, measured by `(claim_ts - publish_ts)` in Colony + observation rows. +- Worker prompt grammar is unchanged for any branch *after* wake — + i.e. the only behavioural change is *when* the worker calls + `task_ready_for_agent`, not *what* it does after. + +### Risks + +- **Stale watchers after Colony restart.** If Colony deletes + `$COLONY_HOME/queue/` on cold boot, every worker's `tail -F` keeps the + old inode and silently never wakes. Mitigation: Colony truncates + (`> $file`) rather than `rm`, and the 300s backstop catches the gap. +- **Per-agent file paths leaking plan slugs.** `queue/.jsonl` + becomes a side-channel for whoever can read `$COLONY_HOME`. The + worker pane already has read access; the risk is third-party panes + (e.g. a logs window) accidentally tailing it. Mitigation: mode `0600` + on the file, owner == colony service user. +- **`tail -F` reliability across filesystem boundaries.** If + `$COLONY_HOME` ends up on a bind-mount or overlayfs (e.g. inside a + future container migration), `inotify` may miss events. Mitigation: + document the supported FS classes in the worker prompt and keep the + poll backstop alive forever. +- **Wake storms on broadcast publish.** A plan with N agents listed all + fire at once; first to call `task_claim` wins, the rest no-op. This is + fine but worth measuring — if the no-op claims show up as cost, switch + the broadcast publish to emit one line per agent on a small jittered + fan-out (50–250ms). + +### Scheduling order + +- Ships **first**. §3 (send-keys-free dispatch) depends directly on the + per-agent queue file existing. §2 (Rust orchestrator) is independent + but is much smaller once §1 lands, because the orchestrator can read + the same queue file instead of re-implementing a Colony event probe. + +--- + +## 2. Consolidated Rust orchestrator — one binary owns the ticker zoo. + +### Problem + +`scripts/codex-fleet/full-bringup.sh` brings up a sibling tmux session +`fleet-ticker` with one window per daemon. Today's roster (`lines 365-403` +in `full-bringup.sh` and surrounding): + +- `fleet-tick` — 15s clock + heartbeat row +- `cap-swap` — rotates per-pane account caps on quota +- `state-pump` — flattens Colony into `/tmp/claude-viz/*.json` +- `force-claim` (+ `claim-trigger`) — dispatch idle panes via send-keys +- `claim-release-supervisor` — calls `colony rescue stranded --apply` every 60s +- `stall-watcher` — flags claims with no progress in N minutes +- `supervisor` (optional) — re-bringup on daemon death + +Six-plus separate bash daemons, each scanning every `openspec/plans/*/plan.json` +on its own 15–60s interval. Concrete pain: + +- Duplicated work. `state-pump`, `force-claim`, `stall-watcher`, and + `claim-release-supervisor` all walk the same plan tree. +- Hard to reason about ordering. If `claim-release-supervisor` reaps a + stranded claim 100ms before `force-claim` reads it, the dispatch fires + on a task that's already back in the queue and races itself. Today + this is silently absorbed by Colony's idempotent claim API, but + observability is bad — you see the race only in Colony observation + spam. +- Crash blast radius. A bash `set -eo pipefail` daemon that hits a + transient `python3` JSON parse error exits and stays dead until + `supervisor` restarts it (if `supervisor` is even on). Meanwhile its + invariant (e.g. "no claim older than 30min stays stranded") silently + lapses. + +### Proposal + +A new bin `rust/fleet-orchestrator` that subsumes the daemons into a +single event loop. One process, one log, one PID to babysit. The +orchestrator owns: + +- claim-release (port of `claim-release-supervisor.sh`) +- stall detection (port of `stall-watcher.sh`) +- dispatch (port of `force-claim.sh` + `claim-trigger.sh`) +- state pump (port of `state-pump.sh`) +- plan-complete detection + +It does *not* own: + +- tmux pane geometry / window layout → `style-tabs.sh` keeps that +- cap rotation → `cap-swap` stays a sidecar; ties to + codex CLI state we don't model +- bringup orchestration → `full-bringup.sh` stays the entrypoint + +Single 1Hz tick + event triggers. Structured logging to one file (JSONL +to `/tmp/claude-viz/orchestrator.jsonl`) so the cockpit can tail-render it. + +### Event loop (ASCII) + +``` + ┌──────────────────────────────────────────────┐ + │ fleet-orchestrator │ + │ │ + 1Hz tick ───►│ tick() │ + colony evt ──►│ ├─ scan_plans() (cached, mtime-keyed) │ + queue line ──►│ ├─ reap_stranded() (≥30min claimed → rel) │ + │ ├─ detect_stall() (no progress ≥N min) │ + │ ├─ dispatch_ready() (write queue.jsonl) │ + │ ├─ pump_state() (flatten → viz json) │ + │ └─ emit_log() (one JSONL per action) │ + │ │ + │ state cache: │ + │ - plans: slug → (mtime, parsed_json) │ + │ - claims: task_id → (agent, claimed_at) │ + │ - panes: idx → (state, last_seen) │ + └──────────────────────────────────────────────┘ + │ │ + ▼ ▼ + $COLONY_HOME/ /tmp/claude-viz/ + queue/*.jsonl *.json (state pump) + orchestrator.jsonl (log) +``` + +The Colony-event input lands as a thin `inotify` reader on `$COLONY_HOME` +(reusing the §1 queue-file path) plus a periodic full reconcile every +60s. Pane state input is `tmux list-panes` + `tmux capture-pane` calls +identical to today's bash — the bash regex moves into Rust, the tmux +shell-out stays. + +### Minimal viable cut + +Four phases. Each phase deletes a bash daemon end-to-end before the +next phase starts. No "ports half-done; both daemons running in +parallel forever" state. + +**Phase 1 — claim-release.** Simplest: 60s scan, call +`colony rescue stranded --apply`, log. No dispatch, no tmux. Single +file `src/main.rs` + `claim_release.rs`. Replaces `claim-release-supervisor.sh`. + +**Phase 2 — stall-watcher.** Adds the per-claim age check. Same input +(Colony plan scan), different output (Colony `task_post` with kind=stall). +Replaces `stall-watcher.sh`. + +**Phase 3 — dispatch.** Depends on §1 file-tail being live. Replaces +`tmux send-keys` in `force-claim.sh` with `append to queue/.jsonl` +(see §3). The orchestrator becomes the sole writer for the dispatch lane. +This is the phase where the tmux coupling actually goes away. + +**Phase 4 — state pump + plan-complete detector.** Last because it's +the loudest output (a JSON file written every 5s) and the most +likely to surface latent dashboard assumptions about the file shape. +Cockpit (the ratatui Rust UI) gets validated against the new writer +before this phase closes. + +### Acceptance + +- `full-bringup.sh` ticker session shrinks from 7+ windows to **1** + (`orchestrator`) by end of phase 4. `cap-swap` may remain as a + sidecar (explicitly out of scope above), but every plan-walking + daemon is gone. +- Colony query rate (counted at the Colony server) drops by **≥ 50%** + after phase 4, measured by `task_*` call count over a 10-minute + idle window vs. the today-baseline. +- Orchestrator restart preserves the **no-stale-claim invariant**: + killing the process for 30s and restarting must not produce a + stranded claim older than the existing 30-minute reap threshold. + Verified by a restart-during-claim integration test. +- One log file. `grep -c '"event":' /tmp/claude-viz/orchestrator.jsonl` + grows monotonically with every action. + +### Risks + +- **Porting bash regex to Rust types.** The dispatcher's + `idle_panes()` is a sed/grep stack matching against a pane tail. The + Rust port must keep parity with the exact ANSI-stripped regexes + (`Working \([0-9]+[ms]`, `Reviewing approval request`, + `^› (Find and fix|…)`). Mitigation: lift those into a single + `pane_state.rs` module with a golden-file test corpus of captured + pane tails. Do this before phase 3, even though parity only matters + in phase 3. +- **`FORCE_CLAIM_WINDOW=overview` tmux coupling.** Phase 1 and 2 do + not touch tmux, but phase 3 does — the orchestrator has to know + *which window* contains the codex panes to read their state. That + knob can't be hard-coded; it has to remain an env override matching + today's bash defaults. Mitigation: a small `tmux_target.rs` + ingesting `FORCE_CLAIM_SESSION`, `FORCE_CLAIM_WINDOW`, + `CODEX_FLEET_REPO_ROOT` exactly as today. +- **Single-process blast radius.** Today, a `state-pump` bug doesn't + kill claim-release. After consolidation it might. Mitigation: each + subsystem runs in its own task with `tokio::select!` + per-task + panic catch; a panic logs and continues the loop instead of + aborting the process. +- **The `supervisor` daemon's job becomes ours.** Today there's a + bash `supervisor` window that restarts dead daemons. After phase 4 + there's nothing to restart — only the orchestrator itself. A + systemd-style restart-on-exit wrapper at the `tmux new-window` + layer covers this without re-introducing the bash zoo. + +### Scheduling order + +- Phase 1 and Phase 2 are **independent** of §1 and §3 — ship them + first to capture the consolidation win even if §1 stalls. +- Phase 3 **blocks on §1** (file-tail queue must exist) and **co-ships + with §3** (deleting `force-claim.sh` is the §3 acceptance gate). +- Phase 4 is last and independent. + +--- + +## 3. Send-keys-free dispatch — control plane stops typing. + +### Problem + +`scripts/codex-fleet/force-claim.sh` lines 165–179 dispatch tasks by: + +```bash +tmux send-keys -t "$SESSION:$WINDOW.$pane_idx" -l "$prompt" +tmux send-keys -t "$SESSION:$WINDOW.$pane_idx" Enter +``` + +This types a literal claim prompt into the pane's stdin. Concretely it +fails or misroutes when: + +- the pane is in tmux **copy-mode** or **scrollback** — keystrokes go + into the selection cursor, not the codex CLI +- the pane is showing an **approval request** — keystrokes accept or + reject the prompt rather than dispatching the next task +- the pane is mid-`Working (…)` — `idle_panes()` excludes these, but + there's a race window between capture and send +- the status-line redraws mid-paste — `send-keys -l` is literal mode, + so the prompt survives, but the trailing `Enter` can hit a stale + buffer position on slow paints + +The deeper problem is architectural: the control plane is coupled to +the presentation surface. Anything that changes the pane's UI state +(future ratatui rewrite of the worker pane, a kitty graphics overlay, +a tmux popup) becomes a dispatch correctness concern. + +### Proposal + +Once §1 ships, the worker tails `$COLONY_HOME/queue/.jsonl` and +processes lines as wake events. The orchestrator (§2, Rust) appends to +that queue file instead of typing into the pane. No more `send-keys`. + +Wire shape: + +``` +orchestrator worker pane +───────────── ──────────── +detect ready task ─────► tail -F queue/.jsonl + │ +append JSONL line ▼ +{ read line + "task_id": "...", │ + "plan_slug": "...", ▼ + "sub_idx": 3, mcp__colony__task_ready_for_agent + "ready_at": "..." │ +} ▼ + claim, work, finish +``` + +The line itself is **a wake signal, not a claim grant** — same shape as +§1's event. Colony stays the source of truth on who-owns-what; the +queue file just tells the worker "stop sleeping, go ask Colony now." + +### Minimal viable cut + +Strictly dependent on §1 file-tail being live. Then it's a one-line +change in the Rust orchestrator's dispatch path (§2 phase 3): + +- **Before:** `tmux send-keys -t "$SESSION:$WINDOW.$pane_idx" -l "$prompt"` +- **After:** `append_jsonl("$COLONY_HOME/queue/.jsonl", event)` + +Followed by: + +1. Delete `scripts/codex-fleet/force-claim.sh`. +2. Delete `scripts/codex-fleet/claim-trigger.sh` (its job is now the + orchestrator's `inotify` reader). +3. Update `full-bringup.sh` to stop opening the `force-claim` window. +4. Update `worker-prompt.md` to reflect the new wake source (one-line + edit at step 2 of the Loop section). + +### Acceptance + +- `scripts/codex-fleet/force-claim.sh` no longer exists in the tree. +- `grep -rn 'send-keys' scripts/codex-fleet/` returns **only** + cosmetic / presentation uses — `style-tabs.sh` and similar — and + zero control-plane uses. +- A dispatch fired while the target pane is in copy-mode still + triggers a claim within the worker's next loop iteration (≤ 2s), + proving control plane is decoupled from pane UI state. +- The `force-claim` tmux window is gone from `full-bringup.sh`'s + ticker layout. + +### Risks + +- **Queue file growth.** Today's dispatch is fire-and-forget; if it + becomes append-only file lines, the file grows. Mitigation: the + worker truncates the file after reading each line, OR the + orchestrator rotates the file when it exceeds a small threshold + (e.g. 1 MiB). Pick truncate-on-read for v0 — simpler, only one + writer concern. +- **Missing acks.** With `send-keys`, success was visually obvious: + the prompt appeared in the pane. With JSONL append, the dispatch + side can't tell whether the worker actually woke. Mitigation: the + worker emits a Colony `task_post(kind: 'wake', evidence: ready_at)` + on consume, and the orchestrator measures the dispatch→wake + latency. If a wake doesn't arrive within 5s, retry-append (same + line, same `ready_at` — idempotent on the worker side). +- **Concurrent writers.** Colony itself writes to `queue/.jsonl` + for native ready events (§1); the orchestrator also writes for + dispatch decisions. Two writers on one append-only file is safe + on POSIX **only** for writes ≤ `PIPE_BUF` (typically 4096 bytes) + with `O_APPEND`. Mitigation: cap each JSONL line at 1 KiB and + always open with `O_APPEND`. Lines that would exceed 1 KiB get + rejected at write-time and logged — the wake event payload is + small enough that this is a hard ceiling, not a soft one. +- **Direct-typing escape hatch.** Operators sometimes want to paste + a one-off prompt into a pane to override the orchestrator (the + pre-existing approved flow documented in + `feedback_gx_fleet_dispatch_authorized`). Removing `send-keys` + from the control plane must not remove the *operator's* ability + to do this manually. Mitigation: keep `tmux send-keys` working + at the operator's shell — the deletion is of `force-claim.sh` + and its in-process automation, not of tmux's ability to receive + keystrokes. + +### Scheduling order + +- Strictly **after §1** (queue file must exist). +- Co-ships with **§2 phase 3** (the Rust dispatch port is where the + `send-keys` line gets replaced). +- Independent of §2 phases 1, 2, and 4. + +--- + +## Cross-cutting notes + +### Shipping order, end-to-end + +``` +§1 (file-tail queue) + └─► §2 phase 3 (dispatch port) + └─► §3 (delete force-claim.sh) + +§2 phase 1 (claim-release) ── independent, ship first or in parallel +§2 phase 2 (stall-watcher) ── after phase 1 +§2 phase 4 (state pump) ── last; cockpit gate +``` + +### Out of scope + +- Worker pane visual rewrite (ratatui-based codex pane). Tracked + separately under `codex-fleet-overlays-phase5-2026-05-14`. +- Cap rotation (`cap-swap`). Stays bash for now; binds to codex CLI + internals we don't model in Colony. +- Colony server's `task_ready_for_agent` semantics. Unchanged. This + doc is purely about *when* workers call it and *who* triggers + that call. + +### Verification gates per item + +- §1: a 10-minute idle observation window with Colony read counts + before/after, plus a one-shot end-to-end "publish → claim ≤ 2s" + measurement. +- §2: phase-by-phase, the deleted bash daemon's invariants + restated as a Rust integration test (stranded reap, stall + detection, dispatch fan-out, state-pump output diff against + the bash baseline). +- §3: `grep` acceptance above, plus a copy-mode-during-dispatch + manual reproduction confirming the worker still wakes. diff --git a/openspec/plans/codex-fleet-overlays-phase5-2026-05-14/plan.json b/openspec/plans/codex-fleet-overlays-phase5-2026-05-14/plan.json index ad4b30a..3bef92d 100644 --- a/openspec/plans/codex-fleet-overlays-phase5-2026-05-14/plan.json +++ b/openspec/plans/codex-fleet-overlays-phase5-2026-05-14/plan.json @@ -138,6 +138,6 @@ "spec_change_path": "/home/deadpool/Documents/codex-fleet/openspec/changes/codex-fleet-overlays-phase5-2026-05-14/CHANGE.md", "auto_archive": false }, - "created_at": "2026-05-14T16:17:20.446Z", - "updated_at": "2026-05-14T16:17:20.446Z" + "created_at": "2026-05-14T17:34:07.483Z", + "updated_at": "2026-05-14T17:34:07.483Z" } diff --git a/scripts/codex-fleet/accounts.example.yml b/scripts/codex-fleet/accounts.example.yml index fce7f4e..6fc7c08 100644 --- a/scripts/codex-fleet/accounts.example.yml +++ b/scripts/codex-fleet/accounts.example.yml @@ -8,24 +8,49 @@ # # Copy this file to `accounts.yml` and adjust to your real accounts: # cp scripts/codex-fleet/accounts.example.yml scripts/codex-fleet/accounts.yml +# +# Schema (per account): +# id short pane id (kebab/snake) +# email account auth filename in ~/.codex/accounts/ +# skills [list] informational, surfaced to Colony +# rate_limit_tier "high" | "standard" — legacy quota hint +# tier "low" | "medium" | "high" (default "high") +# Drives model_reasoning_effort at spawn AND which task +# difficulties this pane will accept: +# high → reasoning=xhigh; accepts hard/standard/trivial +# medium → reasoning=medium; accepts standard/trivial +# low → reasoning=low; accepts trivial only +# The worker prompt's tier gate releases tasks that +# exceed pane capacity back to the queue. +# specialty [list of plan-slug prefixes] (default [] = generalist) +# Pane prefers tasks whose `plan_slug` starts with one of +# these prefixes. Empty list = accept all plans. accounts: - id: research email: admin@gitguardex.com skills: [research, planning, deep-analysis] rate_limit_tier: high + tier: high + specialty: [] - id: coding email: admin@magnoliavilag.hu skills: [implementation, testing] rate_limit_tier: standard + tier: medium + specialty: ["codex-fleet", "recodee"] - id: review email: admin@mite.hu skills: [code-review, security] rate_limit_tier: standard + tier: low + specialty: [] - id: docs email: dpq@recodee.online skills: [documentation, refactor] rate_limit_tier: standard + tier: low + specialty: ["docs"] diff --git a/scripts/codex-fleet/fleet-config.toml.tmpl b/scripts/codex-fleet/fleet-config.toml.tmpl new file mode 100644 index 0000000..2597a84 --- /dev/null +++ b/scripts/codex-fleet/fleet-config.toml.tmpl @@ -0,0 +1,109 @@ +# codex-fleet pane config — rendered into each staged CODEX_HOME. +# +# Stays deliberately minimal. The fleet worker prompt only calls +# `mcp__colony__*`; every other MCP server in the operator's interactive +# `~/.codex/config.toml` is dead weight here and adds 30-60s of startup +# blocking per pane when its backing daemon is down (e.g. recodee on +# :2455, drawio @drawio/mcp, Higgsfield SaaS). +# +# Tokens substituted by fleet_render_config() in lib/mcp-preflight.sh +# (placeholders wrapped in double underscores below — listed here without +# the wrapper so this comment block does not get rewritten in the +# rendered output): +# +# COLONY_ENABLED true / false (from mcp-preflight.sh) +# COLONY_HOME absolute path to COLONY_HOME (from preflight) +# COLONY_BIN absolute path to the `colony` binary +# COLONY_TIMEOUT_SEC integer seconds for startup_timeout_sec +# PATH shell PATH the codex pane should inherit +# REASONING_EFFORT "xhigh" | "medium" | "low" — tiered per pane. +# Driven by the `tier` field in accounts.yml +# (high→xhigh, medium→medium, low→low). Codex +# locks reasoning effort at startup, so this is +# the only knob for cheap panes. +# +# Anything not substituted is intentionally static — edit this file, not +# the per-pane copy. + +# Autonomous-worker policy. The fleet panes execute the worker-loop +# prompt without a human at the keyboard, so on-request approval would +# stall every Colony MCP call against the guardian_subagent reviewer +# (see the "Request denied for codex to run colony mcp" symptom in the +# Phase-5 screenshot). `never` immediately returns execution failures to +# the model instead of blocking on approval — the right shape for a +# pull-loop worker. +approval_policy = "never" +cli_auth_credentials_store = "file" +model = "gpt-5.5" +model_reasoning_effort = "__REASONING_EFFORT__" +personality = "pragmatic" +# Workers do write files (claim → edit → commit). Read-only would +# silently break every task. workspace-write keeps the sandbox on but +# allows edits inside the working tree. +sandbox_mode = "workspace-write" +suppress_unstable_features_warning = true + +[features] +external_migration = true +goals = true +memories = true +prevent_idle_sleep = true +terminal_resize_reflow = true + +# Only Colony is enabled. Every other MCP from the operator's interactive +# config is intentionally omitted — the worker prompt does not call them +# and slow / unreachable servers (drawio's @drawio/mcp, recodee at +# 127.0.0.1:2455 when the daemon is down) would otherwise burn 30s of +# pane startup time and trip the "MCP startup incomplete" banner. +[mcp_servers.colony] +args = ["mcp"] +command = "__COLONY_BIN__" +enabled = __COLONY_ENABLED__ +startup_timeout_sec = __COLONY_TIMEOUT_SEC__ + +[mcp_servers.colony.env] +COLONY_HOME = "__COLONY_HOME__" +COLONY_MCP_METRICS_MAX = "5000" +COLONY_OBSERVATIONS_TTL_DAYS = "30" +COLONY_SESSION_TTL_DAYS = "14" +NODE_OPTIONS = "--max-old-space-size=1024" + +[notice] +fast_default_opt_out = true +hide_full_access_warning = true +"hide_gpt-5.1-codex-max_migration_prompt" = true +hide_gpt5_1_migration_prompt = true +hide_rate_limit_model_nudge = true + +# Trust the fleet's working roots so the workspace-write sandbox doesn't +# escalate on every edit. Mirrors the operator's interactive trust list +# for the project paths the fleet actually touches. +[projects."/home/deadpool"] +trust_level = "trusted" + +[projects."/home/deadpool/Documents"] +trust_level = "trusted" + +[projects."/home/deadpool/Documents/codex-fleet"] +trust_level = "trusted" + +[projects."/home/deadpool/Documents/recodee"] +trust_level = "trusted" + +[projects."/home/deadpool/Documents/recodee/colony"] +trust_level = "trusted" + +[shell_environment_policy] +inherit = "core" + +[shell_environment_policy.set] +PATH = "__PATH__" + +[tui] +status_line = [ + "model-with-reasoning", + "git-branch", + "context-remaining", + "five-hour-limit", + "weekly-limit", +] diff --git a/scripts/codex-fleet/full-bringup.sh b/scripts/codex-fleet/full-bringup.sh index 46ff3b0..381dc5b 100755 --- a/scripts/codex-fleet/full-bringup.sh +++ b/scripts/codex-fleet/full-bringup.sh @@ -27,6 +27,7 @@ ATTACH=1 PLAN_SLUG="" FLEET_ID="${FLEET_ID:-}" AUTO_FLEET_ID=0 +NO_CAP_CACHE=0 while [ $# -gt 0 ]; do case "$1" in @@ -35,6 +36,7 @@ while [ $# -gt 0 ]; do --no-attach) ATTACH=0; shift ;; --fleet-id) FLEET_ID="$2"; shift 2 ;; --auto-fleet-id) AUTO_FLEET_ID=1; shift ;; + --no-cap-cache) NO_CAP_CACHE=1; shift ;; *) echo "unknown arg: $1"; exit 2 ;; esac done @@ -43,6 +45,17 @@ log() { printf '\033[36m[full-bringup]\033[0m %s\n' "$*"; } warn() { printf '\033[33m[full-bringup]\033[0m %s\n' "$*"; } die() { printf '\033[31m[full-bringup] FATAL:\033[0m %s\n' "$*"; exit 1; } +# Source the MCP preflight so stage_account() (below) renders a fleet-local +# config.toml driven by FLEET_COLONY_* + FLEET_PATH. The preflight is +# best-effort: an unhealthy Colony degrades the staged config rather than +# failing bringup, matching the worker-prompt's shell-CLI fallback. +preflight_log() { log "preflight: $*"; } +preflight_warn() { warn "preflight: $*"; } +# shellcheck source=lib/mcp-preflight.sh +. "$SCRIPT_DIR/lib/mcp-preflight.sh" + +FLEET_CONFIG_TMPL="${CODEX_FLEET_CONFIG_TMPL:-$SCRIPT_DIR/fleet-config.toml.tmpl}" + cd "$REPO" # Fleet ID handling — lets you run multiple parallel fleets on different @@ -99,17 +112,61 @@ fi [ -f "openspec/plans/$PLAN_SLUG/plan.json" ] || die "plan workspace missing: openspec/plans/$PLAN_SLUG/plan.json" log "priority plan: $PLAN_SLUG" +# 2b. Build --add-dir flags from plan metadata.writable_roots (schema: +# scripts/codex-fleet/lib/plan-meta.md). Falls back to the recodee + +# codex-fleet pair when the plan declares nothing. +ADD_DIR_FLAGS=$(PLAN_FILE="openspec/plans/$PLAN_SLUG/plan.json" python3 - <<'PY' +import json, os +p = os.environ["PLAN_FILE"] +try: + with open(p) as f: + data = json.load(f) +except Exception: + data = {} +roots = (data.get("metadata") or {}).get("writable_roots") or [] +if not roots: + roots = ["/home/deadpool/Documents/recodee", "/home/deadpool/Documents/codex-fleet"] +print(" ".join(f"--add-dir {r}" for r in roots)) +PY +) +[ -n "$ADD_DIR_FLAGS" ] || die "failed to compute ADD_DIR_FLAGS for plan $PLAN_SLUG" + +# Preflight every writable root: must exist + be writable by the current user. +add_count=0 +for path in $(printf '%s\n' "$ADD_DIR_FLAGS" | awk '{for(i=1;i<=NF;i++) if($i=="--add-dir"){print $(i+1)}}'); do + [ -d "$path" ] || die "writable root unreachable: $path (chmod / chown / mount?)" + [ -w "$path" ] || die "writable root unreachable: $path (chmod / chown / mount?)" + add_count=$((add_count + 1)) +done +log "writable roots ok: $add_count root(s)" + # 3. Pre-spawn git cleanup (prevents 'incorrect old value provided' inside agent-branch-start.sh) log "pruning stale remote refs" git -C "$REPO" remote prune origin 2>&1 | sed 's/^/ /' || true git -C "$REPO" fetch --prune origin 2>&1 | sed 's/^/ /' >/dev/null || true -# 4. Ensure the plan is published to Colony so task_plan_list shows it -log "ensuring plan is published" -if colony plan publish "$PLAN_SLUG" --agent claude --session "full-bringup-$(date +%s)" 2>&1 | sed 's/^/ /'; then - log "publish: ok (or already published — publish is idempotent)" -else - warn "publish returned non-zero; check above. Workers may not see this plan in task_ready_for_agent." +# 4. Ensure the plan is published to Colony so task_plan_list shows it. +# 5-min publish cache short-circuits the second/third bringup of the same +# plan slug — colony plan publish is idempotent but the round-trip costs +# ~2-3s and an MCP call. +PLAN_PUBLISH_MARK="/tmp/codex-fleet/.plan-publish.$PLAN_SLUG.mark" +mkdir -p /tmp/codex-fleet +plan_publish_skip=0 +if [ -f "$PLAN_PUBLISH_MARK" ]; then + mark_age=$(( $(date +%s) - $(stat -c %Y "$PLAN_PUBLISH_MARK" 2>/dev/null || echo 0) )) + if [ "$mark_age" -lt 300 ]; then + log "plan publish: cache hit, skipping (age=${mark_age}s)" + plan_publish_skip=1 + fi +fi +if [ "$plan_publish_skip" = "0" ]; then + log "ensuring plan is published" + if colony plan publish "$PLAN_SLUG" --agent claude --session "full-bringup-$(date +%s)" 2>&1 | sed 's/^/ /'; then + log "publish: ok (or already published — publish is idempotent)" + touch "$PLAN_PUBLISH_MARK" + else + warn "publish returned non-zero; check above. Workers may not see this plan in task_ready_for_agent." + fi fi # 5. Verify wake prompt exists @@ -149,39 +206,147 @@ CAND_N=$(printf "%s\n" "$CANDIDATES" | wc -l) log "ranked $CAND_N candidates by codex-auth score; running live probe..." # (b) Live probe — keep only candidates whose codex CLI is actually usable. -HEALTHY_EMAILS=$(bash "$SCRIPT_DIR/cap-probe.sh" "$N_PANES" $CANDIDATES 2>/tmp/cap-probe.err) || true +# 5-min cache short-circuits back-to-back bringups (each probe spawns N codex +# subprocesses and takes 30-90s). Bypass with --no-cap-cache. +CAP_PROBE_CACHE="/tmp/codex-fleet/.cap-probe-cache.json" +mkdir -p /tmp/codex-fleet +HEALTHY_EMAILS="" +cap_cache_hit=0 +if [ "$NO_CAP_CACHE" = "0" ] && [ -f "$CAP_PROBE_CACHE" ]; then + HEALTHY_EMAILS=$(CACHE="$CAP_PROBE_CACHE" python3 - <<'PY' +import json, os, time, sys +try: + with open(os.environ["CACHE"]) as f: + data = json.load(f) + ts = int(data.get("ts", 0)) + age = int(time.time()) - ts + if age < 300 and isinstance(data.get("emails"), list) and data["emails"]: + print(age) + for e in data["emails"]: + print(e) +except Exception: + pass +PY +) + if [ -n "$HEALTHY_EMAILS" ]; then + cache_age=$(printf "%s\n" "$HEALTHY_EMAILS" | head -n1) + HEALTHY_EMAILS=$(printf "%s\n" "$HEALTHY_EMAILS" | tail -n +2) + log "cap-probe cache hit (age=${cache_age}s)" + cap_cache_hit=1 + fi +fi +if [ "$cap_cache_hit" = "0" ]; then + HEALTHY_EMAILS=$(bash "$SCRIPT_DIR/cap-probe.sh" "$N_PANES" $CANDIDATES 2>/tmp/cap-probe.err) || true +fi HEALTHY_N=$(printf "%s\n" "$HEALTHY_EMAILS" | grep -c "@" || true) if [ "$HEALTHY_N" -lt "$N_PANES" ]; then warn "cap-probe found only $HEALTHY_N/$N_PANES healthy accounts" warn "$(cat /tmp/cap-probe.err 2>/dev/null)" [ "$HEALTHY_N" -eq 0 ] && die "no healthy accounts; check /tmp/claude-viz/cap-probe.log" fi +if [ "$cap_cache_hit" = "0" ] && [ "$HEALTHY_N" -gt 0 ]; then + # Atomic write: tmp + rename so a concurrent reader never sees half a file. + CACHE_TMP="${CAP_PROBE_CACHE}.tmp.$$" + EMAILS="$HEALTHY_EMAILS" python3 - < "$CACHE_TMP" +import json, os, time +emails = [e.strip() for e in os.environ.get("EMAILS","").splitlines() if e.strip()] +print(json.dumps({"ts": int(time.time()), "emails": emails})) +PY + mv "$CACHE_TMP" "$CAP_PROBE_CACHE" +fi log "$HEALTHY_N healthy account(s) confirmed by live probe" -# Map healthy emails to id|email format expected downstream -ACCOUNTS=$(printf "%s\n" "$HEALTHY_EMAILS" | python3 -c ' -import sys -m={"magnoliavilag":"magnolia","gitguardex":"gg","pipacsclub":"pipacs"} +# Map healthy emails to id|email|tier|specialty format. `tier` + `specialty` +# are looked up from accounts.yml by email; missing entries default to +# tier=high (xhigh reasoning) and specialty="" (generalist). The downstream +# stage + spawn loops read 4 fields per line. +ACCOUNTS_YAML="${ACCOUNTS_YAML:-$SCRIPT_DIR/accounts.yml}" +ACCOUNTS=$(printf "%s\n" "$HEALTHY_EMAILS" | ACCOUNTS_YAML="$ACCOUNTS_YAML" python3 -c ' +import sys, os, re +acct_yml = os.environ.get("ACCOUNTS_YAML", "") +by_email = {} +if acct_yml and os.path.exists(acct_yml): + cur = None + with open(acct_yml) as fh: + for raw in fh: + line = raw.rstrip() + s = line.lstrip() + if not s or s.startswith("#"): + continue + if s.startswith("- id:"): + if cur is not None and cur.get("email"): + by_email[cur["email"]] = cur + cur = {} + continue + if cur is None: + continue + mm = re.match(r"^\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*:\s*(.*?)$", line) + if not mm: continue + k, v = mm.group(1), mm.group(2).strip() + if v.startswith("[") and v.endswith("]"): + v = [x.strip().strip("\"").strip("\x27") for x in v[1:-1].split(",") if x.strip()] + else: + v = v.strip("\"").strip("\x27") + cur[k] = v + if cur is not None and cur.get("email"): + by_email[cur["email"]] = cur +dommap = {"magnoliavilag":"magnolia","gitguardex":"gg","pipacsclub":"pipacs"} for line in sys.stdin: email = line.strip() if not email: continue part, dom = email.split("@", 1) dom = dom.split(".", 1)[0] - dom = m.get(dom, dom) - print(f"{part}-{dom}|{email}") + dom = dommap.get(dom, dom) + aid = f"{part}-{dom}" + info = by_email.get(email, {}) + tier = info.get("tier", "high") + spec = info.get("specialty", "") + if isinstance(spec, list): + spec = ",".join(spec) + print(f"{aid}|{email}|{tier}|{spec}") ') COUNT=$(echo "$ACCOUNTS" | grep -c "|") -log "final account list: $COUNT" +log "final account list: $COUNT (tier+specialty from $ACCOUNTS_YAML)" # 7. Stage CODEX_HOMEs log "staging per-account CODEX_HOMEs" -while IFS='|' read -r id email; do +# Map tier (from accounts.yml) → codex `model_reasoning_effort`. Consumed by +# fleet_render_config's __REASONING_EFFORT__ substitution. +tier_to_effort() { + case "$1" in + low) echo "low" ;; + medium) echo "medium" ;; + *) echo "xhigh" ;; # high or unset + esac +} +while IFS='|' read -r id email tier specialty; do [ -z "$id" ] && continue d="/tmp/codex-fleet/$id" mkdir -p "$d" cp "$HOME/.codex/accounts/$email.json" "$d/auth.json" chmod 600 "$d/auth.json" - [ -e "$d/config.toml" ] || ln -s "$HOME/.codex/config.toml" "$d/config.toml" + export FLEET_REASONING_EFFORT="$(tier_to_effort "$tier")" + # Render a fleet-local config.toml (Colony only, pre-approved, sandbox + # workspace-write) instead of symlinking the operator's interactive + # `~/.codex/config.toml`. The old symlink dragged in drawio / recodee / + # Higgsfield / coolify / hostinger-api / soul-skills MCPs that the + # worker prompt never calls — and when any of their backends were down + # (recodee daemon on :2455, @drawio/mcp), every pane blocked 30-60s on + # MCP startup and tripped the "MCP startup incomplete" banner. A stale + # symlink target is replaced on every bringup so re-staging fixes a + # config that drifted. + if [ -L "$d/config.toml" ] || [ -e "$d/config.toml" ]; then + rm -f "$d/config.toml" + fi + if [ -f "$FLEET_CONFIG_TMPL" ]; then + if ! fleet_render_config "$FLEET_CONFIG_TMPL" "$d/config.toml"; then + warn "failed to render fleet config for $id; falling back to symlink" + ln -s "$HOME/.codex/config.toml" "$d/config.toml" + fi + else + warn "fleet template missing ($FLEET_CONFIG_TMPL); falling back to symlinking ~/.codex/config.toml" + ln -s "$HOME/.codex/config.toml" "$d/config.toml" + fi done <<< "$ACCOUNTS" # 8. Create the main session with overview window @@ -217,12 +382,20 @@ tmux select-layout -t "$SESSION:overview" tiled log "launching $N_PANES codex workers" PANE_IDS=( $(tmux list-panes -t "$SESSION:overview" -F '#{pane_id}') ) i=0 -while IFS='|' read -r id email; do +while IFS='|' read -r id email tier specialty; do [ -z "$id" ] && continue pid="${PANE_IDS[$i]}" tmux set-option -p -t "$pid" '@panel' "[codex-$id]" + # --add-dir is required when the active plan touches paths outside the + # codex-fleet repo (e.g. /home/deadpool/Documents/recodee for gx-fleet-* + # plans). Without it, `workspace-write` blocks all writes and the worker + # spins on `outside writable roots` / `.git/FETCH_HEAD: Read-only file + # system` for the entire session. + # CODEX_FLEET_TIER + CODEX_FLEET_SPECIALTY are read by worker-prompt.md's + # "Tier + specialty gate" — pane post-skips tasks beyond its tier or + # outside its specialty prefixes. tmux respawn-pane -k -t "$pid" \ - "env CODEX_GUARD_BYPASS=1 CODEX_HOME=/tmp/codex-fleet/$id CODEX_FLEET_AGENT_NAME=codex-$id CODEX_FLEET_ACCOUNT_EMAIL=$email codex --dangerously-bypass-approvals-and-sandbox \"\$(cat $WAKE)\"" + "env CODEX_GUARD_BYPASS=1 CODEX_HOME=/tmp/codex-fleet/$id CODEX_FLEET_AGENT_NAME=codex-$id CODEX_FLEET_ACCOUNT_EMAIL=$email CODEX_FLEET_TIER=${tier:-high} CODEX_FLEET_SPECIALTY=\"$specialty\" codex --dangerously-bypass-approvals-and-sandbox $ADD_DIR_FLAGS \"\$(cat $WAKE)\"" i=$((i + 1)) done <<< "$ACCOUNTS" @@ -306,6 +479,17 @@ else open_window watcher "$SCRIPT_DIR/watcher-board.sh" "" fi +# Design preview — fleet-tui-poc renders the glass-dock floating nav + the +# iOS overlay surfaces (ContextMenu / Spotlight / ActionSheet) as a live +# reference for design work inside the running fleet. Optional: skip +# silently when the release bin isn't built so design work doesn't gate +# bringup on a non-essential window. Inside the pane, press 1/2/3 to open +# ContextMenu / Spotlight / ActionSheet and reveal the terminal-backdrop +# preview underneath. +if [ -x "$rust_bin_dir/fleet-tui-poc" ]; then + open_window design "$rust_bin_dir/fleet-tui-poc" remain +fi + # 11b. Apply canonical iOS-style chrome (3-row tab strip at top, rounded pane # borders with `▭ #{@panel}` headers, sticky right-click menu). Runs after # windows exist so window-status-format covers all six tabs. @@ -377,14 +561,30 @@ CODEX_FLEET_SESSION="$TICKER_SESSION" bash "$SCRIPT_DIR/style-tabs.sh" >/dev/nul || warn "style-tabs.sh failed for $TICKER_SESSION" # 12c. Verify chrome actually rendered (catches the session-local `status on` -# shadow regression where tmux clamps to 1 row and silently hides the tab -# strip). Expected: status_height=3. -chrome_h=$(tmux display-message -p -t "$SESSION:overview" '#{?status_height,#{status_height},#{e|-|:#{client_height},#{window_height}}}' 2>/dev/null || echo "?") -if [ "$chrome_h" = "3" ]; then - log "iOS chrome verified: status_height=$chrome_h" -else - warn "iOS chrome looks wrong: status_height=$chrome_h (expected 3)" -fi +# shadow regression where tmux clamps the bar away and silently hides the tab +# strip). The acceptable height is whatever STYLE_TABS_HEIGHT asked for — +# default 1 (single-row, clicks work), 2-5 for opt-in floating-dock padding. +# +# Older revisions of this check read the per-client `status_height` format +# var and fell back to `client_height - window_height` when it was empty. +# Both are client-scoped — when full-bringup runs with no client attached +# (--no-attach, or while we're still spawning the ticker session) tmux +# returns `''` for status_height and `-` for the subtraction +# (we logged `-76`), tripping the alarm even though the chrome was fine. +# +# Instead, read the GLOBAL `status` option directly: `on`/`off`/`1..5`. +# style-tabs.sh wipes the session-local override before re-setting the +# global, so any non-`off` global value means the chrome is in place. +expected_h="${STYLE_TABS_HEIGHT:-1}" +chrome_status=$(tmux show-options -gv status 2>/dev/null || echo "") +case "$chrome_status" in + ''|off|0) + warn "iOS chrome looks wrong: global status='$chrome_status' (expected on or ${expected_h})" + ;; + *) + log "iOS chrome verified: status=$chrome_status (target ${expected_h})" + ;; +esac log "DONE." log " main session: tmux attach -t $SESSION" diff --git a/scripts/codex-fleet/lib/mcp-preflight.sh b/scripts/codex-fleet/lib/mcp-preflight.sh new file mode 100755 index 0000000..7ce7ff9 --- /dev/null +++ b/scripts/codex-fleet/lib/mcp-preflight.sh @@ -0,0 +1,124 @@ +#!/usr/bin/env bash +# codex-fleet MCP preflight — probe the MCP servers the staged fleet +# config will reference, and export per-MCP enable flags + timeouts +# consumed by the fleet-config.toml.tmpl renderer. +# +# The fleet worker prompt only calls `mcp__colony__*`, so this script +# focuses on Colony. We deliberately do NOT probe recodee / drawio / +# Higgsfield / coolify / hostinger-api / soul-skills — they are absent +# from the rendered fleet config by design, so probing them is wasted +# work AND would falsely block bringup when those daemons are down. +# +# Outputs (exported on success, so the caller can substitute them into +# fleet-config.toml.tmpl with sed): +# +# FLEET_COLONY_BIN absolute path to the colony CLI +# FLEET_COLONY_HOME absolute path to COLONY_HOME +# FLEET_COLONY_ENABLED "true" | "false" (lowercase TOML literal) +# FLEET_COLONY_TIMEOUT_SEC integer (default 60) +# FLEET_PATH PATH the staged config inherits +# +# Failures degrade gracefully: if Colony is unreachable, the fleet still +# stages a config that has `enabled = false` for Colony rather than +# refusing to spawn. Workers will fall back to invoking `colony` as a +# shell CLI (the worker-prompt loop already handles this), and the +# preflight log makes the degradation visible. +# +# Source it; do not exec it. Caller must have already sourced lib/_env.sh. +# This file deliberately does NOT enable `set -u` / `set -e`: those flags +# would bleed into the sourcing shell (`up.sh`, `full-bringup.sh`) and +# trip on shell-snapshot lookups (e.g. unbound ZSH_VERSION) outside our +# control. The caller picks its own strictness; we keep the lib quiet. + +# --- log helpers (no-op friendly if caller defines their own) ----------- +if ! declare -F preflight_log >/dev/null 2>&1; then + preflight_log() { printf "[fleet-preflight] %s\n" "$*" >&2; } +fi +if ! declare -F preflight_warn >/dev/null 2>&1; then + preflight_warn() { printf "[fleet-preflight] WARN %s\n" "$*" >&2; } +fi + +# --- defaults ----------------------------------------------------------- +: "${FLEET_COLONY_TIMEOUT_SEC:=60}" +: "${FLEET_COLONY_HOME_DEFAULT:=$HOME/Documents/recodee/colony/.omx/colony-home}" +: "${FLEET_PATH:=$HOME/.bun/bin:$HOME/.nvm/versions/node/v22.22.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin}" + +# --- locate colony binary ------------------------------------------------ +# Prefer the version on PATH (honors $HOME/.nvm symlinks). Fall back to a +# few well-known absolute paths so the preflight works from a non-login +# shell (cron, systemd) that hasn't loaded nvm. +_fleet_locate_colony() { + local candidate + candidate="$(command -v colony 2>/dev/null || true)" + if [[ -n "$candidate" && -x "$candidate" ]]; then + printf '%s\n' "$candidate"; return 0 + fi + for candidate in \ + "$HOME/.nvm/versions/node/v22.22.0/bin/colony" \ + "$HOME/.local/bin/colony" \ + "/usr/local/bin/colony" + do + if [[ -x "$candidate" ]]; then + printf '%s\n' "$candidate"; return 0 + fi + done + return 1 +} + +FLEET_COLONY_BIN="$(_fleet_locate_colony || true)" + +# --- probe colony -------------------------------------------------------- +# Two-step health check: +# 1. Binary present + invokable (`colony --help` exits 0 fast). +# 2. COLONY_HOME directory readable. +# We do NOT spawn the MCP server itself here — that would race with the +# pane spawn and double the bringup time. The codex startup will spin it +# up; the preflight just guarantees the spawn won't blow up immediately. +FLEET_COLONY_HOME="${COLONY_HOME:-$FLEET_COLONY_HOME_DEFAULT}" +FLEET_COLONY_ENABLED="false" + +if [[ -z "$FLEET_COLONY_BIN" ]]; then + preflight_warn "colony CLI not found on PATH or in known locations — fleet panes will fall back to shell calls" +elif [[ ! -d "$FLEET_COLONY_HOME" ]]; then + preflight_warn "COLONY_HOME missing: $FLEET_COLONY_HOME — disabling colony MCP in staged config" +elif ! "$FLEET_COLONY_BIN" --help >/dev/null 2>&1; then + preflight_warn "colony CLI at $FLEET_COLONY_BIN failed --help probe — disabling colony MCP in staged config" +else + FLEET_COLONY_ENABLED="true" + preflight_log "colony MCP healthy: bin=$FLEET_COLONY_BIN home=$FLEET_COLONY_HOME timeout=${FLEET_COLONY_TIMEOUT_SEC}s" +fi + +export FLEET_COLONY_BIN FLEET_COLONY_HOME FLEET_COLONY_ENABLED FLEET_COLONY_TIMEOUT_SEC FLEET_PATH + +# --- render helper ------------------------------------------------------- +# fleet_render_config +# Materializes the fleet-config.toml.tmpl into , substituting +# the FLEET_* env vars above. Caller is responsible for making sure +# FLEET_COLONY_BIN is non-empty when FLEET_COLONY_ENABLED=true. +# +# Tier wiring: the spawn loop sets `FLEET_REASONING_EFFORT` per pane +# based on the account's `tier` field in accounts.yml: +# tier=high → FLEET_REASONING_EFFORT=xhigh (default if unset) +# tier=medium → FLEET_REASONING_EFFORT=medium +# tier=low → FLEET_REASONING_EFFORT=low +# This substitutes __REASONING_EFFORT__ in the template. Codex locks +# model_reasoning_effort at startup, so the tier MUST be decided +# before render, not at task-claim time. +fleet_render_config() { + local tmpl="$1" dst="$2" + if [[ ! -f "$tmpl" ]]; then + preflight_warn "fleet config template missing: $tmpl" + return 1 + fi + # Use a bash-only substitution loop instead of sed -e to avoid quoting + # surprises with the PATH value (contains `/`). + local content + content="$(<"$tmpl")" + content="${content//__COLONY_ENABLED__/$FLEET_COLONY_ENABLED}" + content="${content//__COLONY_HOME__/$FLEET_COLONY_HOME}" + content="${content//__COLONY_BIN__/${FLEET_COLONY_BIN:-colony}}" + content="${content//__COLONY_TIMEOUT_SEC__/$FLEET_COLONY_TIMEOUT_SEC}" + content="${content//__PATH__/$FLEET_PATH}" + content="${content//__REASONING_EFFORT__/${FLEET_REASONING_EFFORT:-xhigh}}" + printf '%s' "$content" > "$dst" +} diff --git a/scripts/codex-fleet/lib/plan-meta.md b/scripts/codex-fleet/lib/plan-meta.md new file mode 100644 index 0000000..02bd8f7 --- /dev/null +++ b/scripts/codex-fleet/lib/plan-meta.md @@ -0,0 +1,44 @@ +# plan.json metadata schema + +Optional `metadata` object on the plan root. Consumers: `full-bringup.sh` +(writable_roots) and the worker prompt's tier router (difficulty). + +## Fields + +### `metadata.writable_roots: string[]` + +Absolute paths passed as `--add-dir ` to `codex` at worker spawn. +Required when the plan touches files outside the codex-fleet repo +(workspace-write sandbox blocks writes outside listed roots → workers spin +on `outside writable roots` / `Read-only file system`). + +Fallback when absent or empty: `["/home/deadpool/Documents/recodee", +"/home/deadpool/Documents/codex-fleet"]`. + +`full-bringup.sh` preflights each path: `test -d` + `test -w`. Missing or +read-only → `die`. Fix with `chmod` / `chown` / remount. + +### `metadata.subtasks[].difficulty: "trivial"|"standard"|"hard"` + +Per-subtask hint consumed by the worker prompt's tier router. Schema +declared here; routing logic lives in `worker-prompt.md` (Agent B owns). +Default when absent: `"standard"`. + +## Example + +```json +{ + "plan_slug": "demo-2026-05-14", + "metadata": { + "writable_roots": [ + "/home/deadpool/Documents/codex-fleet", + "/home/deadpool/Documents/recodee" + ], + "subtasks": { + "0": { "difficulty": "trivial" }, + "1": { "difficulty": "hard" } + } + }, + "tasks": [ /* ... */ ] +} +``` diff --git a/scripts/codex-fleet/token-meter.sh b/scripts/codex-fleet/token-meter.sh new file mode 100755 index 0000000..fa23628 --- /dev/null +++ b/scripts/codex-fleet/token-meter.sh @@ -0,0 +1,202 @@ +#!/usr/bin/env bash +# token-meter.sh — per-pane spend snapshot for codex-fleet tmux. +# pane /proc env + codex-auth list + tmux capture-pane → sorted table/JSON. + +set -u; LC_ALL=C +SESSION="codex-fleet"; USE_COLOR=1; MODE="table"; WATCH=0 + +usage() { + cat <<'EOF' +token-meter.sh — codex-fleet per-pane spend snapshot +usage: bash scripts/codex-fleet/token-meter.sh [flags] + --session tmux session (default codex-fleet) + --no-color disable ANSI + --json emit JSON array + --watch refresh every 5s, Ctrl-C exits + --help this help +cols: agent | account | 5h% | wk% | ctx% | tasks-done | status +sort asc by 5h% (lowest headroom first). red when 5h%<20 OR wk%<15 OR ctx%<15. +EOF +} + +while [ $# -gt 0 ]; do + case "$1" in + --session) SESSION="${2:-codex-fleet}"; shift 2 ;; + --no-color) USE_COLOR=0; shift ;; + --json) MODE="json"; shift ;; + --watch) WATCH=1; shift ;; + --help|-h) usage; exit 0 ;; + *) echo "unknown flag: $1" >&2; usage >&2; exit 2 ;; + esac +done +[ -t 1 ] || USE_COLOR=0 +[ "$MODE" = "json" ] && USE_COLOR=0 +c_red=""; c_dim=""; c_bold=""; c_reset="" +if [ "$USE_COLOR" = "1" ]; then + c_red=$'\033[31m'; c_dim=$'\033[2m'; c_bold=$'\033[1m'; c_reset=$'\033[0m' +fi + +# pull `codex-auth list` once; cache as email\t5h\twk per line +fetch_auth() { + codex-auth list 2>/dev/null \ + | awk '/type=ChatGPT/ { + email=""; fh="n/a"; wk="n/a"; + for (i=1;i<=NF;i++) { + if ($i ~ /@/) email=$i; + else if ($i ~ /^5h=/) { sub(/^5h=/,"",$i); fh=$i } + else if ($i ~ /^weekly=/) { sub(/^weekly=/,"",$i); wk=$i } + } + if (email != "") print email"\t"fh"\t"wk + }' +} + +# tmux pane env via /proc//environ — works even when pane is busy. +pane_env() { + local pid="$1" key="$2" + [ -r "/proc/$pid/environ" ] || { echo ""; return; } + tr '\0' '\n' < "/proc/$pid/environ" 2>/dev/null \ + | awk -F= -v k="$key" '$1==k {print substr($0,length(k)+2); exit}' +} + +# walk descendant pids (BFS) then resolve agent/email from /proc env. +resolve_pane() { + local root="${1:-}" agent="" email="" q next p kids + [ -z "$root" ] && { printf 'n/a\tn/a\n'; return; } + q="$root"; next="" + while [ -n "$q" ]; do + next="" + for p in $q; do + if [ -z "$agent" ]; then agent=$(pane_env "$p" CODEX_FLEET_AGENT_NAME); fi + if [ -z "$email" ]; then email=$(pane_env "$p" CODEX_FLEET_ACCOUNT_EMAIL); fi + [ -n "$agent" ] && [ -n "$email" ] && break 2 + kids=$(pgrep -P "$p" 2>/dev/null || true) + [ -n "$kids" ] && next="$next $kids" + done + q="$next" + done + printf '%s\t%s\n' "${agent:-n/a}" "${email:-n/a}" +} + +# scrape ctx% + status from pane tail. +scrape_pane() { + local target="$1" + local buf ctx="n/a" status="idle" + buf=$(tmux capture-pane -p -t "$target" -S -60 2>/dev/null || true) + [ -z "$buf" ] && { printf '%s\t%s\n' "$ctx" "$status"; return; } + local left + left=$(printf '%s' "$buf" | grep -oE 'Context[[:space:]]+[0-9]+%' | tail -1 | grep -oE '[0-9]+' || true) + [ -n "$left" ] && ctx="$((100 - left))%" + if printf '%s' "$buf" | grep -qiE 'rate[- ]?limit|rate_limit|429'; then status="rate-limited" + elif printf '%s' "$buf" | grep -qiE 'esc to interrupt|^[[:space:]]*Working'; then status="working" + elif printf '%s' "$buf" | grep -qiE 'blocked|BLOCKED:'; then status="blocked" + fi + printf '%s\t%s\n' "$ctx" "$status" +} + +pct_num() { # "62%" → 62; "n/a" → -1 + case "$1" in n/a|"") echo -1 ;; *) echo "${1%\%}" ;; esac +} + +is_hot() { + # codex-auth's 5h% / weekly% report REMAINING quota; codex's ctx% reports + # REMAINING context. Low values = about to wedge → red. (Earlier draft had + # these inverted because the design brief used "spend %" semantics.) + local fh wk ctx + fh=$(pct_num "$1"); wk=$(pct_num "$2"); ctx=$(pct_num "$3") + [ "$fh" -ge 0 ] && [ "$fh" -lt 20 ] 2>/dev/null && return 0 + [ "$wk" -ge 0 ] && [ "$wk" -lt 15 ] 2>/dev/null && return 0 + [ "$ctx" -ge 0 ] && [ "$ctx" -lt 15 ] 2>/dev/null && return 0 + return 1 +} + +collect() { # emits TSV: agent email 5h wk ctx tasks status + local panes auth_tsv + panes=$(tmux list-panes -t "$SESSION:0" -F '#{pane_id} #{pane_pid}' 2>/dev/null || true) + if [ -z "$panes" ]; then + echo "no panes for session '$SESSION' (try --session )" >&2 + return 1 + fi + auth_tsv=$(fetch_auth) + while IFS=' ' read -r pid_tgt pid_pane; do + [ -z "$pid_pane" ] && continue + local rline agent email + rline=$(resolve_pane "$pid_pane") + agent=$(printf '%s' "$rline" | cut -f1) + email=$(printf '%s' "$rline" | cut -f2) + [ "$agent" = "n/a" ] && continue # skip non-codex panes + local fh="n/a" wk="n/a" + if [ -n "$auth_tsv" ] && [ "$email" != "n/a" ]; then + local row + row=$(printf '%s\n' "$auth_tsv" | awk -F'\t' -v e="$email" '$1==e {print $2"\t"$3; exit}') + [ -n "$row" ] && { fh=$(printf '%s' "$row" | cut -f1); wk=$(printf '%s' "$row" | cut -f2); } + fi + local sline ctx status + sline=$(scrape_pane "$pid_tgt") + ctx=$(printf '%s' "$sline" | cut -f1) + status=$(printf '%s' "$sline" | cut -f2) + local tasks="n/a" # colony CLI exposes no per-agent count; fallback only + printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' \ + "$agent" "$email" "$fh" "$wk" "$ctx" "$tasks" "$status" + done <<<"$panes" \ + | sort -t$'\t' -k3,3 -n # sort asc by 5h% (lowest headroom first → spend-watch panes on top) +} + +render_table() { + local tsv="$1" + local ts; ts=$(date '+%Y-%m-%d %H:%M') + printf '%s%s%s — %s\n' "$c_bold" "codex-fleet token meter" "$c_reset" "$ts" + printf '%-28s %-28s %-6s %-6s %-6s %-11s %s\n' \ + "agent" "account" "5h%" "wk%" "ctx%" "tasks-done" "status" + [ -z "$tsv" ] && { printf '%s(no panes)%s\n' "$c_dim" "$c_reset"; return; } + while IFS=$'\t' read -r agent email fh wk ctx tasks status; do + [ -z "$agent" ] && continue + local pre="" post="" + if is_hot "$fh" "$wk" "$ctx"; then pre="$c_red"; post="$c_reset"; fi + printf '%s%-28s %-28s %-6s %-6s %-6s %-11s %s%s\n' \ + "$pre" "$agent" "$email" "$fh" "$wk" "$ctx" "$tasks" "$status" "$post" + done <<<"$tsv" +} + +render_json() { + local tsv="$1" + local ts; ts=$(date '+%Y-%m-%dT%H:%M:%S') + TS="$ts" SESS="$SESSION" python3 - "$tsv" <<'PY' +import json, os, sys +rows = [] +for line in (sys.argv[1] or "").splitlines(): + if not line.strip(): + continue + parts = line.split("\t") + if len(parts) != 7: + continue + agent, email, fh, wk, ctx, tasks, status = parts + rows.append({ + "agent": agent, "account": email, + "five_hour_pct": fh, "weekly_pct": wk, "ctx_pct": ctx, + "tasks_done": tasks, "status": status, + }) +print(json.dumps({ + "session": os.environ.get("SESS",""), + "timestamp": os.environ.get("TS",""), + "agents": rows, +}, indent=2)) +PY +} + +one_pass() { + local tsv + tsv=$(collect) || return $? + if [ "$MODE" = "json" ]; then render_json "$tsv" + else render_table "$tsv"; fi +} + +if [ "$WATCH" = "1" ]; then + trap 'printf "\n"; exit 0' INT + while :; do + [ "$USE_COLOR" = "1" ] && printf '\033[2J\033[H' || printf '\n' + one_pass + sleep 5 + done +else + one_pass +fi diff --git a/scripts/codex-fleet/up.sh b/scripts/codex-fleet/up.sh index db1a27e..649d339 100755 --- a/scripts/codex-fleet/up.sh +++ b/scripts/codex-fleet/up.sh @@ -32,9 +32,18 @@ CONFIG="${SCRIPT_DIR}/accounts.yml" SESSION="${CODEX_FLEET_SESSION:-codex-fleet}" WORK_ROOT="${CODEX_FLEET_WORK_ROOT:-/tmp/codex-fleet}" PROMPT_FILE="${SCRIPT_DIR}/worker-prompt.md" +FLEET_CONFIG_TMPL="${CODEX_FLEET_CONFIG_TMPL:-$SCRIPT_DIR/fleet-config.toml.tmpl}" DRY_RUN=0 ATTACH=1 +# Probe Colony / MCP health once, before any pane spawns. Exports +# FLEET_COLONY_* + FLEET_PATH used by fleet_render_config below. The +# preflight is non-fatal: when Colony is unhealthy it disables the MCP +# in the staged config rather than refusing bringup, so the worker +# prompt's shell-CLI fallback still has a chance to keep things moving. +# shellcheck source=lib/mcp-preflight.sh +. "$SCRIPT_DIR/lib/mcp-preflight.sh" + while [[ $# -gt 0 ]]; do case "$1" in --config) CONFIG="$2"; shift 2 ;; @@ -133,9 +142,22 @@ stage_account() { mkdir -p "$dst" cp -f "$src" "$dst/auth.json" chmod 600 "$dst/auth.json" - # config.toml is large and stable — symlink rather than copy. - if [[ -f "$HOME/.codex/config.toml" ]]; then - ln -sf "$HOME/.codex/config.toml" "$dst/config.toml" + # Render a fleet-local config.toml instead of symlinking the operator's + # interactive one. The worker prompt only calls `mcp__colony__*`; every + # other MCP in `~/.codex/config.toml` (drawio, recodee, Higgsfield, …) + # would burn 30-60s of pane startup time blocking on slow / unreachable + # backends. fleet_render_config substitutes preflight-derived enable + # flags and timeouts into the template. + if [[ -f "$FLEET_CONFIG_TMPL" ]]; then + if ! fleet_render_config "$FLEET_CONFIG_TMPL" "$dst/config.toml"; then + echo "fatal: failed to render fleet config from $FLEET_CONFIG_TMPL" >&2 + return 1 + fi + else + echo "[codex-fleet] WARN fleet template missing ($FLEET_CONFIG_TMPL); falling back to symlinking ~/.codex/config.toml" >&2 + if [[ -f "$HOME/.codex/config.toml" ]]; then + ln -sf "$HOME/.codex/config.toml" "$dst/config.toml" + fi fi echo "[codex-fleet] staged $acct_id ($email) -> $dst" } @@ -189,11 +211,16 @@ for a in json.load(sys.stdin): # previous shape) made codex exit on EOF, killing the pane immediately. # Use --prompt-file when available (codex >= 0.x), otherwise fall back # to passing the file contents as the positional argument. + # # --dangerously-bypass-approvals-and-sandbox: auto-approve all MCP tool # calls and shell commands. Fleet workers run unattended; without this # every Colony MCP call would hit the "Allow / Always allow / Cancel" # gate and stall the pull-loop. - pane_cmd="env ${pane_env[*]} codex --dangerously-bypass-approvals-and-sandbox \"\$(cat '$PROMPT_FILE')\"" + # --add-dir extends the workspace-write sandbox so workers can edit + # files in sibling repos that the active plan targets (e.g. + # /home/deadpool/Documents/recodee for gx-fleet-* plans). Without + # this, workers hit `outside writable roots` and silently spin. + pane_cmd="env ${pane_env[*]} codex --dangerously-bypass-approvals-and-sandbox --add-dir /home/deadpool/Documents/recodee --add-dir /home/deadpool/Documents/codex-fleet \"\$(cat '$PROMPT_FILE')\"" if [[ $FIRST -eq 1 ]]; then tmux new-session -d -s "$SESSION" -n "codex-$acct_id" "$pane_cmd" FIRST=0 diff --git a/scripts/codex-fleet/worker-prompt.md b/scripts/codex-fleet/worker-prompt.md index a23aa00..5c5c7b3 100644 --- a/scripts/codex-fleet/worker-prompt.md +++ b/scripts/codex-fleet/worker-prompt.md @@ -1,112 +1,121 @@ # codex-fleet worker loop -You are a Codex worker in a tmux pane spawned by `scripts/codex-fleet/up.sh`. -Your environment is set up by the parent script: +You are pane `$CODEX_FLEET_AGENT_NAME` (Colony agent id) under account +`$CODEX_FLEET_ACCOUNT_EMAIL`. The orchestrator is the host Claude session +plus the `force-claim` + `claim-release-supervisor` daemons. Your job: +pull → preflight → execute → report. Do not propose tasks. Do not chat. -- `CODEX_HOME` points at a per-pane staged dir with this account's - `auth.json` and a symlinked `config.toml`. Do not write to it. -- `CODEX_FLEET_AGENT_NAME` is your unique agent id (e.g. `codex-research`). - Use this exact string whenever a Colony MCP tool asks for `agent`. -- `CODEX_FLEET_ACCOUNT_EMAIL` is the email of the underlying codex - account. Surface it only in handoff notes when a rate-limit issue - needs operator attention. +## Token discipline -## Your job +- Less word, same proof. No commentary, no narration of your reasoning. +- Tool calls only when state changes. Skip "let me check…" prose. +- One Colony observation per real state change. Nothing else. +- Drop filler tokens (`I will`, `Now`, `Let me`). Imperative + result. -You are one of N parallel workers. The host Claude session is the -orchestrator: it proposes tasks via `mcp__colony__task_propose` and -monitors progress via `mcp__colony__attention_inbox`. Your job is to -**pull tasks from the Colony queue, execute them, and report back** — -nothing else. +## Boot (once) -Do not propose new tasks. Do not invent work. If `task_ready_for_agent` -returns nothing, wait and try again. +1. `mcp__colony__hivemind_context` — confirm Colony reachable. If it fails, + stop the loop and post a single shell echo "colony unreachable" then exit. + Do not retry indefinitely. ## Loop -Repeat indefinitely: - -1. Call `mcp__colony__hivemind_context` once at boot only — to load - project context and confirm you can talk to Colony. - -2. Call `mcp__colony__task_ready_for_agent({ agent: $CODEX_FLEET_AGENT_NAME })` - to claim the next ready sub-task. The server auto-claims when there - is exactly one candidate. - -3. If you got a task, the response payload includes plan-structure fields - you MUST read before editing: - - - `plan_slug` + `sub_idx` — your position in the plan tree. - - `parent` — the wave/parent sub-task (if any). Reference it in your - completion note so the orchestrator can render the W{n}·sub-{i} - lineage on the plan board. - - `depends_on` — sub-tasks that must be `done` before yours. Colony - already filters by ready deps, so if you got the task they're - satisfied. Re-check only if your evidence step needs an artifact - from an upstream sub-task; if a dep is `claimed` but not `done`, - treat it as a real blocker. - - `touches_files` — the EXACT file scope declared in the plan. Treat - this as a hard upper bound for what you edit. Adding a test file - next to a claimed source is fine; widening into a sibling module - is scope creep — post a question to the orchestrator first. - - Then: - - - Call `mcp__colony__task_claim_file` for each file you will edit - (subset of `touches_files`, plus any test file you're adding - adjacent to a claimed source). - - Call `mcp__colony__task_note_working` with `{ agent: $CODEX_FLEET_AGENT_NAME, plan_slug, sub_idx }` - immediately after the claim. This is what flips your row in - the cockpit's "WORKING ON" column from `idle` to the live - `→ sub-N `; without it the tick daemon falls back to - scraping your pane content, which lags and looks dead. - - Do the work. Match `touches_files` exactly. - - Verify with the narrowest meaningful command (cargo check, pytest -k, - tsc --noEmit) — see the project's verification gates in AGENTS.md. - - On success: open the PR via the agent-branch-finish flow, then - call `mcp__colony__task_plan_complete_subtask` with - `completed_summary` containing the PR URL or `PR #<n>` token. - The plan visualization scans this string for the PR badge. - - Then post the working-state note: - `mcp__colony__task_post(kind: 'note', content: 'branch=…; \ - task=plan=<plan_slug>/sub-<sub_idx>; parent=<parent>; \ - blocker=none; next=…; evidence=<PR URL>')`. - - On a real blocker (missing dep, ambiguous spec, broken build): - `mcp__colony__task_post(kind: 'blocker', content: 'BLOCKED branch=…; \ - plan=<plan_slug>/sub-<sub_idx>; reason=…; need=…')` then - `mcp__colony__task_hand_off` back to the orchestrator - (`to_agent: 'any'`). - -4. If `task_ready_for_agent` returns no work: - - Sleep ~60 seconds (use the ScheduleWakeup tool when available, or - a short shell sleep), then go back to step 2. - - Do not poll faster than 60 s; do not silently exit. The host - Claude session decides when to tear down the fleet. - -## Rate limits - -If a codex API call returns a 429 / quota error, **do not retry** in -this pane. Instead: - -1. `mcp__colony__task_post(kind: 'blocker', content: 'rate-limit hit on \ - account=$CODEX_FLEET_ACCOUNT_EMAIL; releasing claim')`. -2. Release any active file claims via `mcp__colony__task_claim_file` - with the released flag (or `task_hand_off released_files=[...]`). -3. Sleep ~5 minutes. Then resume the loop. Another pane with a different - account will pick up the released task in the meantime. - -## What you must NOT do - -- Do not switch accounts inside the pane. The pane's `CODEX_HOME` is - fixed by the spawn script. -- Do not run `codex login` / `codex logout`. The auth.json is staged. -- Do not edit `~/.codex/` or `$CODEX_HOME/` files. -- Do not stack git commits on `main` / `dev` — start a worktree per - the project's worktree-discipline rules in AGENTS.md. - -## Reporting cadence - -Every meaningful state change → one Colony observation. Mute commentary -otherwise; the orchestrator reads attention_inbox, not the tmux scrollback. - -Now: start the loop. Step 1 first. +``` +2. ready = mcp__colony__task_ready_for_agent({ agent: $CODEX_FLEET_AGENT_NAME, limit: 1 }) +3. if ready.ready is empty: + if ready.next_action contains "rescue" or ready.next_tool == "rescue_stranded_scan": + sleep 60 # claim-release-supervisor daemon owns rescue; do not loop on it + else: + sleep 60 + goto 2 +4. task = ready.ready[0] +``` + +Then preflight, claim, work, report. Sequence below. + +### Tier + specialty gate (REQUIRED before preflight) + +Read once at boot: `tier=$CODEX_FLEET_TIER` (default `high`), +`spec=$CODEX_FLEET_SPECIALTY` (default empty, comma/space separated). +Let `d = task.metadata.difficulty` (default `standard`). +- Capacity: `high`={hard,standard,trivial}, `medium`={standard,trivial}, `low`={trivial}. +- If `d` not in capacity: `task_post(kind:'note', content:'tier-skip: difficulty=<d> tier=<t>; releasing for stronger pane')`, `task_hand_off(to_agent:'any')`, `sleep 60`, `goto 2`. +- If `spec` non-empty AND no prefix in `spec` is a prefix of `task.plan_slug`: `task_post(kind:'note', content:'specialty-skip: plan=<plan_slug> spec=<spec>')`, `task_hand_off(to_agent:'any')`, `sleep 60`, `goto 2`. +- Empty `spec` = generalist; do not skip. + +### Preflight (REQUIRED before any edit) + +Reject the claim early if the work is unreachable. This stops endless +blocker churn that the prior fleet shot showed. + +- **Writable-root check.** For every path in `task.touches_files`, verify + it falls under one of: the codex pane's `--add-dir` roots + (`/home/deadpool/Documents/recodee`, `/home/deadpool/Documents/codex-fleet`), + `/tmp`, or `$CODEX_HOME`. If any path is outside: + - `task_post(kind: 'blocker', content: 'BLOCKED preflight=writable-root; \ + plan=<plan_slug>/sub-<sub_idx>; path=<offending>; need=add-dir or plan retarget')` + - `task_hand_off(to_agent: 'orchestrator')` and `sleep 60`, then `goto 2`. + - Do NOT attempt edits, claims, or `gx branch start`. Silent failure mode. + +- **Dep-already-claimed check.** If `task.depends_on` has any entry whose + status is `claimed` (not `done`) AND the claim is older than 30 minutes, + treat as stranded. Post a tight blocker referencing the dep's task id + and skip to `goto 2` after `sleep 60`. The `claim-release-supervisor` + daemon will reap it; you do not call rescue yourself. + +### Claim + work + +5. `task_claim_file` for each path in `touches_files` you will edit. +6. `task_note_working({ agent, plan_slug, sub_idx })` — cockpit pulls + "WORKING ON" from this; without it the row reads `idle`. +7. Start the agent worktree: + ``` + gx branch start "<task.title or plan_slug/sub-N>" "$CODEX_FLEET_AGENT_NAME" + cd "<printed worktree path>" + ``` + If `gx branch start` fails with `Read-only file system` or + `cannot open '.git/...'`, you hit the writable-root bug despite + preflight — post `BLOCKED preflight-bypass=gx-write` and `goto 2`. +8. Edit. Stay inside `touches_files`. Adjacent test files OK. +9. Verify with the narrowest meaningful command from the project's + AGENTS.md verification gates (e.g. `cargo check -p <crate>`, + `pytest -k <name>`, `tsc --noEmit`). +10. Finish (do NOT wait for merge — a supervisor finalizes): + ``` + gx branch finish --branch "<agent-branch>" --via-pr --cleanup + ``` + Then immediately: `task_post(kind: 'pending-merge', content: 'PR=<URL>; plan=<plan_slug>/sub-<sub_idx>')`. +11. `task_plan_complete_subtask({ plan_slug, sub_idx, completed_summary: "PR #<n> <one-line>" })`. + The plan board scans `completed_summary` for the `PR #<n>` badge. +12. `task_post(kind: 'note', content: 'branch=<br>; plan=<plan_slug>/sub-<sub_idx>; \ + parent=<parent>; blocker=none; next=<next>; state=pending-merge; pr=<PR URL>')`. +13. `goto 2`. + +### Blocker (real, not preflight) + +If verification fails, build breaks, spec ambiguous, or a dep artifact +is missing: +- `task_post(kind: 'blocker', content: 'BLOCKED branch=<br>; plan=<plan_slug>/sub-<sub_idx>; reason=<one line>; need=<one line>')` +- `task_hand_off(to_agent: 'any')` +- Release file claims so another pane can retry. +- `goto 2`. + +## Rate limits (429 / quota) + +Single response, then back off: +1. `task_post(kind: 'blocker', content: 'rate-limit account=$CODEX_FLEET_ACCOUNT_EMAIL; releasing claim')` +2. Release file claims via `task_claim_file` with the released flag (or + `task_hand_off released_files=[...]`). +3. `sleep 300`. Then `goto 2`. Another account picks up the released task. + +## Don't + +- Don't run `codex login` / `codex logout`. CODEX_HOME is fixed. +- Don't edit `~/.codex/` or `$CODEX_HOME/`. +- Don't commit on `main` / `dev`. Always agent-branch worktree. +- Don't call `rescue_stranded_scan` directly — the supervisor daemon owns it. +- Don't poll faster than 60s on empty queues. +- Don't propose new tasks. Don't invent scope. Don't widen `touches_files`. +- Don't narrate. Don't summarize. The orchestrator reads Colony, not pane text. + +Now: step 1, once. Then loop from step 2.