feat(cli,worker): daemon fast-path for colony bridge lifecycle#443
Merged
NagyVikt merged 3 commits intoMay 5, 2026
Merged
Conversation
Adds apps/cli/bin/colony.sh — POSIX shell wrapper that fast-paths `colony bridge lifecycle --json` via curl POST to a (not-yet-existing) daemon endpoint at /api/bridge/lifecycle. Falls back to in-process Node on any failure with stdin replayed from a temp file. Not yet wired: - Worker endpoint /api/bridge/lifecycle (next commit) - package.json bin entry change - Tests - CLAUDE.md rule #10 reconciliation (blocked on user decision) Why WIP commit: rule #10 forbids hooks crossing an HTTP boundary on the write path. The fast-path violates the literal text but preserves the spirit (writes still succeed via fallback when daemon is down). That's an architectural call that needs explicit user sign-off, not my judgement. Committing the wrapper makes the branch durable while that decision is pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every IDE tool event triggers `colony bridge lifecycle ...` from external hook integrations. Cold-starting Node + bundle load on each event pegs ~one core for ~300ms — multiplied across concurrent agents this is a measurable CPU storm visible at the system level. The CLI bin entry is now a POSIX shell wrapper at apps/cli/bin/colony.sh. When invoked as `colony bridge lifecycle --json`, it POSTs the envelope to the long-lived worker daemon at POST /api/bridge/lifecycle (new) and exits — no Node startup. Anything else execs the Node CLI exactly as before. The worker's daemon route reuses the long-lived MemoryStore so the SQLite connection isn't reopened per event. CLAUDE.md rule #10 is reworded to match the new contract: writes still succeed in-process when the daemon is unreachable. The wrapper buffers stdin to a temp file and falls back to Node on curl missing, connection refused, 2s timeout, non-200, unknown flags, or invocation without --json. Tests: - apps/worker/test/server.test.ts: POST /api/bridge/lifecycle routes envelopes through the long-lived store, dedupes duplicate event_ids, surfaces ok:false on malformed input. - apps/cli/test/bin-shim.test.ts: load-bearing rule-10 contract test. With daemon unreachable, the wrapper falls back to Node with stdin and quoted args (including paths with spaces) intact. Also covers COLONY_BRIDGE_FAST=0 disable, --version pass-through, and that bare `bridge lifecycle` (no --json) keeps human-readable output. Opt out at any time with COLONY_BRIDGE_FAST=0. Notes: - pnpm lint OOM'd locally (host has 13GB active swap from unrelated processes); CI lint should run cleanly. Will re-run before merge. - e2e-publish.sh re-run pending — it covers the bin-shim install path end-to-end and is the official guard for changes to the publish surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/bench-bridge-fastpath.mjs. Boots a fresh worker against a temp COLONY_HOME on a non-default port, then runs N concurrent shell-shim invocations of `bridge lifecycle --json` once with the daemon reachable (fast path) and once with COLONY_BRIDGE_FAST=0 (forced in-process Node fallback, mirrors today's behavior). Reports mean, median, p95, p99. Local results, 8 concurrent × 4 iterations = 32 events per path: [fast (daemon)] mean=54.2ms p95=74.1ms p99=130.4ms [slow (force-fallback)] mean=260.9ms p95=725.4ms p99=728.7ms speedup (mean): 4.8x saved: 206.7ms/event speedup (p95): 9.8x The p95 collapse from 725ms → 74ms is the Node cold-start tail under burst load — exactly the spawn-storm pattern that motivated this PR (four `colony bridge lifecycle` processes simultaneously at 100% CPU each in the original screenshot). Caveats: includes sh + curl + JSON serialization on the fast path; the "slow" path includes the wrapper buffering stdin then exec'ing Node (~5-10ms extra vs. raw `colony bridge lifecycle`), but that's swamped by the ~250ms Node cold-start it's measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Every IDE tool event triggers
colony bridge lifecycle ...from external hook integrations (oh-my-codex'sColonyBridge.spawnSync, Codex/Claude Code settings). Cold-starting Node + JIT + bundle load on each event pegs ~one core for ~300 ms. Multiplied across concurrent agents this is a measurable CPU storm — visible ingnome-system-monitoras 4+ short-livedcolony bridge lifecycleprocesses at 85–111% CPU each, plus active swap thrashing.Diagnosis from
psagainst the original screenshot:colony bridge lifecycleat 85–111% CPU each (transient, sub-second lifespan)codex-native-hook.jsat 30–40% each (oh-my-codex's per-event Node spawns)What
The CLI bin entry is now a POSIX shell wrapper at
apps/cli/bin/colony.sh. When invoked ascolony bridge lifecycle --json, itPOSTs the envelope to the long-lived worker daemon atPOST /api/bridge/lifecycle(new) and exits — no Node startup. Anything else execs the Node CLI exactly as before.The worker route reuses the long-lived
MemoryStore, so the SQLite connection isn't reopened per event. Lifecycle dedup, observation routing, and audit writes are unchanged — they happen inside the daemon now.Rule #10 reconciliation
CLAUDE.md rule #10 previously read "Hooks write observations synchronously through `MemoryStore.addObservation` — never across a network or HTTP boundary." The fast path technically crosses an HTTP boundary, so this PR rewords rule #10 to match the actual contract: writes still complete in-process when the daemon is unavailable.
The wrapper buffers stdin to a temp file and falls back to Node on:
Opt out at any time with `COLONY_BRIDGE_FAST=0`.
The fallback is the load-bearing contract that keeps rule #10's intent alive. Regression-tested in `apps/cli/test/bin-shim.test.ts`.
Tests
All 216 CLI + 48 worker tests pass.
Verification status
Manual wrapper smoke
```
$ apps/cli/bin/colony.sh -V
0.6.0
$ apps/cli/bin/colony.sh --version # pre-existing bug, identical to raw node
error: unknown option '--version'
$ node apps/cli/dist/index.js --version # same defect, wrapper is transparent
error: unknown option '--version'
```
Bench
Bench under spawn-storm load is pending — single-event timing isn't representative; the win lands on concurrent-agent burst scenarios. Will run before marking ready.
🤖 Generated with Claude Code