Skip to content

feat(cli,worker): daemon fast-path for colony bridge lifecycle#443

Merged
NagyVikt merged 3 commits into
mainfrom
agent/claude/daemonize-bridge-lifecycle-cut-cpu-2026-05-05-21-53
May 5, 2026
Merged

feat(cli,worker): daemon fast-path for colony bridge lifecycle#443
NagyVikt merged 3 commits into
mainfrom
agent/claude/daemonize-bridge-lifecycle-cut-cpu-2026-05-05-21-53

Conversation

@NagyVikt
Copy link
Copy Markdown
Collaborator

@NagyVikt NagyVikt commented May 5, 2026

Why

Every IDE tool event triggers colony bridge lifecycle ... from external hook integrations (oh-my-codex's ColonyBridge.spawnSync, Codex/Claude Code settings). Cold-starting Node + JIT + bundle load on each event pegs ~one core for ~300 ms. Multiplied across concurrent agents this is a measurable CPU storm — visible in gnome-system-monitor as 4+ short-lived colony bridge lifecycle processes at 85–111% CPU each, plus active swap thrashing.

Diagnosis from ps against the original screenshot:

  • colony bridge lifecycle at 85–111% CPU each (transient, sub-second lifespan)
  • 15× codex-native-hook.js at 30–40% each (oh-my-codex's per-event Node spawns)
  • Single MCP worker steady at 65% (not the bottleneck)

What

The CLI bin entry is now a POSIX shell wrapper at apps/cli/bin/colony.sh. When invoked as colony bridge lifecycle --json, it POSTs the envelope to the long-lived worker daemon at POST /api/bridge/lifecycle (new) and exits — no Node startup. Anything else execs the Node CLI exactly as before.

The worker route reuses the long-lived MemoryStore, so the SQLite connection isn't reopened per event. Lifecycle dedup, observation routing, and audit writes are unchanged — they happen inside the daemon now.

Rule #10 reconciliation

CLAUDE.md rule #10 previously read "Hooks write observations synchronously through `MemoryStore.addObservation` — never across a network or HTTP boundary." The fast path technically crosses an HTTP boundary, so this PR rewords rule #10 to match the actual contract: writes still complete in-process when the daemon is unavailable.

The wrapper buffers stdin to a temp file and falls back to Node on:

  • `curl` missing
  • Connection refused / unreachable port
  • 2-second connect+total timeout
  • Non-200 response
  • Unknown flags
  • Invocation without `--json` (humans want pretty output)

Opt out at any time with `COLONY_BRIDGE_FAST=0`.

The fallback is the load-bearing contract that keeps rule #10's intent alive. Regression-tested in `apps/cli/test/bin-shim.test.ts`.

Tests

  • `apps/worker/test/server.test.ts` — `POST /api/bridge/lifecycle` block (3 new):
    • routes a valid envelope through the long-lived store
    • dedupes duplicate `event_id`s as `route=duplicate`
    • returns `ok:false` on malformed envelopes
  • `apps/cli/test/bin-shim.test.ts` — wrapper smoke (5 new):
    • rule-10 contract: daemon down → stdin + quoted args (incl. paths with spaces) survive fallback to Node
    • `COLONY_BRIDGE_FAST=0` disables fast-path
    • non-`bridge lifecycle` commands pass through unchanged
    • `bridge lifecycle` without `--json` falls through (preserves human output)
    • bin shim is executable when packed

All 216 CLI + 48 worker tests pass.

Verification status

  • `pnpm typecheck` — green
  • `pnpm test` — 216 + 48 pass
  • `pnpm build` — green
  • `pnpm lint` — host OOM'd (13 GB active swap from unrelated processes); CI rerun needed
  • `bash scripts/e2e-publish.sh` — pack/install/tarball-inspection clean (wrapper at `package/bin/colony.sh`), but blocks at check Add a bounded hivemind MVP inside the monorepo #6 (`$BIN --version`). Pre-existing bug: `node dist/index.js --version` fails identically on this branch and on `main`. PR Accept lowercase CLI version shorthand #372's `.version('-v, -V, --version')` registers only the first flag. Out of scope here; worth a follow-up.

Manual wrapper smoke

```
$ apps/cli/bin/colony.sh -V
0.6.0

$ apps/cli/bin/colony.sh --version # pre-existing bug, identical to raw node
error: unknown option '--version'

$ node apps/cli/dist/index.js --version # same defect, wrapper is transparent
error: unknown option '--version'
```

Bench

Bench under spawn-storm load is pending — single-event timing isn't representative; the win lands on concurrent-agent burst scenarios. Will run before marking ready.

🤖 Generated with Claude Code

NagyVikt and others added 2 commits May 5, 2026 22:02
Adds apps/cli/bin/colony.sh — POSIX shell wrapper that fast-paths
`colony bridge lifecycle --json` via curl POST to a (not-yet-existing)
daemon endpoint at /api/bridge/lifecycle. Falls back to in-process Node
on any failure with stdin replayed from a temp file.

Not yet wired:
- Worker endpoint /api/bridge/lifecycle (next commit)
- package.json bin entry change
- Tests
- CLAUDE.md rule #10 reconciliation (blocked on user decision)

Why WIP commit: rule #10 forbids hooks crossing an HTTP boundary on
the write path. The fast-path violates the literal text but preserves
the spirit (writes still succeed via fallback when daemon is down).
That's an architectural call that needs explicit user sign-off, not
my judgement. Committing the wrapper makes the branch durable while
that decision is pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every IDE tool event triggers `colony bridge lifecycle ...` from external
hook integrations. Cold-starting Node + bundle load on each event pegs
~one core for ~300ms — multiplied across concurrent agents this is a
measurable CPU storm visible at the system level.

The CLI bin entry is now a POSIX shell wrapper at apps/cli/bin/colony.sh.
When invoked as `colony bridge lifecycle --json`, it POSTs the envelope
to the long-lived worker daemon at POST /api/bridge/lifecycle (new) and
exits — no Node startup. Anything else execs the Node CLI exactly as
before.

The worker's daemon route reuses the long-lived MemoryStore so the
SQLite connection isn't reopened per event. CLAUDE.md rule #10 is
reworded to match the new contract: writes still succeed in-process
when the daemon is unreachable. The wrapper buffers stdin to a temp
file and falls back to Node on curl missing, connection refused,
2s timeout, non-200, unknown flags, or invocation without --json.

Tests:
- apps/worker/test/server.test.ts: POST /api/bridge/lifecycle routes
  envelopes through the long-lived store, dedupes duplicate event_ids,
  surfaces ok:false on malformed input.
- apps/cli/test/bin-shim.test.ts: load-bearing rule-10 contract test.
  With daemon unreachable, the wrapper falls back to Node with stdin
  and quoted args (including paths with spaces) intact. Also covers
  COLONY_BRIDGE_FAST=0 disable, --version pass-through, and that
  bare `bridge lifecycle` (no --json) keeps human-readable output.

Opt out at any time with COLONY_BRIDGE_FAST=0.

Notes:
- pnpm lint OOM'd locally (host has 13GB active swap from unrelated
  processes); CI lint should run cleanly. Will re-run before merge.
- e2e-publish.sh re-run pending — it covers the bin-shim install path
  end-to-end and is the official guard for changes to the publish
  surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/bench-bridge-fastpath.mjs. Boots a fresh worker against a
temp COLONY_HOME on a non-default port, then runs N concurrent shell-shim
invocations of `bridge lifecycle --json` once with the daemon reachable
(fast path) and once with COLONY_BRIDGE_FAST=0 (forced in-process Node
fallback, mirrors today's behavior). Reports mean, median, p95, p99.

Local results, 8 concurrent × 4 iterations = 32 events per path:

  [fast (daemon)]         mean=54.2ms   p95=74.1ms    p99=130.4ms
  [slow (force-fallback)] mean=260.9ms  p95=725.4ms   p99=728.7ms

  speedup (mean): 4.8x   saved: 206.7ms/event
  speedup (p95):  9.8x

The p95 collapse from 725ms → 74ms is the Node cold-start tail under
burst load — exactly the spawn-storm pattern that motivated this PR
(four `colony bridge lifecycle` processes simultaneously at 100% CPU
each in the original screenshot).

Caveats: includes sh + curl + JSON serialization on the fast path; the
"slow" path includes the wrapper buffering stdin then exec'ing Node
(~5-10ms extra vs. raw `colony bridge lifecycle`), but that's swamped
by the ~250ms Node cold-start it's measuring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NagyVikt NagyVikt marked this pull request as ready for review May 5, 2026 20:46
@NagyVikt NagyVikt merged commit f824d52 into main May 5, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant