perf(app): kill the cold-launch beachball + guards against its return#242
Merged
Conversation
…results
Cold-launch beachball traced to AcpManager.detectAgents → cachedResolve →
resolveSystemBinary, which uses execSync('<user-shell> -ilc command -v
<name>') once per agent. On a machine with a heavy .zshrc the three
serialised interactive-shell spawns block the main-process event loop
for several seconds and macOS surfaces a spinning pinwheel.
Three changes restore the launch experience:
1. core/resolve-bin: new `resolveSystemBinaryAsync`, `cachedResolveAsync`,
`hydrateResolveCache`, and `getResolveCacheSnapshot`. The async path
tries process PATH first, then well-known install locations (pure
filesystem probes — homebrew, mise shims, nvm, ~/.local/bin), and
only falls back to interactive-shell lookup when nothing else
matched. Sync variants kept for API compatibility.
2. app/main/binaryCache: persists the resolver's snapshot to
~/.spool/resolved-bins.json after every miss-then-resolve, and
hydrates from disk during `app.whenReady` before any IPC handler can
run. Cached entries are validated with `existsSync` before being
served, so a brew upgrade or manual removal triggers a re-resolve
instead of leaking a stale path.
3. app/main/acp: detectAgents parallelises its three lookups via
Promise.all; query() awaits the async resolver for both the native
binary path and node fallback; envSetup is now async so the claude
path can also flow through the async cache.
Net effect on this machine:
- cold cache (first ever launch): 3 lookups in parallel, each tries
fast paths before any shell exec — typically resolves via
well-known paths in <50ms
- warm cache (second+ launch): zero shell exec; cached paths
served from in-memory map, validated against fs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A blocking call on the main-process JS thread (sync child_process, sync
large I/O, an unbroken JSON.parse on a huge blob) manifests to the user
as a beachball within seconds of opening the app. v0.4.17 shipped one
such regression via the agent-binary resolver; the perf commit before
this one is the fix.
To stop that class of bug from recurring without enumerating offenders,
we observe the symptom directly:
- `main/eventLoopMonitor.ts` wraps `node:perf_hooks.monitorEventLoopDelay`,
a nanosecond-precision C++ histogram that exists for exactly this
purpose. Started at the very top of `main/index.ts` module evaluation
so it covers Electron bootstrap, not just whenReady-and-after.
- `main/index.ts` installs a `globalThis.__spoolEventLoopLag` getter
when `SPOOL_E2E_TEST=1`, so the e2e harness can pull the histogram
via `electronApp.evaluate(...)` without exposing an IPC channel in
production builds.
- `e2e/startup-no-beachball.spec.ts` launches the packaged app under
the standard launchApp fixture (which already provisions a fresh
SPOOL_HOME — i.e. an empty `resolved-bins.json`, i.e. cold cache)
and asserts two thresholds:
p99 < 200ms — typical experience hasn't drifted
max < 1000ms — no beachball, ever
Thresholds are intentionally permissive enough to survive CI runners
and GC jitter. The point is to fail loudly if a sub-second stall
returns, not to police micro-stutter.
Local run on this machine:
maxMs: 152, p99Ms: 35, meanMs: 13, count: 185
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ESLint had not been wired up before. Adding it now for a single targeted purpose: stop sync `execSync` / `spawnSync` / `execFileSync` from reappearing in `packages/app/src/main/**`. Those calls block the AppKit event loop and produced the launch beachball traced in the perf commit on this branch — the test guard added in the prior commit catches it after the fact, this lint rule catches it at PR time. Setup: - Root devDependencies: `eslint@10` + `@typescript-eslint/parser` - `eslint.config.mjs` at repo root using ESLint's flat config - Root `lint` script: `eslint .` (was `turbo lint` orchestrating an empty per-package script that no package defined) - The rule itself: `no-restricted-imports` on `packages/app/src/main/**/*.ts`, banning the three sync `node:child_process` exports by name. Bare `child_process` form banned too. Two pre-existing call sites are kept and justified inline: - `acp.ts` execSync inside `getLoginShellEnv` — fires only on the first user-triggered ACP query, never on the launch path - `terminal.ts` execSync inside the "Resume in terminal" handler — also strictly user-triggered, not launch path Both carry a one-line `eslint-disable-next-line no-restricted-imports` with a comment explaining why the sync form is safe there. Anyone adding a new sync-exec usage anywhere else under `main/` will see the error at lint time, with the rule message pointing them at the async path. Layered with the perf_hooks e2e test from the prior commit and the async-resolver fix two commits back, this gives three independent defenses against the same class of regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9a9a0a3 to
2b5ff92
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v0.4.17 had a multi-second main-process stall on cold launch on machines with a heavy shell init — long enough that macOS surfaced a beachball within seconds of opening the app. Root cause was the agent-binary resolver doing three serialised `execSync(' -ilc command -v ')` calls on the main-process JS thread. This branch fixes it and adds two independent guards so the same class of regression cannot return silently.
How it manifested
`sample` of the main process during launch showed it pinned at `v8::Function::Call` → JIT inside what turned out to be `AcpManager.detectAgents` → `cachedResolve` → `resolveSystemBinary` → `execSync(' -ilc ...')`, with no chance for the AppKit event loop to make forward progress.
What's in this PR
Three commits, each scoped to one defense in depth.
1. `perf(app): async + parallel agent-binary resolution with disk-cached results`
Net effect:
2. `test(app): catch any main-thread stall on cold launch`
Symptom-level guard — observes main-thread event-loop delay directly via `node:perf_hooks.monitorEventLoopDelay` (nanosecond-precision C++ histogram, built for exactly this). The monitor is started at the very top of `main/index.ts` module evaluation so it covers Electron bootstrap, not just `whenReady`-and-after.
When `SPOOL_E2E_TEST=1` the snapshot is exposed via a `globalThis` getter the e2e harness reads with `electronApp.evaluate(...)` — no production IPC surface.
`e2e/startup-no-beachball.spec.ts` launches the packaged app under the standard launchApp fixture (which already provisions a fresh `SPOOL_HOME` — empty `resolved-bins.json` — i.e. cold cache) and asserts two thresholds:
Thresholds are intentionally permissive to survive CI runners and GC jitter. The point is to fail loudly if a sub-second stall returns, not to police micro-stutter.
Local run: `maxMs: 152, p99Ms: 35, meanMs: 13, count: 185` (over a 2.4s window).
This catches the symptom regardless of cause — any future sync child_process, sync big I/O, or unbroken `JSON.parse` on a huge blob will fail it.
3. `chore(repo): ESLint setup that bans sync child_process in main/**`
ESLint had not been wired up before. Added for a single targeted purpose: stop sync `execSync` / `spawnSync` / `execFileSync` from reappearing in `packages/app/src/main/**`.
Two pre-existing call sites are kept and justified inline with `eslint-disable-next-line` plus a comment explaining why the sync form is safe there (both fire on explicit user gestures long after launch, never on the cold path). Anyone adding a new sync-exec usage anywhere else under `main/` will see the error at lint time with the rule message pointing at the async path.
Defense layers
The three commits stack:
Test plan