Skip to content

[codex] Prevent EMFILE startup failures and stale workspace WARNs#50

Draft
brsbl wants to merge 1 commit into
ymichael:mainfrom
brsbl:codex/fix-status-data-watch-emfile
Draft

[codex] Prevent EMFILE startup failures and stale workspace WARNs#50
brsbl wants to merge 1 commit into
ymichael:mainfrom
brsbl:codex/fix-status-data-watch-emfile

Conversation

@brsbl
Copy link
Copy Markdown

@brsbl brsbl commented May 23, 2026

Summary

Fixes two startup-facing failure modes that showed up as terminal errors while starting bb:

  1. The server could fail its health check with EMFILE: too many open files, watch because the STATUS-data watcher recursively watched the full thread-storage tree.
  2. The host daemon logged a scary command execution failed WARN for workspace.status when probing a stale managed worktree path that no longer exists.

Bugs And Failure Modes

EMFILE health-check timeout

The server started a recursive watcher at the full thread storage root, usually ~/.bb/thread-storage, even though the watcher only needs durable STATUS state files under each thread's STATUS-data/*.json directory.

On machines with many retained thread-storage artifacts, generated reports, screenshots, browser user-data directories, or archived worktrees, that recursive watch could consume too many file descriptors or watch handles. The observable failure mode was:

Error: EMFILE: too many open files, watch
  at FSWatcher._handle.onchange (node:internal/fs/watchers:207:21)
✗ Server failed to start (health check timed out)

Because the watcher starts before the HTTP server reports healthy, the launcher timed out while waiting for /health.

Stale managed-worktree status warning

After the server starts, the daemon may receive workspace.status for an environment whose managed worktree was already removed, for example an archived/destroyed environment still referenced by retained history. The command correctly returns path_not_found, and server cleanup/status callers can handle that result, but the host daemon also logged it as:

WARN: [host-daemon] command execution failed {"type":"workspace.status"}
WorkspaceError: Managed workspace path does not exist: ...

That made an expected stale-workspace probe look like another startup failure even though the server and daemon were already healthy.

Fix

  • Added a STATUS-data-specific ignore predicate for the server watcher.
  • Limited watcher traversal to the only required shape: <thread-id>/STATUS-data/<key>.json.
  • Kept thread directories and STATUS-data directories watchable so late-created STATUS-data directories still emit broadcasts.
  • Reused the same relative-path guard for parsing and directory-add handling.
  • Suppressed host-daemon WARN logging only for workspace.status failures whose structured error code is path_not_found.
  • Kept the workspace.status command result as ok: false with errorCode: "path_not_found", so server/UI logic still receives the real failure.

Testing And Validation

  • pnpm exec turbo run typecheck --filter=@bb/server
  • pnpm exec turbo run test --filter=@bb/server -- test/services/threads/status-state-watcher.test.ts
  • pnpm exec turbo run build --filter=@bb/server
  • Startup smoke with BB_THREAD_STORAGE=/Users/brsbl/.bb/thread-storage: server /health returned OK.
  • pnpm exec turbo run test --filter=@bb/host-daemon -- test/command/command-router.test.ts
  • pnpm exec turbo run typecheck --filter=@bb/host-daemon
  • pnpm exec turbo run build --filter=@bb/host-daemon
  • Packaged bb-app rebuild plus packaged-server smoke using packages/bb-app/server/dist/index.js with BB_THREAD_STORAGE=/Users/brsbl/.bb/thread-storage: /health returned OK.\n### STATUS fallback read warning\n\nThe unified status route intentionally probes STATUS/index.html before falling back to STATUS.html. When the STATUS/ directory is absent, host.read_file_relative returned ENOENT correctly but classified the miss as a normal command failure, causing the daemon to log WARN: command execution failed even though the server expected and handled the fallback.\n\nThe fix now classifies missing relative-read roots as expected ENOENT failures. The command result is unchanged (ok: false, errorCode: "ENOENT"), but the daemon no longer emits a WARN for that expected read miss.\n\nAdditional validation:\n\n- pnpm exec turbo run test --filter=@bb/host-daemon -- test/command/command-router.test.ts src/command-handlers/host-files.test.ts\n- pnpm exec turbo run typecheck --filter=@bb/host-daemon\n- pnpm exec turbo run build --filter=@bb/host-daemon\n- pnpm exec turbo run test --filter=@bb/server -- test/public/public-thread-data.test.ts\n- pnpm exec turbo run build --filter=bb-app

Copy link
Copy Markdown
Author

@brsbl brsbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review result: no findings.

I reviewed the watcher scope change for missed STATUS-data broadcasts, accidental over-filtering, and startup behavior with late-created STATUS-data directories. The bounded depth still permits root thread directories, STATUS-data directories, and JSON key files, while excluding unrelated thread-storage artifacts. Existing priming remains shallow and targeted, and the new test covers the important late-directory creation path.

Validation already performed:

  • pnpm exec turbo run typecheck --filter=@bb/server
  • pnpm exec turbo run test --filter=@bb/server -- test/services/threads/status-state-watcher.test.ts
  • pnpm exec turbo run build --filter=@bb/server
  • Startup smoke with BB_THREAD_STORAGE=/Users/brsbl/.bb/thread-storage: /health returned OK.

Residual risk: chokidar depth semantics are dependency behavior, but the focused test exercises the required add path through the real watcher.

@brsbl brsbl force-pushed the codex/fix-status-data-watch-emfile branch from 93eb783 to d4e58e0 Compare May 23, 2026 21:59
@brsbl brsbl changed the title [codex] Limit STATUS-data watcher to prevent EMFILE startup failures [codex] Prevent EMFILE startup failures and stale workspace WARNs May 23, 2026
Copy link
Copy Markdown
Author

@brsbl brsbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result after the latest amend: no blocking findings. I re-reviewed the watcher narrowing and the host-daemon WARN suppression. The watcher still preserves late-created STATUS-data directories and JSON file events, and the host-daemon change only suppresses WARN logging for workspace.status path_not_found while preserving the failed command result for server/UI handling. Validation run: server typecheck, focused server watcher test, server build, production-style startup smoke, host-daemon command-router test, host-daemon typecheck, and host-daemon build.

@brsbl brsbl force-pushed the codex/fix-status-data-watch-emfile branch from d4e58e0 to 97b96a6 Compare May 23, 2026 22:12
Copy link
Copy Markdown
Author

@brsbl brsbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review after addressing the host.read_file_relative warning: no blocking findings. The additional change keeps the daemon command result as ok:false/ENOENT for missing relative reads, but marks the miss as expected so the router does not emit WARN during the intentional STATUS/index.html -> STATUS.html fallback. I rechecked that invalid_path and other non-ENOENT errors are not broadened by this change. Additional validation passed: focused host-daemon command-router + host-files tests, host-daemon typecheck/build, server public-thread-data route suite, and bb-app package rebuild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant