Skip to content

Fix battery drain from hung Codex RPC + 60 FPS animation loop#844

Closed
hyspacex wants to merge 2 commits intosteipete:mainfrom
hyspacex:fix/issue-842-codex-rpc-battery-drain
Closed

Fix battery drain from hung Codex RPC + 60 FPS animation loop#844
hyspacex wants to merge 2 commits intosteipete:mainfrom
hyspacex:fix/issue-842-codex-rpc-battery-drain

Conversation

@hyspacex
Copy link
Copy Markdown
Contributor

@hyspacex hyspacex commented May 5, 2026

Closes #842.

Summary

Three connected changes that together stop the menu-bar from spending CPU forever when a Codex RPC call hangs:

  1. Per-method RPC timeouts so account/rateLimits/read (and friends) can't await stdout indefinitely.
  2. Process termination on timeout so the AsyncStream reader actually exits instead of leaking.
  3. Animation hardening — drop loading FPS 60→30, add a 30s hard ceiling on continuous animation.

Root cause (verified in #842)

When codex app-server accepted initialize but never replied to account/rateLimits/read:

  1. CodexRPCClient.request (Sources/CodexBarCore/UsageFetcher.swift) awaited for await lineData in self.stdoutLineStream with no per-request deadline.
  2. loadRPCUsage hung, so withFallback never saw a throw and the TTY secondary never ran.
  3. UsageStore.isStale(provider:) returns errors[provider] != nil. A hang records no error → isStale = false.
  4. shouldAnimate returned !hasData && !isStaletrue, so DisplayLinkDriver.start(fps: 60) ran main-thread NSImage redraws indefinitely. macOS flagged this as "Using Significant Energy."

The existing fallback-only animation guard (Animation.swift:670"Animating the fallback causes unnecessary CPU usage (battery drain). See #269, #139.") didn't cover the hung-primary case.

What changed

Sources/CodexBarCore/UsageFetcher.swift

  • RPCWireError.timeout(method:) added.
  • CodexRPCClient and UsageFetcher take internal-only initializeTimeoutSeconds (default 8s), requestTimeoutSeconds (default 3s), ttyTimeoutSeconds (default 10s). Defaults match production behavior; tests inject shorter values.
  • request() races the stdout read against Task.sleep via withThrowingTaskGroup. On timeout: process.terminate() (closes stdout, unblocks the AsyncStream reader, body task unwinds) → throw RPCWireError.timeout → propagates through withFallback to TTY.
  • loadTTYUsage / loadTTYCredits get a defensive runWithTTYTimeout wrapper. CodexStatusProbe.fetch() already has its own internal timeout; the outer race is defense-in-depth and bounds injected test fetchers that bypass the probe.
  • The terminate-on-timeout log line names the failing method so prod logs distinguish initialize vs account/rateLimits/read hangs.

Sources/CodexBar/StatusItemController+Animation.swift

Sources/CodexBar/StatusItemController.swift

  • var animationStartedAt: Date? to track ceiling.

Tests/CodexBarTests/CodexUsageFetcherFallbackTests.swift

4 new regression tests, extended in the existing file (no parallel suite):

Test What it locks in
Hung rate-limits read times out and falls back to TTY within budget RPC timeout fires fast enough that TTY fallback completes in <2s
Repeated hung RPC requests stay bounded across calls Process termination actually unblocks the reader (no leaked task / wedged pipe)
Hanging TTY fallback also times out within budget Defensive TTY wrapper bounds custom fetchers; specifically asserts RPCWireError.timeout not just any throw
Recorded error marks provider stale and stops menu-bar animation Second half of the chain — once an error is recorded, shouldAnimate returns false and animationDriver stays nil

The first stub (makeHungRateLimitsStubCodexCLI) mirrors the production failure shape: responds to initialize, hangs forever on account/rateLimits/read, answers account/read and TTY normally.

Decisions

  • Timeouts: 8s for initialize (covers child-process spawn + cold init), 3s for everything else (read calls are sub-second on healthy networks). Single-budget would either be too tight for initialize or too loose for the actual bug path.
  • 30 FPS: The energy cost halves vs 60 FPS but stays perceptually smooth. 12 FPS was on the table but visibly choppy on continuous sweeps (Knight Rider).
  • 30s hard ceiling: Belt-and-suspenders. Max legitimate refresh budget is ~21s (8s init + 3s request + 10s TTY); 30s is well above that. Cheap insurance against future code paths bypassing the error-recording chain.
  • No new user-facing settings. All timeouts are internal init parameters for tests only.

Verification

  • swift build — clean
  • swift test --filter CodexUsageFetcherFallbackTests — 10/10 (6 existing + 4 new)
  • swift test --filter BatteryDrainDiagnosticTests — 3/3
  • Fast-path test runtimes well under their assertions (Test A: 0.41s vs 2s budget; Test C: 0.82s; Test D: 1.01s)

The macOS "Using Significant Energy" badge is a heuristic over time and not deterministically reproducible on demand. What's verified here is the underlying mechanism: the hung-RPC chain that produced sustained 60 FPS main-thread redraws.

Out of scope

  • WebKit/AuthKit dashboard refresh resilience (already addressed upstream)
  • New "battery saver" user-facing toggle (this is a correctness fix, not a configurable behavior)

🤖 Generated with Claude Code

hyspacex and others added 2 commits May 4, 2026 21:45
Hung account/rateLimits/read left UsageFetcher awaiting stdout forever,
recording no error so isStale stayed false and the 60 FPS DisplayLink
ran indefinitely, triggering macOS "Using Significant Energy". Adds
per-method RPC timeouts (8s initialize, 3s request) that race the
stdout read and terminate the codex process so the AsyncStream reader
exits, plus a defensive 10s wrapper on the TTY fallback. Drops the
loading animation to 30 FPS with a derived phase increment, and adds a
30s hard ceiling so any future !hasData && !isStale path can't pin
the menu bar at full FPS again. See steipete#842.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review follow-ups on the steipete#842 fix:
- terminate-on-timeout log line now names the failing RPC method so prod
  diagnostics show whether initialize, account/read, or rateLimits hung
- SendableJSONMessage doc no longer overclaims thread safety of
  Foundation JSON values; reframes safety around the single-message
  produce/wrap/unwrap usage pattern
- RPCWireError exposed as internal so @testable tests can pattern-match
  on .timeout instead of swallowing every error
- Test D now asserts RPCWireError.timeout specifically, so a future
  bug that throws an unrelated error type fails loudly instead of
  silently passing the bound-duration check

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a62d8def47

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +822 to +824
group.addTask {
try await Task.sleep(for: .seconds(budget))
throw RPCWireError.timeout(method: "tty/status")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make TTY timeout actually preempt the fallback task

runWithTTYTimeout races body() against Task.sleep, but when the sleep task throws it only calls group.cancelAll() and does not stop the underlying TTY operation. In Swift task groups, the scope still waits for the other child to finish, so if body() is stuck in a non-cooperative call (for example CodexStatusProbe.fetch() reaching synchronous PTY work), this method can still block well past ttyTimeoutSeconds, defeating the new timeout guarantee.

Useful? React with 👍 / 👎.

@steipete
Copy link
Copy Markdown
Owner

steipete commented May 5, 2026

Thanks for the report and patch, Harry. The root bug was real, but this PR conflicted with the newer shared Codex CLI snapshot path on main, so I landed an adapted fix in 021f739 instead.

What landed:

  • Codex RPC reads now have bounded per-request timeouts and terminate a hung codex app-server process on timeout: Sources/CodexBarCore/UsageFetcher.swift.
  • The loading animation now runs at 30 FPS and has a 30s continuous ceiling: Sources/CodexBar/StatusItemController+Animation.swift.
  • Regression coverage covers hung account/rateLimits/read calls and the “provider errored, do not animate forever” path.

Verification on main: full local swift test, pnpm check, packaged app relaunch via ./Scripts/compile_and_run.sh, live bundled CodexBarCLI usage --provider codex --source cli, and GitHub CI 25360102746 are green.

I preserved contributor credit via Co-authored-by on the landed commit and thanked you in the changelog.

@steipete steipete closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hung Codex JSON-RPC account/rateLimits/read leaves StatusItem animating at 60 FPS (battery drain path)

2 participants