Fix battery drain from hung Codex RPC + 60 FPS animation loop#844
Fix battery drain from hung Codex RPC + 60 FPS animation loop#844hyspacex wants to merge 2 commits intosteipete:mainfrom
Conversation
Hung account/rateLimits/read left UsageFetcher awaiting stdout forever, recording no error so isStale stayed false and the 60 FPS DisplayLink ran indefinitely, triggering macOS "Using Significant Energy". Adds per-method RPC timeouts (8s initialize, 3s request) that race the stdout read and terminate the codex process so the AsyncStream reader exits, plus a defensive 10s wrapper on the TTY fallback. Drops the loading animation to 30 FPS with a derived phase increment, and adds a 30s hard ceiling so any future !hasData && !isStale path can't pin the menu bar at full FPS again. See steipete#842. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review follow-ups on the steipete#842 fix: - terminate-on-timeout log line now names the failing RPC method so prod diagnostics show whether initialize, account/read, or rateLimits hung - SendableJSONMessage doc no longer overclaims thread safety of Foundation JSON values; reframes safety around the single-message produce/wrap/unwrap usage pattern - RPCWireError exposed as internal so @testable tests can pattern-match on .timeout instead of swallowing every error - Test D now asserts RPCWireError.timeout specifically, so a future bug that throws an unrelated error type fails loudly instead of silently passing the bound-duration check Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a62d8def47
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| group.addTask { | ||
| try await Task.sleep(for: .seconds(budget)) | ||
| throw RPCWireError.timeout(method: "tty/status") |
There was a problem hiding this comment.
Make TTY timeout actually preempt the fallback task
runWithTTYTimeout races body() against Task.sleep, but when the sleep task throws it only calls group.cancelAll() and does not stop the underlying TTY operation. In Swift task groups, the scope still waits for the other child to finish, so if body() is stuck in a non-cooperative call (for example CodexStatusProbe.fetch() reaching synchronous PTY work), this method can still block well past ttyTimeoutSeconds, defeating the new timeout guarantee.
Useful? React with 👍 / 👎.
|
Thanks for the report and patch, Harry. The root bug was real, but this PR conflicted with the newer shared Codex CLI snapshot path on main, so I landed an adapted fix in 021f739 instead. What landed:
Verification on main: full local I preserved contributor credit via |
Closes #842.
Summary
Three connected changes that together stop the menu-bar from spending CPU forever when a Codex RPC call hangs:
account/rateLimits/read(and friends) can't await stdout indefinitely.Root cause (verified in #842)
When
codex app-serveracceptedinitializebut never replied toaccount/rateLimits/read:CodexRPCClient.request(Sources/CodexBarCore/UsageFetcher.swift) awaitedfor await lineData in self.stdoutLineStreamwith no per-request deadline.loadRPCUsagehung, sowithFallbacknever saw a throw and the TTY secondary never ran.UsageStore.isStale(provider:)returnserrors[provider] != nil. A hang records no error →isStale = false.shouldAnimatereturned!hasData && !isStale→true, soDisplayLinkDriver.start(fps: 60)ran main-thread NSImage redraws indefinitely. macOS flagged this as "Using Significant Energy."The existing fallback-only animation guard (
Animation.swift:670— "Animating the fallback causes unnecessary CPU usage (battery drain). See #269, #139.") didn't cover the hung-primary case.What changed
Sources/CodexBarCore/UsageFetcher.swiftRPCWireError.timeout(method:)added.CodexRPCClientandUsageFetchertake internal-onlyinitializeTimeoutSeconds(default 8s),requestTimeoutSeconds(default 3s),ttyTimeoutSeconds(default 10s). Defaults match production behavior; tests inject shorter values.request()races the stdout read againstTask.sleepviawithThrowingTaskGroup. On timeout:process.terminate()(closes stdout, unblocks the AsyncStream reader, body task unwinds) → throwRPCWireError.timeout→ propagates throughwithFallbackto TTY.loadTTYUsage/loadTTYCreditsget a defensiverunWithTTYTimeoutwrapper.CodexStatusProbe.fetch()already has its own internaltimeout; the outer race is defense-in-depth and bounds injected test fetchers that bypass the probe.initializevsaccount/rateLimits/readhangs.Sources/CodexBar/StatusItemController+Animation.swiftloadingAnimationFPS = 30.0(was 60) — halves main-thread NSImage redraw cost. 12 FPS would be choppy; 30 stays smooth.loadingAnimationPhaseIncrement = 2.7 / FPSderived so visible animation speed is unchanged when FPS is tuned.loadingAnimationMaxContinuousDuration: 30.0— hard ceiling on continuous animation. Even if a future code path produces!hasData && !isStaleindefinitely, the driver force-stops after 30s. Defense-in-depth on a bug class that has shipped twice (Significant usage of battery #269, consuming too much power on my macbook #139, Hung Codex JSON-RPC account/rateLimits/read leaves StatusItem animating at 60 FPS (battery drain path) #842).stopLoadingAnimation()so the ceiling and the natural-stop path use the same teardown.Sources/CodexBar/StatusItemController.swiftvar animationStartedAt: Date?to track ceiling.Tests/CodexBarTests/CodexUsageFetcherFallbackTests.swift4 new regression tests, extended in the existing file (no parallel suite):
RPCWireError.timeoutnot just any throwshouldAnimatereturns false andanimationDriverstays nilThe first stub (
makeHungRateLimitsStubCodexCLI) mirrors the production failure shape: responds toinitialize, hangs forever onaccount/rateLimits/read, answersaccount/readand TTY normally.Decisions
initialize(covers child-process spawn + cold init), 3s for everything else (read calls are sub-second on healthy networks). Single-budget would either be too tight for initialize or too loose for the actual bug path.Verification
swift build— cleanswift test --filter CodexUsageFetcherFallbackTests— 10/10 (6 existing + 4 new)swift test --filter BatteryDrainDiagnosticTests— 3/3The macOS "Using Significant Energy" badge is a heuristic over time and not deterministically reproducible on demand. What's verified here is the underlying mechanism: the hung-RPC chain that produced sustained 60 FPS main-thread redraws.
Out of scope
🤖 Generated with Claude Code