Skip to content

fix: prevent WS-broadcast OOM crash under connection churn#1

Open
skialpine wants to merge 1 commit into
feat/scale-telemetryfrom
fix/ws-oom-broadcast
Open

fix: prevent WS-broadcast OOM crash under connection churn#1
skialpine wants to merge 1 commit into
feat/scale-telemetryfrom
fix/ws-oom-broadcast

Conversation

@skialpine
Copy link
Copy Markdown
Owner

Stacked on feat/scale-telemetry (PR decentespresso#57). Base will be retargeted to main once decentespresso#57 merges. The telemetry in decentespresso#57 (reset_reason, heap) is what made this diagnosable.

Root cause (from a captured + decoded panic backtrace)

Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an AsyncWebSocketMessage per client and operator new throws std::bad_alloc; Arduino-ESP32 builds -fno-exceptions, so the throw goes to std::terminate()abort() → reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).

Decoded stack:

operator new -> __cxa_throw -> std::terminate -> abort
AsyncWebSocketClient::_queueMessage   (AsyncWebSocket.cpp:490)
AsyncWebSocket::printfAll
sendWebsocketWeightAll                 (include/websocket.h, loop() 10 Hz broadcast)

The existing 15 KB heap watchdog can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation bad_allocs long before it acts.

Fix

  • wsBroadcastHeapOk() heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll, sendWebsocketStatusAll, button, power-off): when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).
  • -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.
  • CLAUDE.md: documented the footgun (notes + troubleshooting table).

Verification (on hardware)

Re-ran the exact load that crashed the unpatched build — conn_churn --rst 8×8 + 10 Hz WS + mDNS, BT connected:

  • Free heap driven to 6436 bytes (old build crashed at ~4684).
  • Gate engaged: [ws] low heap 17736 < 25000 -> skip broadcast.
  • No abort, no reboot, weight stream uninterrupted (uptime continuous).

🤖 Generated with Claude Code

Root-caused from a captured panic backtrace: under sustained multi-client
WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap
collapses and AsyncWebSocket's printfAll path allocates an
AsyncWebSocketMessage per client -> operator new throws std::bad_alloc ->
(Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot.
That OOM-reboot is the "weight stops being collected under load" failure
(not thermal -- die temp was 33 C). Decoded stack:

  operator new -> __cxa_throw -> std::terminate -> abort
  AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490)
  AsyncWebSocket::printfAll
  sendWebsocketWeightAll (websocket.h)  <- loop() 10 Hz broadcast

Fix:
- Heap-gate every broadcast-to-all helper (weight, status, button,
  power-off) with wsBroadcastHeapOk(): skip the frame when free heap is
  below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog)
  instead of allocating into an exhausted heap and crashing. Dropping a
  frame is invisible; the next is <=500 ms away.
- Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib
  default 32) so a backed-up/half-open client can't hoard heap.
- Document the footgun in CLAUDE.md (notes + troubleshooting table).

Stacked on the scale-telemetry branch (PR decentespresso#57) whose reset_reason / heap
telemetry made this diagnosable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant