chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race#119
Merged
StefanSteiner merged 2 commits intoJun 8, 2026
Conversation
When two MCP clients cold-start simultaneously, both scan and find the base port free, both call spawn_detached(base). One daemon wins the bind (stays on base); the other fails bind and exits. A third (or the same race victim) scans again, finds base now occupied, lands on base+1 — and writes daemon.json overwriting the base-port daemon's file, producing two live daemons on adjacent ports and doubling the idle hyperd CPU overhead. Fix: after wait_for_daemon() returns a daemon on a non-base port, re-scan the ports below it. If a lower-port daemon is live, STOP the off-base daemon (best-effort) and adopt the base-port one instead. The lower-port daemon wins because it bound the socket first and is the canonical single instance.
A long-lived idle connection to hyperd that goes half-open (laptop sleep, network blip, or a hyperd that vanished without a FIN) had no way to be detected: the next blocking read would hang until the OS default keepalive idle timeout (7200s / 2h on macOS and Linux). Because the MCP serializes all tool calls behind a single engine mutex with no per-op timeout, one such stalled connection makes EVERY tool call — including `status` — appear to hang indefinitely. This became materially more likely in 0.5.0, where the daemon went resident-by-default: connections now live indefinitely across laptop suspends instead of being torn down by the old 30-minute idle shutdown. Fix: set TCP keepalive (60s idle / 10s interval / 3 probes -> dead peer in ~90s) at both the sync and async TCP connect sites, alongside the existing nodelay/buffer-size socket options. Best-effort (.ok()): a kernel that rejects a knob leaves the connection at OS defaults. Probe count is Linux-only (macOS honors idle+interval). Deliberately NOT a query timeout — HyperDB runs legitimately long analytics queries, and keepalive only probes during true silence, so it can't abort live work. Follow-up (not in this commit): the single global engine mutex means a slow/stalled op blocks unrelated tool calls. That is an architectural change warranting its own design pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two reliability fixes for the shared-daemon MCP path, both surfaced by real
usage after the resident-by-default daemon shipped in v0.5.0:
hyperdconnections — so a half-open idle connectionis detected in ~90s instead of blocking for the 2-hour OS default.
MCP clients racing to start don't leave a duplicate daemon+hyperd pair
burning idle CPU on an adjacent port.
Targeting a 0.5.1 patch release.
Fix 1 — TCP keepalive on hyperd connections
Problem
A user reported the
statustool "hanging."statusis trivial, so this wassurprising. Investigation (sockets, netstat, process state) found:
hyperdwere healthy — fresh connects were ~1.6ms.ESTABLISHEDwith empty send/recv queues — notdropped — but had been idle for hours.
network/wake event occurred during the idle gap.
Root cause: the
hyperdconnection setTCP_NODELAYand socket buffer sizesbut no keepalive. A long-lived idle connection that goes half-open — laptop
sleep, a network blip, or a
hyperdthat vanished without a FIN — has nothingto detect the dead peer, so the next blocking
read()hangs until the OS idletimeout (up to 2h). This became materially more likely in v0.5.0, where the
daemon became resident-by-default and connections now live indefinitely across
suspends instead of being reaped by the old 30-minute idle shutdown.
(The hang is amplified by the MCP serializing all tool calls behind a single
engine mutex with no per-op timeout — tracked separately in #118 as an
architectural follow-up. This PR fixes the immediate trigger.)
Fix
Enable TCP keepalive at both the sync and async TCP connect sites in
hyperdb-api-core, alongside the existing nodelay/buffer-size options:60s idle / 10s interval / 3 probes → dead peer detected in ~90s, after
which the existing
ConnectionLost→ drop-and-reconnect path takes over..ok()): a kernel that rejects a knob leaves the connection atOS defaults.
with_retries); macOS/Windows honor idle+interval.analytics queries; keepalive only probes during true silence, so it detects a
dead peer without ever aborting live work.
Fix 2 — Stop redundant off-base daemon after concurrent cold-start
Problem
When two MCP clients cold-start simultaneously, both scan, both find the base
port free, and both
spawn_detached(base). One daemon wins the bind; the otherfails and exits. But a third client (or the race victim) then scans again,
finds base occupied, lands on
base+1, and overwritesdaemon.json— leavingtwo live daemons on adjacent ports, each with its own
hyperd, doublingidle CPU. Observed in the wild: daemons on 7485 and 7486 each ~0.9% CPU.
Fix
After
wait_for_daemon()returns a daemon on a non-base port, re-scan theports below it. If a lower-port daemon is live, send it
STOP(best-effort)and adopt the lower-port one — it bound the socket first and is the canonical
single instance. Adds a unit test asserting
scan_for_daemonprefers thelowest-port daemon.
Testing
cargo clippy --all-targets -- -D warningsclean;cargo fmt --all --checkclean (affected crates).
hyperdb-api-coreunit tests (169) +hyperdb-apiroundtrip/integrationtests pass against real
hyperd— the modified connect path is exercised.hyperdb-mcpdaemon tests pass (48), including the new lower-port-preferencetest. Socket-binding tests were run with the local sandbox disabled.
Notes / honest caveats
healthy by the time of investigation), but the missing-keepalive + half-open
socket mechanism is a confirmed structural gap that matches every symptom and
is the correct defense regardless.
addressed here by design; it's an architectural change tracked in MCP: single engine mutex + no op timeout lets one stalled hyperd call hang all tool calls #118.
Changelog
hyperdb-mcp/CHANGELOG.md[Unreleased]documents the keepalive fix. (Thecold-start dedup landed in an earlier commit on this branch.)