Skip to content

chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race#119

Merged
StefanSteiner merged 2 commits into
tableau:mainfrom
StefanSteiner:fix/daemon-cold-start-dedup
Jun 8, 2026
Merged

chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race#119
StefanSteiner merged 2 commits into
tableau:mainfrom
StefanSteiner:fix/daemon-cold-start-dedup

Conversation

@StefanSteiner

Copy link
Copy Markdown
Contributor

Summary

Two reliability fixes for the shared-daemon MCP path, both surfaced by real
usage after the resident-by-default daemon shipped in v0.5.0:

  1. TCP keepalive on hyperd connections — so a half-open idle connection
    is detected in ~90s instead of blocking for the 2-hour OS default.
  2. Stop redundant off-base daemons after a concurrent cold-start — so two
    MCP clients racing to start don't leave a duplicate daemon+hyperd pair
    burning idle CPU on an adjacent port.

Targeting a 0.5.1 patch release.

Fix 1 — TCP keepalive on hyperd connections

Problem

A user reported the status tool "hanging." status is trivial, so this was
surprising. Investigation (sockets, netstat, process state) found:

  • The daemon and hyperd were healthy — fresh connects were ~1.6ms.
  • The MCP↔hyperd socket was ESTABLISHED with empty send/recv queues — not
    dropped — but had been idle for hours.
  • TCP keepalive was off (macOS/Linux default idle = 7200s / 2h), and a
    network/wake event occurred during the idle gap.

Root cause: the hyperd connection set TCP_NODELAY and socket buffer sizes
but no keepalive. A long-lived idle connection that goes half-open — laptop
sleep, a network blip, or a hyperd that vanished without a FIN — has nothing
to detect the dead peer, so the next blocking read() hangs until the OS idle
timeout (up to 2h). This became materially more likely in v0.5.0, where the
daemon became resident-by-default and connections now live indefinitely across
suspends instead of being reaped by the old 30-minute idle shutdown.

(The hang is amplified by the MCP serializing all tool calls behind a single
engine mutex with no per-op timeout — tracked separately in #118 as an
architectural follow-up. This PR fixes the immediate trigger.)

Fix

Enable TCP keepalive at both the sync and async TCP connect sites in
hyperdb-api-core, alongside the existing nodelay/buffer-size options:
60s idle / 10s interval / 3 probes → dead peer detected in ~90s, after
which the existing ConnectionLost → drop-and-reconnect path takes over.

  • Best-effort (.ok()): a kernel that rejects a knob leaves the connection at
    OS defaults.
  • Probe count is Linux-only (with_retries); macOS/Windows honor idle+interval.
  • Deliberately not a query timeout. HyperDB runs legitimately long
    analytics queries; keepalive only probes during true silence, so it detects a
    dead peer without ever aborting live work.

Fix 2 — Stop redundant off-base daemon after concurrent cold-start

Problem

When two MCP clients cold-start simultaneously, both scan, both find the base
port free, and both spawn_detached(base). One daemon wins the bind; the other
fails and exits. But a third client (or the race victim) then scans again,
finds base occupied, lands on base+1, and overwrites daemon.json — leaving
two live daemons on adjacent ports, each with its own hyperd, doubling
idle CPU. Observed in the wild: daemons on 7485 and 7486 each ~0.9% CPU.

Fix

After wait_for_daemon() returns a daemon on a non-base port, re-scan the
ports below it. If a lower-port daemon is live, send it STOP (best-effort)
and adopt the lower-port one — it bound the socket first and is the canonical
single instance. Adds a unit test asserting scan_for_daemon prefers the
lowest-port daemon.

Testing

  • cargo clippy --all-targets -- -D warnings clean; cargo fmt --all --check
    clean (affected crates).
  • hyperdb-api-core unit tests (169) + hyperdb-api roundtrip/integration
    tests pass against real hyperd — the modified connect path is exercised.
  • hyperdb-mcp daemon tests pass (48), including the new lower-port-preference
    test. Socket-binding tests were run with the local sandbox disabled.

Notes / honest caveats

  • The keepalive fix could not be reproduced on demand (the environment was
    healthy by the time of investigation), but the missing-keepalive + half-open
    socket mechanism is a confirmed structural gap that matches every symptom and
    is the correct defense regardless.
  • The deeper amplifier — single global engine mutex + no op timeout — is not
    addressed here by design; it's an architectural change tracked in MCP: single engine mutex + no op timeout lets one stalled hyperd call hang all tool calls #118.

Changelog

hyperdb-mcp/CHANGELOG.md [Unreleased] documents the keepalive fix. (The
cold-start dedup landed in an earlier commit on this branch.)

When two MCP clients cold-start simultaneously, both scan and find the
base port free, both call spawn_detached(base). One daemon wins the
bind (stays on base); the other fails bind and exits. A third (or the
same race victim) scans again, finds base now occupied, lands on
base+1 — and writes daemon.json overwriting the base-port daemon's
file, producing two live daemons on adjacent ports and doubling the
idle hyperd CPU overhead.

Fix: after wait_for_daemon() returns a daemon on a non-base port,
re-scan the ports below it. If a lower-port daemon is live, STOP the
off-base daemon (best-effort) and adopt the base-port one instead.
The lower-port daemon wins because it bound the socket first and is
the canonical single instance.
A long-lived idle connection to hyperd that goes half-open (laptop
sleep, network blip, or a hyperd that vanished without a FIN) had no
way to be detected: the next blocking read would hang until the OS
default keepalive idle timeout (7200s / 2h on macOS and Linux). Because
the MCP serializes all tool calls behind a single engine mutex with no
per-op timeout, one such stalled connection makes EVERY tool call —
including `status` — appear to hang indefinitely.

This became materially more likely in 0.5.0, where the daemon went
resident-by-default: connections now live indefinitely across laptop
suspends instead of being torn down by the old 30-minute idle shutdown.

Fix: set TCP keepalive (60s idle / 10s interval / 3 probes -> dead peer
in ~90s) at both the sync and async TCP connect sites, alongside the
existing nodelay/buffer-size socket options. Best-effort (.ok()): a
kernel that rejects a knob leaves the connection at OS defaults. Probe
count is Linux-only (macOS honors idle+interval). Deliberately NOT a
query timeout — HyperDB runs legitimately long analytics queries, and
keepalive only probes during true silence, so it can't abort live work.

Follow-up (not in this commit): the single global engine mutex means a
slow/stalled op blocks unrelated tool calls. That is an architectural
change warranting its own design pass.
@StefanSteiner StefanSteiner changed the title fix(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race Jun 8, 2026
@StefanSteiner StefanSteiner merged commit 480a579 into tableau:main Jun 8, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant