chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race by StefanSteiner · Pull Request #119 · tableau/hyper-api-rust

StefanSteiner · 2026-06-08T01:16:41Z

Summary

Two reliability fixes for the shared-daemon MCP path, both surfaced by real
usage after the resident-by-default daemon shipped in v0.5.0:

TCP keepalive on hyperd connections — so a half-open idle connection
is detected in ~90s instead of blocking for the 2-hour OS default.
Stop redundant off-base daemons after a concurrent cold-start — so two
MCP clients racing to start don't leave a duplicate daemon+hyperd pair
burning idle CPU on an adjacent port.

Targeting a 0.5.1 patch release.

Fix 1 — TCP keepalive on hyperd connections

Problem

A user reported the status tool "hanging." status is trivial, so this was
surprising. Investigation (sockets, netstat, process state) found:

The daemon and hyperd were healthy — fresh connects were ~1.6ms.
The MCP↔hyperd socket was ESTABLISHED with empty send/recv queues — not
dropped — but had been idle for hours.
TCP keepalive was off (macOS/Linux default idle = 7200s / 2h), and a
network/wake event occurred during the idle gap.

Root cause: the hyperd connection set TCP_NODELAY and socket buffer sizes
but no keepalive. A long-lived idle connection that goes half-open — laptop
sleep, a network blip, or a hyperd that vanished without a FIN — has nothing
to detect the dead peer, so the next blocking read() hangs until the OS idle
timeout (up to 2h). This became materially more likely in v0.5.0, where the
daemon became resident-by-default and connections now live indefinitely across
suspends instead of being reaped by the old 30-minute idle shutdown.

(The hang is amplified by the MCP serializing all tool calls behind a single
engine mutex with no per-op timeout — tracked separately in #118 as an
architectural follow-up. This PR fixes the immediate trigger.)

Fix

Enable TCP keepalive at both the sync and async TCP connect sites in
hyperdb-api-core, alongside the existing nodelay/buffer-size options:
60s idle / 10s interval / 3 probes → dead peer detected in ~90s, after
which the existing ConnectionLost → drop-and-reconnect path takes over.

Best-effort (.ok()): a kernel that rejects a knob leaves the connection at
OS defaults.
Probe count is Linux-only (with_retries); macOS/Windows honor idle+interval.
Deliberately not a query timeout. HyperDB runs legitimately long
analytics queries; keepalive only probes during true silence, so it detects a
dead peer without ever aborting live work.

Fix 2 — Stop redundant off-base daemon after concurrent cold-start

Problem

When two MCP clients cold-start simultaneously, both scan, both find the base
port free, and both spawn_detached(base). One daemon wins the bind; the other
fails and exits. But a third client (or the race victim) then scans again,
finds base occupied, lands on base+1, and overwrites daemon.json — leaving
two live daemons on adjacent ports, each with its own hyperd, doubling
idle CPU. Observed in the wild: daemons on 7485 and 7486 each ~0.9% CPU.

Fix

After wait_for_daemon() returns a daemon on a non-base port, re-scan the
ports below it. If a lower-port daemon is live, send it STOP (best-effort)
and adopt the lower-port one — it bound the socket first and is the canonical
single instance. Adds a unit test asserting scan_for_daemon prefers the
lowest-port daemon.

Testing

cargo clippy --all-targets -- -D warnings clean; cargo fmt --all --check
clean (affected crates).
hyperdb-api-core unit tests (169) + hyperdb-api roundtrip/integration
tests pass against real hyperd — the modified connect path is exercised.
hyperdb-mcp daemon tests pass (48), including the new lower-port-preference
test. Socket-binding tests were run with the local sandbox disabled.

Notes / honest caveats

The keepalive fix could not be reproduced on demand (the environment was
healthy by the time of investigation), but the missing-keepalive + half-open
socket mechanism is a confirmed structural gap that matches every symptom and
is the correct defense regardless.
The deeper amplifier — single global engine mutex + no op timeout — is not
addressed here by design; it's an architectural change tracked in MCP: single engine mutex + no op timeout lets one stalled hyperd call hang all tool calls #118.

Changelog

hyperdb-mcp/CHANGELOG.md [Unreleased] documents the keepalive fix. (The
cold-start dedup landed in an earlier commit on this branch.)

When two MCP clients cold-start simultaneously, both scan and find the base port free, both call spawn_detached(base). One daemon wins the bind (stays on base); the other fails bind and exits. A third (or the same race victim) scans again, finds base now occupied, lands on base+1 — and writes daemon.json overwriting the base-port daemon's file, producing two live daemons on adjacent ports and doubling the idle hyperd CPU overhead. Fix: after wait_for_daemon() returns a daemon on a non-base port, re-scan the ports below it. If a lower-port daemon is live, STOP the off-base daemon (best-effort) and adopt the base-port one instead. The lower-port daemon wins because it bound the socket first and is the canonical single instance.

A long-lived idle connection to hyperd that goes half-open (laptop sleep, network blip, or a hyperd that vanished without a FIN) had no way to be detected: the next blocking read would hang until the OS default keepalive idle timeout (7200s / 2h on macOS and Linux). Because the MCP serializes all tool calls behind a single engine mutex with no per-op timeout, one such stalled connection makes EVERY tool call — including `status` — appear to hang indefinitely. This became materially more likely in 0.5.0, where the daemon went resident-by-default: connections now live indefinitely across laptop suspends instead of being torn down by the old 30-minute idle shutdown. Fix: set TCP keepalive (60s idle / 10s interval / 3 probes -> dead peer in ~90s) at both the sync and async TCP connect sites, alongside the existing nodelay/buffer-size socket options. Best-effort (.ok()): a kernel that rejects a knob leaves the connection at OS defaults. Probe count is Linux-only (macOS honors idle+interval). Deliberately NOT a query timeout — HyperDB runs legitimately long analytics queries, and keepalive only probes during true silence, so it can't abort live work. Follow-up (not in this commit): the single global engine mutex means a slow/stalled op blocks unrelated tool calls. That is an architectural change warranting its own design pass.

StefanSteiner added 2 commits June 7, 2026 11:11

StefanSteiner changed the title ~~fix(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race~~ chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race Jun 8, 2026

StefanSteiner merged commit 480a579 into tableau:main Jun 8, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race#119

chore(mcp): TCP keepalive on hyperd connections + dedup redundant daemons on cold-start race#119
StefanSteiner merged 2 commits into
tableau:mainfrom
StefanSteiner:fix/daemon-cold-start-dedup

StefanSteiner commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StefanSteiner commented Jun 8, 2026

Summary

Fix 1 — TCP keepalive on hyperd connections

Problem

Fix

Fix 2 — Stop redundant off-base daemon after concurrent cold-start

Problem

Fix

Testing

Notes / honest caveats

Changelog

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant