Skip to content

fix(mcp): harden daemon discovery — identified PONG, port scanning, version takeover, resident-by-default#115

Merged
StefanSteiner merged 7 commits into
tableau:mainfrom
StefanSteiner:fix/daemon-discovery-hardening
Jun 7, 2026
Merged

fix(mcp): harden daemon discovery — identified PONG, port scanning, version takeover, resident-by-default#115
StefanSteiner merged 7 commits into
tableau:mainfrom
StefanSteiner:fix/daemon-discovery-hardening

Conversation

@StefanSteiner
Copy link
Copy Markdown
Contributor

Summary

Hardens the hyperdb-mcp single-instance daemon so clients can't mistake a
foreign process for the daemon, the default health port stops colliding with
hyperd's gRPC port, upgrades take effect immediately, and the daemon (plus the
hyperd it owns) stays resident — eliminating the "hyper is restarting, please
retry" round-trip clients hit after an idle shutdown.

Resolves #114.

Background

The daemon advertises a TCP health port in ~/.hyperdb/daemon.json that serves
as both a single-instance lock and a control channel. Four problems motivated
this work:

  1. Weak liveness check. discover() trusted a bare TCP connect() to the
    health port and never sent PING — any process camped on the port read as a
    live daemon.
  2. Collision-prone default port. The default was 7484, which is hyperd's
    conventional gRPC port (ListenMode::Both { grpc_port: 7484 }) — exactly the
    kind of process that triggers problem 1.
  3. Slow upgrades. A newer client silently reused an old daemon; the stale
    hyperd lingered until the 30-min idle timeout.
  4. Restart-churn UX. On idle timeout the daemon killed hyperd; the next
    client hit a connection error surfaced as "hyper restarting, retry."

Changes

  • Identity-checked discovery. PING now replies PONG hyperdb-mcp <version>.
    Clients verify the exact tokens (not a string prefix) before trusting a daemon;
    is_daemon_alive uses this instead of a bare TCP connect. A process that
    answers TCP but not the identified protocol is classified as "camped" and
    skipped; a stale/foreign daemon.json is detected and removed.
  • Port scanning. Default base port is now 7485 (away from hyperd gRPC).
    resolve_port_scan() returns a PortScan { base, span }: pins the exact port
    when HYPERDB_DAEMON_PORT is set, else scans 16 ports upward. probe_port
    classifies each as OurDaemon / Camped / Refused; the daemon spawns on the
    first refused port. daemon status / daemon stop locate the daemon via
    discovery + scan (CLI --port is now optional).
  • Newer-client version takeover. A starting client whose semver is strictly
    newer STOPs the old daemon (which drops its HyperProcess, stopping hyperd),
    waits for the port to release, and respawns on the same port. Equal / older /
    unparseable versions reuse the daemon — never a downgrade-kill.
  • Resident by default. DaemonConfig.idle_timeout is now Option, set only
    via --idle-timeout / HYPERDB_DAEMON_IDLE_TIMEOUT (flag wins over env). With
    neither set the idle monitor never arms; the daemon and hyperd stay warm. The
    hyperd restart-limit shutdown (3 failures / 60s) is unchanged.
  • status tool surfaces the endpoint. New engine block: mode
    (daemon/local), hyperd_endpoint, and daemon_health_port — previously
    only reachable via ~/.hyperdb/daemon.json or hyperdb-mcp daemon status.
  • Heartbeat correctness fix. The server heartbeat previously re-resolved the
    port, which would target the wrong port under scanning; the engine now records
    the discovered health_port and the heartbeat uses it.

Decisions

  • HYPERDB_DAEMON_PORT set ⇒ pin that exact port (no scan).
  • Scan tuning: 16 ports from 7485, 300ms connect/read per probe.
  • Version takeover only on strictly-newer semver (two builds of the same version
    — e.g. dev rebuilds — are equal and won't take over; use daemon stop to force).
  • Old/foreign daemon failing the identity check ⇒ delete stale daemon.json,
    re-spawn.

Known tradeoff

Resident-by-default removes the idle timeout that used to implicitly reap a
hung-but-alive hyperd. Such a daemon now stays wedged until a client reports
an error (REPORT_HYPERD_ERROR, fired on client-side ConnectionLost) or an
operator runs daemon stop. Documented in DEVELOPMENT.md; a future daemon-side
liveness probe could close the "all clients idle + hyperd hung" gap. See #114.

Testing

  • New unit tests: identified-PONG accept / reject-foreign / reject-refused /
    reject-token-lookalike; resolve_port_scan pin-vs-scan; scan
    finds-via-STATUS / skips-camped / all-refused; client_should_take_over
    version matrix; DaemonConfig::from_args none/flag/env/precedence; status
    engine block.
  • cargo test -p hyperdb-mcp green; cargo clippy --workspace --all-targets -- -D warnings clean; cargo fmt --all --check clean; cargo deny check clean
    (adds semver as a direct dep, already in the lock).
  • Verified end-to-end against a rebuilt MCP: status reports
    engine.hyperd_endpoint and daemon_health_port: 7486 (scanner correctly
    skipped an occupied 7485), version stamp matches the branch HEAD.

Docs

README (Operating Modes + CLI reference), DEVELOPMENT.md (daemon internals +
tradeoff), CHANGELOG Unreleased. Wiki: new [Shared Daemon] page.

Liveness checks now send PING and require an identifying
'PONG hyperdb-mcp <version>' reply (verified by exact tokens, not a
string prefix), so a foreign process camped on the health port no
longer reads as a live daemon. Default base port moves 7484 -> 7485
(7484 is hyperd's conventional gRPC port) and a PortScan resolver is
introduced (pin when HYPERDB_DAEMON_PORT is set, else scan span 16).

Also fixes a latent bug: the server heartbeat re-resolved the port
instead of using the daemon's discovered health_port, which would
target the wrong port once scanning lands. The engine now carries
daemon_health_port and the heartbeat uses it.
ensure_daemon now scans a port range (PortScan) instead of a single
fixed port: it PING-identifies each port, returns the first running
hyperdb-mcp daemon (verified via STATUS), and otherwise spawns a fresh
daemon on the first connection-refused port. probe_port distinguishes
our-daemon / camped-foreign / refused; a process that answers TCP but
not the identified protocol is treated as camped and skipped.

A newly starting client whose semver is strictly newer than the running
daemon takes it over: STOP the old daemon (which drops its HyperProcess
and stops hyperd), wait for the health port to release, then respawn on
the same port. Equal/older/​unparseable versions reuse the daemon — never
a downgrade-kill. Adds semver as a direct dependency (already in lock).
Idle shutdown is now opt-in: DaemonConfig.idle_timeout is Option, set
only when --idle-timeout or HYPERDB_DAEMON_IDLE_TIMEOUT is provided
(flag wins over env). With neither set the idle monitor never arms, so
the daemon and its hyperd stay resident — eliminating the connection
error + 'hyper restarting, retry' churn a client hit after a 30-min
idle shutdown. The hyperd restart-limit shutdown path is unchanged.

The daemon CLI --port is now Option<u16>: 'daemon stop'/'status' omit it
and resolve the live daemon via find_running_daemon() (discover + scan),
so they no longer miss a daemon that scanned onto a non-base port. A
bare 'hyperdb-mcp daemon' binds resolve_port_scan().base.
…dent default

Update README Operating Modes + CLI reference, DEVELOPMENT daemon
internals, and the CHANGELOG Unreleased entry to reflect: identity-checked
discovery (PONG hyperdb-mcp <version>), port scanning from 7485 (was fixed
7484, which clashes with hyperd gRPC), newer-client version takeover, and
idle shutdown now being opt-in (daemon stays resident by default).
…radeoff

Final-sweep follow-ups:
- maybe_take_over: before respawning on the freed port, adopt a
  concurrently-published identity-verified daemon on that same port if
  one already exists, avoiding a redundant spawn and a stale-endpoint
  return during simultaneous version takeovers.
- Expand the heartbeat comment to explain why the discovered health_port
  is used instead of re-resolving (scanning can land off the base port).
- DEVELOPMENT.md: document the resident-by-default tradeoff for a
  hung-but-alive hyperd and note a possible daemon-side liveness probe.
The status tool now includes an "engine" block: mode (daemon/local),
hyperd_endpoint (the libpq endpoint queries run against), and
daemon_health_port (the shared daemon's control/lock port, null in local
mode). Previously the endpoint was only reachable by reading
~/.hyperdb/daemon.json or via 'hyperdb-mcp daemon status'.
The test scanned the full port range between two arbitrary OS-assigned
ports. Other tests leak identity-answering HealthListeners on random
high ports for the process lifetime; one could land inside that range
and be returned as Found instead of FreePort (observed on Linux CI).

Narrow the scan to exactly two adjacent ports {base (camped), base+1
(free)}, confirming base+1 is bindable immediately before scanning, so
a leaked listener can no longer fall inside the window.
@StefanSteiner StefanSteiner merged commit 05019b9 into tableau:main Jun 7, 2026
11 checks passed
@StefanSteiner StefanSteiner mentioned this pull request Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden hyperdb-mcp daemon discovery: identified PONG, port scanning, version takeover, resident-by-default

1 participant