Skip to content

MCP: single engine mutex + no op timeout lets one stalled hyperd call hang all tool calls #118

@StefanSteiner

Description

@StefanSteiner

Summary

The MCP server serializes every tool call behind a single
Arc<Mutex<Option<Engine>>> (server.rs:821),
and with_engine holds that lock across the entire blocking hyperd operation
(server.rs:1228). With no
per-operation timeout, a single slow or stalled hyperd call blocks all
other tool calls — including lightweight ones like status — for as long as
the slow call runs.

How it surfaced

A user reported status "hanging." Investigation showed status itself is
trivial, but it was queued behind another operation on a long-idle connection
that had gone half-open (laptop sleep / network blip). The immediate trigger —
no TCP keepalive, so a half-open socket blocked for the 2h OS idle default — is
fixed separately (TCP keepalive, ~90s dead-peer detection). This issue is the
amplifier
: the global mutex turns one stalled connection into a total stall of
the MCP surface, and the missing timeout removes the only other backstop.

This became more impactful in v0.5.0, where the daemon became
resident-by-default and connections now live indefinitely across suspends.

Why this is filed as a follow-up (not fixed in the keepalive PR)

Keepalive bounds the worst case to ~90s and is a safe, surgical change.
Removing the serialization is an architectural change (concurrency model of
the engine) and deserves its own design pass rather than a rushed patch. Filing
to track it.

Options to evaluate (not a decision)

  1. Per-operation timeout / cancellation. Wrap blocking hyperd calls so a
    stalled op can't hold the lock unboundedly. The connection builder already
    exposes query_timeout (connection_builder.rs:136)
    — but a blanket query timeout is wrong for HyperDB (legitimate long
    analytics queries). Any timeout must target liveness (is the peer
    responding) not duration (how long the query runs).
  2. Connection pool instead of a single engine. Let independent tool calls
    use independent connections so a slow call doesn't block unrelated ones.
    Larger change; interacts with the ephemeral-primary / per-session workspace
    model and the catalog-bootstrap-once logic.
  3. Cheap read-only fast path. Let status (and other non-engine-mutating
    introspection) answer without taking the engine lock — e.g. from cached
    metadata + the daemon health port — so diagnostics never hang even if the
    data plane is stalled.
  4. Run blocking calls on spawn_blocking with a watchdog that drops/replaces
    the engine if an op exceeds a liveness deadline (reusing the existing
    ConnectionLost → drop-and-reconnect path in with_engine).

Acceptance criteria (rough)

  • A stalled or very slow hyperd operation on one connection does not make
    unrelated tool calls (especially status) hang indefinitely.
  • Legitimate long-running analytics queries are not aborted by a
    duration-based cutoff.
  • The fix is verified with a test that simulates a wedged/slow connection and
    asserts a second concurrent call still returns (or fails fast).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions