MCP: single engine mutex + no op timeout lets one stalled hyperd call hang all tool calls

## Summary

The MCP server serializes **every** tool call behind a single
`Arc<Mutex<Option<Engine>>>` ([`server.rs:821`](../../blob/main/hyperdb-mcp/src/server.rs#L821)),
and `with_engine` holds that lock across the entire blocking `hyperd` operation
([`server.rs:1228`](../../blob/main/hyperdb-mcp/src/server.rs#L1228)). With no
per-operation timeout, a single slow or stalled `hyperd` call blocks *all*
other tool calls — including lightweight ones like `status` — for as long as
the slow call runs.

## How it surfaced

A user reported `status` "hanging." Investigation showed `status` itself is
trivial, but it was queued behind another operation on a long-idle connection
that had gone half-open (laptop sleep / network blip). The immediate trigger —
no TCP keepalive, so a half-open socket blocked for the 2h OS idle default — is
fixed separately (TCP keepalive, ~90s dead-peer detection). **This issue is the
amplifier**: the global mutex turns one stalled connection into a total stall of
the MCP surface, and the missing timeout removes the only other backstop.

This became more impactful in v0.5.0, where the daemon became
resident-by-default and connections now live indefinitely across suspends.

## Why this is filed as a follow-up (not fixed in the keepalive PR)

Keepalive bounds the worst case to ~90s and is a safe, surgical change.
Removing the serialization is an **architectural** change (concurrency model of
the engine) and deserves its own design pass rather than a rushed patch. Filing
to track it.

## Options to evaluate (not a decision)

1. **Per-operation timeout / cancellation.** Wrap blocking `hyperd` calls so a
   stalled op can't hold the lock unboundedly. The connection builder already
   exposes `query_timeout` ([`connection_builder.rs:136`](../../blob/main/hyperdb-api/src/connection_builder.rs#L136))
   — but a blanket query timeout is **wrong** for HyperDB (legitimate long
   analytics queries). Any timeout must target *liveness* (is the peer
   responding) not *duration* (how long the query runs).
2. **Connection pool instead of a single engine.** Let independent tool calls
   use independent connections so a slow call doesn't block unrelated ones.
   Larger change; interacts with the ephemeral-primary / per-session workspace
   model and the catalog-bootstrap-once logic.
3. **Cheap read-only fast path.** Let `status` (and other non-engine-mutating
   introspection) answer without taking the engine lock — e.g. from cached
   metadata + the daemon health port — so diagnostics never hang even if the
   data plane is stalled.
4. **Run blocking calls on `spawn_blocking` with a watchdog** that drops/replaces
   the engine if an op exceeds a liveness deadline (reusing the existing
   `ConnectionLost` → drop-and-reconnect path in `with_engine`).

## Acceptance criteria (rough)

- A stalled or very slow `hyperd` operation on one connection does not make
  unrelated tool calls (especially `status`) hang indefinitely.
- Legitimate long-running analytics queries are **not** aborted by a
  duration-based cutoff.
- The fix is verified with a test that simulates a wedged/slow connection and
  asserts a second concurrent call still returns (or fails fast).

## Related

- TCP keepalive fix (the immediate-trigger mitigation): branch
  `fix/daemon-cold-start-dedup`, PR for v0.5.1.
- Resident-by-default daemon that increased exposure: #114 / v0.5.0.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP: single engine mutex + no op timeout lets one stalled hyperd call hang all tool calls #118

Summary

How it surfaced

Why this is filed as a follow-up (not fixed in the keepalive PR)

Options to evaluate (not a decision)

Acceptance criteria (rough)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MCP: single engine mutex + no op timeout lets one stalled hyperd call hang all tool calls #118

Description

Summary

How it surfaced

Why this is filed as a follow-up (not fixed in the keepalive PR)

Options to evaluate (not a decision)

Acceptance criteria (rough)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions