-
Notifications
You must be signed in to change notification settings - Fork 183
Description
It was demonstrated in [stacklok/toolhive#1062](#1062) that stdio-based MCP servers do not scale well - if at all - inside Kubernetes. That impacts every server using the stdio transport. We should lean on ToolHive to provide a scaling path.
For production workloads under high load, the recommendation is to use Streamable-HTTP (or future transports that handle scaling better). Still, we should offer a basic scaling story for stdio.
The goal for stdio servers is modest: allow requests to succeed under light load (with acceptable wait time). This issue explores ways ToolHive could enable that in Kubernetes.
Options
Option A - Simple Request-attach, Response-detach (S-RARD)
For each request, the ToolHive proxy attaches to a backend stdio server, then detaches after sending the response to the client. This avoids a 1:1 proxy↔server pod mapping and lets the proxy serve additional client requests via subsequent attach/detach cycles. The proxy will need to queue pending requests while waiting to attach.
Pros
- Eliminates long-lived connections
- Allows multiple requests over time with a single proxy
Cons
- Longer or more numerous requests increase latency for subsequent ones
- No MCP notifications (no persistent connection for server-initiated updates)
Option B - Pooled Request-attach, Response-detach (P-RARD)
Same model as Option A, but with a pool of backend stdio servers ready to take work. This targets higher throughput by reducing queue time.
Pros
- Eliminates long-lived connections
- Supports multiple requests
- Scales better than A via a backend pool → shorter waits
Cons
- Longer or more numerous requests still add latency under heavy load
- No MCP notifications (no persistent connection)
Option C - Server-per-Request
For each client request, the proxy spins up a dedicated MCP server, keeps it running until the client disconnects (or times out), then shuts it down.
Pros
- Supports long-lived connections and MCP notifications
- No head-of-line blocking from prior requests
Cons
- Higher latency from per-request server startup (cold starts)
- Larger container footprint (e.g., 100 requests in 60s → ~100 pods)
Conclusion
All three approaches are viable. Since we already recommend different transports for high-scale scenarios, our stdio strategy only needs to cover basic scaling. Given complexity vs. benefit, I (@ChrisJBurns) recommend Option A. It’s the simplest, carries fewer trade-offs, and is likely the most practical default. Option C preserves notifications but incurs longer wait times and a heavier footprint, which many users won’t accept.
Open to other views and refinements.
The below are issues that were raised pertinent to this issue: