Skip to content

Basic Scaling of stdio Servers inside of Kubernetes #1589

@ChrisJBurns

Description

@ChrisJBurns

It was demonstrated in [stacklok/toolhive#1062](#1062) that stdio-based MCP servers do not scale well - if at all - inside Kubernetes. That impacts every server using the stdio transport. We should lean on ToolHive to provide a scaling path.

For production workloads under high load, the recommendation is to use Streamable-HTTP (or future transports that handle scaling better). Still, we should offer a basic scaling story for stdio.

The goal for stdio servers is modest: allow requests to succeed under light load (with acceptable wait time). This issue explores ways ToolHive could enable that in Kubernetes.

Options

Option A - Simple Request-attach, Response-detach (S-RARD)

For each request, the ToolHive proxy attaches to a backend stdio server, then detaches after sending the response to the client. This avoids a 1:1 proxy↔server pod mapping and lets the proxy serve additional client requests via subsequent attach/detach cycles. The proxy will need to queue pending requests while waiting to attach.

Pros

  • Eliminates long-lived connections
  • Allows multiple requests over time with a single proxy

Cons

  • Longer or more numerous requests increase latency for subsequent ones
  • No MCP notifications (no persistent connection for server-initiated updates)

Option B - Pooled Request-attach, Response-detach (P-RARD)

Same model as Option A, but with a pool of backend stdio servers ready to take work. This targets higher throughput by reducing queue time.

Pros

  • Eliminates long-lived connections
  • Supports multiple requests
  • Scales better than A via a backend pool → shorter waits

Cons

  • Longer or more numerous requests still add latency under heavy load
  • No MCP notifications (no persistent connection)

Option C - Server-per-Request

For each client request, the proxy spins up a dedicated MCP server, keeps it running until the client disconnects (or times out), then shuts it down.

Pros

  • Supports long-lived connections and MCP notifications
  • No head-of-line blocking from prior requests

Cons

  • Higher latency from per-request server startup (cold starts)
  • Larger container footprint (e.g., 100 requests in 60s → ~100 pods)

Conclusion

All three approaches are viable. Since we already recommend different transports for high-scale scenarios, our stdio strategy only needs to cover basic scaling. Given complexity vs. benefit, I (@ChrisJBurns) recommend Option A. It’s the simplest, carries fewer trade-offs, and is likely the most practical default. Option C preserves notifications but incurs longer wait times and a heavier footprint, which many users won’t accept.

Open to other views and refinements.

The below are issues that were raised pertinent to this issue:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kubernetesItems related to Kubernetes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions