Basic Scaling of `stdio` Servers inside of Kubernetes

It was demonstrated in [[stacklok/toolhive#1062](https://github.com/stacklok/toolhive/issues/1062)](https://github.com/stacklok/toolhive/issues/1062) that `stdio`-based MCP servers do not scale well - if at all - inside Kubernetes. That impacts every server using the `stdio` transport. We should lean on ToolHive to provide a scaling path.

For production workloads under high load, the recommendation is to use Streamable-HTTP (or future transports that handle scaling better). Still, we should offer a basic scaling story for `stdio`.

The goal for `stdio` servers is modest: allow requests to succeed under light load (with acceptable wait time). This issue explores ways ToolHive could enable that in Kubernetes.

## Options

### Option A - Simple Request-attach, Response-detach (S-RARD)

For each request, the ToolHive proxy attaches to a backend `stdio` server, then detaches after sending the response to the client. This avoids a 1:1 proxy↔server pod mapping and lets the proxy serve additional client requests via subsequent attach/detach cycles. The proxy will need to queue pending requests while waiting to attach.

**Pros**

* Eliminates long-lived connections
* Allows multiple requests over time with a single proxy

**Cons**

* Longer or more numerous requests increase latency for subsequent ones
* No MCP notifications (no persistent connection for server-initiated updates)

### Option B - Pooled Request-attach, Response-detach (P-RARD)

Same model as Option A, but with a pool of backend `stdio` servers ready to take work. This targets higher throughput by reducing queue time.

**Pros**

* Eliminates long-lived connections
* Supports multiple requests
* Scales better than A via a backend pool → shorter waits

**Cons**

* Longer or more numerous requests still add latency under heavy load
* No MCP notifications (no persistent connection)

### Option C - Server-per-Request

For each client request, the proxy spins up a dedicated MCP server, keeps it running until the client disconnects (or times out), then shuts it down.

**Pros**

* Supports long-lived connections and MCP notifications
* No head-of-line blocking from prior requests

**Cons**

* Higher latency from per-request server startup (cold starts)
* Larger container footprint (e.g., 100 requests in 60s → \~100 pods)

## Conclusion

All three approaches are viable. Since we already recommend different transports for high-scale scenarios, our `stdio` strategy only needs to cover basic scaling. Given complexity vs. benefit, **I (@ChrisJBurns) recommend Option A**. It’s the simplest, carries fewer trade-offs, and is likely the most practical default. Option C preserves notifications but incurs longer wait times and a heavier footprint, which many users won’t accept.

Open to other views and refinements.

The below are issues that were raised pertinent to this issue:
- https://github.com/stacklok/toolhive/issues/428
- https://github.com/stacklok/toolhive/issues/892


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Scaling of `stdio` Servers inside of Kubernetes #1589

Options

Option A - Simple Request-attach, Response-detach (S-RARD)

Option B - Pooled Request-attach, Response-detach (P-RARD)

Option C - Server-per-Request

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Basic Scaling of stdio Servers inside of Kubernetes #1589

Description

Options

Option A - Simple Request-attach, Response-detach (S-RARD)

Option B - Pooled Request-attach, Response-detach (P-RARD)

Option C - Server-per-Request

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Basic Scaling of `stdio` Servers inside of Kubernetes #1589