-
Notifications
You must be signed in to change notification settings - Fork 192
Description
Summary
When a VirtualMCPServer pod starts at the same time as its backend MCPServer proxies (e.g., during a simultaneous Flux reconciliation), the health monitor's initial health checks fail because the proxies aren't ready yet. Once marked unhealthy, the backends never recover despite becoming fully healthy — the vMCP stays in a Degraded state indefinitely until the pod is manually restarted.
Reproduction
- Deploy a VirtualMCPServer and its backend MCPServers simultaneously (e.g., via a single Flux Kustomization)
- The vMCP pod starts and runs its initial health checks before the backend proxy pods are ready
- Initial health checks fail with:
transport error: server returned 4xx for initialize POST, likely a legacy SSE server - Subsequent health checks (every 30s) continue to fail with the same error, even though the backends are now fully healthy
- The vMCP reports
Degradedwith increasingconsecutiveFailures
Observed behavior
- The vMCP
status.phasestaysDegradedwithmessage: "Some backends are unhealthy" consecutiveFailuresincrements every health check interval (30s) and never resets- Port-forwarding to the backend proxy services and sending the same initialize POST returns HTTP 200 with a valid MCP response
- A second VirtualMCPServer (anonymous auth, already running before the backends were deployed) shows all the same backends as
readyusing the exact same proxy URLs - Deleting the vMCP pod (forcing a fresh start) immediately resolves the issue — all backends become healthy
Expected behavior
Each health check cycle should create a fully independent client connection to the backend. If a backend was unavailable at startup but later becomes healthy, the health monitor should detect the recovery and transition the backend from unhealthy → healthy.
Environment
- toolhive v0.12.4 (
ghcr.io/stacklok/toolhive/vmcp:v0.12.4) - mcp-go v0.45.0
- Kubernetes (EKS), backends managed via Flux CD
- VirtualMCPServer with
manualconflict resolution and OIDC + Cedar auth
Analysis
The health check path (health/monitor.go → health/checker.go → client/client.go → ListCapabilities → defaultClientFactory) appears to create a new *client.Client and http.Client on each cycle, but all share http.DefaultTransport as the base round tripper. The mcp-go StreamableHTTP transport returns ErrLegacySSEServer for any 4xx during an initialize POST — this may be getting cached or the connection pool in http.DefaultTransport may be holding onto stale connection state.
Secondary issue
A previously deleted MCPServer (admin-tools) remained in the vMCP's discoveredBackends list after the resource was removed from the cluster. The vMCP controller did not reconcile the removal, resulting in perpetual DNS lookup failures for mcp-admin-tools-proxy.