Skip to content

vMCP health monitor permanently marks backends unhealthy after startup race #4278

@jerm-dro

Description

@jerm-dro

Summary

When a VirtualMCPServer pod starts at the same time as its backend MCPServer proxies (e.g., during a simultaneous Flux reconciliation), the health monitor's initial health checks fail because the proxies aren't ready yet. Once marked unhealthy, the backends never recover despite becoming fully healthy — the vMCP stays in a Degraded state indefinitely until the pod is manually restarted.

Reproduction

  1. Deploy a VirtualMCPServer and its backend MCPServers simultaneously (e.g., via a single Flux Kustomization)
  2. The vMCP pod starts and runs its initial health checks before the backend proxy pods are ready
  3. Initial health checks fail with: transport error: server returned 4xx for initialize POST, likely a legacy SSE server
  4. Subsequent health checks (every 30s) continue to fail with the same error, even though the backends are now fully healthy
  5. The vMCP reports Degraded with increasing consecutiveFailures

Observed behavior

  • The vMCP status.phase stays Degraded with message: "Some backends are unhealthy"
  • consecutiveFailures increments every health check interval (30s) and never resets
  • Port-forwarding to the backend proxy services and sending the same initialize POST returns HTTP 200 with a valid MCP response
  • A second VirtualMCPServer (anonymous auth, already running before the backends were deployed) shows all the same backends as ready using the exact same proxy URLs
  • Deleting the vMCP pod (forcing a fresh start) immediately resolves the issue — all backends become healthy

Expected behavior

Each health check cycle should create a fully independent client connection to the backend. If a backend was unavailable at startup but later becomes healthy, the health monitor should detect the recovery and transition the backend from unhealthyhealthy.

Environment

  • toolhive v0.12.4 (ghcr.io/stacklok/toolhive/vmcp:v0.12.4)
  • mcp-go v0.45.0
  • Kubernetes (EKS), backends managed via Flux CD
  • VirtualMCPServer with manual conflict resolution and OIDC + Cedar auth

Analysis

The health check path (health/monitor.gohealth/checker.goclient/client.goListCapabilitiesdefaultClientFactory) appears to create a new *client.Client and http.Client on each cycle, but all share http.DefaultTransport as the base round tripper. The mcp-go StreamableHTTP transport returns ErrLegacySSEServer for any 4xx during an initialize POST — this may be getting cached or the connection pool in http.DefaultTransport may be holding onto stale connection state.

Secondary issue

A previously deleted MCPServer (admin-tools) remained in the vMCP's discoveredBackends list after the resource was removed from the cluster. The vMCP controller did not reconcile the removal, resulting in perpetual DNS lookup failures for mcp-admin-tools-proxy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkubernetesItems related to KubernetesvmcpVirtual MCP Server related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions