vMCP health monitor permanently marks backends unhealthy after startup race

## Summary

When a VirtualMCPServer pod starts at the same time as its backend MCPServer proxies (e.g., during a simultaneous Flux reconciliation), the health monitor's initial health checks fail because the proxies aren't ready yet. Once marked unhealthy, the backends **never recover** despite becoming fully healthy — the vMCP stays in a `Degraded` state indefinitely until the pod is manually restarted.

## Reproduction

1. Deploy a VirtualMCPServer and its backend MCPServers simultaneously (e.g., via a single Flux Kustomization)
2. The vMCP pod starts and runs its initial health checks before the backend proxy pods are ready
3. Initial health checks fail with: `transport error: server returned 4xx for initialize POST, likely a legacy SSE server`
4. Subsequent health checks (every 30s) continue to fail with the same error, even though the backends are now fully healthy
5. The vMCP reports `Degraded` with increasing `consecutiveFailures`

## Observed behavior

- The vMCP `status.phase` stays `Degraded` with `message: "Some backends are unhealthy"`
- `consecutiveFailures` increments every health check interval (30s) and never resets
- Port-forwarding to the backend proxy services and sending the same initialize POST returns HTTP 200 with a valid MCP response
- A second VirtualMCPServer (anonymous auth, already running before the backends were deployed) shows all the same backends as `ready` using the exact same proxy URLs
- Deleting the vMCP pod (forcing a fresh start) immediately resolves the issue — all backends become healthy

## Expected behavior

Each health check cycle should create a fully independent client connection to the backend. If a backend was unavailable at startup but later becomes healthy, the health monitor should detect the recovery and transition the backend from `unhealthy` → `healthy`.

## Environment

- toolhive v0.12.4 (`ghcr.io/stacklok/toolhive/vmcp:v0.12.4`)
- mcp-go v0.45.0
- Kubernetes (EKS), backends managed via Flux CD
- VirtualMCPServer with `manual` conflict resolution and OIDC + Cedar auth

## Analysis

The health check path (`health/monitor.go` → `health/checker.go` → `client/client.go` → `ListCapabilities` → `defaultClientFactory`) appears to create a new `*client.Client` and `http.Client` on each cycle, but all share `http.DefaultTransport` as the base round tripper. The mcp-go `StreamableHTTP` transport returns `ErrLegacySSEServer` for any 4xx during an initialize POST — this may be getting cached or the connection pool in `http.DefaultTransport` may be holding onto stale connection state.

## Secondary issue

A previously deleted `MCPServer` (`admin-tools`) remained in the vMCP's `discoveredBackends` list after the resource was removed from the cluster. The vMCP controller did not reconcile the removal, resulting in perpetual DNS lookup failures for `mcp-admin-tools-proxy`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vMCP health monitor permanently marks backends unhealthy after startup race #4278

Summary

Reproduction

Observed behavior

Expected behavior

Environment

Analysis

Secondary issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vMCP health monitor permanently marks backends unhealthy after startup race #4278

Description

Summary

Reproduction

Observed behavior

Expected behavior

Environment

Analysis

Secondary issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions