Prevent permanent backend unhealthy marking after startup race#4290
Prevent permanent backend unhealthy marking after startup race#4290
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a vMCP health-monitor startup race where backends could be marked unhealthy indefinitely by improving HTTP connection isolation, error classification, and status reporting responsiveness (Fixes #4278).
Changes:
- Clone
http.DefaultTransportfor each backend client creation to isolate connection pools and avoid stale keep-alive connections after pod replacement. - Map mcp-go transport sentinel errors (
ErrUnauthorized,ErrLegacySSEServer) to vMCP sentinel errors inwrapBackendError, and extendIsAuthenticationErrorto match"unauthorized (401)". - Add a 2s DynamicRegistry version poller to trigger immediate status reporting (and backend refresh) when backends are added/removed.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/server/status_reporting.go | Adds DynamicRegistry version polling to trigger immediate status reports on backend registry changes. |
| pkg/vmcp/server/status_reporting_test.go | Adds a unit test ensuring version changes trigger an immediate status report (no need to wait for the main interval). |
| pkg/vmcp/health/checker_test.go | Extends auth-error classification tests to include mcp-go’s "unauthorized (401)" format and wrapped auth sentinel behavior. |
| pkg/vmcp/errors.go | Updates IsAuthenticationError string matching to include "unauthorized (401)". |
| pkg/vmcp/client/client.go | Clones the base HTTP transport per client creation and adds explicit mapping for mcp-go transport sentinel errors in wrapBackendError. |
| pkg/vmcp/client/client_test.go | Adds tests verifying wrapBackendError maps mcp-go transport sentinel errors to the correct vMCP sentinel errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad39e47b36
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4290 +/- ##
==========================================
- Coverage 68.95% 68.77% -0.19%
==========================================
Files 473 473
Lines 47854 47986 +132
==========================================
Hits 33000 33000
- Misses 12266 12284 +18
- Partials 2588 2702 +114 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…race Health checks could permanently mark backends as unhealthy due to two issues: shared http.DefaultTransport kept stale keep-alive connections to replaced K8s pods returning 4xx indefinitely, and mcp-go sentinel errors (ErrUnauthorized, ErrLegacySSEServer) were not recognized by the error classification chain, causing auth failures to surface as generic backend unavailability. - Clone http.DefaultTransport per client factory call to isolate connection pools and avoid stale connections after pod replacement - Map transport.ErrUnauthorized to ErrAuthenticationFailed and transport.ErrLegacySSEServer to ErrBackendUnavailable in wrapBackendError before falling back to string-based detection - Add "unauthorized (401)" pattern to IsAuthenticationError to match mcp-go's ErrUnauthorized string format - Poll DynamicRegistry version every 2s and trigger an immediate status report when backends are added or removed, rather than waiting for the full 30s reporting interval Closes: #4278
Summary
Health checks could permanently mark backends as unhealthy due to two issues: shared http.DefaultTransport kept stale keep-alive connections to replaced K8s pods returning 4xx indefinitely, and mcp-go sentinel errors (ErrUnauthorized, ErrLegacySSEServer) were not recognized by the error classification chain, causing auth failures to surface as generic backend unavailability.
Fixes #4278
Type of change
Test plan
task test)task test-e2e)task lint-fix)