Skip to content

Prevent permanent backend unhealthy marking after startup race#4290

Open
yrobla wants to merge 1 commit intomainfrom
issue-4278
Open

Prevent permanent backend unhealthy marking after startup race#4290
yrobla wants to merge 1 commit intomainfrom
issue-4278

Conversation

@yrobla
Copy link
Contributor

@yrobla yrobla commented Mar 20, 2026

Summary

Health checks could permanently mark backends as unhealthy due to two issues: shared http.DefaultTransport kept stale keep-alive connections to replaced K8s pods returning 4xx indefinitely, and mcp-go sentinel errors (ErrUnauthorized, ErrLegacySSEServer) were not recognized by the error classification chain, causing auth failures to surface as generic backend unavailability.

  • Cache one *http.Transport per backend (rather than cloning per call) to isolate connection pools and avoid stale connections after pod replacement; flush the cached transport via FlushIdleConnections on health check failure or backend removal
  • Map transport.ErrUnauthorized to ErrAuthenticationFailed and transport.ErrLegacySSEServer to ErrBackendUnavailable in wrapBackendError before falling back to string-based detection
  • Add "unauthorized (401)" pattern to IsAuthenticationError to match mcp-go's ErrUnauthorized string format
  • Poll DynamicRegistry version every 2s and trigger an immediate status report when backends are added or removed, rather than waiting for the full 30s reporting interval

Fixes #4278

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

@yrobla yrobla requested a review from Copilot March 20, 2026 11:20
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 20, 2026
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a vMCP health-monitor startup race where backends could be marked unhealthy indefinitely by improving HTTP connection isolation, error classification, and status reporting responsiveness (Fixes #4278).

Changes:

  • Clone http.DefaultTransport for each backend client creation to isolate connection pools and avoid stale keep-alive connections after pod replacement.
  • Map mcp-go transport sentinel errors (ErrUnauthorized, ErrLegacySSEServer) to vMCP sentinel errors in wrapBackendError, and extend IsAuthenticationError to match "unauthorized (401)".
  • Add a 2s DynamicRegistry version poller to trigger immediate status reporting (and backend refresh) when backends are added/removed.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/vmcp/server/status_reporting.go Adds DynamicRegistry version polling to trigger immediate status reports on backend registry changes.
pkg/vmcp/server/status_reporting_test.go Adds a unit test ensuring version changes trigger an immediate status report (no need to wait for the main interval).
pkg/vmcp/health/checker_test.go Extends auth-error classification tests to include mcp-go’s "unauthorized (401)" format and wrapped auth sentinel behavior.
pkg/vmcp/errors.go Updates IsAuthenticationError string matching to include "unauthorized (401)".
pkg/vmcp/client/client.go Clones the base HTTP transport per client creation and adds explicit mapping for mcp-go transport sentinel errors in wrapBackendError.
pkg/vmcp/client/client_test.go Adds tests verifying wrapBackendError maps mcp-go transport sentinel errors to the correct vMCP sentinel errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad39e47b36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026
@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 59.72222% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.77%. Comparing base (4d4fbe2) to head (7049642).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/client/client.go 50.00% 21 Missing and 2 partials ⚠️
pkg/vmcp/health/checker.go 0.00% 1 Missing and 2 partials ⚠️
pkg/vmcp/server/status_reporting.go 83.33% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4290      +/-   ##
==========================================
- Coverage   68.95%   68.77%   -0.19%     
==========================================
  Files         473      473              
  Lines       47854    47986     +132     
==========================================
  Hits        33000    33000              
- Misses      12266    12284      +18     
- Partials     2588     2702     +114     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@yrobla yrobla requested a review from Copilot March 20, 2026 11:50
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…race

Health checks could permanently mark backends as unhealthy due to two
issues: shared http.DefaultTransport kept stale keep-alive connections
to replaced K8s pods returning 4xx indefinitely, and mcp-go sentinel
errors (ErrUnauthorized, ErrLegacySSEServer) were not recognized by
the error classification chain, causing auth failures to surface as
generic backend unavailability.

- Clone http.DefaultTransport per client factory call to isolate
  connection pools and avoid stale connections after pod replacement
- Map transport.ErrUnauthorized to ErrAuthenticationFailed and
  transport.ErrLegacySSEServer to ErrBackendUnavailable in
  wrapBackendError before falling back to string-based detection
- Add "unauthorized (401)" pattern to IsAuthenticationError to match
  mcp-go's ErrUnauthorized string format
- Poll DynamicRegistry version every 2s and trigger an immediate
  status report when backends are added or removed, rather than
  waiting for the full 30s reporting interval

Closes: #4278
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@yrobla yrobla changed the title fix(vmcp): prevent permanent backend unhealthy marking after startup race Prevent permanent backend unhealthy marking after startup race Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vMCP health monitor permanently marks backends unhealthy after startup race

3 participants