Skip to content

Conversation

@JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Dec 3, 2025

Summary

  • Fix intermittent EOF errors in VirtualMCPServer E2E tests by adding proper readiness checks
  • Add WaitForVMCPHealthy helper to poll /health endpoint until server responds
  • Add WaitForVMCPFullyReady helper that combines all readiness checks (CR ready + pods ready + health check)
  • Update 8 test files to use the new comprehensive readiness helper

Problem

The VirtualMCPServer Inline Auth with Anonymous Incoming test (and potentially others) was failing intermittently with:

transport error: failed to send request: Post "http://localhost:30016/mcp": EOF

This happened because there's a timing gap between when the Kubernetes operator marks the VirtualMCPServer CR as Ready and when the actual HTTP server inside the pod is fully initialized and ready to handle requests.

Solution

Added a comprehensive readiness check sequence:

  1. Wait for VirtualMCPServer CRD to reach Ready status
  2. Wait for all vMCP pods to be running and containers ready
  3. Poll the /health endpoint until the server responds successfully

This ensures the vMCP server is truly ready before tests attempt to make MCP requests.

Test plan

  • Code compiles successfully
  • Linting passes with 0 issues
  • CI E2E tests pass (this PR should fix the flaky failures)

🤖 Generated with Claude Code

The VirtualMCPServer E2E tests were intermittently failing with EOF
errors because they weren't waiting for the vMCP server to be fully
initialized before making MCP requests. The tests only waited for the
Kubernetes Ready condition, but there's a timing gap between when the
operator marks the CR as Ready and when the actual HTTP server inside
the pod is fully initialized.

This change adds a comprehensive readiness check sequence:
1. Wait for VirtualMCPServer CRD to reach Ready status
2. Wait for all vMCP pods to be running and containers ready
3. Poll the /health endpoint until the server responds successfully

Added two new helpers:
- WaitForVMCPHealthy: Polls /health endpoint until successful
- WaitForVMCPFullyReady: Combines all checks and returns NodePort

Updated 8 test files to use WaitForVMCPFullyReady, replacing the
previous pattern of separate WaitForVirtualMCPServerReady and
GetVMCPNodePort calls.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Dec 3, 2025
@JAORMX JAORMX requested review from jerm-dro, jhrozek and yrobla December 3, 2025 19:03
@codecov
Copy link

codecov bot commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.60%. Comparing base (15f0beb) to head (2fa0d8b).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2881   +/-   ##
=======================================
  Coverage   56.59%   56.60%           
=======================================
  Files         322      322           
  Lines       31439    31439           
=======================================
+ Hits        17794    17796    +2     
+ Misses      12110    12108    -2     
  Partials     1535     1535           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +719 to +723
// 1. Waits for the VirtualMCPServer CRD to reach Ready status
// 2. Waits for all vMCP pods to be running and ready
// 3. Retrieves the NodePort for the service
// 4. Polls the /health endpoint until the server is fully initialized
//
Copy link
Contributor

@jerm-dro jerm-dro Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking): Am I understanding correctly?

I'm surprised CRD readiness does not imply a pod is ready to receive traffic on /mcp. Based on my reading of the CRD status definition, CRD readiness implies a pod is ready.

Is this the readiness and liveness config here?

If so, then the only thing the health check gets us is more time for something to finish initializing. Would it make more sense to have a stricter definition of readiness? Otherwise, we could encounter issues where we serve traffic before pods are truly ready in prod environments.

}, timeout, 2*time.Second).Should(gomega.Succeed(), "VirtualMCPServer health endpoint should be reachable")
}

// WaitForVMCPFullyReady performs a complete readiness check sequence for a VirtualMCPServer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we implement a better readiness and healthness check then? that actually checks for those conditions before declaring vmcp as ready and healthy?

// WaitForVMCPHealthy polls the VirtualMCPServer's /health endpoint until it responds successfully.
// This ensures the server is fully initialized and ready to handle MCP requests, which may happen
// after the Kubernetes readiness probe passes but before the MCP server is fully initialized.
func WaitForVMCPHealthy(nodePort int32, timeout time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make the function accept a context so we can pass one from the parent for cancellation

@JAORMX JAORMX closed this Dec 4, 2025
@JAORMX
Copy link
Collaborator Author

JAORMX commented Dec 4, 2025

Lets fix the readiness checks instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants