-
Notifications
You must be signed in to change notification settings - Fork 156
Fix flaky vMCP E2E tests with proper readiness checks #2881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The VirtualMCPServer E2E tests were intermittently failing with EOF errors because they weren't waiting for the vMCP server to be fully initialized before making MCP requests. The tests only waited for the Kubernetes Ready condition, but there's a timing gap between when the operator marks the CR as Ready and when the actual HTTP server inside the pod is fully initialized. This change adds a comprehensive readiness check sequence: 1. Wait for VirtualMCPServer CRD to reach Ready status 2. Wait for all vMCP pods to be running and containers ready 3. Poll the /health endpoint until the server responds successfully Added two new helpers: - WaitForVMCPHealthy: Polls /health endpoint until successful - WaitForVMCPFullyReady: Combines all checks and returns NodePort Updated 8 test files to use WaitForVMCPFullyReady, replacing the previous pattern of separate WaitForVirtualMCPServerReady and GetVMCPNodePort calls. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2881 +/- ##
=======================================
Coverage 56.59% 56.60%
=======================================
Files 322 322
Lines 31439 31439
=======================================
+ Hits 17794 17796 +2
+ Misses 12110 12108 -2
Partials 1535 1535 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // 1. Waits for the VirtualMCPServer CRD to reach Ready status | ||
| // 2. Waits for all vMCP pods to be running and ready | ||
| // 3. Retrieves the NodePort for the service | ||
| // 4. Polls the /health endpoint until the server is fully initialized | ||
| // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non-blocking): Am I understanding correctly?
I'm surprised CRD readiness does not imply a pod is ready to receive traffic on /mcp. Based on my reading of the CRD status definition, CRD readiness implies a pod is ready.
Is this the readiness and liveness config here?
If so, then the only thing the health check gets us is more time for something to finish initializing. Would it make more sense to have a stricter definition of readiness? Otherwise, we could encounter issues where we serve traffic before pods are truly ready in prod environments.
| }, timeout, 2*time.Second).Should(gomega.Succeed(), "VirtualMCPServer health endpoint should be reachable") | ||
| } | ||
|
|
||
| // WaitForVMCPFullyReady performs a complete readiness check sequence for a VirtualMCPServer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we implement a better readiness and healthness check then? that actually checks for those conditions before declaring vmcp as ready and healthy?
| // WaitForVMCPHealthy polls the VirtualMCPServer's /health endpoint until it responds successfully. | ||
| // This ensures the server is fully initialized and ready to handle MCP requests, which may happen | ||
| // after the Kubernetes readiness probe passes but before the MCP server is fully initialized. | ||
| func WaitForVMCPHealthy(nodePort int32, timeout time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make the function accept a context so we can pass one from the parent for cancellation
|
Lets fix the readiness checks instead |
Summary
WaitForVMCPHealthyhelper to poll/healthendpoint until server respondsWaitForVMCPFullyReadyhelper that combines all readiness checks (CR ready + pods ready + health check)Problem
The
VirtualMCPServer Inline Auth with Anonymous Incomingtest (and potentially others) was failing intermittently with:This happened because there's a timing gap between when the Kubernetes operator marks the VirtualMCPServer CR as Ready and when the actual HTTP server inside the pod is fully initialized and ready to handle requests.
Solution
Added a comprehensive readiness check sequence:
/healthendpoint until the server responds successfullyThis ensures the vMCP server is truly ready before tests attempt to make MCP requests.
Test plan
🤖 Generated with Claude Code