Fix flaky vMCP E2E tests with proper readiness checks #2881

JAORMX · 2025-12-03T19:02:22Z

Summary

Fix intermittent EOF errors in VirtualMCPServer E2E tests by adding proper readiness checks
Add WaitForVMCPHealthy helper to poll /health endpoint until server responds
Add WaitForVMCPFullyReady helper that combines all readiness checks (CR ready + pods ready + health check)
Update 8 test files to use the new comprehensive readiness helper

Problem

The VirtualMCPServer Inline Auth with Anonymous Incoming test (and potentially others) was failing intermittently with:

transport error: failed to send request: Post "http://localhost:30016/mcp": EOF

This happened because there's a timing gap between when the Kubernetes operator marks the VirtualMCPServer CR as Ready and when the actual HTTP server inside the pod is fully initialized and ready to handle requests.

Solution

Added a comprehensive readiness check sequence:

Wait for VirtualMCPServer CRD to reach Ready status
Wait for all vMCP pods to be running and containers ready
Poll the /health endpoint until the server responds successfully

This ensures the vMCP server is truly ready before tests attempt to make MCP requests.

Test plan

Code compiles successfully
Linting passes with 0 issues
CI E2E tests pass (this PR should fix the flaky failures)

🤖 Generated with Claude Code

The VirtualMCPServer E2E tests were intermittently failing with EOF errors because they weren't waiting for the vMCP server to be fully initialized before making MCP requests. The tests only waited for the Kubernetes Ready condition, but there's a timing gap between when the operator marks the CR as Ready and when the actual HTTP server inside the pod is fully initialized. This change adds a comprehensive readiness check sequence: 1. Wait for VirtualMCPServer CRD to reach Ready status 2. Wait for all vMCP pods to be running and containers ready 3. Poll the /health endpoint until the server responds successfully Added two new helpers: - WaitForVMCPHealthy: Polls /health endpoint until successful - WaitForVMCPFullyReady: Combines all checks and returns NodePort Updated 8 test files to use WaitForVMCPFullyReady, replacing the previous pattern of separate WaitForVirtualMCPServerReady and GetVMCPNodePort calls. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

codecov · 2025-12-03T19:09:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.60%. Comparing base (15f0beb) to head (2fa0d8b).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2881   +/-   ##
=======================================
  Coverage   56.59%   56.60%           
=======================================
  Files         322      322           
  Lines       31439    31439           
=======================================
+ Hits        17794    17796    +2     
+ Misses      12110    12108    -2     
  Partials     1535     1535

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jerm-dro · 2025-12-04T00:51:16Z

test/e2e/thv-operator/virtualmcp/helpers.go

+// 1. Waits for the VirtualMCPServer CRD to reach Ready status
+// 2. Waits for all vMCP pods to be running and ready
+// 3. Retrieves the NodePort for the service
+// 4. Polls the /health endpoint until the server is fully initialized
+//


(non-blocking): Am I understanding correctly?

I'm surprised CRD readiness does not imply a pod is ready to receive traffic on /mcp. Based on my reading of the CRD status definition, CRD readiness implies a pod is ready.

Is this the readiness and liveness config here?

If so, then the only thing the health check gets us is more time for something to finish initializing. Would it make more sense to have a stricter definition of readiness? Otherwise, we could encounter issues where we serve traffic before pods are truly ready in prod environments.

yrobla · 2025-12-04T08:32:44Z

test/e2e/thv-operator/virtualmcp/helpers.go

+	}, timeout, 2*time.Second).Should(gomega.Succeed(), "VirtualMCPServer health endpoint should be reachable")
+}
+
+// WaitForVMCPFullyReady performs a complete readiness check sequence for a VirtualMCPServer:


should we implement a better readiness and healthness check then? that actually checks for those conditions before declaring vmcp as ready and healthy?

jhrozek · 2025-12-04T11:13:34Z

test/e2e/thv-operator/virtualmcp/helpers.go

+// WaitForVMCPHealthy polls the VirtualMCPServer's /health endpoint until it responds successfully.
+// This ensures the server is fully initialized and ready to handle MCP requests, which may happen
+// after the Kubernetes readiness probe passes but before the MCP server is fully initialized.
+func WaitForVMCPHealthy(nodePort int32, timeout time.Duration) {


let's make the function accept a context so we can pass one from the parent for cancellation

JAORMX · 2025-12-04T13:58:58Z

Lets fix the readiness checks instead

github-actions bot added the size/S Small PR: 100-299 lines changed label Dec 3, 2025

JAORMX requested review from jerm-dro, jhrozek and yrobla December 3, 2025 19:03

jerm-dro reviewed Dec 4, 2025

View reviewed changes

yrobla reviewed Dec 4, 2025

View reviewed changes

jhrozek reviewed Dec 4, 2025

View reviewed changes

JAORMX closed this Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky vMCP E2E tests with proper readiness checks #2881

Fix flaky vMCP E2E tests with proper readiness checks #2881

Uh oh!

JAORMX commented Dec 3, 2025

Uh oh!

codecov bot commented Dec 3, 2025 •

edited

Loading

Uh oh!

jerm-dro Dec 4, 2025 •

edited

Loading

Uh oh!

yrobla Dec 4, 2025

Uh oh!

jhrozek Dec 4, 2025

Uh oh!

JAORMX commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix flaky vMCP E2E tests with proper readiness checks #2881

Fix flaky vMCP E2E tests with proper readiness checks #2881

Uh oh!

Conversation

JAORMX commented Dec 3, 2025

Summary

Problem

Solution

Test plan

Uh oh!

codecov bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jerm-dro Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yrobla Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

jhrozek Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

JAORMX commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Dec 3, 2025 •

edited

Loading

jerm-dro Dec 4, 2025 •

edited

Loading