Skip to content

Skip caching partial capability aggregation results#4110

Open
ChrisJBurns wants to merge 4 commits intomainfrom
fix/discovery-cache-partial-results
Open

Skip caching partial capability aggregation results#4110
ChrisJBurns wants to merge 4 commits intomainfrom
fix/discovery-cache-partial-results

Conversation

@ChrisJBurns
Copy link
Collaborator

Summary

  • When a new backend is added to an MCPGroup, the vMCP's first capability discovery attempt may fail to query it because the MCP server isn't fully ready yet. The aggregator silently continues with the other backends and caches the partial result (only 4 tools from 2 backends instead of 6 from 3). Since the cache key includes all 3 backend IDs and the registry version doesn't change between retries, subsequent requests hit the stale cache entry for up to 5 minutes, preventing the new backend's tools from ever appearing.
  • Track QueriedBackendCount alongside BackendCount in AggregationMetadata. When the discovery manager sees a partial result (queried < total), it skips caching so the next request re-aggregates and picks up the now-ready backend.

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)

Added TestDiscover_PartialAggregationNotCached which verifies that partial results (2 of 3 backends queried) are not cached, allowing the next call to re-aggregate and return the complete result.

Changes

File Change
pkg/vmcp/aggregator/aggregator.go Add QueriedBackendCount field to AggregationMetadata
pkg/vmcp/aggregator/default_aggregator.go Set QueriedBackendCount from successful query count
pkg/vmcp/discovery/manager.go Skip caching when QueriedBackendCount < BackendCount
pkg/vmcp/discovery/manager_test.go Add test for partial aggregation cache skip

Special notes for reviewers

This fixes the flaky virtualmcp_discovered_mode_test.go:554 E2E test ("should detect new backend and update tool list") which consistently fails with expected more tools, got 4 (was 4). The root cause is that the aggregator's QueryAllCapabilities silently drops backends that fail to respond (logs a warning, continues), but the result was still cached — poisoning subsequent discovery attempts for up to 5 minutes.

Generated with Claude Code

When a new backend is added to a group, the first capability discovery
may fail to query it (e.g. MCP server not fully ready). The partial
result (missing the new backend's tools) was cached under a key that
includes all 3 backend IDs. Subsequent retries hit the stale cache
entry for 5 minutes, preventing the new backend's tools from appearing.

Track QueriedBackendCount in AggregationMetadata and skip caching when
it is less than BackendCount. This ensures partial results are always
re-aggregated on the next request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Mar 11, 2026
@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.62%. Comparing base (c474326) to head (4fc81b0).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4110      +/-   ##
==========================================
- Coverage   68.68%   68.62%   -0.06%     
==========================================
  Files         454      458       +4     
  Lines       46027    46238     +211     
==========================================
+ Hits        31612    31731     +119     
- Misses      11977    12059      +82     
- Partials     2438     2448      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
// Update metadata with backend count
// Update metadata with backend counts
aggregated.Metadata.BackendCount = len(backends)
aggregated.Metadata.QueriedBackendCount = len(capabilities)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are backends and capabilities the same thing? e.g. QueriedBackendCount = len(capabilities)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyortiz3

Good question! I don't think they they're the same thing. Capabilities is a map[backendID]*BackendCapabilities returned by QueryAllCapabilities. When a backend fails to respond, it's silently skipped (logged as a warning, not added to the map), so len(capabilities) equals the number of backends that were successfully queried, which can be less than len(backends). I've added a clarifying comment in the code. Happy to be corrected by @yrobla @amirejaz @jerm-dro here though

Copy link
Contributor

@reyortiz3 reyortiz3 Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for clarifying, I see now. I was thrown off by the naming mismatch, aggregated.Metadata.QueriedBackendCount = len(capabilities), would have expected something like aggregated.Metadata.QueriedCapabilitiesCount = len(capabilities)

but this is not a blocking comment, I defer to vmcp team

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
Comment on lines +160 to +170
// Only cache complete aggregation results. When some backends fail to respond,
// the result is partial and should not be cached — otherwise subsequent requests
// would keep hitting the stale cache entry until it expires (5 min TTL),
// preventing newly added backends from being discovered.
if caps.Metadata != nil && caps.Metadata.QueriedBackendCount < caps.Metadata.BackendCount {
slog.Debug("skipping cache for partial aggregation result",
"queried_backends", caps.Metadata.QueriedBackendCount,
"total_backends", caps.Metadata.BackendCount)
} else {
m.cacheCapabilities(cacheKey, caps)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I wonder if we should just delete the cache? It seems like a premature optimization. New sessions are rare and latency is going to be dominated by whatever the LLM is doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants