Skip caching partial capability aggregation results#4110
Skip caching partial capability aggregation results#4110ChrisJBurns wants to merge 4 commits intomainfrom
Conversation
When a new backend is added to a group, the first capability discovery may fail to query it (e.g. MCP server not fully ready). The partial result (missing the new backend's tools) was cached under a key that includes all 3 backend IDs. Subsequent retries hit the stale cache entry for 5 minutes, preventing the new backend's tools from appearing. Track QueriedBackendCount in AggregationMetadata and skip caching when it is less than BackendCount. This ensures partial results are always re-aggregated on the next request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4110 +/- ##
==========================================
- Coverage 68.68% 68.62% -0.06%
==========================================
Files 454 458 +4
Lines 46027 46238 +211
==========================================
+ Hits 31612 31731 +119
- Misses 11977 12059 +82
- Partials 2438 2448 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // Update metadata with backend count | ||
| // Update metadata with backend counts | ||
| aggregated.Metadata.BackendCount = len(backends) | ||
| aggregated.Metadata.QueriedBackendCount = len(capabilities) |
There was a problem hiding this comment.
are backends and capabilities the same thing? e.g. QueriedBackendCount = len(capabilities)
There was a problem hiding this comment.
Good question! I don't think they they're the same thing. Capabilities is a map[backendID]*BackendCapabilities returned by QueryAllCapabilities. When a backend fails to respond, it's silently skipped (logged as a warning, not added to the map), so len(capabilities) equals the number of backends that were successfully queried, which can be less than len(backends). I've added a clarifying comment in the code. Happy to be corrected by @yrobla @amirejaz @jerm-dro here though
There was a problem hiding this comment.
ok, thanks for clarifying, I see now. I was thrown off by the naming mismatch, aggregated.Metadata.QueriedBackendCount = len(capabilities), would have expected something like aggregated.Metadata.QueriedCapabilitiesCount = len(capabilities)
but this is not a blocking comment, I defer to vmcp team
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // Only cache complete aggregation results. When some backends fail to respond, | ||
| // the result is partial and should not be cached — otherwise subsequent requests | ||
| // would keep hitting the stale cache entry until it expires (5 min TTL), | ||
| // preventing newly added backends from being discovered. | ||
| if caps.Metadata != nil && caps.Metadata.QueriedBackendCount < caps.Metadata.BackendCount { | ||
| slog.Debug("skipping cache for partial aggregation result", | ||
| "queried_backends", caps.Metadata.QueriedBackendCount, | ||
| "total_backends", caps.Metadata.BackendCount) | ||
| } else { | ||
| m.cacheCapabilities(cacheKey, caps) | ||
| } |
There was a problem hiding this comment.
Honestly, I wonder if we should just delete the cache? It seems like a premature optimization. New sessions are rare and latency is going to be dominated by whatever the LLM is doing.
Summary
QueriedBackendCountalongsideBackendCountinAggregationMetadata. When the discovery manager sees a partial result (queried < total), it skips caching so the next request re-aggregates and picks up the now-ready backend.Type of change
Test plan
task test)Added
TestDiscover_PartialAggregationNotCachedwhich verifies that partial results (2 of 3 backends queried) are not cached, allowing the next call to re-aggregate and return the complete result.Changes
pkg/vmcp/aggregator/aggregator.goQueriedBackendCountfield toAggregationMetadatapkg/vmcp/aggregator/default_aggregator.goQueriedBackendCountfrom successful query countpkg/vmcp/discovery/manager.goQueriedBackendCount < BackendCountpkg/vmcp/discovery/manager_test.goSpecial notes for reviewers
This fixes the flaky
virtualmcp_discovered_mode_test.go:554E2E test ("should detect new backend and update tool list") which consistently fails withexpected more tools, got 4 (was 4). The root cause is that the aggregator'sQueryAllCapabilitiessilently drops backends that fail to respond (logs a warning, continues), but the result was still cached — poisoning subsequent discovery attempts for up to 5 minutes.Generated with Claude Code