Skip caching partial capability aggregation results by ChrisJBurns · Pull Request #4110 · stacklok/toolhive

ChrisJBurns · 2026-03-11T19:10:13Z

Summary

When a new backend is added to an MCPGroup, the vMCP's first capability discovery attempt may fail to query it because the MCP server isn't fully ready yet. The aggregator silently continues with the other backends and caches the partial result (only 4 tools from 2 backends instead of 6 from 3). Since the cache key includes all 3 backend IDs and the registry version doesn't change between retries, subsequent requests hit the stale cache entry for up to 5 minutes, preventing the new backend's tools from ever appearing.
Track QueriedBackendCount alongside BackendCount in AggregationMetadata. When the discovery manager sees a partial result (queried < total), it skips caching so the next request re-aggregates and picks up the now-ready backend.

Type of change

Bug fix

Test plan

Unit tests (task test)

Added TestDiscover_PartialAggregationNotCached which verifies that partial results (2 of 3 backends queried) are not cached, allowing the next call to re-aggregate and return the complete result.

Changes

File	Change
`pkg/vmcp/aggregator/aggregator.go`	Add `QueriedBackendCount` field to `AggregationMetadata`
`pkg/vmcp/aggregator/default_aggregator.go`	Set `QueriedBackendCount` from successful query count
`pkg/vmcp/discovery/manager.go`	Skip caching when `QueriedBackendCount < BackendCount`
`pkg/vmcp/discovery/manager_test.go`	Add test for partial aggregation cache skip

Special notes for reviewers

This fixes the flaky virtualmcp_discovered_mode_test.go:554 E2E test ("should detect new backend and update tool list") which consistently fails with expected more tools, got 4 (was 4). The root cause is that the aggregator's QueryAllCapabilities silently drops backends that fail to respond (logs a warning, continues), but the result was still cached — poisoning subsequent discovery attempts for up to 5 minutes.

Generated with Claude Code

When a new backend is added to a group, the first capability discovery may fail to query it (e.g. MCP server not fully ready). The partial result (missing the new backend's tools) was cached under a key that includes all 3 backend IDs. Subsequent retries hit the stale cache entry for 5 minutes, preventing the new backend's tools from appearing. Track QueriedBackendCount in AggregationMetadata and skip caching when it is less than BackendCount. This ensures partial results are always re-aggregated on the next request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-11T19:13:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.62%. Comparing base (c474326) to head (4fc81b0).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4110      +/-   ##
==========================================
- Coverage   68.68%   68.62%   -0.06%     
==========================================
  Files         454      458       +4     
  Lines       46027    46238     +211     
==========================================
+ Hits        31612    31731     +119     
- Misses      11977    12059      +82     
- Partials     2438     2448      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

reyortiz3 · 2026-03-11T21:40:02Z

pkg/vmcp/aggregator/default_aggregator.go

-	// Update metadata with backend count
+	// Update metadata with backend counts
 	aggregated.Metadata.BackendCount = len(backends)
+	aggregated.Metadata.QueriedBackendCount = len(capabilities)


are backends and capabilities the same thing? e.g. QueriedBackendCount = len(capabilities)

@reyortiz3

Good question! I don't think they they're the same thing. Capabilities is a map[backendID]*BackendCapabilities returned by QueryAllCapabilities. When a backend fails to respond, it's silently skipped (logged as a warning, not added to the map), so len(capabilities) equals the number of backends that were successfully queried, which can be less than len(backends). I've added a clarifying comment in the code. Happy to be corrected by @yrobla @amirejaz @jerm-dro here though

ok, thanks for clarifying, I see now. I was thrown off by the naming mismatch, aggregated.Metadata.QueriedBackendCount = len(capabilities), would have expected something like aggregated.Metadata.QueriedCapabilitiesCount = len(capabilities)

but this is not a blocking comment, I defer to vmcp team

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jerm-dro · 2026-03-11T23:40:30Z

pkg/vmcp/discovery/manager.go

+	// Only cache complete aggregation results. When some backends fail to respond,
+	// the result is partial and should not be cached — otherwise subsequent requests
+	// would keep hitting the stale cache entry until it expires (5 min TTL),
+	// preventing newly added backends from being discovered.
+	if caps.Metadata != nil && caps.Metadata.QueriedBackendCount < caps.Metadata.BackendCount {
+		slog.Debug("skipping cache for partial aggregation result",
+			"queried_backends", caps.Metadata.QueriedBackendCount,
+			"total_backends", caps.Metadata.BackendCount)
+	} else {
+		m.cacheCapabilities(cacheKey, caps)
+	}


Honestly, I wonder if we should just delete the cache? It seems like a premature optimization. New sessions are rare and latency is going to be dominated by whatever the LLM is doing.

ChrisJBurns requested review from JAORMX, amirejaz, jerm-dro, jhrozek and yrobla as code owners March 11, 2026 19:10

github-actions bot added the size/XS Extra small PR: < 100 lines changed label Mar 11, 2026

Merge branch 'main' into fix/discovery-cache-partial-results

997e519

github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026

Merge branch 'main' into fix/discovery-cache-partial-results

f856ef0

github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026

reyortiz3 reviewed Mar 11, 2026

View reviewed changes

Add clarifying comment for QueriedBackendCount assignment

4fc81b0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026

jerm-dro reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip caching partial capability aggregation results#4110

Skip caching partial capability aggregation results#4110
ChrisJBurns wants to merge 4 commits intomainfrom
fix/discovery-cache-partial-results

ChrisJBurns commented Mar 11, 2026

Uh oh!

codecov bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

reyortiz3 Mar 11, 2026

Uh oh!

ChrisJBurns Mar 11, 2026

Uh oh!

reyortiz3 Mar 11, 2026 •

edited

Loading

Uh oh!

jerm-dro Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ChrisJBurns commented Mar 11, 2026

Summary

Type of change

Test plan

Changes

Special notes for reviewers

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

reyortiz3 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

reyortiz3 Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerm-dro Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 11, 2026 •

edited

Loading

reyortiz3 Mar 11, 2026 •

edited

Loading