Add per-user caching to vMCP discovery manager #2620

tgrunnagle · 2025-11-17T20:10:30Z

Implements issue #2503 by adding an in-memory cache to the discovery
manager that caches capability aggregation results per (user, backend-set)
combination.

Implementation Details:

Cache keyed by userID:sha256(sorted-backend-ids) for stability
5-minute TTL per cache entry (hardcoded)
1000 entry maximum capacity (hardcoded)
Simple eviction: rejects new entries when at capacity
Background cleanup goroutine removes expired entries every minute
Thread-safe with sync.RWMutex protecting all cache operations
Graceful shutdown via new Stop() method on Manager interface

Cache Behavior:

Cache hit: Returns cached capabilities without calling aggregator
Cache miss: Calls aggregator, caches result (if under size limit)
Expired entries: Treated as cache miss, triggers re-aggregation
Backend order: Hash normalized via sorting for stable keys
User isolation: Separate cache entries per user identity

Changes:

pkg/vmcp/discovery/manager.go:
- Added cache infrastructure to DefaultManager
- Modified Discover() to check cache before aggregation
- Added Stop() method to Manager interface
- Added background cleanup goroutine
- Added cache management helper methods
pkg/vmcp/discovery/manager_test.go:
- Added 9 comprehensive test cases covering:
  - Cache hits and misses (user/backend variations)
  - Cache key stability (backend order independence)
  - Concurrent access thread safety
  - Expiration and cleanup
  - Size limit enforcement
  - Graceful shutdown
- Added defer mgr.Stop() to prevent goroutine leaks
pkg/vmcp/server/server.go:
- Added call to discovery manager Stop
Regenerated mocks for updated Manager interface

Implements issue #2503 by adding an in-memory cache to the discovery manager that caches capability aggregation results per (user, backend-set) combination. **Implementation Details:** - Cache keyed by `userID:sha256(sorted-backend-ids)` for stability - 5-minute TTL per cache entry (hardcoded) - 1000 entry maximum capacity (hardcoded) - Simple eviction: rejects new entries when at capacity - Background cleanup goroutine removes expired entries every minute - Thread-safe with sync.RWMutex protecting all cache operations - Graceful shutdown via new Stop() method on Manager interface **Cache Behavior:** - Cache hit: Returns cached capabilities without calling aggregator - Cache miss: Calls aggregator, caches result (if under size limit) - Expired entries: Treated as cache miss, triggers re-aggregation - Backend order: Hash normalized via sorting for stable keys - User isolation: Separate cache entries per user identity **Changes:** - `pkg/vmcp/discovery/manager.go`: - Added cache infrastructure to DefaultManager - Modified Discover() to check cache before aggregation - Added Stop() method to Manager interface - Added background cleanup goroutine - Added cache management helper methods - `pkg/vmcp/discovery/manager_test.go`: - Added 9 comprehensive test cases covering: - Cache hits and misses (user/backend variations) - Cache key stability (backend order independence) - Concurrent access thread safety - Expiration and cleanup - Size limit enforcement - Graceful shutdown - Added defer mgr.Stop() to prevent goroutine leaks - Regenerated mocks for updated Manager interface

tgrunnagle · 2025-11-17T20:12:41Z

@claude Review this PR, which addresses #2503

codecov · 2025-11-17T20:12:45Z

Codecov Report

❌ Patch coverage is 97.18310% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.56%. Comparing base (4ceed40) to head (d64b4dc).
⚠️ Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/vmcp/discovery/manager.go	97.10%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2620      +/-   ##
==========================================
- Coverage   55.84%   55.56%   -0.29%     
==========================================
  Files         312      312              
  Lines       29541    29782     +241     
==========================================
+ Hits        16498    16549      +51     
- Misses      11601    11792     +191     
+ Partials     1442     1441       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude · 2025-11-17T20:13:30Z

Claude finished @tgrunnagle's task —— View job

Code Review for PR #2620

I've completed a thorough review of the per-user caching implementation. Overall, this is a well-designed and correctly implemented feature with excellent test coverage. Here are my findings:

✅ Strengths

Solid Cache Design
- Cache key generation (userID:sha256(sorted-backend-ids)) is well-designed with collision resistance
- Backend order normalization via sorting ensures stable keys (tested in manager_test.go:255)
- 5-minute TTL is reasonable for capability data
- Simple eviction strategy (reject when at capacity) is appropriate for this use case
Excellent Thread Safety
- Proper use of sync.RWMutex with read locks for lookups and write locks for mutations
- Background cleanup goroutine with clean shutdown via stopCh and sync.WaitGroup
- Test coverage for concurrent access (manager_test.go:297)
Comprehensive Test Coverage
- 9 test cases covering all major scenarios (hits/misses, expiration, concurrency, size limits, shutdown)
- Proper use of defer mgr.Stop() to prevent goroutine leaks
- Good use of mocks and assertions
Clean Integration
- Stop() method properly added to Manager interface
- Server correctly calls discoveryMgr.Stop() on shutdown (server.go:440-442)
- Mocks regenerated correctly

🔍 Issues & Recommendations

1. Critical: Race Condition in Expired Entry Handling

Location: pkg/vmcp/discovery/manager.go:150-165

Issue: When getCachedCapabilities() finds an expired entry, it returns nil but does not delete the entry. This means:

Expired entries accumulate until background cleanup runs (up to 1 minute delay)
Under heavy load, cache could fill with expired entries, causing legitimate new entries to be rejected
The background cleanup acquires a write lock, which could cause contention

Recommendation: Consider one of these approaches:

// Option 1: Delete expired entries on read (lazy deletion)
func (m *DefaultManager) getCachedCapabilities(key string) *aggregator.AggregatedCapabilities {
	m.cacheMu.RLock()
	entry, ok := m.cache[key]
	if !ok {
		m.cacheMu.RUnlock()
		return nil
	}
	
	// Check if entry has expired
	if time.Now().After(entry.expiresAt) {
		m.cacheMu.RUnlock()
		// Upgrade to write lock to delete
		m.cacheMu.Lock()
		delete(m.cache, key)
		m.cacheMu.Unlock()
		return nil
	}
	
	caps := entry.capabilities
	m.cacheMu.RUnlock()
	return caps
}

// Option 2: Make background cleanup more aggressive (run every 10-30 seconds)
const cleanupInterval = 30 * time.Second

Impact: Medium priority. Won't cause correctness issues but could impact cache efficiency under load.

2. Enhancement: Cache Size Limit Behavior Could Be More Intelligent

Location: pkg/vmcp/discovery/manager.go:168-185

Current behavior: When cache reaches 1000 entries, new entries are rejected (unless the key already exists).

Observations:

No LRU/LFU eviction - the first 1000 (user, backend-set) combinations "own" the cache
If you have 1001 unique users, the 1001st user will never get cached entries
Expired entries count toward the limit until cleanup runs

Recommendation (for future enhancement):

Document this behavior in code comments as intentional for simplicity
Consider implementing proper LRU eviction in a future PR if this becomes a bottleneck
For now, add a comment explaining the trade-off:

// Simple eviction: reject caching when at capacity
// Trade-off: First N unique (user, backend-set) combinations occupy cache permanently
// until expiration. This is acceptable because:
// 1. 1000 entries is large enough for most deployments
// 2. 5-minute TTL ensures natural churn
// 3. Avoids complexity of LRU/LFU tracking
if len(m.cache) >= maxCacheSize {
	_, exists := m.cache[key]
	if !exists {
		logger.Debugf("Cache at capacity (%d entries), not caching new entry", maxCacheSize)
		return
	}
}

Impact: Low priority. Current implementation is reasonable; this is just for future consideration.

3. Minor: Cache Key Collision Separator Could Be More Robust

Location: pkg/vmcp/discovery/manager.go:140-142

Current implementation:

for _, id := range backendIDs {
	h.Write([]byte(id))
	h.Write([]byte{0}) // Separator to avoid collisions
}

Observation: The null byte separator is good, but SHA-256 already provides collision resistance. The separator is defense-in-depth, which is fine.

Recommendation: No change needed, but consider documenting why the separator is there:

// Hash the sorted backend IDs with separators to avoid ambiguity
// Example: ["ab", "c"] vs ["a", "bc"] produce different hashes
h := sha256.New()
for _, id := range backendIDs {
	h.Write([]byte(id))
	h.Write([]byte{0}) // Null byte separator prevents concatenation ambiguity
}

Impact: Very low priority. Current code is correct; this is just for clarity.

4. Minor: Test Coverage Gap - Cache Eviction Boundary

Location: pkg/vmcp/discovery/manager_test.go:446-507

Observation: The size limit test fills the cache to exactly 1000 entries, then adds one more. However, it doesn't test:

What happens when cache is at capacity and an existing key is re-cached (should succeed based on line 174-178)
Cache behavior with multiple goroutines hitting capacity simultaneously

Recommendation (optional enhancement):

t.Run("can update existing key when at capacity", func(t *testing.T) {
	// Fill cache to capacity
	// ... existing setup ...
	
	// Try to update an existing entry - should succeed
	existingIdentity := &auth.Identity{Subject: "user0"}
	ctxWithExisting := auth.WithIdentity(ctx, existingIdentity)
	backends := []vmcp.Backend{newTestBackend("backend1")}
	_, err = dm.Discover(ctxWithExisting, backends)
	require.NoError(t, err)
	
	// Verify cache size didn't increase
	dm.cacheMu.RLock()
	finalSize := len(dm.cache)
	dm.cacheMu.RUnlock()
	assert.Equal(t, maxCacheSize, finalSize)
})

Impact: Very low priority. Current tests are already comprehensive.

📝 Documentation & Code Quality

Positive:

✅ Clear comments explaining cache behavior
✅ Exported errors with descriptive messages
✅ Proper godoc comments on public functions
✅ Follows Go conventions (receiver names, error handling)
✅ Consistent with CLAUDE.md guidelines (public methods first, private after)

Suggestions:

Consider adding a package-level comment about cache characteristics (TTL, size, eviction policy)
The PR description is excellent - consider adding a similar summary to the package godoc

🧪 Test Quality Assessment

Excellent coverage across:

✅ Cache hits and misses (different users, different backends)
✅ Cache key stability (backend order independence)
✅ Concurrent access (thread safety verification)
✅ Expiration and cleanup
✅ Size limit enforcement
✅ Graceful shutdown (goroutine leak prevention)

Test quality observations:

Proper use of t.Parallel() for test isolation
Good use of gomock for dependency injection
Clear test names following Go conventions
Appropriate assertions with helpful failure messages

🎯 Summary & Recommendation

Overall Assessment: ✅ Approve with minor suggestions

This is a solid implementation that correctly addresses issue #2503. The cache design is appropriate, thread safety is properly handled, and test coverage is comprehensive.

Key Points:

✅ Core functionality is correct - caching works as intended
✅ Thread safety is properly implemented - no data races
✅ Resource management is clean - proper shutdown via Stop()
⚠️ One medium-priority issue - expired entry handling could be more efficient (see issue fix(typo): corrects readme #1)
💡 Minor enhancements available - mostly documentation improvements

Recommendation: The expired entry handling (issue #1) should be addressed before merge, but it's not blocking if you're comfortable with the current behavior under load. The other suggestions are nice-to-haves that can be addressed in future PRs.

Specifically for issue #1: I'd recommend either:

Quick fix: Reduce cleanupInterval to 30 seconds (one-line change)
Better fix: Implement lazy deletion in getCachedCapabilities() (5-10 lines)

Great work on this PR! The implementation quality is high and the tests are thorough. 🎉

tgrunnagle · 2025-11-17T20:22:21Z

Regarding claude's review:

Critical: Race Condition in Expired Entry Handling
The expired entry is re-cached after the cache miss, even if the cache is full, so this should not really be an issue and avoids an extra lock (which is already acquired for caching the newly retrieved capabilities).
Enhancement: Cache Size Limit Behavior Could Be More Intelligent
Not unfair, but this is implemented per the issue description. It can be revisited if we encounter issues.

jhrozek

two nits (one just FYI), but overall looks good!

pkg/vmcp/discovery/manager.go

tgrunnagle added 2 commits November 17, 2025 10:44

Address internal review feedback

4a5415f

tgrunnagle requested review from JAORMX and jhrozek November 17, 2025 20:10

tgrunnagle linked an issue Nov 17, 2025 that may be closed by this pull request

vMCP: Add per-user caching to discovery manager #2503

Closed

jhrozek previously approved these changes Nov 17, 2025

View reviewed changes

pkg/vmcp/discovery/manager.go Show resolved Hide resolved

pkg/vmcp/discovery/manager.go Show resolved Hide resolved

add sync.Once to Stop function

d64b4dc

tgrunnagle dismissed jhrozek’s stale review via d64b4dc November 18, 2025 14:36

jhrozek approved these changes Nov 18, 2025

View reviewed changes

tgrunnagle merged commit ce9a8a5 into main Nov 18, 2025
29 checks passed

tgrunnagle deleted the issue_2503_2025-11-17 branch November 18, 2025 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add per-user caching to vMCP discovery manager #2620

Add per-user caching to vMCP discovery manager #2620

Uh oh!

tgrunnagle commented Nov 17, 2025

Uh oh!

tgrunnagle commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

tgrunnagle commented Nov 17, 2025 •

edited

Loading

Uh oh!

jhrozek left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add per-user caching to vMCP discovery manager #2620

Add per-user caching to vMCP discovery manager #2620

Uh oh!

Conversation

tgrunnagle commented Nov 17, 2025

Uh oh!

tgrunnagle commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PR #2620

✅ Strengths

🔍 Issues & Recommendations

1. Critical: Race Condition in Expired Entry Handling

2. Enhancement: Cache Size Limit Behavior Could Be More Intelligent

3. Minor: Cache Key Collision Separator Could Be More Robust

4. Minor: Test Coverage Gap - Cache Eviction Boundary

📝 Documentation & Code Quality

🧪 Test Quality Assessment

🎯 Summary & Recommendation

Uh oh!

tgrunnagle commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 17, 2025 •

edited

Loading

claude bot commented Nov 17, 2025 •

edited

Loading

tgrunnagle commented Nov 17, 2025 •

edited

Loading