-
Notifications
You must be signed in to change notification settings - Fork 145
Add per-user caching to vMCP discovery manager #2620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements issue #2503 by adding an in-memory cache to the discovery manager that caches capability aggregation results per (user, backend-set) combination. **Implementation Details:** - Cache keyed by `userID:sha256(sorted-backend-ids)` for stability - 5-minute TTL per cache entry (hardcoded) - 1000 entry maximum capacity (hardcoded) - Simple eviction: rejects new entries when at capacity - Background cleanup goroutine removes expired entries every minute - Thread-safe with sync.RWMutex protecting all cache operations - Graceful shutdown via new Stop() method on Manager interface **Cache Behavior:** - Cache hit: Returns cached capabilities without calling aggregator - Cache miss: Calls aggregator, caches result (if under size limit) - Expired entries: Treated as cache miss, triggers re-aggregation - Backend order: Hash normalized via sorting for stable keys - User isolation: Separate cache entries per user identity **Changes:** - `pkg/vmcp/discovery/manager.go`: - Added cache infrastructure to DefaultManager - Modified Discover() to check cache before aggregation - Added Stop() method to Manager interface - Added background cleanup goroutine - Added cache management helper methods - `pkg/vmcp/discovery/manager_test.go`: - Added 9 comprehensive test cases covering: - Cache hits and misses (user/backend variations) - Cache key stability (backend order independence) - Concurrent access thread safety - Expiration and cleanup - Size limit enforcement - Graceful shutdown - Added defer mgr.Stop() to prevent goroutine leaks - Regenerated mocks for updated Manager interface
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2620 +/- ##
==========================================
- Coverage 55.84% 55.56% -0.29%
==========================================
Files 312 312
Lines 29541 29782 +241
==========================================
+ Hits 16498 16549 +51
- Misses 11601 11792 +191
+ Partials 1442 1441 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Claude finished @tgrunnagle's task —— View job Code Review for PR #2620I've completed a thorough review of the per-user caching implementation. Overall, this is a well-designed and correctly implemented feature with excellent test coverage. Here are my findings: ✅ Strengths
🔍 Issues & Recommendations1. Critical: Race Condition in Expired Entry HandlingLocation: Issue: When
Recommendation: Consider one of these approaches: // Option 1: Delete expired entries on read (lazy deletion)
func (m *DefaultManager) getCachedCapabilities(key string) *aggregator.AggregatedCapabilities {
m.cacheMu.RLock()
entry, ok := m.cache[key]
if !ok {
m.cacheMu.RUnlock()
return nil
}
// Check if entry has expired
if time.Now().After(entry.expiresAt) {
m.cacheMu.RUnlock()
// Upgrade to write lock to delete
m.cacheMu.Lock()
delete(m.cache, key)
m.cacheMu.Unlock()
return nil
}
caps := entry.capabilities
m.cacheMu.RUnlock()
return caps
}
// Option 2: Make background cleanup more aggressive (run every 10-30 seconds)
const cleanupInterval = 30 * time.SecondImpact: Medium priority. Won't cause correctness issues but could impact cache efficiency under load. 2. Enhancement: Cache Size Limit Behavior Could Be More IntelligentLocation: Current behavior: When cache reaches 1000 entries, new entries are rejected (unless the key already exists). Observations:
Recommendation (for future enhancement):
// Simple eviction: reject caching when at capacity
// Trade-off: First N unique (user, backend-set) combinations occupy cache permanently
// until expiration. This is acceptable because:
// 1. 1000 entries is large enough for most deployments
// 2. 5-minute TTL ensures natural churn
// 3. Avoids complexity of LRU/LFU tracking
if len(m.cache) >= maxCacheSize {
_, exists := m.cache[key]
if !exists {
logger.Debugf("Cache at capacity (%d entries), not caching new entry", maxCacheSize)
return
}
}Impact: Low priority. Current implementation is reasonable; this is just for future consideration. 3. Minor: Cache Key Collision Separator Could Be More RobustLocation: Current implementation: for _, id := range backendIDs {
h.Write([]byte(id))
h.Write([]byte{0}) // Separator to avoid collisions
}Observation: The null byte separator is good, but SHA-256 already provides collision resistance. The separator is defense-in-depth, which is fine. Recommendation: No change needed, but consider documenting why the separator is there: // Hash the sorted backend IDs with separators to avoid ambiguity
// Example: ["ab", "c"] vs ["a", "bc"] produce different hashes
h := sha256.New()
for _, id := range backendIDs {
h.Write([]byte(id))
h.Write([]byte{0}) // Null byte separator prevents concatenation ambiguity
}Impact: Very low priority. Current code is correct; this is just for clarity. 4. Minor: Test Coverage Gap - Cache Eviction BoundaryLocation: Observation: The size limit test fills the cache to exactly 1000 entries, then adds one more. However, it doesn't test:
Recommendation (optional enhancement): t.Run("can update existing key when at capacity", func(t *testing.T) {
// Fill cache to capacity
// ... existing setup ...
// Try to update an existing entry - should succeed
existingIdentity := &auth.Identity{Subject: "user0"}
ctxWithExisting := auth.WithIdentity(ctx, existingIdentity)
backends := []vmcp.Backend{newTestBackend("backend1")}
_, err = dm.Discover(ctxWithExisting, backends)
require.NoError(t, err)
// Verify cache size didn't increase
dm.cacheMu.RLock()
finalSize := len(dm.cache)
dm.cacheMu.RUnlock()
assert.Equal(t, maxCacheSize, finalSize)
})Impact: Very low priority. Current tests are already comprehensive. 📝 Documentation & Code QualityPositive:
Suggestions:
🧪 Test Quality AssessmentExcellent coverage across:
Test quality observations:
🎯 Summary & RecommendationOverall Assessment: ✅ Approve with minor suggestions This is a solid implementation that correctly addresses issue #2503. The cache design is appropriate, thread safety is properly handled, and test coverage is comprehensive. Key Points:
Recommendation: The expired entry handling (issue #1) should be addressed before merge, but it's not blocking if you're comfortable with the current behavior under load. The other suggestions are nice-to-haves that can be addressed in future PRs. Specifically for issue #1: I'd recommend either:
Great work on this PR! The implementation quality is high and the tests are thorough. 🎉 |
|
Regarding claude's review:
|
jhrozek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two nits (one just FYI), but overall looks good!
Implements issue #2503 by adding an in-memory cache to the discovery
manager that caches capability aggregation results per (user, backend-set)
combination.
Implementation Details:
userID:sha256(sorted-backend-ids)for stabilityCache Behavior:
Changes:
pkg/vmcp/discovery/manager.go:pkg/vmcp/discovery/manager_test.go:pkg/vmcp/server/server.go:StopRegenerated mocks for updated Manager interface