Skip to content

Conversation

@tgrunnagle
Copy link
Contributor

Implements issue #2503 by adding an in-memory cache to the discovery
manager that caches capability aggregation results per (user, backend-set)
combination.

Implementation Details:

  • Cache keyed by userID:sha256(sorted-backend-ids) for stability
  • 5-minute TTL per cache entry (hardcoded)
  • 1000 entry maximum capacity (hardcoded)
  • Simple eviction: rejects new entries when at capacity
  • Background cleanup goroutine removes expired entries every minute
  • Thread-safe with sync.RWMutex protecting all cache operations
  • Graceful shutdown via new Stop() method on Manager interface

Cache Behavior:

  • Cache hit: Returns cached capabilities without calling aggregator
  • Cache miss: Calls aggregator, caches result (if under size limit)
  • Expired entries: Treated as cache miss, triggers re-aggregation
  • Backend order: Hash normalized via sorting for stable keys
  • User isolation: Separate cache entries per user identity

Changes:

  • pkg/vmcp/discovery/manager.go:

    • Added cache infrastructure to DefaultManager
    • Modified Discover() to check cache before aggregation
    • Added Stop() method to Manager interface
    • Added background cleanup goroutine
    • Added cache management helper methods
  • pkg/vmcp/discovery/manager_test.go:

    • Added 9 comprehensive test cases covering:
      • Cache hits and misses (user/backend variations)
      • Cache key stability (backend order independence)
      • Concurrent access thread safety
      • Expiration and cleanup
      • Size limit enforcement
      • Graceful shutdown
    • Added defer mgr.Stop() to prevent goroutine leaks
  • pkg/vmcp/server/server.go:

    • Added call to discovery manager Stop
  • Regenerated mocks for updated Manager interface

Implements issue #2503 by adding an in-memory cache to the discovery
manager that caches capability aggregation results per (user, backend-set)
combination.

**Implementation Details:**

- Cache keyed by `userID:sha256(sorted-backend-ids)` for stability
- 5-minute TTL per cache entry (hardcoded)
- 1000 entry maximum capacity (hardcoded)
- Simple eviction: rejects new entries when at capacity
- Background cleanup goroutine removes expired entries every minute
- Thread-safe with sync.RWMutex protecting all cache operations
- Graceful shutdown via new Stop() method on Manager interface

**Cache Behavior:**

- Cache hit: Returns cached capabilities without calling aggregator
- Cache miss: Calls aggregator, caches result (if under size limit)
- Expired entries: Treated as cache miss, triggers re-aggregation
- Backend order: Hash normalized via sorting for stable keys
- User isolation: Separate cache entries per user identity

**Changes:**

- `pkg/vmcp/discovery/manager.go`:
  - Added cache infrastructure to DefaultManager
  - Modified Discover() to check cache before aggregation
  - Added Stop() method to Manager interface
  - Added background cleanup goroutine
  - Added cache management helper methods

- `pkg/vmcp/discovery/manager_test.go`:
  - Added 9 comprehensive test cases covering:
    - Cache hits and misses (user/backend variations)
    - Cache key stability (backend order independence)
    - Concurrent access thread safety
    - Expiration and cleanup
    - Size limit enforcement
    - Graceful shutdown
  - Added defer mgr.Stop() to prevent goroutine leaks

- Regenerated mocks for updated Manager interface
@tgrunnagle tgrunnagle requested review from JAORMX and jhrozek November 17, 2025 20:10
@tgrunnagle
Copy link
Contributor Author

@claude Review this PR, which addresses #2503

@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

❌ Patch coverage is 97.18310% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.56%. Comparing base (4ceed40) to head (d64b4dc).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/discovery/manager.go 97.10% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2620      +/-   ##
==========================================
- Coverage   55.84%   55.56%   -0.29%     
==========================================
  Files         312      312              
  Lines       29541    29782     +241     
==========================================
+ Hits        16498    16549      +51     
- Misses      11601    11792     +191     
+ Partials     1442     1441       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@claude
Copy link
Contributor

claude bot commented Nov 17, 2025

Claude finished @tgrunnagle's task —— View job


Code Review for PR #2620

I've completed a thorough review of the per-user caching implementation. Overall, this is a well-designed and correctly implemented feature with excellent test coverage. Here are my findings:


✅ Strengths

  1. Solid Cache Design

    • Cache key generation (userID:sha256(sorted-backend-ids)) is well-designed with collision resistance
    • Backend order normalization via sorting ensures stable keys (tested in manager_test.go:255)
    • 5-minute TTL is reasonable for capability data
    • Simple eviction strategy (reject when at capacity) is appropriate for this use case
  2. Excellent Thread Safety

    • Proper use of sync.RWMutex with read locks for lookups and write locks for mutations
    • Background cleanup goroutine with clean shutdown via stopCh and sync.WaitGroup
    • Test coverage for concurrent access (manager_test.go:297)
  3. Comprehensive Test Coverage

    • 9 test cases covering all major scenarios (hits/misses, expiration, concurrency, size limits, shutdown)
    • Proper use of defer mgr.Stop() to prevent goroutine leaks
    • Good use of mocks and assertions
  4. Clean Integration

    • Stop() method properly added to Manager interface
    • Server correctly calls discoveryMgr.Stop() on shutdown (server.go:440-442)
    • Mocks regenerated correctly

🔍 Issues & Recommendations

1. Critical: Race Condition in Expired Entry Handling

Location: pkg/vmcp/discovery/manager.go:150-165

Issue: When getCachedCapabilities() finds an expired entry, it returns nil but does not delete the entry. This means:

  • Expired entries accumulate until background cleanup runs (up to 1 minute delay)
  • Under heavy load, cache could fill with expired entries, causing legitimate new entries to be rejected
  • The background cleanup acquires a write lock, which could cause contention

Recommendation: Consider one of these approaches:

// Option 1: Delete expired entries on read (lazy deletion)
func (m *DefaultManager) getCachedCapabilities(key string) *aggregator.AggregatedCapabilities {
	m.cacheMu.RLock()
	entry, ok := m.cache[key]
	if !ok {
		m.cacheMu.RUnlock()
		return nil
	}
	
	// Check if entry has expired
	if time.Now().After(entry.expiresAt) {
		m.cacheMu.RUnlock()
		// Upgrade to write lock to delete
		m.cacheMu.Lock()
		delete(m.cache, key)
		m.cacheMu.Unlock()
		return nil
	}
	
	caps := entry.capabilities
	m.cacheMu.RUnlock()
	return caps
}

// Option 2: Make background cleanup more aggressive (run every 10-30 seconds)
const cleanupInterval = 30 * time.Second

Impact: Medium priority. Won't cause correctness issues but could impact cache efficiency under load.


2. Enhancement: Cache Size Limit Behavior Could Be More Intelligent

Location: pkg/vmcp/discovery/manager.go:168-185

Current behavior: When cache reaches 1000 entries, new entries are rejected (unless the key already exists).

Observations:

  • No LRU/LFU eviction - the first 1000 (user, backend-set) combinations "own" the cache
  • If you have 1001 unique users, the 1001st user will never get cached entries
  • Expired entries count toward the limit until cleanup runs

Recommendation (for future enhancement):

  • Document this behavior in code comments as intentional for simplicity
  • Consider implementing proper LRU eviction in a future PR if this becomes a bottleneck
  • For now, add a comment explaining the trade-off:
// Simple eviction: reject caching when at capacity
// Trade-off: First N unique (user, backend-set) combinations occupy cache permanently
// until expiration. This is acceptable because:
// 1. 1000 entries is large enough for most deployments
// 2. 5-minute TTL ensures natural churn
// 3. Avoids complexity of LRU/LFU tracking
if len(m.cache) >= maxCacheSize {
	_, exists := m.cache[key]
	if !exists {
		logger.Debugf("Cache at capacity (%d entries), not caching new entry", maxCacheSize)
		return
	}
}

Impact: Low priority. Current implementation is reasonable; this is just for future consideration.


3. Minor: Cache Key Collision Separator Could Be More Robust

Location: pkg/vmcp/discovery/manager.go:140-142

Current implementation:

for _, id := range backendIDs {
	h.Write([]byte(id))
	h.Write([]byte{0}) // Separator to avoid collisions
}

Observation: The null byte separator is good, but SHA-256 already provides collision resistance. The separator is defense-in-depth, which is fine.

Recommendation: No change needed, but consider documenting why the separator is there:

// Hash the sorted backend IDs with separators to avoid ambiguity
// Example: ["ab", "c"] vs ["a", "bc"] produce different hashes
h := sha256.New()
for _, id := range backendIDs {
	h.Write([]byte(id))
	h.Write([]byte{0}) // Null byte separator prevents concatenation ambiguity
}

Impact: Very low priority. Current code is correct; this is just for clarity.


4. Minor: Test Coverage Gap - Cache Eviction Boundary

Location: pkg/vmcp/discovery/manager_test.go:446-507

Observation: The size limit test fills the cache to exactly 1000 entries, then adds one more. However, it doesn't test:

  • What happens when cache is at capacity and an existing key is re-cached (should succeed based on line 174-178)
  • Cache behavior with multiple goroutines hitting capacity simultaneously

Recommendation (optional enhancement):

t.Run("can update existing key when at capacity", func(t *testing.T) {
	// Fill cache to capacity
	// ... existing setup ...
	
	// Try to update an existing entry - should succeed
	existingIdentity := &auth.Identity{Subject: "user0"}
	ctxWithExisting := auth.WithIdentity(ctx, existingIdentity)
	backends := []vmcp.Backend{newTestBackend("backend1")}
	_, err = dm.Discover(ctxWithExisting, backends)
	require.NoError(t, err)
	
	// Verify cache size didn't increase
	dm.cacheMu.RLock()
	finalSize := len(dm.cache)
	dm.cacheMu.RUnlock()
	assert.Equal(t, maxCacheSize, finalSize)
})

Impact: Very low priority. Current tests are already comprehensive.


📝 Documentation & Code Quality

Positive:

  • ✅ Clear comments explaining cache behavior
  • ✅ Exported errors with descriptive messages
  • ✅ Proper godoc comments on public functions
  • ✅ Follows Go conventions (receiver names, error handling)
  • ✅ Consistent with CLAUDE.md guidelines (public methods first, private after)

Suggestions:

  • Consider adding a package-level comment about cache characteristics (TTL, size, eviction policy)
  • The PR description is excellent - consider adding a similar summary to the package godoc

🧪 Test Quality Assessment

Excellent coverage across:

  • ✅ Cache hits and misses (different users, different backends)
  • ✅ Cache key stability (backend order independence)
  • ✅ Concurrent access (thread safety verification)
  • ✅ Expiration and cleanup
  • ✅ Size limit enforcement
  • ✅ Graceful shutdown (goroutine leak prevention)

Test quality observations:

  • Proper use of t.Parallel() for test isolation
  • Good use of gomock for dependency injection
  • Clear test names following Go conventions
  • Appropriate assertions with helpful failure messages

🎯 Summary & Recommendation

Overall Assessment:Approve with minor suggestions

This is a solid implementation that correctly addresses issue #2503. The cache design is appropriate, thread safety is properly handled, and test coverage is comprehensive.

Key Points:

  1. Core functionality is correct - caching works as intended
  2. Thread safety is properly implemented - no data races
  3. Resource management is clean - proper shutdown via Stop()
  4. ⚠️ One medium-priority issue - expired entry handling could be more efficient (see issue fix(typo): corrects readme #1)
  5. 💡 Minor enhancements available - mostly documentation improvements

Recommendation: The expired entry handling (issue #1) should be addressed before merge, but it's not blocking if you're comfortable with the current behavior under load. The other suggestions are nice-to-haves that can be addressed in future PRs.

Specifically for issue #1: I'd recommend either:

  • Quick fix: Reduce cleanupInterval to 30 seconds (one-line change)
  • Better fix: Implement lazy deletion in getCachedCapabilities() (5-10 lines)

Great work on this PR! The implementation quality is high and the tests are thorough. 🎉


@tgrunnagle
Copy link
Contributor Author

tgrunnagle commented Nov 17, 2025

Regarding claude's review:

  1. Critical: Race Condition in Expired Entry Handling
    The expired entry is re-cached after the cache miss, even if the cache is full, so this should not really be an issue and avoids an extra lock (which is already acquired for caching the newly retrieved capabilities).
  2. Enhancement: Cache Size Limit Behavior Could Be More Intelligent
    Not unfair, but this is implemented per the issue description. It can be revisited if we encounter issues.

@tgrunnagle tgrunnagle linked an issue Nov 17, 2025 that may be closed by this pull request
jhrozek
jhrozek previously approved these changes Nov 17, 2025
Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two nits (one just FYI), but overall looks good!

@tgrunnagle tgrunnagle merged commit ce9a8a5 into main Nov 18, 2025
29 checks passed
@tgrunnagle tgrunnagle deleted the issue_2503_2025-11-17 branch November 18, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vMCP: Add per-user caching to discovery manager

3 participants