Skip to content

Conversation

@JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Oct 30, 2025

Summary

Implements the final stage of the Virtual MCP Server capability aggregation pipeline: exposing aggregated capabilities via MCP protocol and routing requests to appropriate backends.

This PR completes the Virtual MCP Server implementation by adding:

  • Router: Thread-safe routing with RWMutex for mapping capability names to backend targets
  • Virtual MCP Server: Aggregates multiple backend MCP servers into a unified MCP interface
  • Request Forwarding: Routes and forwards tool/resource/prompt calls to appropriate backends

Architecture

Client → Virtual MCP Server → Router → BackendClient → Backend MCPs
         (MCP protocol)        (routing) (HTTP)        (MCP servers)

Data Flow

  1. Aggregation (already implemented): Discovers backends, queries capabilities, resolves conflicts → produces AggregatedCapabilities with RoutingTable
  2. Routing (this PR): Routes capability names to backend targets
  3. Server (this PR): Exposes capabilities via MCP protocol, forwards requests to backends

What's Included

1. Router (pkg/vmcp/router/)

Purpose: Maps capability names/URIs to backend targets for request routing.

Implementation (default_router.go):

  • Thread-safe using sync.RWMutex (supports dynamic updates)
  • Three routing methods: RouteTool, RouteResource, RoutePrompt
  • Atomic routing table updates via UpdateRoutingTable
  • Test coverage: 100% including concurrency tests ✅

Design Choice - RWMutex vs atomic.Pointer:

  • Supports both wholesale table replacement AND future fine-grained updates
  • Flexible for dynamic backend discovery (add/remove individual backends at runtime)
  • Read locks are cheap with low write contention
  • More familiar Go pattern for most developers

2. Virtual MCP Server (pkg/vmcp/server/)

Purpose: Aggregates multiple backend MCP servers into a single MCP interface.

Implementation (server.go):

  • Uses mark3labs/mcp-go SDK for MCP protocol compliance
  • Automatically exposes tools/list, resources/list, prompts/list endpoints
  • Routes incoming requests to appropriate backends:
    • tools/call → routes by tool name
    • resources/read → routes by resource URI
    • prompts/get → routes by prompt name
  • Streamable HTTP transport at configurable endpoint (default /mcp)
  • Configurable host, port, server name/version
  • Graceful startup/shutdown with context support

Request Handling Flow:

Client RequestMCP Server HandlerRouter.RouteTool(name)
                                   ↓
                              BackendTarget (workload ID, URL, transport)
                                   ↓
                         BackendClient.CallTool(target, name, args)
                                   ↓
                              Backend MCP ServerResponseClient

3. Integration Tests

Full pipeline integration (server/integration_test.go):

  • Tests complete flow: Aggregator → Router → Server
  • Validates conflict resolution strategies (prefix, priority)
  • Mock backends return realistic capabilities
  • Verifies routing correctness and tool name transformations
  • All tests passing ✅

Key Features

Capability Registration

  • Server registers all aggregated tools, resources, and prompts with MCP protocol
  • Creates handler closures that capture capability names for routing
  • Updates router with complete routing table from aggregation

Request Routing

  • Router uses RoutingTable from aggregation phase
  • Maps capability names to BackendTarget structs containing:
    • Workload ID and name
    • Backend base URL
    • Transport type (streamable-http, sse, etc.)
    • Authentication metadata
  • Thread-safe concurrent access (read-heavy workload)

Error Handling

  • Tool not found → returns MCP error to client
  • Backend unavailable → returns MCP error with details
  • Tool execution failures → forwards backend error to client
  • Routing table not initialized → fails fast with clear error

Testing Strategy

Unit Tests

  • Router (router/default_router_test.go):

    • Route to existing tools/resources/prompts ✅
    • Handle not found errors ✅
    • Handle uninitialized routing table ✅
    • Atomic routing table updates ✅
    • Concurrent access (10 goroutines, 100 ops each) ✅
    • Coverage: 100%
  • Server (server/server_test.go):

    • Configuration defaults ✅
    • Capability registration ✅
    • Routing table update failures ✅
    • Start without registration fails ✅
    • Graceful stop ✅

Integration Tests

  • End-to-end pipeline (server/integration_test.go):
    • Mock backends with realistic capabilities ✅
    • Full aggregation → routing → server flow ✅
    • Prefix strategy creates unique names ✅
    • Priority strategy resolves conflicts correctly ✅
    • Router routes to correct backends ✅
    • Server registers all capabilities ✅

Test Results

pkg/vmcp/router    100.0% coverage  ✅
pkg/vmcp/server     45.0% coverage  ✅ (handlers tested via integration)
All tests passing, linting clean

Design Decisions

1. Why separate Router from Server?

  • Single Responsibility: Router only does routing, Server handles MCP protocol
  • Testability: Router has 100% coverage with simple, focused tests
  • Future extensibility: Can add SessionAffinity, LoadBalancing strategies without modifying Server

2. Why use mark3labs/mcp-go SDK?

  • Guaranteed MCP protocol compliance
  • Auto-handles capability list endpoints (tools/list, resources/list, prompts/list)
  • Built-in streaming support
  • Active maintenance and community

3. Why handler closures?

func (s *Server) createToolHandler(toolName string) func(...) {
    return func(ctx, request) {
        // toolName captured at registration time
        target := s.router.RouteTool(ctx, toolName)
        ...
    }
}
  • Captures tool name at registration (no runtime lookups)
  • Clean separation of concerns
  • Type-safe request handling

4. Why dynamic capability registration?

  • Capabilities determined at runtime from backend discovery
  • Different groups → different backends → different tools
  • Enables future hot-reloading of backends
  • Supports dynamic backend addition/removal

Example Usage

// 1. Run aggregation (already implemented)
aggregator := aggregator.NewDefaultAggregator(backendClient, conflictResolver, toolConfigs)
capabilities, _ := aggregator.AggregateCapabilities(ctx, backends)

// 2. Create router (this PR)
router := router.NewDefaultRouter()

// 3. Create and start server (this PR)
server := server.New(&server.Config{
    Name:    "my-vmcp",
    Version: "1.0.0",
    Host:    "0.0.0.0",
    Port:    4483,
}, router, backendClient)

// 4. Register capabilities and start
server.RegisterCapabilities(ctx, capabilities)
server.Start(ctx) // Blocks until shutdown

Client connects to: http://localhost:4483/mcp

Implementation Completeness

This PR implements all requirements for capability merging and routing:

  • ✅ Merge conflict-resolved tools into final tool list
  • ✅ Merge resources from all backends
  • ✅ Merge prompts from all backends
  • ✅ Build tool-to-workload routing map
  • ✅ Build resource-to-workload routing map
  • ✅ Build prompt-to-workload routing map
  • ✅ Expose merged capabilities via MCP protocol endpoints
  • ✅ Route incoming requests to appropriate backends
  • ✅ Forward tool/resource/prompt calls to backends
  • ✅ Unit tests for routing
  • ✅ Unit tests for server
  • ✅ Integration tests for full pipeline

Files Changed

pkg/vmcp/router/
├── router.go                    # Add go:generate directive for mocks
├── default_router.go            # NEW: Router implementation (107 lines)
├── default_router_test.go       # NEW: Router tests (422 lines)
└── mocks/mock_router.go         # NEW: Generated mocks

pkg/vmcp/server/
├── server.go                    # NEW: Virtual MCP Server (372 lines)
├── server_test.go               # NEW: Unit tests (277 lines)
└── integration_test.go          # NEW: Integration tests (279 lines)

Total: 1,457 lines added (excluding generated mocks)

Checklist

  • Tests added with good coverage
  • All tests passing
  • Linting clean (golangci-lint)
  • Follows DDD principles
  • Follows Go best practices
  • Mocks generated via task gen
  • Comprehensive in-code documentation
  • Integration tests verify full pipeline

@JAORMX
Copy link
Collaborator Author

JAORMX commented Oct 30, 2025

@claude please review this

@claude
Copy link
Contributor

claude bot commented Oct 30, 2025

Claude finished @JAORMX's task —— View job


Code Review Complete

I've completed a thorough review of PR #2376. The implementation is solid and well-tested. Below are my findings organized by category.


Overall Assessment

Summary: This is high-quality code that successfully implements the virtual MCP server capability merging and routing system. The architecture is clean, tests are comprehensive, and the code follows Go best practices. I found a few areas for improvement but no blocking issues.

Test Results: ✅ All tests passing in both router and server packages


Strengths

  1. Excellent Test Coverage

    • Router tests are comprehensive with 100% coverage claim validated
    • Integration tests verify the full pipeline (aggregator → router → server)
    • Concurrency tests validate thread-safety with 10 goroutines × 100 operations
    • Good use of table-driven tests throughout
  2. Clean Architecture

    • Clear separation of concerns (router does routing, server handles MCP protocol)
    • Proper use of interfaces for testability
    • Good abstraction boundaries
  3. Comprehensive Documentation

    • Detailed package-level documentation
    • Helpful inline comments explaining design decisions
    • PR description is thorough and well-organized
  4. Thread-Safety

    • Proper use of sync.RWMutex for concurrent access
    • Atomic routing table updates

Issues Found

1. Code Organization Violation (CLAUDE.md Guideline) ⚠️

Location: pkg/vmcp/router/default_router.go and pkg/vmcp/server/server.go

Issue: Per CLAUDE.md, "In Go codefiles, keep public methods to the top half of the file, and private methods to the bottom half of the file."

Current structure in default_router.go:

  • Lines 26-28: Public method NewDefaultRouter()
  • Lines 31-47: Public method RouteTool()
  • Lines 50-66: Public method RouteResource()
  • Lines 69-85: Public method RoutePrompt()
  • Lines 92-106: Public method UpdateRoutingTable()
  • No private methods

Verdict: This file is acceptable since all methods are public. ✅

Current structure in server.go:

  • Lines 62-98: Public constructor New()
  • Lines 102-137: Public method RegisterCapabilities()
  • Lines 140-177: Public method Start()
  • Lines 180-197: Public method Stop()
  • Lines 200-202: Public method Address()
  • Lines 208-227: Private method registerTool()
  • Lines 233-250: Private method registerResource()
  • Lines 256-281: Private method registerPrompt()
  • Lines 284-312: Private method createToolHandler()
  • Lines 315-346: Private method createResourceHandler()
  • Lines 349-388: Private method createPromptHandler()

Recommendation: In server.go, move all public methods (New, RegisterCapabilities, Start, Stop, Address) to the top, then place all private methods (registerTool, registerResource, registerPrompt, createToolHandler, createResourceHandler, createPromptHandler) below. This follows the project's established convention.

2. Context Parameter Unused

Location: Multiple routing methods in pkg/vmcp/router/default_router.go

Code:

func (r *defaultRouter) RouteTool(_ context.Context, toolName string) (*vmcp.BackendTarget, error) {
func (r *defaultRouter) RouteResource(_ context.Context, uri string) (*vmcp.BackendTarget, error) {
func (r *defaultRouter) RoutePrompt(_ context.Context, name string) (*vmcp.BackendTarget, error) {
func (r *defaultRouter) UpdateRoutingTable(_ context.Context, table *vmcp.RoutingTable) error {

Issue: The context parameter is intentionally ignored (using _). While this is common in Go when implementing interfaces, there's no comment explaining why context is unused or what future use cases might require it.

Recommendation: Add a brief comment explaining the design decision:

// RouteTool resolves a tool name to its backend target.
// Context is reserved for future extensions (e.g., request-scoped routing, distributed tracing).
func (r *defaultRouter) RouteTool(_ context.Context, toolName string) (*vmcp.BackendTarget, error) {

3. Missing Error Wrapping in Server

Location: pkg/vmcp/server/server.go:318-326

Code:

target, err := s.router.RouteResource(ctx, uri)
if err != nil {
    logger.Warnf("Failed to route resource %s: %v", uri, err)
    return nil, fmt.Errorf("resource not found: %s", uri)  // ❌ Original error lost
}

Issue: When routing fails, the error message doesn't include the original error cause. Compare with the tool handler at lines 290-293:

target, err := s.router.RouteTool(ctx, toolName)
if err != nil {
    logger.Warnf("Failed to route tool %s: %v", toolName, err)
    return mcp.NewToolResultError(fmt.Sprintf("Tool not found: %s", toolName)), nil
}

Both handlers lose the original error context. This makes debugging difficult when routing fails for non-obvious reasons (e.g., "routing table not initialized" vs "tool not found").

Recommendation: Wrap errors to preserve context:

if err != nil {
    logger.Warnf("Failed to route resource %s: %v", uri, err)
    return nil, fmt.Errorf("resource not found: %s: %w", uri, err)
}

4. Inconsistent Error Handling Pattern

Location: pkg/vmcp/server/server.go:304-306 vs 325-326

Code:

// Tool handler returns MCP error (line 292)
return mcp.NewToolResultError(fmt.Sprintf("Tool not found: %s", toolName)), nil

// Resource handler returns Go error (line 325)
return nil, fmt.Errorf("resource not found: %s", uri)

Issue: Tools use MCP protocol error handling (mcp.NewToolResultError) while resources use standard Go errors. This inconsistency is dictated by the mark3labs/mcp-go SDK, but it's not documented why.

Recommendation: Add a comment explaining the difference:

// createResourceHandler creates a resource handler that routes to the appropriate backend.
// Note: Unlike tool handlers, resource handlers return standard Go errors per MCP spec.
func (s *Server) createResourceHandler(uri string) func(...) {

5. Hard-coded MIME Type

Location: pkg/vmcp/server/server.go:336-342

Code:

contents := []mcp.ResourceContents{
    mcp.TextResourceContents{
        URI:      uri,
        MIMEType: "text/plain",  // ❌ Always text/plain
        Text:     string(data),
    },
}

Issue: The resource handler always returns MIMEType: "text/plain" regardless of the actual resource type. The Resource struct has a MimeType field (line 239 in server.go shows it's stored), but it's not used here.

Recommendation: Use the registered resource's MIME type:

// Store MIME type during registration
s.resourceMimeTypes[uri] = resource.MimeType

// Use it in handler
contents := []mcp.ResourceContents{
    mcp.TextResourceContents{
        URI:      uri,
        MIMEType: s.resourceMimeTypes[uri], // or default to "text/plain" if empty
        Text:     string(data),
    },
}

Alternatively, query the router for the resource metadata or pass it through the routing table.

6. Magic Numbers

Location: pkg/vmcp/server/server.go

Code:

ReadHeaderTimeout: 10 * time.Second,  // Line 156
WithTimeout(ctx, 10*time.Second)      // Line 188

Issue: Timeout values are hard-coded. These should be configurable or at least defined as constants.

Recommendation:

const (
    defaultReadHeaderTimeout = 10 * time.Second
    defaultShutdownTimeout   = 10 * time.Second
)

// Then use:
ReadHeaderTimeout: defaultReadHeaderTimeout,
WithTimeout(ctx, defaultShutdownTimeout)

Or add them to the Config struct with defaults.


Suggestions for Enhancement

1. Missing Health Check Integration

The BackendTarget struct includes a HealthStatus field, but the router doesn't validate backend health before routing. Consider adding:

func (r *defaultRouter) RouteTool(ctx context.Context, toolName string) (*vmcp.BackendTarget, error) {
    // ... existing routing logic ...
    
    if target.HealthStatus != vmcp.BackendHealthy {
        logger.Warnf("Routing to unhealthy backend %s", target.WorkloadID)
        // Could return error or route to replica
    }
    
    return target, nil
}

2. Observability Gaps

Consider adding OpenTelemetry spans for:

  • Routing decisions (which backend was selected)
  • Request forwarding latency
  • Backend health checks
  • Routing table updates

Example:

func (r *defaultRouter) RouteTool(ctx context.Context, toolName string) (*vmcp.BackendTarget, error) {
    ctx, span := tracer.Start(ctx, "router.RouteTool")
    defer span.End()
    span.SetAttributes(attribute.String("tool.name", toolName))
    
    // ... routing logic ...
    
    span.SetAttributes(attribute.String("backend.id", target.WorkloadID))
    return target, nil
}

3. Documentation: Architecture Docs Not Updated

Per CLAUDE.md guidelines:

"ToolHive maintains comprehensive architecture documentation in docs/arch/. When making changes that affect architecture, you MUST update the relevant documentation to keep it in sync with the code."

Missing Updates: This PR introduces major new components (Router, Virtual MCP Server) but doesn't update:

  • docs/arch/02-core-concepts.md - Should document Router and Server as core concepts
  • Potentially docs/arch/00-overview.md - Should mention vMCP server in architecture overview

Recommendation: Add a follow-up task or update these docs in this PR.

4. Test Coverage: Missing Edge Cases

While test coverage is good, consider adding tests for:

  • What happens when backend is unhealthy during routing?
  • Network timeout scenarios when forwarding to backends
  • Malformed responses from backends
  • Very large routing tables (performance/memory tests)
  • Routing table updates during active requests (race condition stress test beyond the basic concurrent test)

Minor Observations

  1. Unused imports: None found ✅
  2. Variable naming: Consistent and clear ✅
  3. Error messages: Descriptive and actionable ✅
  4. Logging levels: Appropriate (Debug for routing, Warn for failures, Info for lifecycle) ✅
  5. Memory leaks: No obvious leaks; routing table is replaced atomically ✅

Recommendations Summary

Must Fix (before merge):

  1. None - Code is functional and safe

Should Fix (improves code quality):

  1. Reorganize server.go to follow public-then-private method ordering (CLAUDE.md guideline)
  2. Fix hard-coded MIME type issue (use actual resource MIME types)
  3. Add error wrapping in server handlers to preserve routing error context
  4. Extract magic number timeouts to constants or config

Nice to Have (future enhancements):

  1. Add comments explaining unused context parameters
  2. Document inconsistent error handling patterns
  3. Update architecture documentation
  4. Add health check validation in router
  5. Add observability spans
  6. Expand test coverage for edge cases

Final Verdict

Recommendation: Approve with minor improvements

This PR successfully implements the capability merging and routing system with:

  • ✅ Clean architecture and design
  • ✅ Comprehensive test coverage
  • ✅ Thread-safe implementation
  • ✅ Good documentation
  • ✅ All tests passing

The issues identified are minor and don't block the PR. The code is production-ready, though the suggested improvements would enhance maintainability and debuggability.

Great work on this implementation! The integration tests particularly demonstrate thorough validation of the complete pipeline.


@JAORMX JAORMX force-pushed the vmcp-capability-merging-routing branch from d853588 to 983561a Compare October 30, 2025 12:06
@JAORMX JAORMX requested review from jhrozek and yrobla October 30, 2025 12:31
@JAORMX JAORMX force-pushed the vmcp-capability-merging-routing branch 2 times, most recently from 578f88b to b31378c Compare October 30, 2025 12:45
@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 61.90476% with 96 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.39%. Comparing base (88435a3) to head (0bb2849).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/server/server.go 52.47% 88 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2376      +/-   ##
==========================================
+ Coverage   54.29%   54.39%   +0.09%     
==========================================
  Files         260      262       +2     
  Lines       24925    25177     +252     
==========================================
+ Hits        13534    13694     +160     
- Misses      10100    10178      +78     
- Partials     1291     1305      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX requested a review from yrobla October 30, 2025 13:28
@JAORMX JAORMX force-pushed the vmcp-capability-merging-routing branch from 6ea7f3d to bc80da6 Compare October 30, 2025 13:35
Copy link
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some minor comments

This completes issue #154 by implementing the final stage of the
capability aggregation pipeline: exposing aggregated capabilities
via MCP protocol and routing requests to backends.

Components added:
- Router (pkg/vmcp/router/default_router.go): Thread-safe routing
  using RWMutex for capability name → backend target mapping.
  Supports dynamic routing table updates for future backend discovery.

- Virtual MCP Server (pkg/vmcp/server/server.go): Aggregates multiple
  backend MCP servers into unified interface. Uses mark3labs/mcp-go
  to expose tools/list, resources/list, prompts/list automatically.
  Routes incoming requests (tools/call, resources/read, prompts/get)
  to appropriate backends via Router.

Features:
- Automatic MCP protocol endpoint exposure
- Tool/resource/prompt routing with conflict-resolved names
- Request forwarding to backends via BackendClient
- Streamable HTTP transport support
- Graceful startup/shutdown

Testing:
- Router: 100% test coverage including concurrency tests
- Server: Unit tests for configuration, registration, error handling
- Integration tests: Full pipeline (Aggregator→Router→Server)
- All tests passing, linting clean

Architecture:
Client → Virtual MCP Server → Router → BackendClient → Backend MCPs
         (MCP protocol)        (routing) (HTTP)        (MCP servers)

Resolves: #154
Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
Fix points 4 and 5 from review:

4. Extract magic number timeouts to constants:
   - Add defaultReadHeaderTimeout (10s) for HTTP server security
   - Add defaultShutdownTimeout (10s) for graceful shutdown
   - Replace hardcoded values with named constants

5. Fix hardcoded MIME type in resource handler:
   - Remove hardcoded "text/plain" MIME type
   - Look up actual MIME type from aggregatedCapabilities.Resources
   - Use "application/octet-stream" as sensible default for unknown types
   - No data duplication - reuses existing aggregatedCapabilities

This is production-ready: works with ANY resource type (JSON, images,
PDFs, etc.) without hardcoding or TODOs. The MIME type comes from the
backend's resource metadata and is properly preserved through the
aggregation pipeline.
Address yrobla's review feedback: Add nil checks for routing table maps
(Tools, Resources, Prompts) in addition to checking if the routing
table itself is nil.

Changes:
- RouteTool: Check routingTable.Tools != nil
- RouteResource: Check routingTable.Resources != nil
- RoutePrompt: Check routingTable.Prompts != nil

Tests added:
- Test case for nil Tools map
- Test case for nil Resources map
- Test case for nil Prompts map

This provides defense-in-depth against malformed routing tables and
gives clearer error messages when maps are unexpectedly nil.
@JAORMX JAORMX force-pushed the vmcp-capability-merging-routing branch from bc80da6 to 9e51a82 Compare October 30, 2025 14:47
JAORMX added a commit that referenced this pull request Oct 30, 2025
Add comprehensive HTTP server timeout configuration including ReadTimeout,
WriteTimeout, IdleTimeout, and MaxHeaderBytes to prevent resource exhaustion
attacks and improve security.

Add server startup validation to the integration test to ensure the server
actually starts listening on the configured port, not just that it's
configured correctly. The test now validates the complete lifecycle including
server startup and shutdown.

Addresses review feedback from yrobla on PR #2376.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@JAORMX JAORMX requested a review from yrobla October 30, 2025 14:48
Add comprehensive HTTP server timeout configuration including ReadTimeout,
WriteTimeout, IdleTimeout, and MaxHeaderBytes to prevent resource exhaustion
attacks and improve security.

Add server startup validation to the integration test to ensure the server
actually starts listening on the configured port, not just that it's
configured correctly. The test now validates the complete lifecycle including
server startup and shutdown.

Addresses review feedback from yrobla on PR #2376.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
@JAORMX JAORMX force-pushed the vmcp-capability-merging-routing branch from 9e51a82 to 0bb2849 Compare October 30, 2025 14:49
@JAORMX JAORMX merged commit 4050531 into main Oct 30, 2025
28 of 29 checks passed
@JAORMX JAORMX deleted the vmcp-capability-merging-routing branch October 30, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants