Skip to content

[vMCP] RestoreSession reconstructs a partial auth.Identity (Subject only), breaks cross-pod failover with upstream-auth strategies #5336

@jhrozek

Description

@jhrozek

Summary

defaultMultiSessionFactory.RestoreSession at pkg/vmcp/session/factory.go:560-567 reconstructs an *auth.Identity from stored session metadata with only the Subject field populated. Token, UpstreamTokens, TokenType, Metadata, and PrincipalInfo are all zero-valued. That partial identity is then passed through makeBaseSessioninitOneBackendbackend.NewHTTPConnector and captured by identityRoundTripper in pkg/vmcp/session/internal/backend/mcp_session.go:271.

For deployments configured with Redis-backed session storage and replicas > 1 (an opt-in multi-replica topology supported by the operator and documented in docs/arch/13-vmcp-scalability.md), this manifests as a cross-pod failover failure: after RestoreSession runs on a different pod, every backend tool call using upstreamInject / tokenExchange / aws_sts outgoing auth goes out with an empty UpstreamTokens map and returns 401.

Single-replica deployments with the default LocalSessionDataStorage never hit RestoreSession and are unaffected by the visible symptom.

Once #5323 (and its sibling for mcp_session.go's round-tripper) lands, the captured identity is no longer used to override req.Context(), so the visible failure mode goes away. But the RestoreSession code still constructs and propagates the partial identity to anything that reads it — a landmine for any future consumer (decorator, hook, telemetry, audit log) that assumes a non-nil *auth.Identity is fully populated.

This issue is filed to fix the underlying reconstruction itself, not just its current consumer.

Code reference

pkg/vmcp/session/factory.go:560-567:

// Reconstruct a minimal identity from stored metadata. The original bearer
// token is never persisted (only its HMAC-SHA256 hash is), so Token is empty.
// The security decorator is restored from the stored hash/salt below.
var identity *auth.Identity
if subject := storedMetadata[MetadataKeyIdentitySubject]; subject != "" {
    identity = &auth.Identity{}
    identity.Subject = subject
}

The comment acknowledges Token is intentionally empty. What the comment does not mention is that the resulting half-populated *auth.Identity is then handed to consumers that have no way to know it's partial.

Downstream propagation:

  • pkg/vmcp/session/factory.go:580makeBaseSession(ctx, id, identity, …)
  • pkg/vmcp/session/factory.go:446initOneBackend(ctx, b, identity, …)
  • pkg/vmcp/session/factory.go:235f.connector(bCtx, target, identity, sessionHint) (where connector is backend.NewHTTPConnector(registry), wired at factory.go:194)
  • pkg/vmcp/session/internal/backend/mcp_session.go:271identityRoundTripper{base: base, identity: identity} (captures the partial identity)

Why this is a separate concern from #5323 and the sibling

#5323 and the sibling issue for mcp_session.go's identityRoundTripper both fix the consumption of stale/partial captured identity — they teach the round-trippers to defer to the per-request identity on req.Context() instead of overriding it.

This issue is about production of the partial identity. Even with both round-tripper fixes in:

  • The partial identity still flows through makeBaseSession and initOneBackend. Any new code on those paths that reads identity.UpstreamTokens or identity.Token will silently get empty values.
  • The MakeSessionWithID path enforces a guard at pkg/vmcp/session/factory.go:368 that rejects bound sessions where identity.Token == "":
    if (identity == nil || identity.Token == "") && !allowAnonymous {
        return nil, fmt.Errorf("invalid session configuration: cannot create bound session ...")
    }
    RestoreSession bypasses that guard. The invariant "a bound session has a non-empty Token" — explicitly enforced for new sessions — is silently violated for restored ones.
  • The *auth.Identity type carries an immutability contract in its godoc (pkg/auth/identity.go:64: "MUST NOT be mutated after the Identity is placed in the request context") but no contract about field completeness. There's no compile-time or runtime signal that a returned *auth.Identity is partial.

Risk analysis

Scenario Current behavior After #5323 + sibling After this fix
Tool call after cross-pod failover (Redis + replicas > 1, upstream auth) 401 (captured empty identity wins) Works (context identity wins) Works
Future code reads identity.UpstreamTokens from a defaultMultiSession or decorator Silently empty Silently empty Either gets nil identity (explicit) or correct identity
Future code reads identity.Token (e.g., for a new outgoing-auth strategy) Silently empty Silently empty Same
Audit log / telemetry includes identity fields Reports incomplete fields silently Reports incomplete fields silently Same

The pattern matches the project rule "Constructor Validation: Fail Loudly on Invalid Input" (.claude/rules/go-style.md) and the spirit of vMCP anti-pattern #8.

Reproduction

The existing e2e test at test/e2e/thv-operator/virtualmcp/virtualmcp_redis_session_test.go:149-243 ("Should allow a session established on pod A to be reconstructed on pod B") covers cross-pod restore but does not exercise upstream-auth tool calls after restore. Extend it:

  1. Configure a VirtualMCPServer with replicas: 2, Redis session storage, and one or more MCPServerEntry backends using upstreamInject outgoing auth against an upstream IDP.
  2. Authenticate through pod A and exercise a tool call — succeeds.
  3. Kill pod A's vMCP instance (or force the session to evict from pod A's cache).
  4. Make the same tool call against pod B (which will trigger RestoreSession).
    • Expected: pod B's TokenValidator.Middleware populates a fresh identity on req.Context(); the tool call succeeds.
    • Actual (with current code): the partial identity reconstructed by RestoreSession is captured by identityRoundTripper and overrides the fresh context; the upstream returns 401.

Suggested fix

Two options, depending on the team's preference.

Option A — Pass nil and preserve Subject metadata explicitly (recommended)

RestoreSession shouldn't fabricate an *auth.Identity. The Subject metadata can be preserved directly without round-tripping through a partial struct. makeBaseSession already guards all identity dereferences with if identity != nil checks (verified end-to-end: factory.go:491-493, identityRoundTripper.RoundTrip:67, RestoreHijackPrevention takes no identity arg).

// Don't fabricate a partial identity. The original bearer is not persisted;
// downstream consumers must read identity from req.Context() (populated by
// TokenValidator.Middleware on each incoming request).
baseSession, err := f.makeBaseSession(ctx, id, nil, filteredBackends, sessionHints)
if err != nil {
    return nil, fmt.Errorf("RestoreSession: failed to rebuild backend connections: %w", err)
}

// Preserve subject metadata across restore for audit and future re-restores.
// makeBaseSession creates a fresh transport session, so the key isn't carried
// forward implicitly — re-write it from stored metadata.
if subject := storedMetadata[MetadataKeyIdentitySubject]; subject != "" {
    baseSession.SetMetadata(MetadataKeyIdentitySubject, subject)
}

baseSession is *defaultMultiSession, which embeds transportsession.Session and exposes SetMetadata (already used at factory.go:597 for the hijack-prevention hash/salt). No API change.

Option B — Introduce an explicit RestoredIdentity type

If the Subject needs to remain accessible via the same type used in the live-session path, introduce a distinct type that signals "partial":

type RestoredIdentity struct {
    Subject string
}

…and change the signatures that need to accept either. More invasive and adds a type that mostly only RestoreSession produces. Option A is preferred unless there's a concrete consumer that needs the Subject through the same code path as a live *auth.Identity.

Either way: add a doc-comment to auth.Identity clarifying that a non-nil *auth.Identity is always fully populated for its session-binding semantics, and that anonymous / partial states are represented by nil or a distinct type — not by a struct with empty fields.

Acceptance criteria

  • RestoreSession no longer produces a half-populated *auth.Identity and is internally consistent with the Token-required invariant enforced for new sessions in pkg/vmcp/session/factory.go:368.
  • MetadataKeyIdentitySubject is preserved across restore (regression guard — verify it's set on the restored session when present in storedMetadata).
  • auth.Identity godoc states the field-completeness contract.
  • E2E test extension under test/e2e/thv-operator/virtualmcp/virtualmcp_redis_session_test.go: cross-pod restore followed by a tool call against a backend with upstreamInject outgoing auth succeeds end-to-end (depends on [vMCP] identityPropagatingRoundTripper captures stale identity, bypassing per-request upstream token refresh #5323 + sibling fix being in).
  • No path in pkg/vmcp/session/ constructs an &auth.Identity{Subject: …} with Token and UpstreamTokens unset.

Severity

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    authenticationbugSomething isn't workinggoPull requests that update go codevmcpVirtual MCP Server related issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions