Skip to content

Extend MCPServer reconciler and deployment builder with spec-driven replicas, stdio cap, session storage validation, and terminationGracePeriodSeconds #4217

@yrobla

Description

@yrobla

Description

Extend deploymentForMCPServer in mcpserver_controller.go to set Deployment.Spec.Replicas from spec.replicas with nil-passthrough (preserving HPA compatibility) and set terminationGracePeriodSeconds on the pod spec. Extend the reconciler to enforce the stdio transport hard cap (force to 1 and set a status condition when spec.transport == stdio && spec.replicas > 1) and emit a Warning status condition when spec.replicas > 1 but session storage is absent or set to memory. These changes implement RC-3 and RC-14 of epic THV-0047 for the MCPServer resource.

Context

Epic THV-0047 adds horizontal scaling support to the ToolHive operator. TASK-001 (#267) and TASK-002 (#269) added the Replicas, BackendReplicas, and SessionStorage fields to the CRD specs; TASK-003 (#275) regenerated the deepcopy functions and CRD YAML manifests, making the new fields available at compile time. This task wires those fields into the MCPServer reconciler and deployment builder so they are actually respected at runtime.

The existing deploymentForMCPServer in mcpserver_controller.go hardcodes replicas := int32(1) and never sets terminationGracePeriodSeconds. The existing reconciler at lines 392–407 handles the stdio cap for externally-driven scaling (HPA or kubectl scale), but has no path for spec-driven replica values. This task adds both the spec-driven replica path and the session storage validation warning.

Dependencies: #275 (TASK-003 — Regenerate deepcopy and CRD manifests after scaling type changes)
Blocks: TASK-008 (Unit Tests)

Acceptance Criteria

  • deploymentForMCPServer sets Deployment.Spec.Replicas to spec.replicas when spec.replicas is non-nil; when spec.replicas is nil, Deployment.Spec.Replicas is left nil (not set to any default), allowing external scaling tools (HPA, KEDA, kubectl scale) to manage replica counts freely
  • deploymentForMCPServer sets TerminationGracePeriodSeconds on PodSpec using the value from a named constant or a documented default (e.g., 30 seconds); the field must appear in the generated Deployment even when spec.replicas is nil
  • The reconciler enforces the stdio hard cap on the spec-driven path: if spec.transport == "stdio" and spec.replicas != nil && *spec.replicas > 1, the reconciler forces replicas to 1 in the created Deployment, sets a Warning status condition with a descriptive message, and requeues
  • The existing externally-driven stdio cap at mcpserver_controller.go lines 392–407 continues to function correctly and is not broken by this change — the two paths (spec-driven vs externally-driven) are distinct
  • The reconciler emits a Warning status condition when spec.replicas != nil && *spec.replicas > 1 and either spec.sessionStorage is nil or spec.sessionStorage.provider is "memory" — the deployment proceeds (no hard rejection); the condition is cleared when the configuration is corrected
  • On config-driven deployment updates, Spec.Replicas is not overwritten (only Spec.Template, Spec.Selector, Labels, and Annotations are updated), following the existing pattern at lines 452–469 — this preserves the nil-passthrough invariant on update
  • New status condition type constants and reason constants for the stdio cap and session storage warning are defined in cmd/thv-operator/api/v1alpha1/mcpserver_types.go, following the existing naming convention
  • go build ./cmd/thv-operator/... passes
  • go test ./cmd/thv-operator/... (excluding integration tests) passes, including all pre-existing tests in mcpserver_replicas_test.go
  • Code reviewed and approved

Technical Approach

Recommended Implementation

1. Add new condition type and reason constants to mcpserver_types.go

Before touching the controller, define the new condition types and reasons. Follow the existing const block pattern in mcpserver_types.go:

// Condition type for stdio replica cap enforcement
const (
    ConditionStdioReplicaCapped = "StdioReplicaCapped"
)
const (
    ConditionReasonStdioReplicaCapped = "StdioTransportCapAt1"
)

// Condition type for session storage warning
const (
    ConditionSessionStorageWarning = "SessionStorageWarning"
)
const (
    ConditionReasonSessionStorageMissing  = "SessionStorageMissingForReplicas"
    ConditionReasonSessionStorageOK       = "SessionStorageConfigured"
)

2. Extend deploymentForMCPServer in mcpserver_controller.go

Replace the hardcoded replicas := int32(1) at line 934 with spec-driven nil-passthrough logic, and add terminationGracePeriodSeconds to the pod spec.

For replicas, the logic is:

  • If spec.replicas != nil and spec.transport == "stdio" and *spec.replicas > 1: use int32Ptr(1) (the spec-driven stdio cap is enforced here in the builder; a status condition is set in the reconciler before this call)
  • If spec.replicas != nil: use spec.replicas directly (pass the pointer through)
  • If spec.replicas == nil: set Deployment.Spec.Replicas to nil (omit it entirely)

In the appsv1.DeploymentSpec struct literal, change Replicas: &replicas to use the resolved value, which may be nil. Add TerminationGracePeriodSeconds on the pod spec:

// In the Deployment Spec.Template.Spec:
TerminationGracePeriodSeconds: int64Ptr(defaultTerminationGracePeriodSeconds),

Define the constant and helper:

const defaultTerminationGracePeriodSeconds = int64(30)

func int64Ptr(i int64) *int64 { return &i }

3. Extend the reconciler's pre-creation / pre-update path

Insert two new validation steps in Reconcile(), after the existing image and template validation and before the deploymentForMCPServer call. These are distinct from the existing externally-driven cap at lines 392–407 (which runs after the deployment is fetched):

Step A — Spec-driven stdio cap: If spec.transport == "stdio" and spec.replicas != nil && *spec.replicas > 1:

  • Set a Warning status condition using meta.SetStatusCondition
  • Continue (allow deployment creation with replicas forced to 1 inside deploymentForMCPServer)

Step B — Session storage validation: If spec.replicas != nil && *spec.replicas > 1:

  • If spec.sessionStorage == nil or spec.sessionStorage.provider != "redis": set a Warning status condition; do not block deployment
  • Otherwise: clear/set the condition to indicate session storage is configured

4. Update the deployment update path

No changes are needed to deploymentNeedsUpdate or the deployment update logic at lines 452–469. Those lines already avoid overwriting Spec.Replicas on update, which is the correct behavior. Verify this is true after the deployment builder changes.

Patterns & Frameworks

  • Existing stdio cap patternmcpserver_controller.go lines 392–407: externally-driven cap. The new spec-driven cap is a separate code path added in deploymentForMCPServer itself and in reconciler pre-creation validation. Do not merge the two paths.
  • Status condition patternmeta.SetStatusCondition(&mcpServer.Status.Conditions, metav1.Condition{...}) followed by r.Status().Update(ctx, mcpServer). See setCABundleRefCondition / validateCABundleRef in mcpserver_controller.go lines 553–616 for the exact pattern to follow for helper functions.
  • Nil-passthrough replicas — The deploymentNeedsUpdate function at lines 1505+ already avoids overwriting Spec.Replicas on updates; deploymentForMCPServer must produce a nil Spec.Replicas when spec.replicas == nil so this invariant holds on initial creation too.
  • int32Ptr helper — Already defined at line 1859 of mcpserver_controller.go. Add a companion int64Ptr for TerminationGracePeriodSeconds.
  • Condition update after r.Status().Update — Always call r.Status().Update immediately after meta.SetStatusCondition; log but do not return on status update errors (follow the pattern of existing condition setters).

Code Pointers

  • cmd/thv-operator/controllers/mcpserver_controller.go line 934 — Hardcoded replicas := int32(1); this is the primary change point in deploymentForMCPServer
  • cmd/thv-operator/controllers/mcpserver_controller.go lines 1191–1263 — The full Deployment struct literal in deploymentForMCPServer; Spec.Replicas and Spec.Template.Spec.TerminationGracePeriodSeconds are set here
  • cmd/thv-operator/controllers/mcpserver_controller.go lines 392–407 — Existing externally-driven stdio cap; new spec-driven logic must not disturb this block
  • cmd/thv-operator/controllers/mcpserver_controller.go lines 452–469 — Deployment update path that preserves Spec.Replicas; verify it remains correct after builder changes
  • cmd/thv-operator/controllers/mcpserver_controller.go lines 553–616 — setCABundleRefCondition and validateCABundleRef — pattern for helper condition-setter functions and status update calls
  • cmd/thv-operator/controllers/mcpserver_controller.go line 1859 — int32Ptr helper; add int64Ptr companion here
  • cmd/thv-operator/api/v1alpha1/mcpserver_types.go lines 12–69 — Existing condition type and reason constants; new constants go in new const blocks here following the same convention
  • cmd/thv-operator/controllers/mcpserver_replicas_test.go — Existing replica tests (HPA-scenario); new spec-driven tests for TASK-008 extend this file using the same fake.NewClientBuilder pattern

Component Interfaces

New condition type and reason constants (in mcpserver_types.go):

// ConditionStdioReplicaCapped indicates the stdio transport replica count
// was capped at 1 because spec.replicas was set to a value greater than 1.
const ConditionStdioReplicaCapped = "StdioReplicaCapped"

const (
    // ConditionReasonStdioReplicaCapped is set when spec.replicas > 1 for a stdio transport.
    ConditionReasonStdioReplicaCapped = "StdioTransportCapAt1"
)

// ConditionSessionStorageWarning indicates that replicas > 1 but session storage
// is not configured with a Redis backend.
const ConditionSessionStorageWarning = "SessionStorageWarning"

const (
    // ConditionReasonSessionStorageMissing is set when replicas > 1 and no Redis session storage is configured.
    ConditionReasonSessionStorageMissing = "SessionStorageMissingForReplicas"
    // ConditionReasonSessionStorageConfigured is set when replicas > 1 and Redis session storage is configured.
    ConditionReasonSessionStorageConfigured = "SessionStorageConfigured"
)

Replica resolution logic in deploymentForMCPServer (illustrative shape, not prescriptive):

// resolveDeploymentReplicas returns the replica pointer to set on Deployment.Spec.Replicas.
// Returns nil when spec.replicas is nil (hands-off mode for HPA/KEDA).
// Enforces stdio cap at 1 in the builder as defense-in-depth.
func resolveDeploymentReplicas(transport string, specReplicas *int32) *int32 {
    if specReplicas == nil {
        return nil
    }
    if transport == "stdio" && *specReplicas > 1 {
        return int32Ptr(1)
    }
    return specReplicas
}

Session storage validation call site in Reconcile() (shape only):

// After existing image/template validation, before deploymentForMCPServer call:
r.validateSessionStorageForReplicas(ctx, mcpServer)
func (r *MCPServerReconciler) validateSessionStorageForReplicas(
    ctx context.Context, mcpServer *mcpv1alpha1.MCPServer,
) {
    if mcpServer.Spec.Replicas == nil || *mcpServer.Spec.Replicas <= 1 {
        return
    }
    if mcpServer.Spec.SessionStorage == nil || mcpServer.Spec.SessionStorage.Provider != "redis" {
        meta.SetStatusCondition(&mcpServer.Status.Conditions, metav1.Condition{
            Type:               mcpv1alpha1.ConditionSessionStorageWarning,
            Status:             metav1.ConditionTrue,
            Reason:             mcpv1alpha1.ConditionReasonSessionStorageMissing,
            Message:            "replicas > 1 but sessionStorage.provider is not redis; sessions are not shared across replicas",
            ObservedGeneration: mcpServer.Generation,
        })
    } else {
        meta.SetStatusCondition(&mcpServer.Status.Conditions, metav1.Condition{
            Type:               mcpv1alpha1.ConditionSessionStorageWarning,
            Status:             metav1.ConditionFalse,
            Reason:             mcpv1alpha1.ConditionReasonSessionStorageConfigured,
            Message:            "Redis session storage is configured",
            ObservedGeneration: mcpServer.Generation,
        })
    }
    if err := r.Status().Update(ctx, mcpServer); err != nil {
        log.FromContext(ctx).Error(err, "Failed to update MCPServer status after session storage validation")
    }
}

Testing Strategy

Tests for this task's behaviors are written in TASK-008. The existing mcpserver_replicas_test.go covers the externally-driven HPA scenario; TASK-008 extends it with spec-driven cases. For reference, the test pattern to follow:

Unit Tests

  • TestSpecDrivenReplicasNildeploymentForMCPServer with spec.replicas == nil produces a Deployment with Spec.Replicas == nil
  • TestSpecDrivenReplicas1deploymentForMCPServer with spec.replicas = 1 produces Spec.Replicas = 1
  • TestSpecDrivenReplicas3SseTransportdeploymentForMCPServer with spec.replicas = 3, spec.transport = "sse" produces Spec.Replicas = 3
  • TestSpecDrivenStdioCapAt1 — reconciler with spec.transport = "stdio" and spec.replicas = 3 creates a Deployment with Spec.Replicas = 1 and sets the StdioReplicaCapped status condition
  • TestSessionStorageWarningConditionSet — reconciler with spec.replicas = 2 and no sessionStorage sets the SessionStorageWarning condition with status True
  • TestSessionStorageWarningConditionCleared — reconciler with spec.replicas = 2 and sessionStorage.provider = "redis" sets the SessionStorageWarning condition with status False
  • TestTerminationGracePeriodSetdeploymentForMCPServer produces a Deployment with Spec.Template.Spec.TerminationGracePeriodSeconds set to the expected default value

Integration Tests

  • Not in scope for this task; integration tests are covered in TASK-008

Edge Cases

  • Verify that the externally-driven stdio cap at lines 392–407 is not triggered when spec.replicas == nil and the live deployment has 1 replica (no spurious requeue)
  • Verify that a deployment update (config change with existing 3-replica deployment) does not overwrite Spec.Replicas when spec.replicas == nil
  • Verify that setting spec.replicas = nil after previously having spec.replicas = 2 does not set Deployment.Spec.Replicas to any value (nil passthrough on update)

Out of Scope

  • RunConfig ConfigMap population with backendReplicas and Redis config — covered by TASK-005
  • VirtualMCPServer deployment builder replica and terminationGracePeriodSeconds changes — covered by TASK-006
  • Redis password env var injection into the proxyrunner pod — covered by TASK-005 (as part of RunConfig) and TASK-006 (vMCP pod)
  • Application-level session storage implementation in proxyrunner — separate epic
  • Horizontal Pod Autoscaler (HPA) object creation or management
  • Redis deployment, provisioning, or lifecycle
  • CLI, UI, or any changes outside cmd/thv-operator/
  • Changes to mcpserver_runconfig.go, virtualmcpserver_deployment.go, or virtualmcpserver_vmcpconfig.go

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgoPull requests that update go codekubernetesItems related to KubernetesoperatorscalabilityItems related to scalabilityvmcpVirtual MCP Server related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions