-
Notifications
You must be signed in to change notification settings - Fork 196
Description
Description
Extend deploymentForMCPServer in mcpserver_controller.go to set Deployment.Spec.Replicas from spec.replicas with nil-passthrough (preserving HPA compatibility) and set terminationGracePeriodSeconds on the pod spec. Extend the reconciler to enforce the stdio transport hard cap (force to 1 and set a status condition when spec.transport == stdio && spec.replicas > 1) and emit a Warning status condition when spec.replicas > 1 but session storage is absent or set to memory. These changes implement RC-3 and RC-14 of epic THV-0047 for the MCPServer resource.
Context
Epic THV-0047 adds horizontal scaling support to the ToolHive operator. TASK-001 (#267) and TASK-002 (#269) added the Replicas, BackendReplicas, and SessionStorage fields to the CRD specs; TASK-003 (#275) regenerated the deepcopy functions and CRD YAML manifests, making the new fields available at compile time. This task wires those fields into the MCPServer reconciler and deployment builder so they are actually respected at runtime.
The existing deploymentForMCPServer in mcpserver_controller.go hardcodes replicas := int32(1) and never sets terminationGracePeriodSeconds. The existing reconciler at lines 392–407 handles the stdio cap for externally-driven scaling (HPA or kubectl scale), but has no path for spec-driven replica values. This task adds both the spec-driven replica path and the session storage validation warning.
Dependencies: #275 (TASK-003 — Regenerate deepcopy and CRD manifests after scaling type changes)
Blocks: TASK-008 (Unit Tests)
Acceptance Criteria
-
deploymentForMCPServersetsDeployment.Spec.Replicastospec.replicaswhenspec.replicasis non-nil; whenspec.replicasis nil,Deployment.Spec.Replicasis left nil (not set to any default), allowing external scaling tools (HPA, KEDA,kubectl scale) to manage replica counts freely -
deploymentForMCPServersetsTerminationGracePeriodSecondsonPodSpecusing the value from a named constant or a documented default (e.g., 30 seconds); the field must appear in the generatedDeploymenteven whenspec.replicasis nil - The reconciler enforces the stdio hard cap on the spec-driven path: if
spec.transport == "stdio"andspec.replicas != nil && *spec.replicas > 1, the reconciler forces replicas to 1 in the createdDeployment, sets aWarningstatus condition with a descriptive message, and requeues - The existing externally-driven stdio cap at
mcpserver_controller.golines 392–407 continues to function correctly and is not broken by this change — the two paths (spec-driven vs externally-driven) are distinct - The reconciler emits a
Warningstatus condition whenspec.replicas != nil && *spec.replicas > 1and eitherspec.sessionStorageis nil orspec.sessionStorage.provideris"memory"— the deployment proceeds (no hard rejection); the condition is cleared when the configuration is corrected - On config-driven deployment updates,
Spec.Replicasis not overwritten (onlySpec.Template,Spec.Selector,Labels, andAnnotationsare updated), following the existing pattern at lines 452–469 — this preserves the nil-passthrough invariant on update - New status condition type constants and reason constants for the stdio cap and session storage warning are defined in
cmd/thv-operator/api/v1alpha1/mcpserver_types.go, following the existing naming convention -
go build ./cmd/thv-operator/...passes -
go test ./cmd/thv-operator/...(excluding integration tests) passes, including all pre-existing tests inmcpserver_replicas_test.go - Code reviewed and approved
Technical Approach
Recommended Implementation
1. Add new condition type and reason constants to mcpserver_types.go
Before touching the controller, define the new condition types and reasons. Follow the existing const block pattern in mcpserver_types.go:
// Condition type for stdio replica cap enforcement
const (
ConditionStdioReplicaCapped = "StdioReplicaCapped"
)
const (
ConditionReasonStdioReplicaCapped = "StdioTransportCapAt1"
)
// Condition type for session storage warning
const (
ConditionSessionStorageWarning = "SessionStorageWarning"
)
const (
ConditionReasonSessionStorageMissing = "SessionStorageMissingForReplicas"
ConditionReasonSessionStorageOK = "SessionStorageConfigured"
)2. Extend deploymentForMCPServer in mcpserver_controller.go
Replace the hardcoded replicas := int32(1) at line 934 with spec-driven nil-passthrough logic, and add terminationGracePeriodSeconds to the pod spec.
For replicas, the logic is:
- If
spec.replicas != nilandspec.transport == "stdio"and*spec.replicas > 1: useint32Ptr(1)(the spec-driven stdio cap is enforced here in the builder; a status condition is set in the reconciler before this call) - If
spec.replicas != nil: usespec.replicasdirectly (pass the pointer through) - If
spec.replicas == nil: setDeployment.Spec.Replicasto nil (omit it entirely)
In the appsv1.DeploymentSpec struct literal, change Replicas: &replicas to use the resolved value, which may be nil. Add TerminationGracePeriodSeconds on the pod spec:
// In the Deployment Spec.Template.Spec:
TerminationGracePeriodSeconds: int64Ptr(defaultTerminationGracePeriodSeconds),Define the constant and helper:
const defaultTerminationGracePeriodSeconds = int64(30)
func int64Ptr(i int64) *int64 { return &i }3. Extend the reconciler's pre-creation / pre-update path
Insert two new validation steps in Reconcile(), after the existing image and template validation and before the deploymentForMCPServer call. These are distinct from the existing externally-driven cap at lines 392–407 (which runs after the deployment is fetched):
Step A — Spec-driven stdio cap: If spec.transport == "stdio" and spec.replicas != nil && *spec.replicas > 1:
- Set a
Warningstatus condition usingmeta.SetStatusCondition - Continue (allow deployment creation with replicas forced to 1 inside
deploymentForMCPServer)
Step B — Session storage validation: If spec.replicas != nil && *spec.replicas > 1:
- If
spec.sessionStorage == nilorspec.sessionStorage.provider != "redis": set aWarningstatus condition; do not block deployment - Otherwise: clear/set the condition to indicate session storage is configured
4. Update the deployment update path
No changes are needed to deploymentNeedsUpdate or the deployment update logic at lines 452–469. Those lines already avoid overwriting Spec.Replicas on update, which is the correct behavior. Verify this is true after the deployment builder changes.
Patterns & Frameworks
- Existing stdio cap pattern —
mcpserver_controller.golines 392–407: externally-driven cap. The new spec-driven cap is a separate code path added indeploymentForMCPServeritself and in reconciler pre-creation validation. Do not merge the two paths. - Status condition pattern —
meta.SetStatusCondition(&mcpServer.Status.Conditions, metav1.Condition{...})followed byr.Status().Update(ctx, mcpServer). SeesetCABundleRefCondition/validateCABundleRefinmcpserver_controller.golines 553–616 for the exact pattern to follow for helper functions. - Nil-passthrough replicas — The
deploymentNeedsUpdatefunction at lines 1505+ already avoids overwritingSpec.Replicason updates;deploymentForMCPServermust produce a nilSpec.Replicaswhenspec.replicas == nilso this invariant holds on initial creation too. int32Ptrhelper — Already defined at line 1859 ofmcpserver_controller.go. Add a companionint64PtrforTerminationGracePeriodSeconds.- Condition update after r.Status().Update — Always call
r.Status().Updateimmediately aftermeta.SetStatusCondition; log but do not return on status update errors (follow the pattern of existing condition setters).
Code Pointers
cmd/thv-operator/controllers/mcpserver_controller.goline 934 — Hardcodedreplicas := int32(1); this is the primary change point indeploymentForMCPServercmd/thv-operator/controllers/mcpserver_controller.golines 1191–1263 — The fullDeploymentstruct literal indeploymentForMCPServer;Spec.ReplicasandSpec.Template.Spec.TerminationGracePeriodSecondsare set herecmd/thv-operator/controllers/mcpserver_controller.golines 392–407 — Existing externally-driven stdio cap; new spec-driven logic must not disturb this blockcmd/thv-operator/controllers/mcpserver_controller.golines 452–469 — Deployment update path that preservesSpec.Replicas; verify it remains correct after builder changescmd/thv-operator/controllers/mcpserver_controller.golines 553–616 —setCABundleRefConditionandvalidateCABundleRef— pattern for helper condition-setter functions and status update callscmd/thv-operator/controllers/mcpserver_controller.goline 1859 —int32Ptrhelper; addint64Ptrcompanion herecmd/thv-operator/api/v1alpha1/mcpserver_types.golines 12–69 — Existing condition type and reason constants; new constants go in newconstblocks here following the same conventioncmd/thv-operator/controllers/mcpserver_replicas_test.go— Existing replica tests (HPA-scenario); new spec-driven tests for TASK-008 extend this file using the samefake.NewClientBuilderpattern
Component Interfaces
New condition type and reason constants (in mcpserver_types.go):
// ConditionStdioReplicaCapped indicates the stdio transport replica count
// was capped at 1 because spec.replicas was set to a value greater than 1.
const ConditionStdioReplicaCapped = "StdioReplicaCapped"
const (
// ConditionReasonStdioReplicaCapped is set when spec.replicas > 1 for a stdio transport.
ConditionReasonStdioReplicaCapped = "StdioTransportCapAt1"
)
// ConditionSessionStorageWarning indicates that replicas > 1 but session storage
// is not configured with a Redis backend.
const ConditionSessionStorageWarning = "SessionStorageWarning"
const (
// ConditionReasonSessionStorageMissing is set when replicas > 1 and no Redis session storage is configured.
ConditionReasonSessionStorageMissing = "SessionStorageMissingForReplicas"
// ConditionReasonSessionStorageConfigured is set when replicas > 1 and Redis session storage is configured.
ConditionReasonSessionStorageConfigured = "SessionStorageConfigured"
)Replica resolution logic in deploymentForMCPServer (illustrative shape, not prescriptive):
// resolveDeploymentReplicas returns the replica pointer to set on Deployment.Spec.Replicas.
// Returns nil when spec.replicas is nil (hands-off mode for HPA/KEDA).
// Enforces stdio cap at 1 in the builder as defense-in-depth.
func resolveDeploymentReplicas(transport string, specReplicas *int32) *int32 {
if specReplicas == nil {
return nil
}
if transport == "stdio" && *specReplicas > 1 {
return int32Ptr(1)
}
return specReplicas
}Session storage validation call site in Reconcile() (shape only):
// After existing image/template validation, before deploymentForMCPServer call:
r.validateSessionStorageForReplicas(ctx, mcpServer)func (r *MCPServerReconciler) validateSessionStorageForReplicas(
ctx context.Context, mcpServer *mcpv1alpha1.MCPServer,
) {
if mcpServer.Spec.Replicas == nil || *mcpServer.Spec.Replicas <= 1 {
return
}
if mcpServer.Spec.SessionStorage == nil || mcpServer.Spec.SessionStorage.Provider != "redis" {
meta.SetStatusCondition(&mcpServer.Status.Conditions, metav1.Condition{
Type: mcpv1alpha1.ConditionSessionStorageWarning,
Status: metav1.ConditionTrue,
Reason: mcpv1alpha1.ConditionReasonSessionStorageMissing,
Message: "replicas > 1 but sessionStorage.provider is not redis; sessions are not shared across replicas",
ObservedGeneration: mcpServer.Generation,
})
} else {
meta.SetStatusCondition(&mcpServer.Status.Conditions, metav1.Condition{
Type: mcpv1alpha1.ConditionSessionStorageWarning,
Status: metav1.ConditionFalse,
Reason: mcpv1alpha1.ConditionReasonSessionStorageConfigured,
Message: "Redis session storage is configured",
ObservedGeneration: mcpServer.Generation,
})
}
if err := r.Status().Update(ctx, mcpServer); err != nil {
log.FromContext(ctx).Error(err, "Failed to update MCPServer status after session storage validation")
}
}Testing Strategy
Tests for this task's behaviors are written in TASK-008. The existing mcpserver_replicas_test.go covers the externally-driven HPA scenario; TASK-008 extends it with spec-driven cases. For reference, the test pattern to follow:
Unit Tests
-
TestSpecDrivenReplicasNil—deploymentForMCPServerwithspec.replicas == nilproduces aDeploymentwithSpec.Replicas == nil -
TestSpecDrivenReplicas1—deploymentForMCPServerwithspec.replicas = 1producesSpec.Replicas = 1 -
TestSpecDrivenReplicas3SseTransport—deploymentForMCPServerwithspec.replicas = 3,spec.transport = "sse"producesSpec.Replicas = 3 -
TestSpecDrivenStdioCapAt1— reconciler withspec.transport = "stdio"andspec.replicas = 3creates aDeploymentwithSpec.Replicas = 1and sets theStdioReplicaCappedstatus condition -
TestSessionStorageWarningConditionSet— reconciler withspec.replicas = 2and nosessionStoragesets theSessionStorageWarningcondition with statusTrue -
TestSessionStorageWarningConditionCleared— reconciler withspec.replicas = 2andsessionStorage.provider = "redis"sets theSessionStorageWarningcondition with statusFalse -
TestTerminationGracePeriodSet—deploymentForMCPServerproduces aDeploymentwithSpec.Template.Spec.TerminationGracePeriodSecondsset to the expected default value
Integration Tests
- Not in scope for this task; integration tests are covered in TASK-008
Edge Cases
- Verify that the externally-driven stdio cap at lines 392–407 is not triggered when
spec.replicas == niland the live deployment has 1 replica (no spurious requeue) - Verify that a deployment update (config change with existing 3-replica deployment) does not overwrite
Spec.Replicaswhenspec.replicas == nil - Verify that setting
spec.replicas = nilafter previously havingspec.replicas = 2does not setDeployment.Spec.Replicasto any value (nil passthrough on update)
Out of Scope
- RunConfig ConfigMap population with
backendReplicasand Redis config — covered by TASK-005 - VirtualMCPServer deployment builder replica and
terminationGracePeriodSecondschanges — covered by TASK-006 - Redis password env var injection into the proxyrunner pod — covered by TASK-005 (as part of RunConfig) and TASK-006 (vMCP pod)
- Application-level session storage implementation in proxyrunner — separate epic
- Horizontal Pod Autoscaler (HPA) object creation or management
- Redis deployment, provisioning, or lifecycle
- CLI, UI, or any changes outside
cmd/thv-operator/ - Changes to
mcpserver_runconfig.go,virtualmcpserver_deployment.go, orvirtualmcpserver_vmcpconfig.go
References
- Epic THV-0047: Horizontal Scaling - CRD and Operator Changes — https://github.com/stacklok/stacklok-epics/issues/264
- Upstream TASK-003: Regenerate deepcopy and CRD manifests after scaling type changes — https://github.com/stacklok/stacklok-epics/issues/275
- RFC THV-0047: Manual Horizontal Scaling for vMCP and Proxy Runner — RFC: Horizontal Scaling for vMCP and Proxy Runner toolhive-rfcs#47
- Existing stdio cap logic:
cmd/thv-operator/controllers/mcpserver_controller.golines 392–407 - Existing deployment update logic (replica preservation):
cmd/thv-operator/controllers/mcpserver_controller.golines 452–469 - Existing condition setter pattern:
cmd/thv-operator/controllers/mcpserver_controller.golines 553–616 - Existing replica test file:
cmd/thv-operator/controllers/mcpserver_replicas_test.go