Bug: runconfig-checksum is dropped from MCPServer proxy pod template when PodTemplateMetadataOverrides.Annotations is set
Summary
When an MCPServer resource has any value in spec.resourceOverrides.proxyDeployment.podTemplateMetadataOverrides.annotations (for example, the Vault Agent injection annotations documented under the "Vault Agent Injection" use case), the operator-managed toolhive.stacklok.dev/runconfig-checksum annotation is silently dropped from the proxy Deployment's pod template. Because the Deployment controller in Kubernetes only restarts pods when the pod template changes, this breaks the rolling-restart-on-RunConfig-change mechanism for any non-policy RunConfig edit (telemetry, inline OIDC, external auth ref, authz ConfigMap content, etc.).
A secondary symptom: the operator's deploymentNeedsUpdate comparator computes the expected pod-template annotations correctly, while the constructor produces them incorrectly — so the two never match, and the operator enters a perpetual reconcile-and-rewrite loop on every MCPServer that uses pod-template overrides. Each reconcile issues an Update against the Deployment, but the rebuilt template still doesn't carry the checksum, so no real rollout happens.
The companion MCPRemoteProxy controller does this correctly. The defect is isolated to MCPServer.
Severity
Latent for clusters that don't use podTemplateMetadataOverrides.annotations. Active for any cluster that sets that field — Vault Agent injection is the obvious community example, but any operator that injects pod-level annotations (Linkerd, Datadog, OPA Gatekeeper, etc.) is affected.
Reproduction
Minimal env
Any toolhive operator build that includes PodTemplateMetadataOverrides support (current main).
Scenario
-
Create an MCPServer with pod-template-level annotations set (here using a Vault Agent injection example, but any non-empty map reproduces the bug):
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPServer
metadata:
name: repro
namespace: default
spec:
image: ghcr.io/example/some-mcp:latest
resourceOverrides:
proxyDeployment:
podTemplateMetadataOverrides:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "toolhive-mcp-workloads"
-
Apply, wait for the operator to converge, then inspect the proxy Deployment's pod template annotations:
kubectl get deploy repro -n default -o jsonpath='{.spec.template.metadata.annotations}'
Expected (per the operator's design intent — the comparator's expected map):
{
"toolhive.stacklok.dev/runconfig-checksum": "<some-hex>",
"vault.hashicorp.com/agent-inject": "true",
"vault.hashicorp.com/role": "toolhive-mcp-workloads"
}
Observed:
{
"vault.hashicorp.com/agent-inject": "true",
"vault.hashicorp.com/role": "toolhive-mcp-workloads"
}
-
Edit any RunConfig-affecting resource that doesn't change container.Args — e.g. modify the referenced MCPTelemetryConfig content. Observe that the proxy pod is not restarted, even though the RunConfig content checksum on the operator-managed RunConfig ConfigMap did change.
-
Watch the operator logs / kubectl get events. You'll see the operator continuously detecting drift and issuing Update calls against the Deployment that don't actually change the live pod template.
Unit-test reproduction
A self-contained unit test against (*MCPServerReconciler).deploymentForMCPServer shows the bug without a cluster. Drop this into cmd/thv-operator/controllers/ (the function is unexported; the test must live in the same package):
//go:build enterprise
package controllers
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"sigs.k8s.io/controller-runtime/pkg/client/fake"
mcpv1beta1 "github.com/stacklok/toolhive/cmd/thv-operator/api/v1beta1"
corev1 "k8s.io/api/core/v1"
"github.com/stacklok/toolhive/cmd/thv-operator/pkg/kubernetes"
"github.com/stacklok/toolhive/cmd/thv-operator/pkg/runconfig/configmap/checksum"
)
func TestDeploymentForMCPServer_ChecksumDroppedByPodTemplateOverrides_REPRO(t *testing.T) {
t.Parallel()
scheme := runtime.NewScheme()
require.NoError(t, mcpv1beta1.AddToScheme(scheme))
require.NoError(t, corev1.AddToScheme(scheme))
client := fake.NewClientBuilder().WithScheme(scheme).Build()
r := newTestMCPServerReconciler(client, scheme, kubernetes.PlatformKubernetes)
mcpServer := &mcpv1beta1.MCPServer{
ObjectMeta: metav1.ObjectMeta{Name: "repro", Namespace: "default"},
Spec: mcpv1beta1.MCPServerSpec{
Image: "test:latest",
ResourceOverrides: &mcpv1beta1.ResourceOverrides{
ProxyDeployment: &mcpv1beta1.ProxyDeploymentOverrides{
PodTemplateMetadataOverrides: &mcpv1beta1.ResourceMetadataOverrides{
Annotations: map[string]string{
"vault.hashicorp.com/agent-inject": "true",
},
},
},
},
},
}
deployment := r.deploymentForMCPServer(t.Context(), mcpServer, "C1")
require.NotNil(t, deployment)
// Currently FAILS — the checksum is missing from the pod template.
assert.Equal(t, "C1",
deployment.Spec.Template.Annotations[checksum.RunConfigChecksumAnnotation])
}
(newTestMCPServerReconciler is the existing helper used by mcpserver_resource_overrides_test.go.)
Root cause
In cmd/thv-operator/controllers/mcpserver_controller.go, in deploymentForMCPServer, around line 1221:
deploymentTemplateAnnotations := make(map[string]string)
deploymentTemplateAnnotations = checksum.AddRunConfigChecksumToPodTemplate(
deploymentTemplateAnnotations, runConfigChecksum)
// ...
if m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations != nil {
// BUG: first arg should be deploymentTemplateAnnotations (which already
// has runconfig-checksum stamped), not deploymentAnnotations (which is
// the deployment-level overrides map, typically empty).
deploymentTemplateAnnotations = ctrlutil.MergeAnnotations(
deploymentAnnotations,
m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations,
)
}
ctrlutil.MergeAnnotations is MergeStringMaps, defined as "first-arg-wins on conflict":
// pkg/controllerutil/resources.go
func MergeStringMaps(defaultMap, overrideMap map[string]string) map[string]string {
result := make(map[string]string)
for k, v := range overrideMap {
result[k] = v
}
for k, v := range defaultMap {
result[k] = v // default takes precedence
}
return result
}
Passing the empty deploymentAnnotations as the first arg means the merged result contains only the user override map; the runconfig-checksum that was already in deploymentTemplateAnnotations is overwritten and lost.
The MCPRemoteProxy sibling at cmd/thv-operator/controllers/mcpremoteproxy_deployment.go (around line 347) gets it right:
templateAnnotations = ctrlutil.MergeAnnotations(templateAnnotations, overrides.Annotations)
The drift-check counterpart deploymentNeedsUpdate at mcpserver_controller.go:~1804 also gets it right (uses expectedPodTemplateAnnotations as the merge base), which is why the constructor/comparator asymmetry produces the perpetual-Update loop.
Suggested fix
One line in cmd/thv-operator/controllers/mcpserver_controller.go:
- deploymentTemplateAnnotations = ctrlutil.MergeAnnotations(deploymentAnnotations,
- m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations)
+ deploymentTemplateAnnotations = ctrlutil.MergeAnnotations(deploymentTemplateAnnotations,
+ m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations)
This brings the constructor in line with both the MCPRemoteProxy sibling and the deploymentNeedsUpdate comparator. With this change:
runconfig-checksum always survives the merge.
- User overrides are still merged in (the user can't accidentally clobber
runconfig-checksum even if they tried — first-arg-wins protects it, which is the right default).
- Constructor and comparator agree, so
deploymentNeedsUpdate no longer flips on every reconcile.
Suggested test
In cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go, the existing "with Vault Agent Injection pod template annotations" case (around line 478) currently asserts only on deployment.Labels and deployment.Annotations. Extend it to assert on deployment.Spec.Template.Annotations:
assert.Equal(t,
map[string]string{
"vault.hashicorp.com/agent-inject": "true",
"vault.hashicorp.com/role": "toolhive-mcp-workloads",
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
},
deployment.Spec.Template.Annotations,
"pod template must contain both user overrides and the runconfig-checksum",
)
That single assertion is the long-term guardrail. It catches the current bug, any future regression where a deployment-level annotation leaks into the pod template, and any future regression where the merge is inadvertently disabled.
Sister sites surveyed
| File |
Status |
cmd/thv-operator/controllers/mcpserver_controller.go (constructor) |
Buggy — fix here |
cmd/thv-operator/controllers/mcpserver_controller.go (comparator deploymentNeedsUpdate) |
Correct — no change |
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go |
Correct — use as model |
cmd/thv-operator/controllers/virtualmcpserver_deployment.go |
Out of scope (currently does not honor PodTemplateMetadataOverrides.Annotations; that is a separate gap) |
cmd/thv-operator/controllers/embeddingserver_controller.go |
Unaffected (uses maps.Copy, no RunConfig-checksum path) |
Notes for triagers
- This is a latent defect that was hidden because no in-tree code populates
podTemplateMetadataOverrides.annotations. It is reproducible with any user-facing setup that uses pod-level annotations — the Vault Agent example is the canonical one and is the easiest to verify by hand.
- The fix is purely additive in behavior (it only matters when the override map is non-empty) and is mechanically symmetric with the existing comparator and the existing
MCPRemoteProxy implementation.
- Backports recommended if release branches are maintained: any user setting Vault Agent / Linkerd / similar pod annotations on
MCPServer today is silently affected.
Bug:
runconfig-checksumis dropped from MCPServer proxy pod template whenPodTemplateMetadataOverrides.Annotationsis setSummary
When an
MCPServerresource has any value inspec.resourceOverrides.proxyDeployment.podTemplateMetadataOverrides.annotations(for example, the Vault Agent injection annotations documented under the "Vault Agent Injection" use case), the operator-managedtoolhive.stacklok.dev/runconfig-checksumannotation is silently dropped from the proxyDeployment's pod template. Because the Deployment controller in Kubernetes only restarts pods when the pod template changes, this breaks the rolling-restart-on-RunConfig-change mechanism for any non-policy RunConfig edit (telemetry, inline OIDC, external auth ref, authz ConfigMap content, etc.).A secondary symptom: the operator's
deploymentNeedsUpdatecomparator computes the expected pod-template annotations correctly, while the constructor produces them incorrectly — so the two never match, and the operator enters a perpetual reconcile-and-rewrite loop on everyMCPServerthat uses pod-template overrides. Each reconcile issues anUpdateagainst the Deployment, but the rebuilt template still doesn't carry the checksum, so no real rollout happens.The companion
MCPRemoteProxycontroller does this correctly. The defect is isolated toMCPServer.Severity
Latent for clusters that don't use
podTemplateMetadataOverrides.annotations. Active for any cluster that sets that field — Vault Agent injection is the obvious community example, but any operator that injects pod-level annotations (Linkerd, Datadog, OPA Gatekeeper, etc.) is affected.Reproduction
Minimal env
Any toolhive operator build that includes
PodTemplateMetadataOverridessupport (currentmain).Scenario
Create an
MCPServerwith pod-template-level annotations set (here using a Vault Agent injection example, but any non-empty map reproduces the bug):Apply, wait for the operator to converge, then inspect the proxy Deployment's pod template annotations:
kubectl get deploy repro -n default -o jsonpath='{.spec.template.metadata.annotations}'Expected (per the operator's design intent — the comparator's expected map):
Observed:
Edit any RunConfig-affecting resource that doesn't change
container.Args— e.g. modify the referencedMCPTelemetryConfigcontent. Observe that the proxy pod is not restarted, even though the RunConfig content checksum on the operator-managed RunConfig ConfigMap did change.Watch the operator logs /
kubectl get events. You'll see the operator continuously detecting drift and issuingUpdatecalls against the Deployment that don't actually change the live pod template.Unit-test reproduction
A self-contained unit test against
(*MCPServerReconciler).deploymentForMCPServershows the bug without a cluster. Drop this intocmd/thv-operator/controllers/(the function is unexported; the test must live in the same package):(
newTestMCPServerReconcileris the existing helper used bymcpserver_resource_overrides_test.go.)Root cause
In
cmd/thv-operator/controllers/mcpserver_controller.go, indeploymentForMCPServer, around line 1221:ctrlutil.MergeAnnotationsisMergeStringMaps, defined as "first-arg-wins on conflict":Passing the empty
deploymentAnnotationsas the first arg means the merged result contains only the user override map; the runconfig-checksum that was already indeploymentTemplateAnnotationsis overwritten and lost.The
MCPRemoteProxysibling atcmd/thv-operator/controllers/mcpremoteproxy_deployment.go(around line 347) gets it right:The drift-check counterpart
deploymentNeedsUpdateatmcpserver_controller.go:~1804also gets it right (usesexpectedPodTemplateAnnotationsas the merge base), which is why the constructor/comparator asymmetry produces the perpetual-Updateloop.Suggested fix
One line in
cmd/thv-operator/controllers/mcpserver_controller.go:This brings the constructor in line with both the
MCPRemoteProxysibling and thedeploymentNeedsUpdatecomparator. With this change:runconfig-checksumalways survives the merge.runconfig-checksumeven if they tried — first-arg-wins protects it, which is the right default).deploymentNeedsUpdateno longer flips on every reconcile.Suggested test
In
cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go, the existing "with Vault Agent Injection pod template annotations" case (around line 478) currently asserts only ondeployment.Labelsanddeployment.Annotations. Extend it to assert ondeployment.Spec.Template.Annotations:That single assertion is the long-term guardrail. It catches the current bug, any future regression where a deployment-level annotation leaks into the pod template, and any future regression where the merge is inadvertently disabled.
Sister sites surveyed
cmd/thv-operator/controllers/mcpserver_controller.go(constructor)cmd/thv-operator/controllers/mcpserver_controller.go(comparatordeploymentNeedsUpdate)cmd/thv-operator/controllers/mcpremoteproxy_deployment.gocmd/thv-operator/controllers/virtualmcpserver_deployment.goPodTemplateMetadataOverrides.Annotations; that is a separate gap)cmd/thv-operator/controllers/embeddingserver_controller.gomaps.Copy, no RunConfig-checksum path)Notes for triagers
podTemplateMetadataOverrides.annotations. It is reproducible with any user-facing setup that uses pod-level annotations — the Vault Agent example is the canonical one and is the easiest to verify by hand.MCPRemoteProxyimplementation.MCPServertoday is silently affected.