Skip to content

MCPServer: runconfig-checksum dropped from proxy pod template when PodTemplateMetadataOverrides.Annotations is set #5148

@jhrozek

Description

@jhrozek

Bug: runconfig-checksum is dropped from MCPServer proxy pod template when PodTemplateMetadataOverrides.Annotations is set

Summary

When an MCPServer resource has any value in spec.resourceOverrides.proxyDeployment.podTemplateMetadataOverrides.annotations (for example, the Vault Agent injection annotations documented under the "Vault Agent Injection" use case), the operator-managed toolhive.stacklok.dev/runconfig-checksum annotation is silently dropped from the proxy Deployment's pod template. Because the Deployment controller in Kubernetes only restarts pods when the pod template changes, this breaks the rolling-restart-on-RunConfig-change mechanism for any non-policy RunConfig edit (telemetry, inline OIDC, external auth ref, authz ConfigMap content, etc.).

A secondary symptom: the operator's deploymentNeedsUpdate comparator computes the expected pod-template annotations correctly, while the constructor produces them incorrectly — so the two never match, and the operator enters a perpetual reconcile-and-rewrite loop on every MCPServer that uses pod-template overrides. Each reconcile issues an Update against the Deployment, but the rebuilt template still doesn't carry the checksum, so no real rollout happens.

The companion MCPRemoteProxy controller does this correctly. The defect is isolated to MCPServer.

Severity

Latent for clusters that don't use podTemplateMetadataOverrides.annotations. Active for any cluster that sets that field — Vault Agent injection is the obvious community example, but any operator that injects pod-level annotations (Linkerd, Datadog, OPA Gatekeeper, etc.) is affected.

Reproduction

Minimal env

Any toolhive operator build that includes PodTemplateMetadataOverrides support (current main).

Scenario

  1. Create an MCPServer with pod-template-level annotations set (here using a Vault Agent injection example, but any non-empty map reproduces the bug):

    apiVersion: toolhive.stacklok.dev/v1beta1
    kind: MCPServer
    metadata:
      name: repro
      namespace: default
    spec:
      image: ghcr.io/example/some-mcp:latest
      resourceOverrides:
        proxyDeployment:
          podTemplateMetadataOverrides:
            annotations:
              vault.hashicorp.com/agent-inject: "true"
              vault.hashicorp.com/role: "toolhive-mcp-workloads"
  2. Apply, wait for the operator to converge, then inspect the proxy Deployment's pod template annotations:

    kubectl get deploy repro -n default -o jsonpath='{.spec.template.metadata.annotations}'

    Expected (per the operator's design intent — the comparator's expected map):

    {
      "toolhive.stacklok.dev/runconfig-checksum": "<some-hex>",
      "vault.hashicorp.com/agent-inject": "true",
      "vault.hashicorp.com/role": "toolhive-mcp-workloads"
    }
    

    Observed:

    {
      "vault.hashicorp.com/agent-inject": "true",
      "vault.hashicorp.com/role": "toolhive-mcp-workloads"
    }
    
  3. Edit any RunConfig-affecting resource that doesn't change container.Args — e.g. modify the referenced MCPTelemetryConfig content. Observe that the proxy pod is not restarted, even though the RunConfig content checksum on the operator-managed RunConfig ConfigMap did change.

  4. Watch the operator logs / kubectl get events. You'll see the operator continuously detecting drift and issuing Update calls against the Deployment that don't actually change the live pod template.

Unit-test reproduction

A self-contained unit test against (*MCPServerReconciler).deploymentForMCPServer shows the bug without a cluster. Drop this into cmd/thv-operator/controllers/ (the function is unexported; the test must live in the same package):

//go:build enterprise

package controllers

import (
    "testing"

    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "sigs.k8s.io/controller-runtime/pkg/client/fake"

    mcpv1beta1 "github.com/stacklok/toolhive/cmd/thv-operator/api/v1beta1"
    corev1 "k8s.io/api/core/v1"
    "github.com/stacklok/toolhive/cmd/thv-operator/pkg/kubernetes"
    "github.com/stacklok/toolhive/cmd/thv-operator/pkg/runconfig/configmap/checksum"
)

func TestDeploymentForMCPServer_ChecksumDroppedByPodTemplateOverrides_REPRO(t *testing.T) {
    t.Parallel()
    scheme := runtime.NewScheme()
    require.NoError(t, mcpv1beta1.AddToScheme(scheme))
    require.NoError(t, corev1.AddToScheme(scheme))

    client := fake.NewClientBuilder().WithScheme(scheme).Build()
    r := newTestMCPServerReconciler(client, scheme, kubernetes.PlatformKubernetes)

    mcpServer := &mcpv1beta1.MCPServer{
        ObjectMeta: metav1.ObjectMeta{Name: "repro", Namespace: "default"},
        Spec: mcpv1beta1.MCPServerSpec{
            Image: "test:latest",
            ResourceOverrides: &mcpv1beta1.ResourceOverrides{
                ProxyDeployment: &mcpv1beta1.ProxyDeploymentOverrides{
                    PodTemplateMetadataOverrides: &mcpv1beta1.ResourceMetadataOverrides{
                        Annotations: map[string]string{
                            "vault.hashicorp.com/agent-inject": "true",
                        },
                    },
                },
            },
        },
    }

    deployment := r.deploymentForMCPServer(t.Context(), mcpServer, "C1")
    require.NotNil(t, deployment)

    // Currently FAILS — the checksum is missing from the pod template.
    assert.Equal(t, "C1",
        deployment.Spec.Template.Annotations[checksum.RunConfigChecksumAnnotation])
}

(newTestMCPServerReconciler is the existing helper used by mcpserver_resource_overrides_test.go.)

Root cause

In cmd/thv-operator/controllers/mcpserver_controller.go, in deploymentForMCPServer, around line 1221:

deploymentTemplateAnnotations := make(map[string]string)
deploymentTemplateAnnotations = checksum.AddRunConfigChecksumToPodTemplate(
    deploymentTemplateAnnotations, runConfigChecksum)
// ...
if m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations != nil {
    // BUG: first arg should be deploymentTemplateAnnotations (which already
    // has runconfig-checksum stamped), not deploymentAnnotations (which is
    // the deployment-level overrides map, typically empty).
    deploymentTemplateAnnotations = ctrlutil.MergeAnnotations(
        deploymentAnnotations,
        m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations,
    )
}

ctrlutil.MergeAnnotations is MergeStringMaps, defined as "first-arg-wins on conflict":

// pkg/controllerutil/resources.go
func MergeStringMaps(defaultMap, overrideMap map[string]string) map[string]string {
    result := make(map[string]string)
    for k, v := range overrideMap {
        result[k] = v
    }
    for k, v := range defaultMap {
        result[k] = v // default takes precedence
    }
    return result
}

Passing the empty deploymentAnnotations as the first arg means the merged result contains only the user override map; the runconfig-checksum that was already in deploymentTemplateAnnotations is overwritten and lost.

The MCPRemoteProxy sibling at cmd/thv-operator/controllers/mcpremoteproxy_deployment.go (around line 347) gets it right:

templateAnnotations = ctrlutil.MergeAnnotations(templateAnnotations, overrides.Annotations)

The drift-check counterpart deploymentNeedsUpdate at mcpserver_controller.go:~1804 also gets it right (uses expectedPodTemplateAnnotations as the merge base), which is why the constructor/comparator asymmetry produces the perpetual-Update loop.

Suggested fix

One line in cmd/thv-operator/controllers/mcpserver_controller.go:

-                deploymentTemplateAnnotations = ctrlutil.MergeAnnotations(deploymentAnnotations,
-                    m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations)
+                deploymentTemplateAnnotations = ctrlutil.MergeAnnotations(deploymentTemplateAnnotations,
+                    m.Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations)

This brings the constructor in line with both the MCPRemoteProxy sibling and the deploymentNeedsUpdate comparator. With this change:

  • runconfig-checksum always survives the merge.
  • User overrides are still merged in (the user can't accidentally clobber runconfig-checksum even if they tried — first-arg-wins protects it, which is the right default).
  • Constructor and comparator agree, so deploymentNeedsUpdate no longer flips on every reconcile.

Suggested test

In cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go, the existing "with Vault Agent Injection pod template annotations" case (around line 478) currently asserts only on deployment.Labels and deployment.Annotations. Extend it to assert on deployment.Spec.Template.Annotations:

assert.Equal(t,
    map[string]string{
        "vault.hashicorp.com/agent-inject":         "true",
        "vault.hashicorp.com/role":                 "toolhive-mcp-workloads",
        "toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
    },
    deployment.Spec.Template.Annotations,
    "pod template must contain both user overrides and the runconfig-checksum",
)

That single assertion is the long-term guardrail. It catches the current bug, any future regression where a deployment-level annotation leaks into the pod template, and any future regression where the merge is inadvertently disabled.

Sister sites surveyed

File Status
cmd/thv-operator/controllers/mcpserver_controller.go (constructor) Buggy — fix here
cmd/thv-operator/controllers/mcpserver_controller.go (comparator deploymentNeedsUpdate) Correct — no change
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go Correct — use as model
cmd/thv-operator/controllers/virtualmcpserver_deployment.go Out of scope (currently does not honor PodTemplateMetadataOverrides.Annotations; that is a separate gap)
cmd/thv-operator/controllers/embeddingserver_controller.go Unaffected (uses maps.Copy, no RunConfig-checksum path)

Notes for triagers

  • This is a latent defect that was hidden because no in-tree code populates podTemplateMetadataOverrides.annotations. It is reproducible with any user-facing setup that uses pod-level annotations — the Vault Agent example is the canonical one and is the easiest to verify by hand.
  • The fix is purely additive in behavior (it only matters when the override map is non-empty) and is mechanically symmetric with the existing comparator and the existing MCPRemoteProxy implementation.
  • Backports recommended if release branches are maintained: any user setting Vault Agent / Linkerd / similar pod annotations on MCPServer today is silently affected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggoPull requests that update go codekubernetesItems related to Kubernetesoperator

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions