Skip to content

Gate proxyrunner StatefulSet apply by MCPServer generation#5024

Merged
JAORMX merged 3 commits intomainfrom
phase1-checksum-gate-ssa
Apr 23, 2026
Merged

Gate proxyrunner StatefulSet apply by MCPServer generation#5024
JAORMX merged 3 commits intomainfrom
phase1-checksum-gate-ssa

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented Apr 23, 2026

Summary

During a rolling update of the proxy Deployment, two proxyrunner pods are alive at the same time: the old-RS pod with the old args, and the new-RS pod with the new args. Both call DeployWorkload → server-side apply on the backing StatefulSet as the same field manager with Force: true. The stale writer can land after the fresh writer and clobber the image back to the old digest — the StatefulSet then re-rolls its pod onto the old image and stays there indefinitely. Confirmed in production on v0.24.0; details and timeline in #5023.

  • Operator stamps MCPServer.metadata.generation into the rendered RunConfig as a monotonic version.
  • Proxyrunner threads it through to DeployWorkloadOptions and into the K8s runtime.
  • Before Apply, the runtime reads the existing StatefulSet's toolhive.stacklok.dev/mcpserver-generation pod-template annotation; if it is strictly greater than ours, the apply is skipped. Otherwise our generation is stamped on the pod template and apply proceeds as today.
  • Zero generation is the backward-compat signal: bypasses both the gate and the stamp, preserving existing behavior for callers that don't pass one (old operator → new proxyrunner, Docker runtime, tests).

Fixes #5023

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

New table-driven test TestDeployWorkload_RunConfigMCPServerGenerationGate covers the seven gate cases: absent STS, existing STS with (missing / older / equal / strictly newer / unparseable) annotation, zero options generation (backward-compat). Round-trip JSON test for MCPServerGeneration with omitempty behavior. Operator-side test that createRunConfigFromMCPServer sets the generation from the CR. Three pre-existing operator determinism tests continue to pass; they were the reason this PR pivoted from a time.Time RenderedAt field to int64 MCPServerGeneration.

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

No CRD or REST API changes. The RunConfig JSON schema gains an omitempty integer field — old consumers ignore unknown fields; new consumers see zero when reading old ConfigMaps.

Does this introduce a user-facing change?

No behavior change in the common case. Users will see a new pod-template annotation toolhive.stacklok.dev/mcpserver-generation on StatefulSets backing MCPServers created by upgraded operators. Users debugging a skipped apply will find a DEBUG log line on the proxyrunner pod: skipping StatefulSet apply; newer MCPServer generation already applied.

Implementation plan

Approved implementation plan

Design

Monotonic version carried end-to-end from the CR to the StatefulSet annotation.

  1. RunConfig gains MCPServerGeneration int64 with JSON tag mcpserver_generation,omitempty. Zero is the unversioned / backward-compat signal.
  2. WithMCPServerGeneration(gen int64) builder option.
  3. Operator sets runner.WithMCPServerGeneration(m.Generation) in createRunConfigFromMCPServer.
  4. DeployWorkloadOptions.RunConfigMCPServerGeneration int64 carries it through the runtime seam.
  5. runner.Runruntime.Setup → k8s DeployWorkload plumb the value.
  6. K8s runtime adds three helpers:
    • runConfigGeneration(options) — safe extract with nil-options handling.
    • shouldSkipStatefulSetApply(ctx, ns, name, ourGen)Get the STS, parse the annotation, return true when theirs > ours. Parse errors warn and fall through. Not-found returns false. Zero ourGen returns false (no gate).
    • applyStatefulSet(...) — stamps the annotation when ourGen > 0, builds the apply-config, performs SSA.
  7. New annotation constant RunConfigMCPServerGenerationAnnotation = \"toolhive.stacklok.dev/mcpserver-generation\" alongside serviceFieldManager. Doc comment calls out that the gate only becomes effective once proxyrunner is upgraded.

Invariants

  • Ties (equal generation) → apply (idempotent).
  • No annotation on existing STS → apply + stamp.
  • Strictly newer annotation → skip, return (0, nil), log DEBUG.
  • Unparseable annotation → log WARN, apply + re-stamp with ours.
  • Zero ourGen → apply without gate, without stamp.

Pivot during implementation

The first draft used RenderedAt time.Time = time.Now().UTC(). Two problems surfaced during test validation:

  1. time.Now() made createRunConfigFromMCPServer non-deterministic. The operator's ConfigMap content-checksum would have changed every reconcile, which would have flipped the proxy Deployment's pod-template annotation and caused the proxy to roll on every reconcile — defeating the entire system. Three pre-existing determinism tests regressed to prove it.
  2. omitempty does not actually drop zero time.Time (Go JSON quirk) — backward-compat output would have contained \"0001-01-01T00:00:00Z\".

Pivoted to int64 MCPServerGeneration sourced from MCPServer.metadata.generation. omitempty works on int64. Generation is monotonic (K8s-enforced) and only bumps on spec changes, so RunConfig rendering is deterministic across no-op reconciles. The three determinism tests pass unchanged.

Scope boundary

This PR fixes the image / pod-spec dimension of the race. It does not cover spec.replicas clobbering (the scaling dimension tracked in #4484). Phase 2 — moving StatefulSet ownership into the operator — is the complete architectural fix and would close both. Out of scope here.

Special notes for reviewers

Generated with Claude Code

@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label Apr 23, 2026
@JAORMX JAORMX force-pushed the phase1-checksum-gate-ssa branch from b77f444 to 5a4707d Compare April 23, 2026 08:11
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 23, 2026
@JAORMX JAORMX force-pushed the phase1-checksum-gate-ssa branch from 5a4707d to 2f45b9a Compare April 23, 2026 08:12
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.13%. Comparing base (bbe8b85) to head (dfd9944).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/container/kubernetes/client.go 82.85% 9 Missing and 3 partials ⚠️
pkg/runtime/setup.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5024      +/-   ##
==========================================
+ Coverage   69.10%   69.13%   +0.03%     
==========================================
  Files         556      556              
  Lines       73283    73348      +65     
==========================================
+ Hits        50641    50712      +71     
+ Misses      19627    19618       -9     
- Partials     3015     3018       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good implementation overall — the monotonic generation gate is well-reasoned, backward-compat signal is clean, and tests cover the seven core gate cases. Four findings below (warning + nits; no blockers).

Comment thread pkg/container/kubernetes/client.go
Comment thread pkg/container/kubernetes/client.go Outdated
Comment thread pkg/container/kubernetes/client.go
Comment thread pkg/container/kubernetes/client_test.go
yrobla
yrobla previously approved these changes Apr 23, 2026
During a proxy Deployment rolling update the old and new ReplicaSet
pods both run DeployWorkload, each calling server-side apply on the
backing StatefulSet with the same field manager and Force: true. The
stale pod's apply can land after the fresh pod's and clobber the
image, leaving the StatefulSet pinned to the old digest even after
MCPServer.spec.image has been bumped.

The operator now stamps MCPServer.metadata.generation into the
RunConfig as a monotonic version. Proxyrunner threads it through to
DeployWorkloadOptions. Before apply, the K8s runtime reads the
existing StatefulSet's mcpserver-generation annotation; if it is
strictly greater than ours, the apply is skipped. Otherwise our
generation is stamped on the pod template and apply proceeds.

Zero generation is the backward-compat signal and bypasses both the
gate and the stamp, preserving existing behavior for callers that
don't pass one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
@JAORMX JAORMX force-pushed the phase1-checksum-gate-ssa branch from 2f45b9a to f9cae74 Compare April 23, 2026 08:44
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 23, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 23, 2026
@JAORMX JAORMX merged commit 3d0712c into main Apr 23, 2026
18 of 19 checks passed
@JAORMX JAORMX deleted the phase1-checksum-gate-ssa branch April 23, 2026 12:40
@github-actions github-actions Bot removed the size/M Medium PR: 300-599 lines changed label Apr 23, 2026
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stale proxy pod clobbers backing StatefulSet during rolling update after spec.image change

2 participants