Skip to content

Patch MCPServer spec instead of Update#4914

Open
jhrozek wants to merge 1 commit intomainfrom
update-to-patch-1-mcpserver
Open

Patch MCPServer spec instead of Update#4914
jhrozek wants to merge 1 commit intomainfrom
update-to-patch-1-mcpserver

Conversation

@jhrozek
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek commented Apr 17, 2026

Summary

  • The controller writes finalizers, finalizer removal, and the restart-processed annotation via r.Update. Update is a full PUT, so any spec field the operator does not track — most importantly spec.authzConfig, which an external operator will soon own via server-side apply — is zeroed on every reconcile.
  • Replace the three Update call sites with an optimistic-lock merge patch (client.MergeFromWithOptions(orig, client.MergeFromWithOptimisticLock{})). The merge-patch body carries only fields the caller changed, so untouched fields never hit the wire and cannot be clobbered. MergeFromWithOptimisticLock sends resourceVersion as a precondition, giving 409-on-collision semantics for concurrent writers and defending metadata.finalizers (which has no array-merge semantics under merge-patch) against wholesale replacement when another controller is mid-flight adding its own entry.
  • First in a 12-PR migration track. PR 2+ will migrate status writes under a separate helper (Switch status writes from Update to Patch across all controllers #4633).

Fixes #4767.

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Unit tests: cmd/thv-operator/controllers/mcpserver_spec_patch_test.go uses a patch-recording client wrapper to assert that each of the three migrated call sites (AddFinalizer, RemoveFinalizer, RestartAnnotation) emits a merge-patch body carrying the resourceVersion precondition — a deterministic wire-level signal that MergeFromWithOptimisticLock is in effect. A regression to plain MergeFrom would drop the precondition and fail the assertion independent of the higher-level survival test.

Envtest integration: cmd/thv-operator/test-integration/mcp-server/mcpserver_spec_patch_integration_test.go creates an MCPServer, writes spec.authzConfig out-of-band from a second client, and asserts the field survives both the finalizer-add reconcile and the restart-annotation reconcile. Run via task operator-test-integration.

Manual: kubectl apply an MCPServer, kubectl patch --type=merge to set spec.authzConfig, wait for reconcile, kubectl get mcpserver -o yamlspec.authzConfig persists across reconciles.

Does this introduce a user-facing change?

No.

Implementation plan

Approved implementation plan

First PR of the r.Updater.Patch migration tracked in #4767 and #4633. This PR covers Track A (#4767): the three MCPServer spec writes that must move to an optimistic-lock merge patch before an external operator starts writing spec.authzConfig via SSA. Status-subresource migration (#4633) is the separate Track B, starting in PR 2.

Key design decisions:

  • Use MergeFromWithOptimisticLock{} for MCPServer spec patches — preserves conflict-detection parity with Update, forces requeue on concurrent SSA instead of silent clobber, defends against metadata.finalizers array replacement.
  • Keep spec-patch migration inline (3 sites) — a helper will land for status (89 sites) in PR 2 of Track B.
  • Test strategy: envtest (real apiserver) for field-survival assertions; patch-recording fake client for wire-level optimistic-lock assertions. Two independent Kubernetes-go reviewers concurred that in-cluster (chainsaw) tests add no unique signal for patch semantics (100% apiserver-side) over envtest.

Special notes for reviewers

  • The three call sites are at cmd/thv-operator/controllers/mcpserver_controller.go:196 (RemoveFinalizer), :212 (AddFinalizer), and :768 (restart annotation). Each follows the same DeepCopy → mutate → Patch(MergeFromWithOptions) idiom.
  • mcpserver_restart_test.go renames the mock-client flag failOnMCPServerUpdatefailOnMCPServerWrite because it now intercepts both Update and Patch on MCPServer.
  • A short "Spec / metadata patching" section was added to .claude/rules/operator.md documenting the pattern for future CR writes.
  • Expect 409 Conflict reconciles to appear as routine log noise once external SSA writers land in a cluster — the optimistic-lock guard doing its job, not a bug.

Generated with Claude Code

@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label Apr 17, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 81.81818% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.01%. Comparing base (f9b540d) to head (daf9140).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...or/controllers/mcpexternalauthconfig_controller.go 75.00% 1 Missing and 1 partial ⚠️
...d/thv-operator/controllers/mcpserver_controller.go 88.23% 0 Missing and 2 partials ⚠️
.../thv-operator/controllers/toolconfig_controller.go 75.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4914      +/-   ##
==========================================
- Coverage   69.03%   69.01%   -0.03%     
==========================================
  Files         552      553       +1     
  Lines       72996    73035      +39     
==========================================
+ Hits        50395    50407      +12     
- Misses      19601    19621      +20     
- Partials     3000     3007       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jhrozek jhrozek force-pushed the update-to-patch-1-mcpserver branch from dee0e05 to 0513e4e Compare April 20, 2026 21:12
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 20, 2026
@jhrozek jhrozek force-pushed the update-to-patch-1-mcpserver branch from 0513e4e to 3187fdc Compare April 21, 2026 11:04
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 21, 2026
jhrozek added a commit that referenced this pull request Apr 21, 2026
Relates to #4633. A shared helper that collapses the
"DeepCopy → mutate → r.Status().Patch(MergeFrom(original))" idiom
to a single call so remaining r.Status().Update sites can migrate
without each one re-implementing the DeepCopy-before-mutate
discipline by hand.

Status writes deliberately use a plain merge patch, not an
optimistic-lock one: the operator and the runtime reporter write
disjoint status fields on every reconcile and must coexist without
forcing a 409 on every overlap. Spec and metadata writes still
require optimistic locking — see #4767 (tracking) / #4914
(MCPServer migration).

The helper does not make every multi-writer pattern safe. The
Caller contract in the doc comment spells out two footguns it
cannot defend against:

- JSON merge-patch replaces arrays wholesale for CRDs, so a writer
  to Status.Conditions must be the sole owner of the entire array.
  Any concurrent writer whose Patch lands between this caller's
  Get and Patch — on any condition type, including ones this
  caller does not touch — will be erased. A fresh Get narrows but
  does not eliminate the TOCTOU window.
- A scalar re-computed from a stale snapshot that differs from the
  live value will overwrite a concurrent writer's update.

The codified checklist for new call sites lives in
.claude/rules/operator.md.

Operational safeguards in the helper itself:

- No-op mutations (empty merge-patch body) short-circuit before
  the wire call; the apiserver runs admission and audit for every
  PATCH regardless of body content, so steady-state reconcilers
  must not generate {} traffic.
- A nil obj returns a descriptive error rather than panicking in
  the downstream type assertion.

The helper lives in cmd/thv-operator/pkg/controllerutil alongside
the existing controller helpers. It may move to a shared location
later if a non-operator caller needs it.

Pure addition — no call-site changes in this PR.

Tests (cmd/thv-operator/pkg/controllerutil/status_test.go) cover:

- Happy path and DeepCopy isolation.
- No-op mutate skips the wire call.
- Disjoint-writer preservation: with a stale snapshot, a second
  writer owning disjoint scalar fields survives the patch.
- Stale snapshot clobbers conditions from another writer — guards
  the documented Caller contract so the behaviour stays load-
  bearing against future changes.
- Stale scalar computation: re-assigning the read value is a no-op
  at the wire level (concurrent writer preserved); assigning a
  differing value overwrites live state.
- Nil obj is rejected with a descriptive error, no PATCH issued.
- Error propagation: apiserver failures from Status().Patch are
  returned unchanged for the controller's requeue decision.
jhrozek added a commit that referenced this pull request Apr 21, 2026
Relates to #4633. A shared helper that collapses the
"DeepCopy → mutate → r.Status().Patch(MergeFrom(original))" idiom
to a single call so remaining r.Status().Update sites can migrate
without each one re-implementing the DeepCopy-before-mutate
discipline by hand.

Status writes deliberately use a plain merge patch, not an
optimistic-lock one: the operator and the runtime reporter write
disjoint status fields on every reconcile and must coexist without
forcing a 409 on every overlap. Spec and metadata writes still
require optimistic locking — see #4767 (tracking) / #4914
(MCPServer migration).

The helper does not make every multi-writer pattern safe. The
Caller contract in the doc comment spells out two footguns it
cannot defend against:

- JSON merge-patch replaces arrays wholesale for CRDs, so a writer
  to Status.Conditions must be the sole owner of the entire array.
  Any concurrent writer whose Patch lands between this caller's
  Get and Patch — on any condition type, including ones this
  caller does not touch — will be erased. A fresh Get narrows but
  does not eliminate the TOCTOU window.
- A scalar re-computed from a stale snapshot that differs from the
  live value will overwrite a concurrent writer's update.

The codified checklist for new call sites lives in
.claude/rules/operator.md.

Operational safeguards in the helper itself:

- No-op mutations (empty merge-patch body) short-circuit before
  the wire call; the apiserver runs admission and audit for every
  PATCH regardless of body content, so steady-state reconcilers
  must not generate {} traffic.
- A nil obj returns a descriptive error rather than panicking in
  the downstream type assertion.

The helper lives in cmd/thv-operator/pkg/controllerutil alongside
the existing controller helpers. It may move to a shared location
later if a non-operator caller needs it.

Pure addition — no call-site changes in this PR.

Tests (cmd/thv-operator/pkg/controllerutil/status_test.go) cover:

- Happy path and DeepCopy isolation.
- No-op mutate skips the wire call.
- Disjoint-writer preservation: with a stale snapshot, a second
  writer owning disjoint scalar fields survives the patch.
- Stale snapshot clobbers conditions from another writer — guards
  the documented Caller contract so the behaviour stays load-
  bearing against future changes.
- Stale scalar computation: re-assigning the read value is a no-op
  at the wire level (concurrent writer preserved); assigning a
  differing value overwrites live state.
- Nil obj is rejected with a descriptive error, no PATCH issued.
- Error propagation: apiserver failures from Status().Patch are
  returned unchanged for the controller's requeue decision.
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 21, 2026
rdimitrov
rdimitrov previously approved these changes Apr 21, 2026
Comment thread .claude/rules/operator.md
task crdref-gen # Generate CRD API docs (run from cmd/thv-operator/)
```

## Spec / metadata patching
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@jhrozek jhrozek force-pushed the update-to-patch-1-mcpserver branch from 3a9a6d3 to 1c137de Compare April 21, 2026 21:07
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 21, 2026
jhrozek added a commit that referenced this pull request Apr 21, 2026
Relates to #4633. A shared helper that collapses the
"DeepCopy → mutate → r.Status().Patch(MergeFrom(original))" idiom
to a single call so remaining r.Status().Update sites can migrate
without each one re-implementing the DeepCopy-before-mutate
discipline by hand.

Status writes deliberately use a plain merge patch, not an
optimistic-lock one: the operator and the runtime reporter write
disjoint status fields on every reconcile and must coexist without
forcing a 409 on every overlap. Spec and metadata writes still
require optimistic locking — see #4767 (tracking) / #4914
(MCPServer migration).

The helper does not make every multi-writer pattern safe. The
Caller contract in the doc comment spells out two footguns it
cannot defend against:

- JSON merge-patch replaces arrays wholesale for CRDs, so a writer
  to Status.Conditions must be the sole owner of the entire array.
  Any concurrent writer whose Patch lands between this caller's
  Get and Patch — on any condition type, including ones this
  caller does not touch — will be erased. A fresh Get narrows but
  does not eliminate the TOCTOU window.
- A scalar re-computed from a stale snapshot that differs from the
  live value will overwrite a concurrent writer's update.

The codified checklist for new call sites lives in
.claude/rules/operator.md.

Operational safeguards in the helper itself:

- No-op mutations (empty merge-patch body) short-circuit before
  the wire call; the apiserver runs admission and audit for every
  PATCH regardless of body content, so steady-state reconcilers
  must not generate {} traffic.
- A nil obj returns a descriptive error rather than panicking in
  the downstream type assertion.

The helper lives in cmd/thv-operator/pkg/controllerutil alongside
the existing controller helpers. It may move to a shared location
later if a non-operator caller needs it.

Pure addition — no call-site changes in this PR.

Tests (cmd/thv-operator/pkg/controllerutil/status_test.go) cover:

- Happy path and DeepCopy isolation.
- No-op mutate skips the wire call.
- Disjoint-writer preservation: with a stale snapshot, a second
  writer owning disjoint scalar fields survives the patch.
- Stale snapshot clobbers conditions from another writer — guards
  the documented Caller contract so the behaviour stays load-
  bearing against future changes.
- Stale scalar computation: re-assigning the read value is a no-op
  at the wire level (concurrent writer preserved); assigning a
  differing value overwrites live state.
- Nil obj is rejected with a descriptive error, no PATCH issued.
- Error propagation: apiserver failures from Status().Patch are
  returned unchanged for the controller's requeue decision.
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: kubernetes-expert, go-expert-developer, code-reviewer, toolhive-expert

Consensus Summary

# Finding Consensus Severity Action
2 Other MCPServer-writing reconcilers still Update spec 8/10 HIGH Discuss scope
3 Verbatim 4-line rationale comment duplicated 3× → helper 7/10 MEDIUM Fix
4 Hardcoded finalizer string; other CRDs have exported constants 7/10 MEDIUM Fix
5 Rule text misstates why optimistic lock defends finalizers 7/10 MEDIUM Fix
6 envtest finalizer-add case can degenerate into a no-op 7/10 LOW Fix
7 patchRecordingClient silently discards patch.Data() errors 7/10 LOW Fix
8 envtest cleanupServer swallows errors; uses non-lock MergeFrom 7/10 LOW Fix

Overall

The controller migration itself is correct: DeepCopy → mutate → Patch(MergeFromWithOptions(orig, MergeFromWithOptimisticLock{})) is the right tool for the "external controller owns a disjoint spec field" problem, the finalizer-defense reasoning is sound, and MergeFromWithOptimisticLock strictly improves on Update for this use case.

The PR scope closes the clobber hazard in mcpserver_controller.go, but toolconfig_controller.go:167 and mcpexternalauthconfig_controller.go:186 still call r.Update(ctx, &server) on MCPServer — so an external SSA writer of spec.authzConfig can still lose its write on any MCPToolConfig or MCPExternalAuthConfig reconcile.

The medium findings are polish: a DRY helper for the 3-site copy-paste pattern, an exported MCPServerFinalizerName constant to match every other CRD, and a correction to how the rule file explains the optimistic lock's role in defending finalizers.

Documentation

The rule file addition in .claude/rules/operator.md is load-bearing for the convention being introduced. See finding #5 — the "merge-patch has no array-merge semantics" paragraph conflates two independent mechanisms. The optimistic lock does not change merge-patch's array-replacement behavior; it forces a requeue on staleness, which then triggers a fresh Get that observes the concurrently-added finalizer.


Generated with Claude Code

Comment thread cmd/thv-operator/controllers/mcpserver_controller.go
Comment thread cmd/thv-operator/controllers/mcpserver_controller.go
Comment thread cmd/thv-operator/controllers/mcpserver_controller.go Outdated
Comment thread .claude/rules/operator.md Outdated
Comment thread cmd/thv-operator/controllers/mcpserver_spec_patch_test.go
ChrisJBurns added a commit that referenced this pull request Apr 21, 2026
Relates to #4633. A shared helper that collapses the
"DeepCopy → mutate → r.Status().Patch(MergeFrom(original))" idiom
to a single call so remaining r.Status().Update sites can migrate
without each one re-implementing the DeepCopy-before-mutate
discipline by hand.

Status writes deliberately use a plain merge patch, not an
optimistic-lock one: the operator and the runtime reporter write
disjoint status fields on every reconcile and must coexist without
forcing a 409 on every overlap. Spec and metadata writes still
require optimistic locking — see #4767 (tracking) / #4914
(MCPServer migration).

The helper does not make every multi-writer pattern safe. The
Caller contract in the doc comment spells out two footguns it
cannot defend against:

- JSON merge-patch replaces arrays wholesale for CRDs, so a writer
  to Status.Conditions must be the sole owner of the entire array.
  Any concurrent writer whose Patch lands between this caller's
  Get and Patch — on any condition type, including ones this
  caller does not touch — will be erased. A fresh Get narrows but
  does not eliminate the TOCTOU window.
- A scalar re-computed from a stale snapshot that differs from the
  live value will overwrite a concurrent writer's update.

The codified checklist for new call sites lives in
.claude/rules/operator.md.

Operational safeguards in the helper itself:

- No-op mutations (empty merge-patch body) short-circuit before
  the wire call; the apiserver runs admission and audit for every
  PATCH regardless of body content, so steady-state reconcilers
  must not generate {} traffic.
- A nil obj returns a descriptive error rather than panicking in
  the downstream type assertion.

The helper lives in cmd/thv-operator/pkg/controllerutil alongside
the existing controller helpers. It may move to a shared location
later if a non-operator caller needs it.

Pure addition — no call-site changes in this PR.

Tests (cmd/thv-operator/pkg/controllerutil/status_test.go) cover:

- Happy path and DeepCopy isolation.
- No-op mutate skips the wire call.
- Disjoint-writer preservation: with a stale snapshot, a second
  writer owning disjoint scalar fields survives the patch.
- Stale snapshot clobbers conditions from another writer — guards
  the documented Caller contract so the behaviour stays load-
  bearing against future changes.
- Stale scalar computation: re-assigning the read value is a no-op
  at the wire level (concurrent writer preserved); assigning a
  differing value overwrites live state.
- Nil obj is rejected with a descriptive error, no PATCH issued.
- Error propagation: apiserver failures from Status().Patch are
  returned unchanged for the controller's requeue decision.

Co-authored-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
Fixes #4767.

The controller writes finalizers, finalizer removal, and the
restart-processed annotation via r.Update. Update is a full PUT, so
any spec field the operator does not track — most importantly
spec.authzConfig, which a separate authorization controller will soon
own — is zeroed on every reconcile.

Replace the three Update call sites with an optimistic-lock merge
patch. The merge-patch body carries only fields the caller changed,
so untouched fields never hit the wire and cannot be clobbered.
MergeFromWithOptimisticLock sends resourceVersion as a precondition,
giving 409-on-collision semantics for concurrent writers and
defending metadata.finalizers (which has no array-merge semantics
under merge-patch) against wholesale replacement when another
controller is mid-flight adding its own entry.

Tests:

- Envtest suite writes spec.authzConfig out-of-band and asserts it
  survives both the finalizer-add reconcile and the
  restart-annotation reconcile.
- Unit suite uses a patch-recording client to assert each migrated
  call site emits a body carrying the resourceVersion precondition
  — a deterministic wire-level signal that
  MergeFromWithOptimisticLock is in effect. A regression to plain
  MergeFrom would drop the precondition and fail the assertion
  independent of the higher-level survival test.

Also:

- .claude/rules/operator.md: new "Spec / metadata patching" section
  documenting the pattern for future CR writes. Status patching is
  a separate follow-up (#4633).
- Rename the mock-client flag failOnMCPServerUpdate →
  failOnMCPServerWrite; it now intercepts both Update and Patch on
  MCPServer, so the name matches reality.
@jhrozek jhrozek force-pushed the update-to-patch-1-mcpserver branch from 1c137de to daf9140 Compare April 21, 2026 22:33
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix r.Update to r.Patch plus regression guard in MCPServer controller

3 participants