Skip to content

fix: add admission control to block finalizers removal not made by the operator service account#2575

Open
assafgi wants to merge 1 commit into
mainfrom
05-31-fix_add_admission_control_to_block_finalizers_removal_not_made_by_the_operator_service_account
Open

fix: add admission control to block finalizers removal not made by the operator service account#2575
assafgi wants to merge 1 commit into
mainfrom
05-31-fix_add_admission_control_to_block_finalizers_removal_not_made_by_the_operator_service_account

Conversation

@assafgi
Copy link
Copy Markdown
Member

@assafgi assafgi commented May 31, 2026

No description provided.

@assafgi assafgi marked this pull request as ready for review May 31, 2026 16:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@assafgi assafgi requested a review from a team as a code owner May 31, 2026 16:58
Copy link
Copy Markdown
Member Author

assafgi commented May 31, 2026


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented May 31, 2026

Claude finished @assafgi's task in 2m 26s —— View job


Claude Code Review

Summary

The PR wires a raw admission handler that rejects UPDATEs which drop weka.weka.io/finalizer or do-not-delete.weka.io/unsafe from WekaCluster / WekaClient / WekaContainer, unless the caller is the operator. It pairs cleanly with the do-not-delete finalizer added in #2574, and intentionally registers the new webhooks without the weka.io/skip-admission objectSelector so the guard cannot be label-bypassed. The plumbing in cmd/manager/main.go and the VWC entries in manager.go look correct and the decoder/PartialObjectMetadata approach is the right shape for a finalizer-only check.

Findings posted inline — summarised here:

# Severity Where Issue
1 High finalizer_protection.go:57-60 The bypass check is namespace-only (system:serviceaccount:<ns>:), so the maintenance SA, CSI controller SA, and CSI node SA (all co-located in the operator namespace) can also strip protected finalizers. Should match the exact operator SA name.
2 Medium (bug) finalizer_protection.go:63-68 The for _, f := range removed body unconditionally returns on the first iteration → dead loop, unreachable trailing Allowed(""), and only removed[0] is reported when multiple protected finalizers are removed at once.
3 Low finalizer_protection.go:57 OperatorPodNamespace comes from os.Getenv (not getEnvOrFail); an empty value silently widens the allowlist to system:serviceaccount::*. Fail-fast at startup or deny on empty.
4 Low (style) finalizer_protection.go:72 new shadows Go's builtin.
5 Design callout manager.go:466-546 failurePolicy: Fail + no objectSelector on three near-identical webhooks. Intentional, but a webhook-pod outage will block the operator's own deletion path, not just unauthorized actors — worth confirming the runbook + enableAdmissionControl=false escape hatch is documented. The three blocks could also collapse into a small loop.
6 Testing gap internal/admission/ No finalizer_protection_test.go. Given this is a security boundary, table tests for (allowed-SA / wrong-SA / non-protected-finalizer / multi-finalizer-removal / non-update op) would lock the behaviour in.

Recommendation: address #1 and #2 before merging — both materially affect correctness/security of the guard. The rest can be follow-ups.
· Branch: 05-31-fix_add_admission_control_to_block_finalizers_removal_not_made_by_the_operator_service_account

@graphite-app
Copy link
Copy Markdown

graphite-app Bot commented May 31, 2026

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (05/31/26)

2 reviewers were added to this PR based on Anton Bykov's automation.

Comment on lines +56 to +60
// Check whether the caller is the operator's own SA.
saPrefix := fmt.Sprintf("system:serviceaccount:%s:", config.Config.OperatorPodNamespace)
if strings.HasPrefix(req.UserInfo.Username, saPrefix) {
return admission.Allowed("")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — bypass is too permissive. This prefix check matches any service account in the operator's namespace, not just the operator's own SA. The operator namespace also hosts (at minimum) {{prefix}}-maintenance (see charts/weka-operator/templates/maintenance_service_account.yaml), the CSI controller SA, and the CSI node SA. Anything running under those identities could remove the protected finalizers and defeat the guard.

Match the exact operator SA instead, e.g.:

expected := fmt.Sprintf("system:serviceaccount:%s:%s",
    config.Config.OperatorPodNamespace, config.Config.OperatorServiceAccountName)
if req.UserInfo.Username == expected {
    return admission.Allowed("")
}

(OperatorServiceAccountName would need to be plumbed through env the same way MaintenanceSaName already is — see internal/config/env.go:392.)

Comment on lines +62 to +68
// Non-operator caller removed a protected finalizer — deny.
for _, f := range removed {
return admission.Denied(
fmt.Sprintf("removal of finalizer %q is not allowed; it is managed by the weka-operator", f))
}

return admission.Allowed("")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — dead code / wrong message when multiple finalizers are removed. The for body unconditionally returns on the first iteration, so the loop is a no-op (only removed[0] is ever surfaced) and the trailing return admission.Allowed("") on line 68 is unreachable. If a caller strips both weka.weka.io/finalizer and do-not-delete.weka.io/unsafe in a single PATCH, only one of them is mentioned in the deny message.

Suggested change
// Non-operator caller removed a protected finalizer — deny.
for _, f := range removed {
return admission.Denied(
fmt.Sprintf("removal of finalizer %q is not allowed; it is managed by the weka-operator", f))
}
return admission.Allowed("")
// Non-operator caller removed a protected finalizer — deny.
return admission.Denied(fmt.Sprintf(
"removal of finalizer(s) %v is not allowed; they are managed by the weka-operator",
removed))

}

// removedFinalizers returns protected finalizers present in old but absent in new.
func removedFinalizers(old, new []string) []string {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: new shadows Go's builtin new. Rename to newer/current to keep the linter quiet and the call site readable.

}

// Check whether the caller is the operator's own SA.
saPrefix := fmt.Sprintf("system:serviceaccount:%s:", config.Config.OperatorPodNamespace)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.Config.OperatorPodNamespace is loaded via os.Getenv("POD_NAMESPACE") (internal/config/env.go:386) — there's no getEnvOrFail, so if the env is unset the prefix collapses to "system:serviceaccount::". That's defense-in-depth in the wrong direction: combined with the namespace-only check below, an unset env effectively whitelists anything whose username happens to start with system:serviceaccount:: (including arguably some malformed identities). Worth either failing fast at startup if empty, or doing an explicit empty-check here and denying.

Comment thread internal/admission/manager.go Outdated
Comment on lines +466 to +546
{
Name: "protect-finalizers.wekacluster.weka.io",
AdmissionReviewVersions: []string{"v1"},
SideEffects: &sideEffects,
FailurePolicy: &failurePolicy,
TimeoutSeconds: &timeoutSeconds,
ClientConfig: admissionregistrationv1.WebhookClientConfig{
Service: &admissionregistrationv1.ServiceReference{
Namespace: m.namespace,
Name: m.config.ServiceName,
Path: &finalizerPath,
},
CABundle: caBundle,
},
Rules: []admissionregistrationv1.RuleWithOperations{
{
Operations: []admissionregistrationv1.OperationType{
admissionregistrationv1.Update,
},
Rule: admissionregistrationv1.Rule{
APIGroups: []string{"weka.weka.io"},
APIVersions: []string{"v1alpha1"},
Resources: []string{"wekaclusters"},
},
},
},
},
{
Name: "protect-finalizers.wekaclient.weka.io",
AdmissionReviewVersions: []string{"v1"},
SideEffects: &sideEffects,
FailurePolicy: &failurePolicy,
TimeoutSeconds: &timeoutSeconds,
ClientConfig: admissionregistrationv1.WebhookClientConfig{
Service: &admissionregistrationv1.ServiceReference{
Namespace: m.namespace,
Name: m.config.ServiceName,
Path: &finalizerPath,
},
CABundle: caBundle,
},
Rules: []admissionregistrationv1.RuleWithOperations{
{
Operations: []admissionregistrationv1.OperationType{
admissionregistrationv1.Update,
},
Rule: admissionregistrationv1.Rule{
APIGroups: []string{"weka.weka.io"},
APIVersions: []string{"v1alpha1"},
Resources: []string{"wekaclients"},
},
},
},
},
{
Name: "protect-finalizers.wekacontainer.weka.io",
AdmissionReviewVersions: []string{"v1"},
SideEffects: &sideEffects,
FailurePolicy: &failurePolicy,
TimeoutSeconds: &timeoutSeconds,
ClientConfig: admissionregistrationv1.WebhookClientConfig{
Service: &admissionregistrationv1.ServiceReference{
Namespace: m.namespace,
Name: m.config.ServiceName,
Path: &finalizerPath,
},
CABundle: caBundle,
},
Rules: []admissionregistrationv1.RuleWithOperations{
{
Operations: []admissionregistrationv1.OperationType{
admissionregistrationv1.Update,
},
Rule: admissionregistrationv1.Rule{
APIGroups: []string{"weka.weka.io"},
APIVersions: []string{"v1alpha1"},
Resources: []string{"wekacontainers"},
},
},
},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Availability risk worth calling out. All three finalizer-protection webhooks use failurePolicy: Fail and have no objectSelector escape hatch (by design — the comment on line 465 is explicit). The combination means: if the webhook pod is unreachable, every UPDATE to wekacluster/wekaclient/wekacontainer is rejected — including the operator's own status writes and finalizer-removal updates. The only documented recovery path is the operator-level enableAdmissionControl: false toggle (CleanupIfExists), which is a deliberate operator action.

This is probably the intended tradeoff, but please confirm the runbook covers it — in particular, a webhook-down scenario during a cluster teardown will block the operator's own deletion flow, not just unauthorized actors. A namespaceSelector excluding the operator's own namespace, or a separate webhook entry for the operator SA with Ignore, are options if the risk is unacceptable.

Three near-identical webhook entries also beg for a small loop — for _, res := range []string{"wekaclusters", "wekaclients", "wekacontainers"} would cut ~60 lines without losing clarity.

@assafgi assafgi changed the base branch from 05-31-fix_add_do-not-delete_finalizer_to_signal_the_finalizer_shouldn_t_be_removed_manually to graphite-base/2575 June 5, 2026 08:03
@assafgi assafgi force-pushed the graphite-base/2575 branch from 5293e20 to da86c0d Compare June 5, 2026 08:04
@assafgi assafgi force-pushed the 05-31-fix_add_admission_control_to_block_finalizers_removal_not_made_by_the_operator_service_account branch from 6ab69e3 to 599708c Compare June 5, 2026 08:04
@assafgi assafgi changed the base branch from graphite-base/2575 to main June 5, 2026 08:04
@assafgi assafgi force-pushed the 05-31-fix_add_admission_control_to_block_finalizers_removal_not_made_by_the_operator_service_account branch from 599708c to c9d3508 Compare June 5, 2026 08:44
Copilot AI review requested due to automatic review settings June 5, 2026 08:44
@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Claude finished @assafgi's task in 1m 28s —— View job


Claude Code Re-Review (post-update)

Summary

The latest revision addresses almost all of the previous review feedback. Nice turnaround.

Prior finding Status
#1 High — namespace-only bypass Fixed. Now matches the exact SA via system:serviceaccount:<ns>:<name> using the new OperatorServiceAccountName (finalizer_protection.go:58-62).
#2 Med (bug) — dead loop / only removed[0] reported Fixed. removedFinalizers now collects all protected finalizers and the deny message reports the full %v slice (finalizer_protection.go:64-89).
#3 Low — empty env silently widens allowlist Fixed twice over. Plumbed via getEnvOrFail (env.go:394, fails fast at startup) and an explicit empty-check denies at request time (finalizer_protection.go:55-57).
#4 Low (style)new shadows builtin Fixed. Renamed to current.
#5 Design — three near-identical webhooks Addressed. Collapsed into a for _, res := range []string{...} loop (manager.go:466-498).
#6 Testing — no unit test for the security boundary ⚠️ Still open — no finalizer_protection_test.go.

Verification

  • No operator self-lockout. WEKA_OPERATOR_SERVICE_ACCOUNT_NAME is set to {{ .Values.prefix }}-controller-manager (manager.yaml:137), which matches the deployment's serviceAccountName (manager.yaml:475). So req.UserInfo.Username for the operator pod equals the expected string and its own finalizer removals are allowed. ✅
  • Singular derivation res[:len(res)-1] is correct for all three plural resources (wekaclusterswekacluster, etc.). ✅

Remaining notes (non-blocking)

  1. Test coverage (Add envtest  #6). This is a security boundary — table tests would lock in the behavior cheaply: allowed-SA, wrong-SA, empty-config→deny, non-protected-finalizer-removal→allow, multi-finalizer-removal, non-Update op→allow. Add tests →
  2. Availability tradeoff (prior [DEVOPS-1595] Create an agent pod with a working weka client #5) still stands by design. failurePolicy: Fail + no objectSelector on the finalizer webhooks means a webhook-pod outage blocks all UPDATEs to these CRs — including the operator's own status/finalizer writes during teardown. The recovery path is the operator-level enableAdmissionControl: false toggle. Worth a one-line mention in the runbook if not already there.

Recommendation: the blocking correctness/security issues are resolved. Adding the unit test would be the last thing I'd want before merge given this guards a deletion-protection boundary, but it's reasonable as a fast follow-up.
· Branch: 05-31-fix_add_admission_control_to_block_finalizers_removal_not_made_by_the_operator_service_account

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an admission-control safeguard to prevent removal of operator-managed finalizers unless the request is made by the operator’s own Kubernetes ServiceAccount, strengthening protection against forced cleanup by other identities.

Changes:

  • Introduces a new validating admission webhook endpoint (/protect-finalizers) that denies protected finalizer removal by non-operator service accounts.
  • Extends the ValidatingWebhookConfiguration to include finalizer-protection webhooks for Weka CRDs without an objectSelector escape hatch.
  • Adds a new environment/config value for the operator ServiceAccount name and wires it via the Helm chart.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/config/env.go Adds config/env binding for operator ServiceAccount name used for webhook caller identity verification.
internal/admission/manager.go Adds /protect-finalizers webhook path and appends new finalizer-protection webhooks into the VWC.
internal/admission/finalizer_protection.go Implements and registers the admission handler that enforces finalizer-removal restrictions.
cmd/manager/main.go Registers the new finalizer-protection webhook when admission is enabled.
charts/weka-operator/templates/manager.yaml Injects WEKA_OPERATOR_SERVICE_ACCOUNT_NAME into the manager Deployment env.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/config/env.go
Comment on lines 391 to 395
Config.BindAddress.Metrics = getEnvOrFail("OPERATOR_METRICS_BIND_ADDRESS")
Config.BindAddress.HealthProbe = getEnvOrFail("HEALTH_PROBE_BIND_ADDRESS")
Config.MaintenanceSaName = getEnvOrFail("WEKA_OPERATOR_MAINTENANCE_SA_NAME")
Config.OperatorServiceAccountName = getEnvOrFail("WEKA_OPERATOR_SERVICE_ACCOUNT_NAME")
Config.OcpCompatibility.DriverToolkitSecretName = getEnvOrFail("WEKA_OCP_PULL_SECRET")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants