Skip to content

fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope)#180

Merged
shujaatTracebloc merged 2 commits into
developfrom
fix/resource-monitor-cluster-rbac
Jun 3, 2026
Merged

fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope)#180
shujaatTracebloc merged 2 commits into
developfrom
fix/resource-monitor-cluster-rbac

Conversation

@shujaatTracebloc
Copy link
Copy Markdown

@shujaatTracebloc shujaatTracebloc commented Jun 2, 2026

Problem

With clusterScope: false, templates/resource-monitor-rbac.yaml rendered only a namespace-scoped Role in the release namespace. But the resource-monitor's code needs cluster scope:

  • resource_monitor.py calls core_v1_api.list_pod_for_all_namespaces(field_selector="spec.nodeName=<node>") — a cluster-scoped list pods verb that a namespaced Role can never satisfy.
  • It also read_namespaced_pod()s its own pod, which lives in .Values.nodeAgents.namespace.name (the node-agents namespace), not .Release.Namespace — so a Role scoped to the release namespace misses it too.

Result on a live cluster: the DaemonSet 403 Forbiddens on startup and CrashLoopBackOffs (70+ restarts observed). Per-node resource metrics are dead while it's down.

Fix

Per-node monitoring is intrinsically cluster-scoped, so the resource-monitor's RBAC is deliberately decoupled from clusterScope. Always render the read-only ClusterRole + ClusterRoleBinding:

  • get/list/watch on pods, pods/log, nodes, nodes/status, namespaces, and metrics.k8s.io pods/nodes.
  • No write, exec, or secret access — strictly read visibility of pod/node metadata.

clusterScope continues to gate the training/jobs isolation footprint elsewhere; it should not cripple node telemetry by leaving the DaemonSet without permissions it cannot run without. If a deployment genuinely cannot allow any cluster-scoped read, disable the monitor entirely via resourceMonitor: false (still supported) rather than deploying it broken.

Validation (helm template)

  • clusterScope=false and clusterScope=true → both render ClusterRole + ClusterRoleBinding with the read-only rules.
  • resourceMonitor=false → 0 objects (fully disabled).
  • Document order preserved (ServiceAccount@0, ClusterRole@1, ClusterRoleBinding@2), so existing tests/resource_monitor_test.yaml assertions still hold. tests/rbac_test.yaml covers rbac.yaml (jobs-manager), which is untouched. No existing tests removed or changed.

🤖 Generated with Claude Code


Note

Medium Risk
Introduces cluster-scoped read RBAC on every install where the monitor is enabled, even when operators chose namespace-scoped isolation via clusterScope=false.

Overview
Fixes resource-monitor DaemonSet CrashLoopBackOff when clusterScope: false by always installing a read-only ClusterRole + ClusterRoleBinding instead of a release-namespace Role/RoleBinding.

The monitor needs cluster-wide list on pods (per-node via list_pod_for_all_namespaces) and access to its own pod in the node-agents namespace—permissions a namespaced Role cannot provide. clusterScope no longer gates this RBAC; it still controls training/jobs isolation elsewhere. Disabling cluster reads entirely remains resourceMonitor: false.

Helm unit tests now assert cluster-scoped RBAC even with clusterScope: false.

Reviewed by Cursor Bugbot for commit 6a869d8. Bugbot is set up for automated code reviews on this repo. Configure here.

…rom clusterScope)

Under clusterScope: false the chart rendered only a namespace-scoped Role in
the release namespace. But the resource-monitor's code:
  * calls core_v1_api.list_pod_for_all_namespaces(field_selector=spec.nodeName=...)
    -- a CLUSTER-SCOPED list verb a namespaced Role can never satisfy; and
  * read_namespaced_pod()s its OWN pod, which lives in
    .Values.nodeAgents.namespace.name (NOT .Release.Namespace).

So with clusterScope: false the DaemonSet 403'd on startup and crashlooped
(70+ restarts observed on a live cluster). Per-node monitoring is intrinsically
cluster-scoped.

Always render the read-only ClusterRole + ClusterRoleBinding regardless of
clusterScope (get/list/watch on pods/nodes/namespaces + metrics; no write,
exec, or secret access). resourceMonitor: false still fully disables the
component. clusterScope continues to gate the training/jobs isolation footprint
elsewhere -- it must not leave the node monitor without permissions it cannot
run without.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 22 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@shujaatTracebloc shujaatTracebloc self-assigned this Jun 2, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 629773d. Configure here.

Comment thread client/templates/resource-monitor-rbac.yaml
…erScope=false

Follow-up to the RBAC fix: node_agents_namespace_test.yaml still asserted the
old behavior (namespaced Role + RoleBinding in the release namespace when
clusterScope=false). Update that case to assert the corrected contract -- a
ClusterRole + ClusterRoleBinding always render (with no metadata.namespace),
while the subject SA still lives in the node-agents namespace.

The clusterScope=false path stays under test; only the asserted behavior
changes to match the fix. Verified with `helm unittest` (all resource-monitor
suites pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@shujaatTracebloc
Copy link
Copy Markdown
Author

Good catch — fixed in 6a869d8.

You're right: tests/node_agents_namespace_test.yaml had a clusterScope: false case asserting the old namespaced Role/RoleBinding (in the release namespace), which this change intentionally replaces. I missed that file in my original scan (I'd only checked rbac_test.yaml and resource_monitor_test.yaml), so the "no existing tests changed" note was wrong — apologies.

Rather than delete the coverage, I updated that case to assert the corrected contract: under clusterScope: false the template now renders a ClusterRole + ClusterRoleBinding (with no metadata.namespace), while the subject ServiceAccount still lives in the node-agents namespace. The clusterScope: false path stays under test — only the asserted behavior changed to match the fix.

Verified locally with helm unittest: the three resource-monitor-related suites (node_agents_namespace_test, resource_monitor_test, rbac_test) all pass — 33/33. (Four other suites — auto-upgrade, image-refresh, network-policy, priority-class — fail with values don't meet the schema, but they fail identically on a clean develop checkout, so they're pre-existing and unrelated to this change.)

@shujaatTracebloc shujaatTracebloc merged commit b8233bc into develop Jun 3, 2026
11 checks passed
@shujaatTracebloc shujaatTracebloc deleted the fix/resource-monitor-cluster-rbac branch June 3, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants