fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope)#180
Conversation
…rom clusterScope)
Under clusterScope: false the chart rendered only a namespace-scoped Role in
the release namespace. But the resource-monitor's code:
* calls core_v1_api.list_pod_for_all_namespaces(field_selector=spec.nodeName=...)
-- a CLUSTER-SCOPED list verb a namespaced Role can never satisfy; and
* read_namespaced_pod()s its OWN pod, which lives in
.Values.nodeAgents.namespace.name (NOT .Release.Namespace).
So with clusterScope: false the DaemonSet 403'd on startup and crashlooped
(70+ restarts observed on a live cluster). Per-node monitoring is intrinsically
cluster-scoped.
Always render the read-only ClusterRole + ClusterRoleBinding regardless of
clusterScope (get/list/watch on pods/nodes/namespaces + metrics; no write,
exec, or secret access). resourceMonitor: false still fully disables the
component. clusterScope continues to gate the training/jobs isolation footprint
elsewhere -- it must not leave the node monitor without permissions it cannot
run without.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
👋 Heads-up — Code review queue is at 22 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 629773d. Configure here.
…erScope=false Follow-up to the RBAC fix: node_agents_namespace_test.yaml still asserted the old behavior (namespaced Role + RoleBinding in the release namespace when clusterScope=false). Update that case to assert the corrected contract -- a ClusterRole + ClusterRoleBinding always render (with no metadata.namespace), while the subject SA still lives in the node-agents namespace. The clusterScope=false path stays under test; only the asserted behavior changes to match the fix. Verified with `helm unittest` (all resource-monitor suites pass). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Good catch — fixed in 6a869d8. You're right: Rather than delete the coverage, I updated that case to assert the corrected contract: under Verified locally with |

Problem
With
clusterScope: false,templates/resource-monitor-rbac.yamlrendered only a namespace-scoped Role in the release namespace. But the resource-monitor's code needs cluster scope:resource_monitor.pycallscore_v1_api.list_pod_for_all_namespaces(field_selector="spec.nodeName=<node>")— a cluster-scopedlist podsverb that a namespacedRolecan never satisfy.read_namespaced_pod()s its own pod, which lives in.Values.nodeAgents.namespace.name(the node-agents namespace), not.Release.Namespace— so a Role scoped to the release namespace misses it too.Result on a live cluster: the DaemonSet
403 Forbiddens on startup and CrashLoopBackOffs (70+ restarts observed). Per-node resource metrics are dead while it's down.Fix
Per-node monitoring is intrinsically cluster-scoped, so the resource-monitor's RBAC is deliberately decoupled from
clusterScope. Always render the read-onlyClusterRole+ClusterRoleBinding:get/list/watchonpods,pods/log,nodes,nodes/status,namespaces, andmetrics.k8s.iopods/nodes.clusterScopecontinues to gate the training/jobs isolation footprint elsewhere; it should not cripple node telemetry by leaving the DaemonSet without permissions it cannot run without. If a deployment genuinely cannot allow any cluster-scoped read, disable the monitor entirely viaresourceMonitor: false(still supported) rather than deploying it broken.Validation (
helm template)clusterScope=falseandclusterScope=true→ both renderClusterRole+ClusterRoleBindingwith the read-only rules.resourceMonitor=false→ 0 objects (fully disabled).ServiceAccount@0,ClusterRole@1,ClusterRoleBinding@2), so existingtests/resource_monitor_test.yamlassertions still hold.tests/rbac_test.yamlcoversrbac.yaml(jobs-manager), which is untouched. No existing tests removed or changed.🤖 Generated with Claude Code
Note
Medium Risk
Introduces cluster-scoped read RBAC on every install where the monitor is enabled, even when operators chose namespace-scoped isolation via clusterScope=false.
Overview
Fixes resource-monitor DaemonSet CrashLoopBackOff when
clusterScope: falseby always installing a read-onlyClusterRole+ClusterRoleBindinginstead of a release-namespaceRole/RoleBinding.The monitor needs cluster-wide
liston pods (per-node vialist_pod_for_all_namespaces) and access to its own pod in the node-agents namespace—permissions a namespaced Role cannot provide.clusterScopeno longer gates this RBAC; it still controls training/jobs isolation elsewhere. Disabling cluster reads entirely remainsresourceMonitor: false.Helm unit tests now assert cluster-scoped RBAC even with
clusterScope: false.Reviewed by Cursor Bugbot for commit 6a869d8. Bugbot is set up for automated code reviews on this repo. Configure here.