fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope) by shujaatTracebloc · Pull Request #180 · tracebloc/client

shujaatTracebloc · 2026-06-02T12:13:19Z

Problem

With clusterScope: false, templates/resource-monitor-rbac.yaml rendered only a namespace-scoped Role in the release namespace. But the resource-monitor's code needs cluster scope:

resource_monitor.py calls core_v1_api.list_pod_for_all_namespaces(field_selector="spec.nodeName=<node>") — a cluster-scoped list pods verb that a namespaced Role can never satisfy.
It also read_namespaced_pod()s its own pod, which lives in .Values.nodeAgents.namespace.name (the node-agents namespace), not .Release.Namespace — so a Role scoped to the release namespace misses it too.

Result on a live cluster: the DaemonSet 403 Forbiddens on startup and CrashLoopBackOffs (70+ restarts observed). Per-node resource metrics are dead while it's down.

Fix

Per-node monitoring is intrinsically cluster-scoped, so the resource-monitor's RBAC is deliberately decoupled from clusterScope. Always render the read-only ClusterRole + ClusterRoleBinding:

get/list/watch on pods, pods/log, nodes, nodes/status, namespaces, and metrics.k8s.io pods/nodes.
No write, exec, or secret access — strictly read visibility of pod/node metadata.

clusterScope continues to gate the training/jobs isolation footprint elsewhere; it should not cripple node telemetry by leaving the DaemonSet without permissions it cannot run without. If a deployment genuinely cannot allow any cluster-scoped read, disable the monitor entirely via resourceMonitor: false (still supported) rather than deploying it broken.

Validation (`helm template`)

clusterScope=false and clusterScope=true → both render ClusterRole + ClusterRoleBinding with the read-only rules.
resourceMonitor=false → 0 objects (fully disabled).
Document order preserved (ServiceAccount@0, ClusterRole@1, ClusterRoleBinding@2), so existing tests/resource_monitor_test.yaml assertions still hold. tests/rbac_test.yaml covers rbac.yaml (jobs-manager), which is untouched. No existing tests removed or changed.

🤖 Generated with Claude Code

Note

Medium Risk
Introduces cluster-scoped read RBAC on every install where the monitor is enabled, even when operators chose namespace-scoped isolation via clusterScope=false.

Overview
Fixes resource-monitor DaemonSet CrashLoopBackOff when clusterScope: false by always installing a read-only ClusterRole + ClusterRoleBinding instead of a release-namespace Role/RoleBinding.

The monitor needs cluster-wide list on pods (per-node via list_pod_for_all_namespaces) and access to its own pod in the node-agents namespace—permissions a namespaced Role cannot provide. clusterScope no longer gates this RBAC; it still controls training/jobs isolation elsewhere. Disabling cluster reads entirely remains resourceMonitor: false.

Helm unit tests now assert cluster-scoped RBAC even with clusterScope: false.

^{Reviewed by Cursor Bugbot for commit 6a869d8. Bugbot is set up for automated code reviews on this repo. Configure here.}

…rom clusterScope) Under clusterScope: false the chart rendered only a namespace-scoped Role in the release namespace. But the resource-monitor's code: * calls core_v1_api.list_pod_for_all_namespaces(field_selector=spec.nodeName=...) -- a CLUSTER-SCOPED list verb a namespaced Role can never satisfy; and * read_namespaced_pod()s its OWN pod, which lives in .Values.nodeAgents.namespace.name (NOT .Release.Namespace). So with clusterScope: false the DaemonSet 403'd on startup and crashlooped (70+ restarts observed on a live cluster). Per-node monitoring is intrinsically cluster-scoped. Always render the read-only ClusterRole + ClusterRoleBinding regardless of clusterScope (get/list/watch on pods/nodes/namespaces + metrics; no write, exec, or secret access). resourceMonitor: false still fully disables the component. clusterScope continues to gate the training/jobs isolation footprint elsewhere -- it must not leave the node monitor without permissions it cannot run without. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-02T12:14:08Z

👋 Heads-up — Code review queue is at 22 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
client#179 — Sync develop → main for v1.4.3 chart release · author: @saadqbal · no reviewer assigned
client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
client-runtime#67 — ci: publish jobs-manager images on merge (closes the deploy-delivery gap) · author: @LukasWodka · no reviewer assigned
client-runtime#68 — fix(pods-monitor): watch the release namespace, not hardcoded "default" · author: @shujaatTracebloc · reviewer: @saadqbal
data-ingestors#132 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
data-ingestors#133 — docs: fix declarative-ingest path/column drift (issue feat(#129): parent client chart owns the shared ingestor ServiceAccount (1.3.4) #131 A-series) · author: @divyasinghds · no reviewer assigned
data-ingestors#138 — test(e2e): end-to-end ingestion equivalence suite (all modalities, real MySQL) · author: @LukasWodka · no reviewer assigned
data-ingestors#139 — fix(ingest): MLM tokenizer + relax PascalVOC difficult + reserved-id guard (Sync develop → main for v1.3.5 chart release #137, Chart publish: include ingestor subchart in release workflow + CI lint #135) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 629773d. Configure here.}

…erScope=false Follow-up to the RBAC fix: node_agents_namespace_test.yaml still asserted the old behavior (namespaced Role + RoleBinding in the release namespace when clusterScope=false). Update that case to assert the corrected contract -- a ClusterRole + ClusterRoleBinding always render (with no metadata.namespace), while the subject SA still lives in the node-agents namespace. The clusterScope=false path stays under test; only the asserted behavior changes to match the fix. Verified with `helm unittest` (all resource-monitor suites pass). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

shujaatTracebloc · 2026-06-02T12:36:38Z

Good catch — fixed in 6a869d8.

You're right: tests/node_agents_namespace_test.yaml had a clusterScope: false case asserting the old namespaced Role/RoleBinding (in the release namespace), which this change intentionally replaces. I missed that file in my original scan (I'd only checked rbac_test.yaml and resource_monitor_test.yaml), so the "no existing tests changed" note was wrong — apologies.

Rather than delete the coverage, I updated that case to assert the corrected contract: under clusterScope: false the template now renders a ClusterRole + ClusterRoleBinding (with no metadata.namespace), while the subject ServiceAccount still lives in the node-agents namespace. The clusterScope: false path stays under test — only the asserted behavior changed to match the fix.

Verified locally with helm unittest: the three resource-monitor-related suites (node_agents_namespace_test, resource_monitor_test, rbac_test) all pass — 33/33. (Four other suites — auto-upgrade, image-refresh, network-policy, priority-class — fail with values don't meet the schema, but they fail identically on a clean develop checkout, so they're pre-existing and unrelated to this change.)

shujaatTracebloc requested a review from saadqbal as a code owner June 2, 2026 12:13

shujaatTracebloc assigned saadqbal Jun 2, 2026

shujaatTracebloc self-assigned this Jun 2, 2026

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread client/templates/resource-monitor-rbac.yaml

shujaatTracebloc unassigned saadqbal Jun 3, 2026

saadqbal approved these changes Jun 3, 2026

View reviewed changes

shujaatTracebloc merged commit b8233bc into develop Jun 3, 2026
11 checks passed

shujaatTracebloc deleted the fix/resource-monitor-cluster-rbac branch June 3, 2026 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope)#180

fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope)#180
shujaatTracebloc merged 2 commits into
developfrom
fix/resource-monitor-cluster-rbac

shujaatTracebloc commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

shujaatTracebloc commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shujaatTracebloc commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Validation (helm template)

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shujaatTracebloc commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shujaatTracebloc commented Jun 2, 2026 •

edited by cursor Bot

Loading

Validation (`helm template`)