feat(client): harden image pinning and credentials (v1.0.4) by saadqbal · Pull Request #53 · tracebloc/client

saadqbal · 2026-04-24T10:55:18Z

Summary

Address the High-severity findings from the client chart security review.

Digest pinning — tracebloc.image helper now supports an optional digest field. New images.jobsManager.digest, images.podsMonitor.digest, images.mysqlClient.digest, and images.busybox.digest values render repo@sha256:... and drop imagePullPolicy to IfNotPresent for immutable, auditable rollouts.
Remove mysql-client:latest — now uses images.mysqlClient.tag (default "prod"). The schema rejects "latest"; operators wanting absolute pinning should set images.mysqlClient.digest.
Harden bare-metal mysql init-container — still runs as root (kubelet does not apply fsGroup to hostPath volumes, k8s#138411), but now with drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
Remove deceptive credential placeholders — clientId / clientPassword defaults are now empty strings. Both the schema and secrets.yaml reject empty values and <...>-style placeholders so deployments fail fast instead of silently base64-encoding a placeholder into the Secret.

Chart bumped 1.0.3 → 1.0.4.

Test plan

helm lint client passes (CSI and bare-metal value sets)
helm unittest client — all 76 tests pass, plus new coverage for placeholder and empty-credential rejection
helm template renders correct digest form when images.*.digest is set
helm template with clientId: "<CLIENT_ID>" fails schema validation
Smoke install on a CSI cluster (EKS/AKS) with and without images.mysqlClient.digest
Smoke install on bare-metal with hostPath.enabled=true to confirm the hardened init-container still succeeds

Note

Medium Risk
Medium risk because it changes rendered image references/pull policies and makes clientId/clientPassword validation fail-fast, which can break existing installs that relied on placeholders or implicit pulls.

Overview
Bumps the chart/app version to 1.0.4 and adds an images.* values block to support optional digest-pinned images via tracebloc.image (rendering repo@sha256:... when provided) and adjusts imagePullPolicy accordingly.

Removes mutable image usage by switching MySQL and init-container images to configurable tags/digests (rejecting latest for mysql-client/busybox) and tightens the bare-metal MySQL initContainer security context (minimal capabilities, no escalation, read-only rootfs, default seccomp).

Makes credentials mandatory and non-placeholder: defaults for clientId/clientPassword become empty, secrets.yaml fails on empty or <...> values, schema validation is updated to enforce this, and Helm unit tests are updated/added to cover the new failure cases.

^{Reviewed by Cursor Bugbot for commit eac812a. Bugbot is set up for automated code reviews on this repo. Configure here.}

Address the High-severity findings from the client chart security review: - Add digest support to tracebloc.image helper and images.* values for jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is set, the image is rendered as repo@sha256:... and imagePullPolicy drops to IfNotPresent (immutable pin, auditable rollout). - Replace the hard-coded mysql-client:latest with a configurable tag that defaults to "prod". The schema rejects "latest" outright; operators wanting absolute pinning should set images.mysqlClient.digest. - Harden the bare-metal mysql init-container: still runs as root (kubelet does not apply fsGroup to hostPath volumes, k8s#138411), but now with drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault. - Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults. The defaults are now empty strings; the schema and template both reject empty values and <...> placeholder patterns so deployments fail fast instead of silently encoding a placeholder into the Secret. Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s upgrade Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before PR #60 have no `images.resourceMonitor` block in their stored values, so `helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`. - Read the digest via nested `default (dict)` so a missing `images` map AND a missing `resourceMonitor` entry both fall through to "" safely. `dig` would be cleaner but it rejects chartutil.Values. - Add tests/resource_monitor_test.yaml with a regression case that sets `images: null` and asserts the DaemonSet still renders with the tag fallback. Scope limited to resourceMonitor: the other images (jobsManager, podsMonitor, mysqlClient, busybox) were introduced together in PR #53 (1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.

…0.8) (#61) * fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the template's range loop produced no items, so `except:` rendered as null. Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`, silently granting training pods unrestricted port-443 egress to MySQL, the K8s API, jobs-manager, and every other in-cluster destination the policy is meant to block. Gate the misconfiguration at two levels: - `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at helm install/upgrade validation) - `network-policy-training.yaml`: add a `{{ fail }}` guard as defense-in-depth for schema-bypass paths (helm template --validate=false) - `tests/network_policy_test.yaml`: add a unit test asserting the failure Credit: bug bot finding. * fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before PR #60 have no `images.resourceMonitor` block in their stored values, so `helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`. - Read the digest via nested `default (dict)` so a missing `images` map AND a missing `resourceMonitor` entry both fall through to "" safely. `dig` would be cleaner but it rejects chartutil.Values. - Add tests/resource_monitor_test.yaml with a regression case that sets `images: null` and asserts the DaemonSet still renders with the tag fallback. Scope limited to resourceMonitor: the other images (jobsManager, podsMonitor, mysqlClient, busybox) were introduced together in PR #53 (1.0.4), so anyone on 1.0.4+ already has those blocks in stored values. * fix(client): scope clusterCidrs minItems guard to enabled=true only Bug bot flagged that the unconditional minItems:1 constraint on networkPolicy.training.clusterCidrs rejects `enabled: false` + `clusterCidrs: []` — a legitimate minimal config for operators on non-enforcing CNIs who disable the policy entirely. Move the constraint behind a JSON Schema draft 7 if/then at the `training` object level: minItems:1 applies only when enabled=true. The template-side fail guard was already correctly scoped inside the `.Values.networkPolicy.training.enabled` check, so no template change is needed — this aligns the schema with the template. Add a unittest covering `enabled: false` + `clusterCidrs: []` (schema must pass, no policy rendered).

@v1

* Add NetworkPolicy locking down training-pod egress Training pods run untrusted ML code uploaded by external data scientists. This policy selects on the tracebloc.io/workload=training label (injected by jobs-manager in the companion client-runtime PR) and: - Denies all ingress (nothing should connect TO a training pod). - Allows DNS to the cluster DNS service. - Allows external TCP/443 only; blocks all pod-to-pod, ClusterIP, and in-cluster pod traffic via ipBlock with cluster-CIDR exclusions. Training pods can still reach tracebloc backend, Azure Service Bus, and App Insights (external HTTPS). They can no longer reach mysql-client, the K8s API server, the jobs-manager pod IP, or other training pods. Per-platform defaults: AKS: enabled=true (requires Azure NPM or Calico at cluster create) EKS: enabled=false (AWS VPC CNI does not enforce NetworkPolicy; safer to explicitly disable than silently have no effect) BM: enabled=true (requires Calico / Cilium / kube-router) OC: enabled=true (OVN-Kubernetes enforces by default; custom DNS selector and OpenShift pod/service CIDRs) The dnsSelector default is empty with a template-side fallback to {k8s-app: kube-dns} to avoid Helm's map-merge semantics surprising customers who override it (OpenShift's selector would otherwise be unioned with the default rather than replacing it). - templates/network-policy-training.yaml: new policy (gated on networkPolicy.training.enabled) - values.yaml + values.schema.json: new networkPolicy.training block - ci/{aks,eks,bm,oc}-values.yaml: per-platform overrides with notes - tests/network_policy_test.yaml: 8 helm-unittest cases covering rendering, ingress denial, DNS allow, external HTTPS allow, cluster CIDR blocking, and the OpenShift selector override No effect until the companion client-runtime PR lands, which adds the tracebloc.io/workload=training label to spawned training pods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add optional Namespace resource with Pod Security Admission labels (#43) * Add optional Namespace resource with Pod Security Admission labels Layers Kubernetes Pod Security Admission on top of the per-pod securityContext work for defense-in-depth. Off by default -- enabling requires a greenfield install, since the chart does not currently own the release namespace on existing deployments. When namespace.create is true, the chart templates a Namespace with: pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/audit: restricted helm.sh/resource-policy: keep Warn + audit surface any pod-spec violation as a kubectl warning and an audit-log event, without rejecting the pod. This gives us a tripwire for future regressions in our own pod specs (jobs-manager, mysql, resource-monitor, training pods) and for any third-party pods in the same namespace. Enforce mode is deliberately left UNSET. Two of our own workloads would be rejected under enforce: restricted: - mysql init containers run as UID 0 (needed to chown the PVC before the main container -- UID 999 -- starts) - resource-monitor DaemonSet mounts hostPath /proc and /sys Enabling enforce before those are refactored (or moved to a separate namespace) would break the chart. Customers who want full enforcement can set namespace.podSecurity.enforce = restricted after auditing their own deployment; the current defaults keep them safe. helm.sh/resource-policy: keep prevents helm uninstall from deleting the Namespace, which would otherwise take the PVC-backed training data and MySQL state with it. - templates/namespace.yaml: new, gated on namespace.create (default false) - values.yaml: new namespace block with long comments - values.schema.json: schema entries for namespace.create + podSecurity - tests/namespace_test.yaml: 8 helm-unittest cases (toggle off, toggle on, keep annotation, labels, version strings, enforce omitted when empty, enforce present when set, baseline override, namespace name respects release) - docs/INSTALL.md: section explaining the greenfield vs existing-ns paths with copy-pasteable kubectl label commands Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix kubeVersion constraint to accept cloud pre-release suffixes Helm's semver parser excludes pre-release versions from >= ranges by default, so ">=1.24.0" rejected EKS ("1.34.4-eks-f69f56f"), GKE ("-gke-*"), and AKS release-tagged versions. Changing to ">=1.24.0-0" explicitly opts the constraint into matching pre-releases, which is how managed-Kubernetes providers encode their vendor suffix. Surfaced while dry-run-installing PR #43 against a dev EKS cluster. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com> * Add consolidated SECURITY.md covering the training-pod sandbox (#44) Brings together the threat model, defense layers, per-platform caveats, operator responsibilities, residual risks, and verification steps into one reviewable artifact. Covers the complete hardening posture as shipped across the chart + jobs-manager + new-arch training images. Sections: 1. Threat model: trusted platform, untrusted external-data- scientist submissions. Explicit in-scope / out-of-scope. 2. Seven design goals (G1-G7) for the training-pod sandbox, each mapped to current status on new-arch vs. legacy. 3. Architecture overview. 4. Defense layers -- credential isolation, network egress, K8s API access, container runtime hardening, storage isolation, cross-tenant forgeability, admission tripwire. 5. Per-platform caveats -- NetworkPolicy CNI matrix (AKS/EKS/ bare-metal/OpenShift), PSA version requirements, OpenShift DNS selector override, runAsUser + arbitrary UIDs, bare- metal hostPath note. 6. What operators must do themselves -- rotate secrets, verify CNI enforces, label existing namespaces, monitor audit, upgrade ordering, refactor path for enforce: restricted. 7. Verification -- copy-pasteable kubectl snippets for each defense layer. 8. Residual risks with explicit ownership -- global SB conn strings (backend), HTTPS egress (platform endgame), token TTL (backend), legacy arch (migration team), PSA enforce (chart refactor), CNI silent no-op (operator), kernel escape (out of scope), resource DoS (out of scope). 9. Compromise response playbook. 10. Where each defense is implemented (code-path map for reviewers). 11. Document history. Also: - README.md: add Security subsection under Deployment Guide linking to docs/SECURITY.md. - docs/INSTALL.md: prerequisite note about CNI enforcement. No code changes; documentation only. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add docs/MIGRATIONS.md and CLAUDE.md for Helm chart migration safety (#47) Document the helm.sh/resource-policy=keep gotcha: Helm reads the annotation from the stored release manifest, not live resources, so kubectl annotate alone does not protect PVCs from helm uninstall. Includes the 2026-04-22 tracebloc-templates migration as a case study and three mitigation options (helm upgrade, strip ownership, or rely on PV Retain + recreate). * docs(client): add pre-Helm resource-monitor cleanup step to MIGRATION.md (#49) Early-era edges were installed with a hand-rolled `resource-monitor` DaemonSet via raw `kubectl apply` before the per-platform charts existed. The unified chart's `tracebloc-resource-monitor` DaemonSet replaces it, but the legacy DS is unmanaged and keeps running after migration, mounting hostPath /proc + /sys and blocking PSA `enforce=restricted` on the namespace. Adds a step-6 section documenting the kubectl cleanup (DS + SA + ClusterRole + ClusterRoleBinding, all named `resource-monitor`) with a safety check to confirm the ClusterRole/Binding aren't shared before deletion. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(mysql): drop root init-containers, add PSA-restricted securityContext (#48) * feat(mysql): drop root init-containers, add PSA-restricted securityContext Unblocks pod-security.kubernetes.io/enforce: restricted on the release namespace. Previously the mysql-client pod had two init-containers running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999 before mysqld started. PSA restricted rejects runAsUser: 0 on any container, so these init-containers were the last blocker to promoting the namespace from warn/audit to enforce. The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch` at the pod level, which kubelet uses to chgrp mounted volumes on first mount. Once that is in place the init-container chowns are redundant: - On existing PVCs (already owned 999:999 from the prior init-container chown) OnRootMismatch sees the correct root ownership and skips the recursive chgrp — mount is instant, no behavior change. - On fresh PVCs kubelet applies fsGroup before the main container starts. - On emptyDir (the logs volume) kubelet applies fsGroup at volume creation. Also adds a container-level securityContext with all six fields PSA restricted requires: - runAsNonRoot: true - runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's default user, and the entrypoint skips its root-to-mysql gosu re-exec when already running as 999) - allowPrivilegeEscalation: false - capabilities: drop all - seccompProfile: RuntimeDefault Scope: client chart only (now the universal chart covering eks/aks/bm/oc). Caveats for customers: - Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType (EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some object-backed drivers do not; chart docs should flag this in a follow-up. Deferred to separate PR: - readOnlyRootFilesystem on the mysql container (needs emptyDir mounts for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk). * fix(mysql): restore chown init-container for hostPath (bare-metal) kubelet does not apply fsGroup ownership to hostPath volumes (kubernetes/kubernetes#138411), so bare-metal installs need a privileged bootstrap to chown /var/lib/mysql to 999:999 on first start. Gated on .Values.hostPath.enabled so CSI-backed deployments (EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * Move tracebloc-resource-monitor to dedicated privileged namespace (#50) * Move tracebloc-resource-monitor to dedicated privileged namespace Pod Security Admission's `restricted` profile bans hostPath volumes outright, and the resource-monitor DaemonSet needs hostPath /proc and /sys to read node-level metrics. Previously, setting `pod-security.kubernetes.io/enforce: restricted` on the release namespace (tracebloc-templates) would reject the DaemonSet outright, and `warn=restricted` + `audit=restricted` already spam violations. This isolates the DaemonSet in a new dedicated namespace (tracebloc-node-agents, configurable via `nodeAgents.namespace.name`) that carries `pod-security.kubernetes.io/{enforce,warn,audit}: privileged` labels. The release namespace is no longer constrained by the node-agent and can run `enforce: restricted` once the mysql init refactor lands. Changes: - templates/node-agents-namespace.yaml: new, gated on nodeAgents.namespace.create (default true) and resourceMonitor - templates/resource-monitor-daemonset.yaml: deploy into node-agents ns - templates/resource-monitor-rbac.yaml: SA + (Cluster)RoleBinding in node-agents ns - templates/resource-monitor-scc.yaml: SCC users + CRB subject updated (OpenShift path) - values.yaml + values.schema.json: new `nodeAgents.namespace` block - templates/namespace.yaml + docs/INSTALL.md: drop resource-monitor from the enforce-blocker list; document the new node-agents ns - tests/node_agents_namespace_test.yaml: 12 new unittest cases Upgrade impact: existing installs will see the DaemonSet / SA / (Cluster)RoleBinding deleted from the release namespace and recreated in the node-agents namespace during `helm upgrade`. Brief (~seconds) gap in node metrics during rollout; no persistent data involved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Mirror secrets into node-agents ns; keep namespace RBAC in release ns Two follow-ups from review of the namespace-split change: 1. Secrets are namespace-scoped — a pod in `tracebloc-node-agents` cannot `secretKeyRef` a Secret that only exists in the release namespace. The resource-monitor DaemonSet was referencing CLIENT_ID / CLIENT_PASSWORD from `tracebloc.secretName` and the registry pull secret, both of which template only into `.Release.Namespace`, so pods would have failed to start with CreateContainerConfigError. templates/secrets.yaml and templates/docker-registry-secret.yaml now template a second copy into `nodeAgents.namespace.name` when: resourceMonitor != false AND node-agents ns != release ns The mirror is skipped when the two namespaces collide (e.g. operator points nodeAgents.namespace.name back at the release namespace) so Helm does not try to create two resources with the same name. 2. When clusterScope: false, the Role must live in the RELEASE namespace because that is where the monitored workloads run — a namespace-scoped Role only grants access to its own namespace. Previously this PR put the Role in `tracebloc-node-agents`, which would have silently broken the resource-monitor for anyone not using ClusterRole. Role + RoleBinding are now back in `.Release.Namespace`; the RoleBinding subject still points at the ServiceAccount in the node-agents namespace (cross-namespace subjects in RoleBindings are valid). Tests updated accordingly; 5 new cases cover mirror-on, mirror-off (resourceMonitor=false), mirror-off (namespaces collide), dockercfg mirror, and the corrected Role/RoleBinding placement. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(resource-monitor): pin NAMESPACE env to release ns; guard node-agents ns==release ns Two review fixes from the PSA hardening change: 1. NAMESPACE env var was using Downward API fieldPath: metadata.namespace, which now resolves to the node-agents namespace (where the DaemonSet pods live) instead of the release namespace (where the monitored workloads live). Replace with the literal Release.Namespace so the monitor continues to watch the right namespace regardless of where its own pods run. 2. node-agents-namespace.yaml would stamp privileged PSA labels onto the release namespace if an operator set nodeAgents.namespace.name to the release namespace (and with namespace.create=true it would render two Namespace docs with the same name — a render-time collision). Add an equality guard so the template is a no-op in that configuration. Adds one test covering the NAMESPACE env fix; tests: 74/74 pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(mysql): set readOnlyRootFilesystem on mysql-client (#52) Completes container runtime hardening (G4) for mysql-client. Adds three emptyDir mounts for the paths mysqld writes to at runtime that are NOT already on PVC or log volumes: - /var/run/mysqld pid file + unix socket - /tmp temp tables, sort buffers, LOAD DATA staging - /var/lib/mysql-files default secure_file_priv dir (touched at start) Verified via helm upgrade on EKS (tb-client-dev-templates / tracebloc-templates): pod Ready, readOnlyRootFilesystem=true, `touch /etc/x` rejected as Read-only, mysqld.sock + mysqld.pid present under /var/run/mysqld, existing DB data intact in /var/lib/mysql. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(psa): enforce=restricted by default on CSI; bare-metal overrides (#51) - values.yaml: namespace.podSecurity.enforce flipped to "restricted". - ci/bm-values.yaml: overrides enforce to "" because kubelet does not apply fsGroup to hostPath volumes (kubernetes/kubernetes#138411), forcing the chart to render a privileged init-mysql-data chown container that PSA restricted would reject. warn+audit remain on. - namespace.yaml docstring + SECURITY.md (§4.7, §6.3, §6.6, §8.5) updated to document the CSI-default / bare-metal-override split. Verified with helm template --set namespace.create=true against both eks-values.yaml (enforce rendered) and bm-values.yaml (enforce absent). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(installer): slim k3d and add dev overrides for local testing (#54) The tracebloc client is outbound-only: jobs-manager and pods-monitor dial out to the platform, and the only in-cluster Service is mysql-client (ClusterIP). The bundled k3s ingress/LB stack and metrics-server are unused overhead, and the chart ships its own StorageClass. Drop the loadbalancer port mappings (HTTP_PORT/HTTPS_PORT) plus their validation/help/log references, and pass --k3s-arg "--disable=..." for traefik, servicelb, metrics-server, and local-storage to k3d cluster create. Applied symmetrically in scripts/install-k8s.ps1. Also add two env vars for local-chart testing in install-client-helm.sh: TRACEBLOC_CHART_PATH install from a local chart path instead of the published tracebloc/client Helm repo (skips helm repo add/update) TRACEBLOC_VALUES_FILE use the caller-supplied values file as-is and skip the clientId/password prompts + values.yaml generation With both set, the installer can exercise the full flow end-to-end against unreleased chart changes before publishing. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(client): harden image pinning and credentials (v1.0.4) (#53) Address the High-severity findings from the client chart security review: - Add digest support to tracebloc.image helper and images.* values for jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is set, the image is rendered as repo@sha256:... and imagePullPolicy drops to IfNotPresent (immutable pin, auditable rollout). - Replace the hard-coded mysql-client:latest with a configurable tag that defaults to "prod". The schema rejects "latest" outright; operators wanting absolute pinning should set images.mysqlClient.digest. - Harden the bare-metal mysql init-container: still runs as root (kubelet does not apply fsGroup to hostPath volumes, k8s#138411), but now with drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault. - Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults. The defaults are now empty strings; the schema and template both reject empty values and <...> placeholder patterns so deployments fail fast instead of silently encoding a placeholder into the Secret. Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(client): require metrics-server for resource-monitor (v1.0.5) (#55) The tracebloc-resource-monitor DaemonSet queries the metrics.k8s.io API for node CPU/memory. Without metrics-server registered, the DaemonSet crash-loops with 404s against /apis/metrics.k8s.io/v1beta1 — silently, every few seconds. Found during a bare-metal smoke test on a k3d cluster where metrics-server had been explicitly disabled. - scripts/lib/cluster.sh: drop --disable=metrics-server from the k3d create args. k3s bundles metrics-server; the earlier comment claiming the chart "ships its own" was wrong — the DaemonSet is a consumer of metrics-server, not a replacement. - client/templates/resource-monitor-daemonset.yaml: add a pre-install `lookup` that fails the release up front when resourceMonitor is true but v1beta1.metrics.k8s.io is not registered. Guarded by a kube-system probe so offline `helm template` still renders. - client/values.yaml: document the dependency inline on resourceMonitor, with per-platform install notes (k3d/AKS bundled; EKS/OC/bare-metal need manual install). - docs/SECURITY.md: call out the dependency and the escape hatch (resourceMonitor: false) in the architecture section. - Chart.yaml: 1.0.4 -> 1.0.5. Verified on a fresh k3d cluster (no --disable=metrics-server): metrics API comes up in ~30s, smoke install succeeds, resource-monitor reaches Running with zero ERROR/404 lines. Pre-flight fail path also verified against a metrics-less cluster. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(mysql): drop chmod from hostPath init (v1.0.6) (#56) The init-container runs as UID 0 with capabilities drop:[ALL] add:[CHOWN]. After 'chown 999:999' transfers ownership, the subsequent 'chmod 755' runs as a non-owner without CAP_FOWNER and returns EPERM on re-install where the hostPath dir already exists from a prior run. Reversing the order does not help (chmod first still fails once the dir is 999-owned from any previous successful run). kubelet creates hostPath dirs at 0755 via DirectoryOrCreate, so the chmod was a no-op on fresh installs and broken on re-installs. Drop it. Verified on k3d/AWS VM: - fresh install: kubelet-created root:root dir -> chown succeeds -> 999:999 - re-install: pre-existing 999:999 dir with data -> chown no-op -> data intact * Chore/merge main into develop (#58) * Update README.md * Add narrow CODEOWNERS for security-sensitive paths * Remove metrics-server disable argument from k3d cluster creation in install-k8s.ps1 to ensure proper functionality of the resource-monitor DaemonSet, which relies on the metrics API. This change aligns with previous updates that emphasized the necessity of metrics-server for monitoring capabilities. --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * Merge pull request #60 from tracebloc/fix/resource-monitor-digest-pinning fix(client): pin resource-monitor by digest (v1.0.7) * chore: add auto-add to engineer kanban workflow (#45) * Add auto-add to engineer kanban workflow * fix(ci): pin actions/add-to-project to v1.0.2 @v1 is not a valid tag — action publishes full semver only. Pin to v1.0.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) (#61) * fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the template's range loop produced no items, so `except:` rendered as null. Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`, silently granting training pods unrestricted port-443 egress to MySQL, the K8s API, jobs-manager, and every other in-cluster destination the policy is meant to block. Gate the misconfiguration at two levels: - `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at helm install/upgrade validation) - `network-policy-training.yaml`: add a `{{ fail }}` guard as defense-in-depth for schema-bypass paths (helm template --validate=false) - `tests/network_policy_test.yaml`: add a unit test asserting the failure Credit: bug bot finding. * fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before PR #60 have no `images.resourceMonitor` block in their stored values, so `helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`. - Read the digest via nested `default (dict)` so a missing `images` map AND a missing `resourceMonitor` entry both fall through to "" safely. `dig` would be cleaner but it rejects chartutil.Values. - Add tests/resource_monitor_test.yaml with a regression case that sets `images: null` and asserts the DaemonSet still renders with the tag fallback. Scope limited to resourceMonitor: the other images (jobsManager, podsMonitor, mysqlClient, busybox) were introduced together in PR #53 (1.0.4), so anyone on 1.0.4+ already has those blocks in stored values. * fix(client): scope clusterCidrs minItems guard to enabled=true only Bug bot flagged that the unconditional minItems:1 constraint on networkPolicy.training.clusterCidrs rejects `enabled: false` + `clusterCidrs: []` — a legitimate minimal config for operators on non-enforcing CNIs who disable the policy entirely. Move the constraint behind a JSON Schema draft 7 if/then at the `training` object level: minItems:1 applies only when enabled=true. The template-side fail guard was already correctly scoped inside the `.Values.networkPolicy.training.enabled` check, so no template change is needed — this aligns the schema with the template. Add a unittest covering `enabled: false` + `clusterCidrs: []` (schema must pass, no policy rendered). --------- Co-authored-by: Lukas Wuttke <lukas@tracebloc.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>

saqlainsyed007 approved these changes Apr 24, 2026

View reviewed changes

saadqbal merged commit d0522ed into develop Apr 24, 2026
1 check passed

saadqbal mentioned this pull request Apr 24, 2026

fix(client): pin resource-monitor by digest (v1.0.7) #60

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(client): harden image pinning and credentials (v1.0.4)#53

feat(client): harden image pinning and credentials (v1.0.4)#53
saadqbal merged 1 commit intodevelopfrom
feature/client-image-pinning-and-credential-hardening

saadqbal commented Apr 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saadqbal commented Apr 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saadqbal commented Apr 24, 2026 •

edited by cursor Bot

Loading