Skip to content

feat(client): harden image pinning and credentials (v1.0.4)#53

Merged
saadqbal merged 1 commit intodevelopfrom
feature/client-image-pinning-and-credential-hardening
Apr 24, 2026
Merged

feat(client): harden image pinning and credentials (v1.0.4)#53
saadqbal merged 1 commit intodevelopfrom
feature/client-image-pinning-and-credential-hardening

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented Apr 24, 2026

Summary

Address the High-severity findings from the client chart security review.

  • Digest pinningtracebloc.image helper now supports an optional digest field. New images.jobsManager.digest, images.podsMonitor.digest, images.mysqlClient.digest, and images.busybox.digest values render repo@sha256:... and drop imagePullPolicy to IfNotPresent for immutable, auditable rollouts.
  • Remove mysql-client:latest — now uses images.mysqlClient.tag (default "prod"). The schema rejects "latest"; operators wanting absolute pinning should set images.mysqlClient.digest.
  • Harden bare-metal mysql init-container — still runs as root (kubelet does not apply fsGroup to hostPath volumes, k8s#138411), but now with drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
  • Remove deceptive credential placeholdersclientId / clientPassword defaults are now empty strings. Both the schema and secrets.yaml reject empty values and <...>-style placeholders so deployments fail fast instead of silently base64-encoding a placeholder into the Secret.

Chart bumped 1.0.31.0.4.

Test plan

  • helm lint client passes (CSI and bare-metal value sets)
  • helm unittest client — all 76 tests pass, plus new coverage for placeholder and empty-credential rejection
  • helm template renders correct digest form when images.*.digest is set
  • helm template with clientId: "<CLIENT_ID>" fails schema validation
  • Smoke install on a CSI cluster (EKS/AKS) with and without images.mysqlClient.digest
  • Smoke install on bare-metal with hostPath.enabled=true to confirm the hardened init-container still succeeds

Note

Medium Risk
Medium risk because it changes rendered image references/pull policies and makes clientId/clientPassword validation fail-fast, which can break existing installs that relied on placeholders or implicit pulls.

Overview
Bumps the chart/app version to 1.0.4 and adds an images.* values block to support optional digest-pinned images via tracebloc.image (rendering repo@sha256:... when provided) and adjusts imagePullPolicy accordingly.

Removes mutable image usage by switching MySQL and init-container images to configurable tags/digests (rejecting latest for mysql-client/busybox) and tightens the bare-metal MySQL initContainer security context (minimal capabilities, no escalation, read-only rootfs, default seccomp).

Makes credentials mandatory and non-placeholder: defaults for clientId/clientPassword become empty, secrets.yaml fails on empty or <...> values, schema validation is updated to enforce this, and Helm unit tests are updated/added to cover the new failure cases.

Reviewed by Cursor Bugbot for commit eac812a. Bugbot is set up for automated code reviews on this repo. Configure here.

Address the High-severity findings from the client chart security review:

- Add digest support to tracebloc.image helper and images.* values for
  jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is
  set, the image is rendered as repo@sha256:... and imagePullPolicy drops
  to IfNotPresent (immutable pin, auditable rollout).
- Replace the hard-coded mysql-client:latest with a configurable tag that
  defaults to "prod". The schema rejects "latest" outright; operators
  wanting absolute pinning should set images.mysqlClient.digest.
- Harden the bare-metal mysql init-container: still runs as root (kubelet
  does not apply fsGroup to hostPath volumes, k8s#138411), but now with
  drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false,
  readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
- Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults.
  The defaults are now empty strings; the schema and template both reject
  empty values and <...> placeholder patterns so deployments fail fast
  instead of silently encoding a placeholder into the Secret.

Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@saadqbal saadqbal merged commit d0522ed into develop Apr 24, 2026
1 check passed
saadqbal added a commit that referenced this pull request Apr 24, 2026
…s upgrade

Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.

- Read the digest via nested `default (dict)` so a missing `images` map
  AND a missing `resourceMonitor` entry both fall through to "" safely.
  `dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
  `images: null` and asserts the DaemonSet still renders with the tag
  fallback.

Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.
saadqbal added a commit that referenced this pull request Apr 24, 2026
…0.8) (#61)

* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8)

When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the
template's range loop produced no items, so `except:` rendered as null.
Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`,
silently granting training pods unrestricted port-443 egress to MySQL, the
K8s API, jobs-manager, and every other in-cluster destination the policy
is meant to block.

Gate the misconfiguration at two levels:
- `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at
  helm install/upgrade validation)
- `network-policy-training.yaml`: add a `{{ fail }}` guard as
  defense-in-depth for schema-bypass paths (helm template --validate=false)
- `tests/network_policy_test.yaml`: add a unit test asserting the failure

Credit: bug bot finding.

* fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade

Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.

- Read the digest via nested `default (dict)` so a missing `images` map
  AND a missing `resourceMonitor` entry both fall through to "" safely.
  `dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
  `images: null` and asserts the DaemonSet still renders with the tag
  fallback.

Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.

* fix(client): scope clusterCidrs minItems guard to enabled=true only

Bug bot flagged that the unconditional minItems:1 constraint on
networkPolicy.training.clusterCidrs rejects `enabled: false` +
`clusterCidrs: []` — a legitimate minimal config for operators on
non-enforcing CNIs who disable the policy entirely.

Move the constraint behind a JSON Schema draft 7 if/then at the
`training` object level: minItems:1 applies only when enabled=true.
The template-side fail guard was already correctly scoped inside the
`.Values.networkPolicy.training.enabled` check, so no template change
is needed — this aligns the schema with the template.

Add a unittest covering `enabled: false` + `clusterCidrs: []`
(schema must pass, no policy rendered).
saadqbal added a commit that referenced this pull request Apr 24, 2026
* Add NetworkPolicy locking down training-pod egress

Training pods run untrusted ML code uploaded by external data scientists.
This policy selects on the tracebloc.io/workload=training label (injected
by jobs-manager in the companion client-runtime PR) and:

  - Denies all ingress (nothing should connect TO a training pod).
  - Allows DNS to the cluster DNS service.
  - Allows external TCP/443 only; blocks all pod-to-pod, ClusterIP, and
    in-cluster pod traffic via ipBlock with cluster-CIDR exclusions.

Training pods can still reach tracebloc backend, Azure Service Bus, and
App Insights (external HTTPS). They can no longer reach mysql-client,
the K8s API server, the jobs-manager pod IP, or other training pods.

Per-platform defaults:
  AKS:  enabled=true  (requires Azure NPM or Calico at cluster create)
  EKS:  enabled=false (AWS VPC CNI does not enforce NetworkPolicy; safer
                       to explicitly disable than silently have no effect)
  BM:   enabled=true  (requires Calico / Cilium / kube-router)
  OC:   enabled=true  (OVN-Kubernetes enforces by default; custom DNS
                       selector and OpenShift pod/service CIDRs)

The dnsSelector default is empty with a template-side fallback to
{k8s-app: kube-dns} to avoid Helm's map-merge semantics surprising
customers who override it (OpenShift's selector would otherwise be
unioned with the default rather than replacing it).

- templates/network-policy-training.yaml: new policy (gated on
  networkPolicy.training.enabled)
- values.yaml + values.schema.json: new networkPolicy.training block
- ci/{aks,eks,bm,oc}-values.yaml: per-platform overrides with notes
- tests/network_policy_test.yaml: 8 helm-unittest cases covering
  rendering, ingress denial, DNS allow, external HTTPS allow, cluster
  CIDR blocking, and the OpenShift selector override

No effect until the companion client-runtime PR lands, which adds the
tracebloc.io/workload=training label to spawned training pods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add optional Namespace resource with Pod Security Admission labels (#43)

* Add optional Namespace resource with Pod Security Admission labels

Layers Kubernetes Pod Security Admission on top of the per-pod
securityContext work for defense-in-depth. Off by default -- enabling
requires a greenfield install, since the chart does not currently own
the release namespace on existing deployments.

When namespace.create is true, the chart templates a Namespace with:

    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted
    helm.sh/resource-policy: keep

Warn + audit surface any pod-spec violation as a kubectl warning and
an audit-log event, without rejecting the pod. This gives us a
tripwire for future regressions in our own pod specs (jobs-manager,
mysql, resource-monitor, training pods) and for any third-party pods
in the same namespace.

Enforce mode is deliberately left UNSET. Two of our own workloads
would be rejected under enforce: restricted:

  - mysql init containers run as UID 0 (needed to chown the PVC
    before the main container -- UID 999 -- starts)
  - resource-monitor DaemonSet mounts hostPath /proc and /sys

Enabling enforce before those are refactored (or moved to a separate
namespace) would break the chart. Customers who want full enforcement
can set namespace.podSecurity.enforce = restricted after auditing
their own deployment; the current defaults keep them safe.

helm.sh/resource-policy: keep prevents helm uninstall from deleting
the Namespace, which would otherwise take the PVC-backed training
data and MySQL state with it.

- templates/namespace.yaml: new, gated on namespace.create (default false)
- values.yaml: new namespace block with long comments
- values.schema.json: schema entries for namespace.create + podSecurity
- tests/namespace_test.yaml: 8 helm-unittest cases (toggle off, toggle
  on, keep annotation, labels, version strings, enforce omitted when
  empty, enforce present when set, baseline override, namespace name
  respects release)
- docs/INSTALL.md: section explaining the greenfield vs existing-ns
  paths with copy-pasteable kubectl label commands

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix kubeVersion constraint to accept cloud pre-release suffixes

Helm's semver parser excludes pre-release versions from >= ranges by
default, so ">=1.24.0" rejected EKS ("1.34.4-eks-f69f56f"), GKE
("-gke-*"), and AKS release-tagged versions. Changing to ">=1.24.0-0"
explicitly opts the constraint into matching pre-releases, which is
how managed-Kubernetes providers encode their vendor suffix.

Surfaced while dry-run-installing PR #43 against a dev EKS cluster.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>

* Add consolidated SECURITY.md covering the training-pod sandbox (#44)

Brings together the threat model, defense layers, per-platform
caveats, operator responsibilities, residual risks, and verification
steps into one reviewable artifact. Covers the complete hardening
posture as shipped across the chart + jobs-manager + new-arch
training images.

Sections:

  1.  Threat model: trusted platform, untrusted external-data-
      scientist submissions. Explicit in-scope / out-of-scope.
  2.  Seven design goals (G1-G7) for the training-pod sandbox,
      each mapped to current status on new-arch vs. legacy.
  3.  Architecture overview.
  4.  Defense layers -- credential isolation, network egress,
      K8s API access, container runtime hardening, storage
      isolation, cross-tenant forgeability, admission tripwire.
  5.  Per-platform caveats -- NetworkPolicy CNI matrix (AKS/EKS/
      bare-metal/OpenShift), PSA version requirements, OpenShift
      DNS selector override, runAsUser + arbitrary UIDs, bare-
      metal hostPath note.
  6.  What operators must do themselves -- rotate secrets, verify
      CNI enforces, label existing namespaces, monitor audit,
      upgrade ordering, refactor path for enforce: restricted.
  7.  Verification -- copy-pasteable kubectl snippets for each
      defense layer.
  8.  Residual risks with explicit ownership -- global SB conn
      strings (backend), HTTPS egress (platform endgame), token
      TTL (backend), legacy arch (migration team), PSA enforce
      (chart refactor), CNI silent no-op (operator), kernel
      escape (out of scope), resource DoS (out of scope).
  9.  Compromise response playbook.
  10. Where each defense is implemented (code-path map for
      reviewers).
  11. Document history.

Also:

- README.md: add Security subsection under Deployment Guide
  linking to docs/SECURITY.md.
- docs/INSTALL.md: prerequisite note about CNI enforcement.

No code changes; documentation only.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add docs/MIGRATIONS.md and CLAUDE.md for Helm chart migration safety (#47)

Document the helm.sh/resource-policy=keep gotcha: Helm reads the
annotation from the stored release manifest, not live resources, so
kubectl annotate alone does not protect PVCs from helm uninstall.
Includes the 2026-04-22 tracebloc-templates migration as a case study
and three mitigation options (helm upgrade, strip ownership, or rely
on PV Retain + recreate).

* docs(client): add pre-Helm resource-monitor cleanup step to MIGRATION.md (#49)

Early-era edges were installed with a hand-rolled `resource-monitor`
DaemonSet via raw `kubectl apply` before the per-platform charts existed.
The unified chart's `tracebloc-resource-monitor` DaemonSet replaces it,
but the legacy DS is unmanaged and keeps running after migration, mounting
hostPath /proc + /sys and blocking PSA `enforce=restricted` on the namespace.

Adds a step-6 section documenting the kubectl cleanup (DS + SA + ClusterRole
+ ClusterRoleBinding, all named `resource-monitor`) with a safety check to
confirm the ClusterRole/Binding aren't shared before deletion.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(mysql): drop root init-containers, add PSA-restricted securityContext (#48)

* feat(mysql): drop root init-containers, add PSA-restricted securityContext

Unblocks pod-security.kubernetes.io/enforce: restricted on the release
namespace. Previously the mysql-client pod had two init-containers
running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999
before mysqld started. PSA restricted rejects runAsUser: 0 on any
container, so these init-containers were the last blocker to promoting
the namespace from warn/audit to enforce.

The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch`
at the pod level, which kubelet uses to chgrp mounted volumes on first
mount. Once that is in place the init-container chowns are redundant:

- On existing PVCs (already owned 999:999 from the prior init-container
  chown) OnRootMismatch sees the correct root ownership and skips the
  recursive chgrp — mount is instant, no behavior change.
- On fresh PVCs kubelet applies fsGroup before the main container starts.
- On emptyDir (the logs volume) kubelet applies fsGroup at volume
  creation.

Also adds a container-level securityContext with all six fields PSA
restricted requires:
- runAsNonRoot: true
- runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's
  default user, and the entrypoint skips its root-to-mysql gosu re-exec
  when already running as 999)
- allowPrivilegeEscalation: false
- capabilities: drop all
- seccompProfile: RuntimeDefault

Scope: client chart only (now the universal chart covering eks/aks/bm/oc).

Caveats for customers:
- Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType
  (EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some
  object-backed drivers do not; chart docs should flag this in a
  follow-up.

Deferred to separate PR:
- readOnlyRootFilesystem on the mysql container (needs emptyDir mounts
  for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk).

* fix(mysql): restore chown init-container for hostPath (bare-metal)

kubelet does not apply fsGroup ownership to hostPath volumes
(kubernetes/kubernetes#138411), so bare-metal installs need a
privileged bootstrap to chown /var/lib/mysql to 999:999 on first
start. Gated on .Values.hostPath.enabled so CSI-backed deployments
(EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Move tracebloc-resource-monitor to dedicated privileged namespace (#50)

* Move tracebloc-resource-monitor to dedicated privileged namespace

Pod Security Admission's `restricted` profile bans hostPath volumes
outright, and the resource-monitor DaemonSet needs hostPath /proc and
/sys to read node-level metrics. Previously, setting
`pod-security.kubernetes.io/enforce: restricted` on the release
namespace (tracebloc-templates) would reject the DaemonSet outright,
and `warn=restricted` + `audit=restricted` already spam violations.

This isolates the DaemonSet in a new dedicated namespace
(tracebloc-node-agents, configurable via `nodeAgents.namespace.name`)
that carries `pod-security.kubernetes.io/{enforce,warn,audit}:
privileged` labels. The release namespace is no longer constrained by
the node-agent and can run `enforce: restricted` once the mysql init
refactor lands.

Changes:
- templates/node-agents-namespace.yaml: new, gated on
  nodeAgents.namespace.create (default true) and resourceMonitor
- templates/resource-monitor-daemonset.yaml: deploy into node-agents ns
- templates/resource-monitor-rbac.yaml: SA + (Cluster)RoleBinding in
  node-agents ns
- templates/resource-monitor-scc.yaml: SCC users + CRB subject updated
  (OpenShift path)
- values.yaml + values.schema.json: new `nodeAgents.namespace` block
- templates/namespace.yaml + docs/INSTALL.md: drop resource-monitor
  from the enforce-blocker list; document the new node-agents ns
- tests/node_agents_namespace_test.yaml: 12 new unittest cases

Upgrade impact: existing installs will see the DaemonSet / SA /
(Cluster)RoleBinding deleted from the release namespace and recreated
in the node-agents namespace during `helm upgrade`. Brief (~seconds)
gap in node metrics during rollout; no persistent data involved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Mirror secrets into node-agents ns; keep namespace RBAC in release ns

Two follow-ups from review of the namespace-split change:

1. Secrets are namespace-scoped — a pod in `tracebloc-node-agents`
   cannot `secretKeyRef` a Secret that only exists in the release
   namespace. The resource-monitor DaemonSet was referencing CLIENT_ID /
   CLIENT_PASSWORD from `tracebloc.secretName` and the registry pull
   secret, both of which template only into `.Release.Namespace`, so
   pods would have failed to start with CreateContainerConfigError.

   templates/secrets.yaml and templates/docker-registry-secret.yaml now
   template a second copy into `nodeAgents.namespace.name` when:
     resourceMonitor != false  AND  node-agents ns != release ns

   The mirror is skipped when the two namespaces collide (e.g. operator
   points nodeAgents.namespace.name back at the release namespace) so
   Helm does not try to create two resources with the same name.

2. When clusterScope: false, the Role must live in the RELEASE
   namespace because that is where the monitored workloads run — a
   namespace-scoped Role only grants access to its own namespace.
   Previously this PR put the Role in `tracebloc-node-agents`, which
   would have silently broken the resource-monitor for anyone not
   using ClusterRole. Role + RoleBinding are now back in
   `.Release.Namespace`; the RoleBinding subject still points at the
   ServiceAccount in the node-agents namespace (cross-namespace
   subjects in RoleBindings are valid).

Tests updated accordingly; 5 new cases cover mirror-on, mirror-off
(resourceMonitor=false), mirror-off (namespaces collide), dockercfg
mirror, and the corrected Role/RoleBinding placement.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(resource-monitor): pin NAMESPACE env to release ns; guard node-agents ns==release ns

Two review fixes from the PSA hardening change:

1. NAMESPACE env var was using Downward API fieldPath: metadata.namespace,
   which now resolves to the node-agents namespace (where the DaemonSet
   pods live) instead of the release namespace (where the monitored
   workloads live). Replace with the literal Release.Namespace so the
   monitor continues to watch the right namespace regardless of where
   its own pods run.

2. node-agents-namespace.yaml would stamp privileged PSA labels onto the
   release namespace if an operator set nodeAgents.namespace.name to the
   release namespace (and with namespace.create=true it would render two
   Namespace docs with the same name — a render-time collision). Add an
   equality guard so the template is a no-op in that configuration.

Adds one test covering the NAMESPACE env fix; tests: 74/74 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(mysql): set readOnlyRootFilesystem on mysql-client (#52)

Completes container runtime hardening (G4) for mysql-client. Adds three
emptyDir mounts for the paths mysqld writes to at runtime that are NOT
already on PVC or log volumes:

- /var/run/mysqld       pid file + unix socket
- /tmp                  temp tables, sort buffers, LOAD DATA staging
- /var/lib/mysql-files  default secure_file_priv dir (touched at start)

Verified via helm upgrade on EKS (tb-client-dev-templates /
tracebloc-templates): pod Ready, readOnlyRootFilesystem=true, `touch /etc/x`
rejected as Read-only, mysqld.sock + mysqld.pid present under /var/run/mysqld,
existing DB data intact in /var/lib/mysql.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(psa): enforce=restricted by default on CSI; bare-metal overrides (#51)

- values.yaml: namespace.podSecurity.enforce flipped to "restricted".
- ci/bm-values.yaml: overrides enforce to "" because kubelet does not
  apply fsGroup to hostPath volumes (kubernetes/kubernetes#138411),
  forcing the chart to render a privileged init-mysql-data chown
  container that PSA restricted would reject. warn+audit remain on.
- namespace.yaml docstring + SECURITY.md (§4.7, §6.3, §6.6, §8.5)
  updated to document the CSI-default / bare-metal-override split.

Verified with helm template --set namespace.create=true against both
eks-values.yaml (enforce rendered) and bm-values.yaml (enforce absent).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(installer): slim k3d and add dev overrides for local testing (#54)

The tracebloc client is outbound-only: jobs-manager and pods-monitor
dial out to the platform, and the only in-cluster Service is mysql-client
(ClusterIP). The bundled k3s ingress/LB stack and metrics-server are
unused overhead, and the chart ships its own StorageClass.

Drop the loadbalancer port mappings (HTTP_PORT/HTTPS_PORT) plus their
validation/help/log references, and pass --k3s-arg "--disable=..." for
traefik, servicelb, metrics-server, and local-storage to k3d cluster
create. Applied symmetrically in scripts/install-k8s.ps1.

Also add two env vars for local-chart testing in install-client-helm.sh:

  TRACEBLOC_CHART_PATH    install from a local chart path instead of the
                          published tracebloc/client Helm repo (skips
                          helm repo add/update)
  TRACEBLOC_VALUES_FILE   use the caller-supplied values file as-is and
                          skip the clientId/password prompts + values.yaml
                          generation

With both set, the installer can exercise the full flow end-to-end
against unreleased chart changes before publishing.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(client): harden image pinning and credentials (v1.0.4) (#53)

Address the High-severity findings from the client chart security review:

- Add digest support to tracebloc.image helper and images.* values for
  jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is
  set, the image is rendered as repo@sha256:... and imagePullPolicy drops
  to IfNotPresent (immutable pin, auditable rollout).
- Replace the hard-coded mysql-client:latest with a configurable tag that
  defaults to "prod". The schema rejects "latest" outright; operators
  wanting absolute pinning should set images.mysqlClient.digest.
- Harden the bare-metal mysql init-container: still runs as root (kubelet
  does not apply fsGroup to hostPath volumes, k8s#138411), but now with
  drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false,
  readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
- Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults.
  The defaults are now empty strings; the schema and template both reject
  empty values and <...> placeholder patterns so deployments fail fast
  instead of silently encoding a placeholder into the Secret.

Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(client): require metrics-server for resource-monitor (v1.0.5) (#55)

The tracebloc-resource-monitor DaemonSet queries the metrics.k8s.io API
for node CPU/memory. Without metrics-server registered, the DaemonSet
crash-loops with 404s against /apis/metrics.k8s.io/v1beta1 — silently,
every few seconds. Found during a bare-metal smoke test on a k3d cluster
where metrics-server had been explicitly disabled.

- scripts/lib/cluster.sh: drop --disable=metrics-server from the k3d
  create args. k3s bundles metrics-server; the earlier comment claiming
  the chart "ships its own" was wrong — the DaemonSet is a consumer of
  metrics-server, not a replacement.
- client/templates/resource-monitor-daemonset.yaml: add a pre-install
  `lookup` that fails the release up front when resourceMonitor is true
  but v1beta1.metrics.k8s.io is not registered. Guarded by a kube-system
  probe so offline `helm template` still renders.
- client/values.yaml: document the dependency inline on resourceMonitor,
  with per-platform install notes (k3d/AKS bundled; EKS/OC/bare-metal
  need manual install).
- docs/SECURITY.md: call out the dependency and the escape hatch
  (resourceMonitor: false) in the architecture section.
- Chart.yaml: 1.0.4 -> 1.0.5.

Verified on a fresh k3d cluster (no --disable=metrics-server): metrics
API comes up in ~30s, smoke install succeeds, resource-monitor reaches
Running with zero ERROR/404 lines. Pre-flight fail path also verified
against a metrics-less cluster.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(mysql): drop chmod from hostPath init (v1.0.6) (#56)

The init-container runs as UID 0 with capabilities drop:[ALL] add:[CHOWN].
After 'chown 999:999' transfers ownership, the subsequent 'chmod 755' runs
as a non-owner without CAP_FOWNER and returns EPERM on re-install where
the hostPath dir already exists from a prior run. Reversing the order
does not help (chmod first still fails once the dir is 999-owned from
any previous successful run).

kubelet creates hostPath dirs at 0755 via DirectoryOrCreate, so the chmod
was a no-op on fresh installs and broken on re-installs. Drop it.

Verified on k3d/AWS VM:
- fresh install: kubelet-created root:root dir -> chown succeeds -> 999:999
- re-install: pre-existing 999:999 dir with data -> chown no-op -> data intact

* Chore/merge main into develop (#58)

* Update README.md

* Add narrow CODEOWNERS for security-sensitive paths

* Remove metrics-server disable argument from k3d cluster creation in install-k8s.ps1 to ensure proper functionality of the resource-monitor DaemonSet, which relies on the metrics API. This change aligns with previous updates that emphasized the necessity of metrics-server for monitoring capabilities.

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Merge pull request #60 from tracebloc/fix/resource-monitor-digest-pinning

fix(client): pin resource-monitor by digest (v1.0.7)

* chore: add auto-add to engineer kanban workflow (#45)

* Add auto-add to engineer kanban workflow

* fix(ci): pin actions/add-to-project to v1.0.2

@v1 is not a valid tag — action publishes full semver only. Pin to v1.0.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) (#61)

* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8)

When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the
template's range loop produced no items, so `except:` rendered as null.
Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`,
silently granting training pods unrestricted port-443 egress to MySQL, the
K8s API, jobs-manager, and every other in-cluster destination the policy
is meant to block.

Gate the misconfiguration at two levels:
- `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at
  helm install/upgrade validation)
- `network-policy-training.yaml`: add a `{{ fail }}` guard as
  defense-in-depth for schema-bypass paths (helm template --validate=false)
- `tests/network_policy_test.yaml`: add a unit test asserting the failure

Credit: bug bot finding.

* fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade

Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.

- Read the digest via nested `default (dict)` so a missing `images` map
  AND a missing `resourceMonitor` entry both fall through to "" safely.
  `dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
  `images: null` and asserts the DaemonSet still renders with the tag
  fallback.

Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.

* fix(client): scope clusterCidrs minItems guard to enabled=true only

Bug bot flagged that the unconditional minItems:1 constraint on
networkPolicy.training.clusterCidrs rejects `enabled: false` +
`clusterCidrs: []` — a legitimate minimal config for operators on
non-enforcing CNIs who disable the policy entirely.

Move the constraint behind a JSON Schema draft 7 if/then at the
`training` object level: minItems:1 applies only when enabled=true.
The template-side fail guard was already correctly scoped inside the
`.Values.networkPolicy.training.enabled` check, so no template change
is needed — this aligns the schema with the template.

Add a unittest covering `enabled: false` + `clusterCidrs: []`
(schema must pass, no policy rendered).

---------

Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants