feat(mysql): drop root init-containers, add PSA-restricted securityContext#48
Merged
feat(mysql): drop root init-containers, add PSA-restricted securityContext#48
Conversation
…ntext Unblocks pod-security.kubernetes.io/enforce: restricted on the release namespace. Previously the mysql-client pod had two init-containers running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999 before mysqld started. PSA restricted rejects runAsUser: 0 on any container, so these init-containers were the last blocker to promoting the namespace from warn/audit to enforce. The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch` at the pod level, which kubelet uses to chgrp mounted volumes on first mount. Once that is in place the init-container chowns are redundant: - On existing PVCs (already owned 999:999 from the prior init-container chown) OnRootMismatch sees the correct root ownership and skips the recursive chgrp — mount is instant, no behavior change. - On fresh PVCs kubelet applies fsGroup before the main container starts. - On emptyDir (the logs volume) kubelet applies fsGroup at volume creation. Also adds a container-level securityContext with all six fields PSA restricted requires: - runAsNonRoot: true - runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's default user, and the entrypoint skips its root-to-mysql gosu re-exec when already running as 999) - allowPrivilegeEscalation: false - capabilities: drop all - seccompProfile: RuntimeDefault Scope: client chart only (now the universal chart covering eks/aks/bm/oc). Caveats for customers: - Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType (EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some object-backed drivers do not; chart docs should flag this in a follow-up. Deferred to separate PR: - readOnlyRootFilesystem on the mysql container (needs emptyDir mounts for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8184c64. Configure here.
kubelet does not apply fsGroup ownership to hostPath volumes (kubernetes/kubernetes#138411), so bare-metal installs need a privileged bootstrap to chown /var/lib/mysql to 999:999 on first start. Gated on .Values.hostPath.enabled so CSI-backed deployments (EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aptracebloc
approved these changes
Apr 23, 2026
3 tasks
saadqbal
added a commit
that referenced
this pull request
Apr 24, 2026
* Add NetworkPolicy locking down training-pod egress
Training pods run untrusted ML code uploaded by external data scientists.
This policy selects on the tracebloc.io/workload=training label (injected
by jobs-manager in the companion client-runtime PR) and:
- Denies all ingress (nothing should connect TO a training pod).
- Allows DNS to the cluster DNS service.
- Allows external TCP/443 only; blocks all pod-to-pod, ClusterIP, and
in-cluster pod traffic via ipBlock with cluster-CIDR exclusions.
Training pods can still reach tracebloc backend, Azure Service Bus, and
App Insights (external HTTPS). They can no longer reach mysql-client,
the K8s API server, the jobs-manager pod IP, or other training pods.
Per-platform defaults:
AKS: enabled=true (requires Azure NPM or Calico at cluster create)
EKS: enabled=false (AWS VPC CNI does not enforce NetworkPolicy; safer
to explicitly disable than silently have no effect)
BM: enabled=true (requires Calico / Cilium / kube-router)
OC: enabled=true (OVN-Kubernetes enforces by default; custom DNS
selector and OpenShift pod/service CIDRs)
The dnsSelector default is empty with a template-side fallback to
{k8s-app: kube-dns} to avoid Helm's map-merge semantics surprising
customers who override it (OpenShift's selector would otherwise be
unioned with the default rather than replacing it).
- templates/network-policy-training.yaml: new policy (gated on
networkPolicy.training.enabled)
- values.yaml + values.schema.json: new networkPolicy.training block
- ci/{aks,eks,bm,oc}-values.yaml: per-platform overrides with notes
- tests/network_policy_test.yaml: 8 helm-unittest cases covering
rendering, ingress denial, DNS allow, external HTTPS allow, cluster
CIDR blocking, and the OpenShift selector override
No effect until the companion client-runtime PR lands, which adds the
tracebloc.io/workload=training label to spawned training pods.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add optional Namespace resource with Pod Security Admission labels (#43)
* Add optional Namespace resource with Pod Security Admission labels
Layers Kubernetes Pod Security Admission on top of the per-pod
securityContext work for defense-in-depth. Off by default -- enabling
requires a greenfield install, since the chart does not currently own
the release namespace on existing deployments.
When namespace.create is true, the chart templates a Namespace with:
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
helm.sh/resource-policy: keep
Warn + audit surface any pod-spec violation as a kubectl warning and
an audit-log event, without rejecting the pod. This gives us a
tripwire for future regressions in our own pod specs (jobs-manager,
mysql, resource-monitor, training pods) and for any third-party pods
in the same namespace.
Enforce mode is deliberately left UNSET. Two of our own workloads
would be rejected under enforce: restricted:
- mysql init containers run as UID 0 (needed to chown the PVC
before the main container -- UID 999 -- starts)
- resource-monitor DaemonSet mounts hostPath /proc and /sys
Enabling enforce before those are refactored (or moved to a separate
namespace) would break the chart. Customers who want full enforcement
can set namespace.podSecurity.enforce = restricted after auditing
their own deployment; the current defaults keep them safe.
helm.sh/resource-policy: keep prevents helm uninstall from deleting
the Namespace, which would otherwise take the PVC-backed training
data and MySQL state with it.
- templates/namespace.yaml: new, gated on namespace.create (default false)
- values.yaml: new namespace block with long comments
- values.schema.json: schema entries for namespace.create + podSecurity
- tests/namespace_test.yaml: 8 helm-unittest cases (toggle off, toggle
on, keep annotation, labels, version strings, enforce omitted when
empty, enforce present when set, baseline override, namespace name
respects release)
- docs/INSTALL.md: section explaining the greenfield vs existing-ns
paths with copy-pasteable kubectl label commands
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix kubeVersion constraint to accept cloud pre-release suffixes
Helm's semver parser excludes pre-release versions from >= ranges by
default, so ">=1.24.0" rejected EKS ("1.34.4-eks-f69f56f"), GKE
("-gke-*"), and AKS release-tagged versions. Changing to ">=1.24.0-0"
explicitly opts the constraint into matching pre-releases, which is
how managed-Kubernetes providers encode their vendor suffix.
Surfaced while dry-run-installing PR #43 against a dev EKS cluster.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
* Add consolidated SECURITY.md covering the training-pod sandbox (#44)
Brings together the threat model, defense layers, per-platform
caveats, operator responsibilities, residual risks, and verification
steps into one reviewable artifact. Covers the complete hardening
posture as shipped across the chart + jobs-manager + new-arch
training images.
Sections:
1. Threat model: trusted platform, untrusted external-data-
scientist submissions. Explicit in-scope / out-of-scope.
2. Seven design goals (G1-G7) for the training-pod sandbox,
each mapped to current status on new-arch vs. legacy.
3. Architecture overview.
4. Defense layers -- credential isolation, network egress,
K8s API access, container runtime hardening, storage
isolation, cross-tenant forgeability, admission tripwire.
5. Per-platform caveats -- NetworkPolicy CNI matrix (AKS/EKS/
bare-metal/OpenShift), PSA version requirements, OpenShift
DNS selector override, runAsUser + arbitrary UIDs, bare-
metal hostPath note.
6. What operators must do themselves -- rotate secrets, verify
CNI enforces, label existing namespaces, monitor audit,
upgrade ordering, refactor path for enforce: restricted.
7. Verification -- copy-pasteable kubectl snippets for each
defense layer.
8. Residual risks with explicit ownership -- global SB conn
strings (backend), HTTPS egress (platform endgame), token
TTL (backend), legacy arch (migration team), PSA enforce
(chart refactor), CNI silent no-op (operator), kernel
escape (out of scope), resource DoS (out of scope).
9. Compromise response playbook.
10. Where each defense is implemented (code-path map for
reviewers).
11. Document history.
Also:
- README.md: add Security subsection under Deployment Guide
linking to docs/SECURITY.md.
- docs/INSTALL.md: prerequisite note about CNI enforcement.
No code changes; documentation only.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add docs/MIGRATIONS.md and CLAUDE.md for Helm chart migration safety (#47)
Document the helm.sh/resource-policy=keep gotcha: Helm reads the
annotation from the stored release manifest, not live resources, so
kubectl annotate alone does not protect PVCs from helm uninstall.
Includes the 2026-04-22 tracebloc-templates migration as a case study
and three mitigation options (helm upgrade, strip ownership, or rely
on PV Retain + recreate).
* docs(client): add pre-Helm resource-monitor cleanup step to MIGRATION.md (#49)
Early-era edges were installed with a hand-rolled `resource-monitor`
DaemonSet via raw `kubectl apply` before the per-platform charts existed.
The unified chart's `tracebloc-resource-monitor` DaemonSet replaces it,
but the legacy DS is unmanaged and keeps running after migration, mounting
hostPath /proc + /sys and blocking PSA `enforce=restricted` on the namespace.
Adds a step-6 section documenting the kubectl cleanup (DS + SA + ClusterRole
+ ClusterRoleBinding, all named `resource-monitor`) with a safety check to
confirm the ClusterRole/Binding aren't shared before deletion.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(mysql): drop root init-containers, add PSA-restricted securityContext (#48)
* feat(mysql): drop root init-containers, add PSA-restricted securityContext
Unblocks pod-security.kubernetes.io/enforce: restricted on the release
namespace. Previously the mysql-client pod had two init-containers
running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999
before mysqld started. PSA restricted rejects runAsUser: 0 on any
container, so these init-containers were the last blocker to promoting
the namespace from warn/audit to enforce.
The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch`
at the pod level, which kubelet uses to chgrp mounted volumes on first
mount. Once that is in place the init-container chowns are redundant:
- On existing PVCs (already owned 999:999 from the prior init-container
chown) OnRootMismatch sees the correct root ownership and skips the
recursive chgrp — mount is instant, no behavior change.
- On fresh PVCs kubelet applies fsGroup before the main container starts.
- On emptyDir (the logs volume) kubelet applies fsGroup at volume
creation.
Also adds a container-level securityContext with all six fields PSA
restricted requires:
- runAsNonRoot: true
- runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's
default user, and the entrypoint skips its root-to-mysql gosu re-exec
when already running as 999)
- allowPrivilegeEscalation: false
- capabilities: drop all
- seccompProfile: RuntimeDefault
Scope: client chart only (now the universal chart covering eks/aks/bm/oc).
Caveats for customers:
- Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType
(EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some
object-backed drivers do not; chart docs should flag this in a
follow-up.
Deferred to separate PR:
- readOnlyRootFilesystem on the mysql container (needs emptyDir mounts
for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk).
* fix(mysql): restore chown init-container for hostPath (bare-metal)
kubelet does not apply fsGroup ownership to hostPath volumes
(kubernetes/kubernetes#138411), so bare-metal installs need a
privileged bootstrap to chown /var/lib/mysql to 999:999 on first
start. Gated on .Values.hostPath.enabled so CSI-backed deployments
(EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Move tracebloc-resource-monitor to dedicated privileged namespace (#50)
* Move tracebloc-resource-monitor to dedicated privileged namespace
Pod Security Admission's `restricted` profile bans hostPath volumes
outright, and the resource-monitor DaemonSet needs hostPath /proc and
/sys to read node-level metrics. Previously, setting
`pod-security.kubernetes.io/enforce: restricted` on the release
namespace (tracebloc-templates) would reject the DaemonSet outright,
and `warn=restricted` + `audit=restricted` already spam violations.
This isolates the DaemonSet in a new dedicated namespace
(tracebloc-node-agents, configurable via `nodeAgents.namespace.name`)
that carries `pod-security.kubernetes.io/{enforce,warn,audit}:
privileged` labels. The release namespace is no longer constrained by
the node-agent and can run `enforce: restricted` once the mysql init
refactor lands.
Changes:
- templates/node-agents-namespace.yaml: new, gated on
nodeAgents.namespace.create (default true) and resourceMonitor
- templates/resource-monitor-daemonset.yaml: deploy into node-agents ns
- templates/resource-monitor-rbac.yaml: SA + (Cluster)RoleBinding in
node-agents ns
- templates/resource-monitor-scc.yaml: SCC users + CRB subject updated
(OpenShift path)
- values.yaml + values.schema.json: new `nodeAgents.namespace` block
- templates/namespace.yaml + docs/INSTALL.md: drop resource-monitor
from the enforce-blocker list; document the new node-agents ns
- tests/node_agents_namespace_test.yaml: 12 new unittest cases
Upgrade impact: existing installs will see the DaemonSet / SA /
(Cluster)RoleBinding deleted from the release namespace and recreated
in the node-agents namespace during `helm upgrade`. Brief (~seconds)
gap in node metrics during rollout; no persistent data involved.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Mirror secrets into node-agents ns; keep namespace RBAC in release ns
Two follow-ups from review of the namespace-split change:
1. Secrets are namespace-scoped — a pod in `tracebloc-node-agents`
cannot `secretKeyRef` a Secret that only exists in the release
namespace. The resource-monitor DaemonSet was referencing CLIENT_ID /
CLIENT_PASSWORD from `tracebloc.secretName` and the registry pull
secret, both of which template only into `.Release.Namespace`, so
pods would have failed to start with CreateContainerConfigError.
templates/secrets.yaml and templates/docker-registry-secret.yaml now
template a second copy into `nodeAgents.namespace.name` when:
resourceMonitor != false AND node-agents ns != release ns
The mirror is skipped when the two namespaces collide (e.g. operator
points nodeAgents.namespace.name back at the release namespace) so
Helm does not try to create two resources with the same name.
2. When clusterScope: false, the Role must live in the RELEASE
namespace because that is where the monitored workloads run — a
namespace-scoped Role only grants access to its own namespace.
Previously this PR put the Role in `tracebloc-node-agents`, which
would have silently broken the resource-monitor for anyone not
using ClusterRole. Role + RoleBinding are now back in
`.Release.Namespace`; the RoleBinding subject still points at the
ServiceAccount in the node-agents namespace (cross-namespace
subjects in RoleBindings are valid).
Tests updated accordingly; 5 new cases cover mirror-on, mirror-off
(resourceMonitor=false), mirror-off (namespaces collide), dockercfg
mirror, and the corrected Role/RoleBinding placement.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(resource-monitor): pin NAMESPACE env to release ns; guard node-agents ns==release ns
Two review fixes from the PSA hardening change:
1. NAMESPACE env var was using Downward API fieldPath: metadata.namespace,
which now resolves to the node-agents namespace (where the DaemonSet
pods live) instead of the release namespace (where the monitored
workloads live). Replace with the literal Release.Namespace so the
monitor continues to watch the right namespace regardless of where
its own pods run.
2. node-agents-namespace.yaml would stamp privileged PSA labels onto the
release namespace if an operator set nodeAgents.namespace.name to the
release namespace (and with namespace.create=true it would render two
Namespace docs with the same name — a render-time collision). Add an
equality guard so the template is a no-op in that configuration.
Adds one test covering the NAMESPACE env fix; tests: 74/74 pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(mysql): set readOnlyRootFilesystem on mysql-client (#52)
Completes container runtime hardening (G4) for mysql-client. Adds three
emptyDir mounts for the paths mysqld writes to at runtime that are NOT
already on PVC or log volumes:
- /var/run/mysqld pid file + unix socket
- /tmp temp tables, sort buffers, LOAD DATA staging
- /var/lib/mysql-files default secure_file_priv dir (touched at start)
Verified via helm upgrade on EKS (tb-client-dev-templates /
tracebloc-templates): pod Ready, readOnlyRootFilesystem=true, `touch /etc/x`
rejected as Read-only, mysqld.sock + mysqld.pid present under /var/run/mysqld,
existing DB data intact in /var/lib/mysql.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(psa): enforce=restricted by default on CSI; bare-metal overrides (#51)
- values.yaml: namespace.podSecurity.enforce flipped to "restricted".
- ci/bm-values.yaml: overrides enforce to "" because kubelet does not
apply fsGroup to hostPath volumes (kubernetes/kubernetes#138411),
forcing the chart to render a privileged init-mysql-data chown
container that PSA restricted would reject. warn+audit remain on.
- namespace.yaml docstring + SECURITY.md (§4.7, §6.3, §6.6, §8.5)
updated to document the CSI-default / bare-metal-override split.
Verified with helm template --set namespace.create=true against both
eks-values.yaml (enforce rendered) and bm-values.yaml (enforce absent).
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(installer): slim k3d and add dev overrides for local testing (#54)
The tracebloc client is outbound-only: jobs-manager and pods-monitor
dial out to the platform, and the only in-cluster Service is mysql-client
(ClusterIP). The bundled k3s ingress/LB stack and metrics-server are
unused overhead, and the chart ships its own StorageClass.
Drop the loadbalancer port mappings (HTTP_PORT/HTTPS_PORT) plus their
validation/help/log references, and pass --k3s-arg "--disable=..." for
traefik, servicelb, metrics-server, and local-storage to k3d cluster
create. Applied symmetrically in scripts/install-k8s.ps1.
Also add two env vars for local-chart testing in install-client-helm.sh:
TRACEBLOC_CHART_PATH install from a local chart path instead of the
published tracebloc/client Helm repo (skips
helm repo add/update)
TRACEBLOC_VALUES_FILE use the caller-supplied values file as-is and
skip the clientId/password prompts + values.yaml
generation
With both set, the installer can exercise the full flow end-to-end
against unreleased chart changes before publishing.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(client): harden image pinning and credentials (v1.0.4) (#53)
Address the High-severity findings from the client chart security review:
- Add digest support to tracebloc.image helper and images.* values for
jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is
set, the image is rendered as repo@sha256:... and imagePullPolicy drops
to IfNotPresent (immutable pin, auditable rollout).
- Replace the hard-coded mysql-client:latest with a configurable tag that
defaults to "prod". The schema rejects "latest" outright; operators
wanting absolute pinning should set images.mysqlClient.digest.
- Harden the bare-metal mysql init-container: still runs as root (kubelet
does not apply fsGroup to hostPath volumes, k8s#138411), but now with
drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false,
readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
- Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults.
The defaults are now empty strings; the schema and template both reject
empty values and <...> placeholder patterns so deployments fail fast
instead of silently encoding a placeholder into the Secret.
Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(client): require metrics-server for resource-monitor (v1.0.5) (#55)
The tracebloc-resource-monitor DaemonSet queries the metrics.k8s.io API
for node CPU/memory. Without metrics-server registered, the DaemonSet
crash-loops with 404s against /apis/metrics.k8s.io/v1beta1 — silently,
every few seconds. Found during a bare-metal smoke test on a k3d cluster
where metrics-server had been explicitly disabled.
- scripts/lib/cluster.sh: drop --disable=metrics-server from the k3d
create args. k3s bundles metrics-server; the earlier comment claiming
the chart "ships its own" was wrong — the DaemonSet is a consumer of
metrics-server, not a replacement.
- client/templates/resource-monitor-daemonset.yaml: add a pre-install
`lookup` that fails the release up front when resourceMonitor is true
but v1beta1.metrics.k8s.io is not registered. Guarded by a kube-system
probe so offline `helm template` still renders.
- client/values.yaml: document the dependency inline on resourceMonitor,
with per-platform install notes (k3d/AKS bundled; EKS/OC/bare-metal
need manual install).
- docs/SECURITY.md: call out the dependency and the escape hatch
(resourceMonitor: false) in the architecture section.
- Chart.yaml: 1.0.4 -> 1.0.5.
Verified on a fresh k3d cluster (no --disable=metrics-server): metrics
API comes up in ~30s, smoke install succeeds, resource-monitor reaches
Running with zero ERROR/404 lines. Pre-flight fail path also verified
against a metrics-less cluster.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(mysql): drop chmod from hostPath init (v1.0.6) (#56)
The init-container runs as UID 0 with capabilities drop:[ALL] add:[CHOWN].
After 'chown 999:999' transfers ownership, the subsequent 'chmod 755' runs
as a non-owner without CAP_FOWNER and returns EPERM on re-install where
the hostPath dir already exists from a prior run. Reversing the order
does not help (chmod first still fails once the dir is 999-owned from
any previous successful run).
kubelet creates hostPath dirs at 0755 via DirectoryOrCreate, so the chmod
was a no-op on fresh installs and broken on re-installs. Drop it.
Verified on k3d/AWS VM:
- fresh install: kubelet-created root:root dir -> chown succeeds -> 999:999
- re-install: pre-existing 999:999 dir with data -> chown no-op -> data intact
* Chore/merge main into develop (#58)
* Update README.md
* Add narrow CODEOWNERS for security-sensitive paths
* Remove metrics-server disable argument from k3d cluster creation in install-k8s.ps1 to ensure proper functionality of the resource-monitor DaemonSet, which relies on the metrics API. This change aligns with previous updates that emphasized the necessity of metrics-server for monitoring capabilities.
---------
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Merge pull request #60 from tracebloc/fix/resource-monitor-digest-pinning
fix(client): pin resource-monitor by digest (v1.0.7)
* chore: add auto-add to engineer kanban workflow (#45)
* Add auto-add to engineer kanban workflow
* fix(ci): pin actions/add-to-project to v1.0.2
@v1 is not a valid tag — action publishes full semver only. Pin to v1.0.2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) (#61)
* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8)
When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the
template's range loop produced no items, so `except:` rendered as null.
Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`,
silently granting training pods unrestricted port-443 egress to MySQL, the
K8s API, jobs-manager, and every other in-cluster destination the policy
is meant to block.
Gate the misconfiguration at two levels:
- `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at
helm install/upgrade validation)
- `network-policy-training.yaml`: add a `{{ fail }}` guard as
defense-in-depth for schema-bypass paths (helm template --validate=false)
- `tests/network_policy_test.yaml`: add a unit test asserting the failure
Credit: bug bot finding.
* fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade
Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.
- Read the digest via nested `default (dict)` so a missing `images` map
AND a missing `resourceMonitor` entry both fall through to "" safely.
`dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
`images: null` and asserts the DaemonSet still renders with the tag
fallback.
Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.
* fix(client): scope clusterCidrs minItems guard to enabled=true only
Bug bot flagged that the unconditional minItems:1 constraint on
networkPolicy.training.clusterCidrs rejects `enabled: false` +
`clusterCidrs: []` — a legitimate minimal config for operators on
non-enforcing CNIs who disable the policy entirely.
Move the constraint behind a JSON Schema draft 7 if/then at the
`training` object level: minItems:1 applies only when enabled=true.
The template-side fail guard was already correctly scoped inside the
`.Values.networkPolicy.training.enabled` check, so no template change
is needed — this aligns the schema with the template.
Add a unittest covering `enabled: false` + `clusterCidrs: []`
(schema must pass, no policy rendered).
---------
Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
runAsUser: 0init-containers that were chowning/var/lib/mysqland/var/log/mysqlto UID 999. The pod-levelfsGroup: 999+fsGroupChangePolicy: OnRootMismatchalready achieves the same effect, so the init-containers were redundant — and the last remaining PSA-restricted blocker on the release namespace.securityContextwith the six fields PSA restricted requires:runAsNonRoot,runAsUser: 999,runAsGroup: 999,allowPrivilegeEscalation: false, drop all capabilities,seccompProfile: RuntimeDefault.clientchart only (the universal chart covering eks/aks/bm/oc variants).Why
Tier A item in the training-pod hardening plan: promote
pod-security.kubernetes.io/{warn,audit}: restricted(added in #43) toenforce: restricted. That was blocked on MySQL being the only remaining non-compliant workload in the release namespace.Behavior change for existing installs
999:999from the prior init-container chown.OnRootMismatchsees correct root ownership and skips the recursive chgrp — mount is instant, no user-visible behavior change.fsGroupto the PVC and emptyDir before the main container starts.gosure-exec, runningmysqlddirectly as 999.CSI caveat
fsGrouprequires a CSI driver declaringfsGroupPolicy: FileorReadWriteOnceWithFSType. Supported: EBS, AzureDisk, GCE-PD, CephRBD, local-path. Not supported: NFS v3 and some object-backed drivers. Chart NOTES.txt warning is a follow-up.Deferred to a separate PR
readOnlyRootFilesystem: trueon the mysql container — needs emptyDir mounts for/tmp,/run/mysqld,/var/lib/mysql-files; real regression risk that deserves its own test cycle.Test plan
helm template client/ --set storageClass.provisioner=kubernetes.io/no-provisionerrenders cleanly with no init-container and the container-level securityContext presentOnRootMismatchskips rechown, data preservedpod-security.kubernetes.io/enforce: restrictedlabel to release namespace — pod still admitskubectl exec mysql-client-* -- idreturnsuid=999(mysql) gid=999(mysql)kubectl exec mysql-client-* -- ls -la /var/lib/mysql | head -3shows999 999ownershipRelated
Note
Medium Risk
Changes MySQL pod startup/permissions by removing root-based init work in most cases and enforcing non-root, restricted PSA settings; misconfigured storage drivers or existing volume ownership could cause startup failures.
Overview
Updates the
mysql-clientDeployment to comply with Pod Security Admission (restricted) by running the main container as non-root (runAsUser/runAsGroup: 999), disabling privilege escalation, dropping all capabilities, and enablingseccompProfile: RuntimeDefault.Reworks volume ownership bootstrapping by removing the log chown init-container and making the remaining
/var/lib/mysqlchown init-container conditional onhostPath.enabled(assuming CSI-backed volumes rely onfsGroup/OnRootMismatchinstead).Reviewed by Cursor Bugbot for commit 097ce50. Bugbot is set up for automated code reviews on this repo. Configure here.