fix(client): allow training pods to reach mysql-client (v1.2.1) by saadqbal · Pull Request #76 · tracebloc/client

saadqbal · 2026-04-28T08:42:37Z

Summary

Training pods now have a third egress rule to reach the in-namespace mysql-client pod on TCP/3306 (podSelector: {app: mysql-client}, same-namespace).
Previously the training-egress NetworkPolicy only allowed DNS and external TCP/443, so under an enforcing CNI training Jobs CrashLoopBackOff'd at the dataset-load step with errno 111.
Bumps chart from 1.2.0 → 1.2.1. No values-schema changes.

Why

Surfaced on a fresh client install on k3d (k3s enforces NetworkPolicy via the bundled kube-router). jobs-manager could reach mysql but every Job spawned with tracebloc.io/workload=training could not — observed error in the failing pod:

Database connection failed: 2003 (HY000): Can't connect to MySQL server on 'mysql-client:3306' (111)
RuntimeError: Database connection is not available for load_dataframe_from_sql_table

The dataset-load path (core/utils/database.py::load_dataframe_from_sql_table, called from core/datasets/loaders/dataset_loader.py:35) is mandatory for every training run, so the policy as written in v1.1.0 made training non-functional on any enforcing CNI. The new rule is scoped to the chart's own mysql pod via app: mysql-client rather than opening the namespace.

Test plan

helm unittest client/ — 99/99 pass (added 1 new assertion for egress[2])
helm template renders the new rule cleanly
Live verification on k3d cluster: pre-fix nc -zv mysql-client 3306 from a pod with the training label is refused; post-fix it connects (mysql-client (10.43.135.100:3306) open)
Real training Job that previously CrashLoopBackOff'd now passes the dataset-load step
Re-verify on AKS / EKS+Calico / OpenShift if those clusters are available before merging

🤖 Generated with Claude Code

Note

Low Risk
Small, isolated Helm NetworkPolicy change plus a chart version bump; main risk is unintentionally widening or breaking egress rules for training pods in clusters that enforce NetworkPolicy.

Overview
Fixes training jobs on enforcing CNIs by adding a new egress allow rule in templates/network-policy-training.yaml permitting TCP/3306 traffic from training pods to the in-namespace mysql-client pod (scoped via podSelector: app=mysql-client).

Bumps the Helm chart version/appVersion to 1.2.1 and extends client/tests/network_policy_test.yaml with assertions validating the new egress[2] rule.

^{Reviewed by Cursor Bugbot for commit 041f459. Bugbot is set up for automated code reviews on this repo. Configure here.}

The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and external TCP/443. Training pods load their dataset from the in-namespace mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table), so under any CNI that actually enforces NetworkPolicy the connect failed with errno 111 and the Job CrashLoopBackOff'd before the first batch: Database connection failed: 2003 (HY000): Can't connect to MySQL server on 'mysql-client:3306' (111) RuntimeError: Database connection is not available for load_dataframe_from_sql_table Surfaced on a fresh client install (k3d / k3s, which enforces policy via the built-in kube-router) where jobs-manager could reach mysql but every training Job spawned with tracebloc.io/workload=training could not. Add a third egress rule scoped to podSelector {app: mysql-client} on TCP/3306. Same-namespace by default (no namespaceSelector), so it stays tight to the chart's own mysql pod and does not open the namespace generally. The egress[1] /32 ipBlock comment is updated to note that MySQL is now explicitly re-permitted by egress[2]. Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod with the training label was refused; post-fix it connects.

@saadqbal

* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight * chore: add default CODEOWNERS for auto-reviewer assignment (#73) * ci: add kanban closure-routing caller workflow (#75) * fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72) Two client releases on the same cluster could not both deploy the resource-monitor DaemonSet because several resources templated into the shared tracebloc-node-agents namespace used the literal name `tracebloc-resource-monitor` rather than a release-scoped name. The second `helm install` failed with: Error: ServiceAccount "tracebloc-resource-monitor" in namespace "tracebloc-node-agents" exists and cannot be imported into the current release: invalid ownership metadata; ... must equal "hasan-prod": current value is "stg". Surfaced during the 2026-04-27 hasan-prod migration on tracebloc-templates-prod; worked around at the time by setting resourceMonitor: false on the second release, which means prod customers currently lose their per-CLIENT_ID metric stream until this lands. What changed: - New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`, centralised in _helpers.tpl alongside the existing per-release name helpers (secretName, serviceAccountName, etc.). - DaemonSet metadata.name, spec.selector.matchLabels.app, pod label app=, and spec.template.spec.serviceAccountName all now go through the helper. The selector + pod label have to move together because DaemonSet selectors are namespace-scoped: two DaemonSets in tracebloc-node-agents both selecting `app: tracebloc-resource-monitor` would each grab the other's pods, which is worse than the surface bug. - ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`) and stay as-is to avoid an unnecessary ClusterRole rename for upgrading installs. Only the *subject* names in (Cluster)RoleBinding change to point at the new SA. - Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents: the secret names were already release-scoped via tracebloc.secretName / tracebloc.registrySecretName so they did not collide. Their `app` label was the literal value, which is harmless on uniquely-named resources but inconsistent — updated for consistency. - Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton resources is a behaviour change for existing installs (DaemonSet name, ServiceAccount name, and selector label all change), so a minor bump signals that operators should review. Tests: 93 -> 98. New cases cover: - DaemonSet name + selector + serviceAccountName all release-scoped - ServiceAccount name release-scoped - ClusterRoleBinding subject points at the release-scoped SA - A second `helm template` with a different release name produces non-colliding names Verified end-to-end via `helm template stg ./client` and `helm template hasan-prod ./client` on the same chart: ServiceAccount, DaemonSet, and ClusterRoleBinding subject names all diverge per release. Upgrade path from 1.1.0: The DaemonSet and ServiceAccount rename triggers a Helm three-way merge that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs the new release-scoped one. ~30-60s gap on each node where resource metrics are not collected. DaemonSet selector is immutable, so the delete-then-create path is what we want — helm upgrade handles this automatically because the names diverge in the stored manifest. No manual orphan cleanup needed. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): allow training pods to reach mysql-client (v1.2.1) (#76) The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and external TCP/443. Training pods load their dataset from the in-namespace mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table), so under any CNI that actually enforces NetworkPolicy the connect failed with errno 111 and the Job CrashLoopBackOff'd before the first batch: Database connection failed: 2003 (HY000): Can't connect to MySQL server on 'mysql-client:3306' (111) RuntimeError: Database connection is not available for load_dataframe_from_sql_table Surfaced on a fresh client install (k3d / k3s, which enforces policy via the built-in kube-router) where jobs-manager could reach mysql but every training Job spawned with tracebloc.io/workload=training could not. Add a third egress rule scoped to podSelector {app: mysql-client} on TCP/3306. Same-namespace by default (no namespaceSelector), so it stays tight to the chart's own mysql pod and does not open the namespace generally. The egress[1] /32 ipBlock comment is updated to note that MySQL is now explicitly re-permitted by egress[2]. Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod with the training label was refused; post-fix it connects. * docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74) * docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x Captures the operational tooling validated during the 2026-04-27 stg and hasan-prod migrations and generalises it for the remaining tenants (bmw, cisco, charite) and any future tenant on the legacy chart family. What's here: - README.md walks the workflow + recommended ordering for the pending set + skip rationale for chart toggles (resourceMonitor: false, priorityClass.create: false, etc). - generate.sh consumes a tenant-config.env (gitignored) and emits, per tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml. Refuses to expand placeholder __FOO__ rows so an operator running generate.sh against the unmodified template fails fast. - migrate-tenant.sh is the parameterised runbook. `phase1` is non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand recovery point, dry-run render). `phase2` is one-shot per tenant (helm uninstall, claimRef clear, SC re-create, PVC pre-create with release-scoped Helm ownership stamp, helm install, verify mysql data + keep annotation in stored manifest). - tenant-config.example.env is the template; populated copy is the secret-bearing artifact and must stay local. No real secrets in any committed file: - DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__) - per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders - MYSQL_ROOT_PW placeholder (it's image-baked; required from env at runtime, no committed default) - .gitignore now excludes docs/migration-tools/tenant-config.env (only the .example variant is tracked) Operational notes: - Every kubectl/helm call passes --context explicitly. The 2026-04-27 prod run hit a context-drift bug mid-migration; the explicit form is a hard requirement. - values.yaml ships with resourceMonitor: false. Flip true after the release-scoped resource-monitor names land in client-1.2.0 (separate PR). Until then the shared SA in tracebloc-node-agents collides with the stg release. - Phase 1 is idempotent and re-runnable. Phase 2 is destructive and one-shot per tenant. Operators should pause and eyeball Phase 1 outputs before running Phase 2 — that's deliberately not automated. Once all four pending tenants are on client-1.x, this directory is historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade` because the new chart already templates `helm.sh/resource-policy: keep` on PVCs, so the migration protocol isn't needed for routine upgrades. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(migration-tools): address bugbot review feedback on PR #74 Three issues flagged by Cursor Bugbot on the migration scripts: * migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp verification (HIGH). Linux operators would abort Phase 1 mid-transfer. Add portable `_md5` and `_size` helpers that pick md5sum on Linux, fall back to md5(1) on macOS, and use `wc -c` instead of stat for size. * generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD (MEDIUM). Literal `__FOO__` placeholders silently rendered into values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install time. Iterate over every per-row field, plus a one-shot global check for DOCKER_PASSWORD before the loop. Error messages now name the offending field. * Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done` (MEDIUM). After the destructive helm uninstall, a non-converging install (image-pull error, mysql kill-loop recurrence, missing PVC binding) hung the script forever instead of surfacing the failure. Add a wall-clock deadline — default 600s, override via READY_TIMEOUT — and exit 1 with the last-seen pod state on timeout. * fix(migration-tools): address bugbot follow-up on PR #74 Two more issues raised on the previous fix commit: * Readiness wait loop aborted on empty pod list (HIGH). With `set -euo pipefail`, the routine post-install window where no pods are visible yet caused `grep -c .` to exit 1, killing the script on the very first iteration before the wall-clock deadline could ever fire — defeating the bounded-wait intent. Guard the empty case explicitly. `wc -l` alone is also wrong because `echo ""` prints a newline. * MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD, CLIENT_*, and PV_* now have (LOW). An operator who copied the example without editing this row passed the non-empty gate, then the literal __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up partway through with an opaque "Access denied" inside kubectl exec. Add the same `*__*__*` case guard right after the non-empty check. * fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74) The pre-source assignment EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}" was a no-op: `source "$CONFIG"` runs immediately after and the example config (and any real tenant-config.env derived from it) unconditionally sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every time. Operators thinking they were targeting a non-default EFS would silently start AWS Backup on-demand jobs against the hard-coded prod filesystem. Move the override knob to AFTER source where env genuinely wins, drop the hard-coded fallback, and require EFS_FS to be set somewhere (config or override) before continuing. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): release-scope SCC SA refs (v1.2.2) (#78) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80) * fix(client): release-scope SCC SA refs (v1.2.2) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues: 1. (Medium) NOTES.txt:9 still hardcoded the literal `tracebloc-resource-monitor` for the resource-monitor DaemonSet display, while the actual DaemonSet name has been `<release>-resource-monitor` since v1.2.0. Operators see one name in the post-install banner and a different name when they `kubectl get ds`. Now routes through the same tracebloc.resourceMonitorName helper as the rest of the chart. 2. (Low) docs/migration-tools/generate.sh hardcoded `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0` on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3, and operators running generate.sh today get PVC labels stuck at 1.1.0 even though the install ahead is 1.2.3. Helm adoption itself is unaffected (it keys on meta.helm.sh/release-name, not the chart label), but the labels lie until a subsequent upgrade reconciles them, and `kubectl get pvc -L helm.sh/chart` is misleading during migration debugging. Fixed by reading name + version from client/Chart.yaml at generate time. Plus a few stale prose references caught while auditing the same path (no functional impact, but the doc was directing operators at "client fix in 1.2.0" as if it were still pending): - generate.sh inline comment on `resourceMonitor: false` rephrased from "until client-1.2.0 is published" to "until you have verified the chart you're installing is 1.2.0+" - migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to "mysql spec sanity (v1.1.0+ shape: ...)" - README.md skip table cell on `resourceMonitor: false` rewritten to reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it to true without colliding with the stg release Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers: - Release-scoped resource-monitor name appears in NOTES.txt - A different release renders a different name (proves the helper isn't accidentally hardcoded) - Negative regex guards against the literal `tracebloc-resource-monitor` reappearing followed by a non-suffix character (i.e. the bare pre-1.2.3 form, while still letting the SCC line `tracebloc- resource-monitor-<release>` further down the file pass) - `resourceMonitor: false` removes the line entirely End-to-end smoke of generate.sh confirms PVCs ship with the live chart version (`helm.sh/chart: client-1.2.3` after this commit, verified against /tmp/tracebloc-migration-<demo>/pvcs.yaml). Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains the SCC SA-ref rename. Once #78 lands the diff against develop will reduce to just this commit. Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix + tooling correctness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * docs(claude): require @saadqbal as PR assignee (#79) Convention captured after a session-end ask. Every PR Claude opens for this repo must be assigned to saadqbal — orphaned PRs without an assignee fall through the review queue. Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running unauthenticated). No exceptions. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

saadqbal requested a review from saqlainsyed007 as a code owner April 28, 2026 08:42

divyasinghds approved these changes Apr 28, 2026

View reviewed changes

saqlainsyed007 approved these changes Apr 28, 2026

View reviewed changes

saadqbal merged commit 5f8e025 into develop Apr 28, 2026
3 checks passed

saadqbal self-assigned this Apr 28, 2026

saadqbal mentioned this pull request Apr 28, 2026

chore(client): bump chart to 1.2.3 for release #81

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): allow training pods to reach mysql-client (v1.2.1)#76

fix(client): allow training pods to reach mysql-client (v1.2.1)#76
saadqbal merged 1 commit intodevelopfrom
fix/training-mysql-egress

saadqbal commented Apr 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

saadqbal commented Apr 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saadqbal commented Apr 28, 2026 •

edited by cursor Bot

Loading