Skip to content

fix(client): allow training pods to reach mysql-client (v1.2.1)#76

Merged
saadqbal merged 1 commit intodevelopfrom
fix/training-mysql-egress
Apr 28, 2026
Merged

fix(client): allow training pods to reach mysql-client (v1.2.1)#76
saadqbal merged 1 commit intodevelopfrom
fix/training-mysql-egress

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented Apr 28, 2026

Summary

  • Training pods now have a third egress rule to reach the in-namespace mysql-client pod on TCP/3306 (podSelector: {app: mysql-client}, same-namespace).
  • Previously the training-egress NetworkPolicy only allowed DNS and external TCP/443, so under an enforcing CNI training Jobs CrashLoopBackOff'd at the dataset-load step with errno 111.
  • Bumps chart from 1.2.01.2.1. No values-schema changes.

Why

Surfaced on a fresh client install on k3d (k3s enforces NetworkPolicy via the bundled kube-router). jobs-manager could reach mysql but every Job spawned with tracebloc.io/workload=training could not — observed error in the failing pod:

Database connection failed: 2003 (HY000): Can't connect to MySQL server on 'mysql-client:3306' (111)
RuntimeError: Database connection is not available for load_dataframe_from_sql_table

The dataset-load path (core/utils/database.py::load_dataframe_from_sql_table, called from core/datasets/loaders/dataset_loader.py:35) is mandatory for every training run, so the policy as written in v1.1.0 made training non-functional on any enforcing CNI. The new rule is scoped to the chart's own mysql pod via app: mysql-client rather than opening the namespace.

Test plan

  • helm unittest client/ — 99/99 pass (added 1 new assertion for egress[2])
  • helm template renders the new rule cleanly
  • Live verification on k3d cluster: pre-fix nc -zv mysql-client 3306 from a pod with the training label is refused; post-fix it connects (mysql-client (10.43.135.100:3306) open)
  • Real training Job that previously CrashLoopBackOff'd now passes the dataset-load step
  • Re-verify on AKS / EKS+Calico / OpenShift if those clusters are available before merging

🤖 Generated with Claude Code


Note

Low Risk
Small, isolated Helm NetworkPolicy change plus a chart version bump; main risk is unintentionally widening or breaking egress rules for training pods in clusters that enforce NetworkPolicy.

Overview
Fixes training jobs on enforcing CNIs by adding a new egress allow rule in templates/network-policy-training.yaml permitting TCP/3306 traffic from training pods to the in-namespace mysql-client pod (scoped via podSelector: app=mysql-client).

Bumps the Helm chart version/appVersion to 1.2.1 and extends client/tests/network_policy_test.yaml with assertions validating the new egress[2] rule.

Reviewed by Cursor Bugbot for commit 041f459. Bugbot is set up for automated code reviews on this repo. Configure here.

The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and
external TCP/443. Training pods load their dataset from the in-namespace
mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table),
so under any CNI that actually enforces NetworkPolicy the connect failed
with errno 111 and the Job CrashLoopBackOff'd before the first batch:

  Database connection failed: 2003 (HY000): Can't connect to MySQL server
  on 'mysql-client:3306' (111)
  RuntimeError: Database connection is not available for load_dataframe_from_sql_table

Surfaced on a fresh client install (k3d / k3s, which enforces policy via
the built-in kube-router) where jobs-manager could reach mysql but every
training Job spawned with tracebloc.io/workload=training could not.

Add a third egress rule scoped to podSelector {app: mysql-client} on
TCP/3306. Same-namespace by default (no namespaceSelector), so it stays
tight to the chart's own mysql pod and does not open the namespace
generally. The egress[1] /32 ipBlock comment is updated to note that
MySQL is now explicitly re-permitted by egress[2].

Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod
with the training label was refused; post-fix it connects.
@saadqbal saadqbal merged commit 5f8e025 into develop Apr 28, 2026
3 checks passed
@saadqbal saadqbal self-assigned this Apr 28, 2026
saadqbal added a commit that referenced this pull request Apr 28, 2026
* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight

* chore: add default CODEOWNERS for auto-reviewer assignment (#73)

* ci: add kanban closure-routing caller workflow (#75)

* fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72)

Two client releases on the same cluster could not both deploy the
resource-monitor DaemonSet because several resources templated into the
shared tracebloc-node-agents namespace used the literal name
`tracebloc-resource-monitor` rather than a release-scoped name. The
second `helm install` failed with:

  Error: ServiceAccount "tracebloc-resource-monitor" in namespace
  "tracebloc-node-agents" exists and cannot be imported into the current
  release: invalid ownership metadata; ... must equal "hasan-prod":
  current value is "stg".

Surfaced during the 2026-04-27 hasan-prod migration on
tracebloc-templates-prod; worked around at the time by setting
resourceMonitor: false on the second release, which means prod customers
currently lose their per-CLIENT_ID metric stream until this lands.

What changed:

- New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`,
  centralised in _helpers.tpl alongside the existing per-release name
  helpers (secretName, serviceAccountName, etc.).
- DaemonSet metadata.name, spec.selector.matchLabels.app, pod label
  app=, and spec.template.spec.serviceAccountName all now go through
  the helper. The selector + pod label have to move together because
  DaemonSet selectors are namespace-scoped: two DaemonSets in
  tracebloc-node-agents both selecting `app: tracebloc-resource-monitor`
  would each grab the other's pods, which is worse than the surface bug.
- ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through
  the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding
  metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`)
  and stay as-is to avoid an unnecessary ClusterRole rename for upgrading
  installs. Only the *subject* names in (Cluster)RoleBinding change to
  point at the new SA.
- Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents:
  the secret names were already release-scoped via
  tracebloc.secretName / tracebloc.registrySecretName so they did not
  collide. Their `app` label was the literal value, which is harmless on
  uniquely-named resources but inconsistent — updated for consistency.
- Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton
  resources is a behaviour change for existing installs (DaemonSet name,
  ServiceAccount name, and selector label all change), so a minor bump
  signals that operators should review.

Tests: 93 -> 98. New cases cover:
- DaemonSet name + selector + serviceAccountName all release-scoped
- ServiceAccount name release-scoped
- ClusterRoleBinding subject points at the release-scoped SA
- A second `helm template` with a different release name produces
  non-colliding names

Verified end-to-end via `helm template stg ./client` and
`helm template hasan-prod ./client` on the same chart: ServiceAccount,
DaemonSet, and ClusterRoleBinding subject names all diverge per release.

Upgrade path from 1.1.0:

The DaemonSet and ServiceAccount rename triggers a Helm three-way merge
that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs
the new release-scoped one. ~30-60s gap on each node where resource
metrics are not collected. DaemonSet selector is immutable, so the
delete-then-create path is what we want — helm upgrade handles this
automatically because the names diverge in the stored manifest. No
manual orphan cleanup needed.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): allow training pods to reach mysql-client (v1.2.1) (#76)

The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and
external TCP/443. Training pods load their dataset from the in-namespace
mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table),
so under any CNI that actually enforces NetworkPolicy the connect failed
with errno 111 and the Job CrashLoopBackOff'd before the first batch:

  Database connection failed: 2003 (HY000): Can't connect to MySQL server
  on 'mysql-client:3306' (111)
  RuntimeError: Database connection is not available for load_dataframe_from_sql_table

Surfaced on a fresh client install (k3d / k3s, which enforces policy via
the built-in kube-router) where jobs-manager could reach mysql but every
training Job spawned with tracebloc.io/workload=training could not.

Add a third egress rule scoped to podSelector {app: mysql-client} on
TCP/3306. Same-namespace by default (no namespaceSelector), so it stays
tight to the chart's own mysql pod and does not open the namespace
generally. The egress[1] /32 ipBlock comment is updated to note that
MySQL is now explicitly re-permitted by egress[2].

Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod
with the training label was refused; post-fix it connects.

* docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74)

* docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x

Captures the operational tooling validated during the 2026-04-27 stg and
hasan-prod migrations and generalises it for the remaining tenants
(bmw, cisco, charite) and any future tenant on the legacy chart family.

What's here:

- README.md walks the workflow + recommended ordering for the pending
  set + skip rationale for chart toggles (resourceMonitor: false,
  priorityClass.create: false, etc).
- generate.sh consumes a tenant-config.env (gitignored) and emits, per
  tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml.
  Refuses to expand placeholder __FOO__ rows so an operator running
  generate.sh against the unmodified template fails fast.
- migrate-tenant.sh is the parameterised runbook. `phase1` is
  non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand
  recovery point, dry-run render). `phase2` is one-shot per tenant
  (helm uninstall, claimRef clear, SC re-create, PVC pre-create with
  release-scoped Helm ownership stamp, helm install, verify mysql data
  + keep annotation in stored manifest).
- tenant-config.example.env is the template; populated copy is the
  secret-bearing artifact and must stay local.

No real secrets in any committed file:

- DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__)
- per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders
- MYSQL_ROOT_PW placeholder (it's image-baked; required from env at
  runtime, no committed default)
- .gitignore now excludes docs/migration-tools/tenant-config.env
  (only the .example variant is tracked)

Operational notes:

- Every kubectl/helm call passes --context explicitly. The 2026-04-27
  prod run hit a context-drift bug mid-migration; the explicit form
  is a hard requirement.
- values.yaml ships with resourceMonitor: false. Flip true after the
  release-scoped resource-monitor names land in client-1.2.0 (separate
  PR). Until then the shared SA in tracebloc-node-agents collides with
  the stg release.
- Phase 1 is idempotent and re-runnable. Phase 2 is destructive and
  one-shot per tenant. Operators should pause and eyeball Phase 1
  outputs before running Phase 2 — that's deliberately not automated.

Once all four pending tenants are on client-1.x, this directory is
historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade`
because the new chart already templates `helm.sh/resource-policy: keep`
on PVCs, so the migration protocol isn't needed for routine upgrades.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(migration-tools): address bugbot review feedback on PR #74

Three issues flagged by Cursor Bugbot on the migration scripts:

* migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp
  verification (HIGH). Linux operators would abort Phase 1 mid-transfer.
  Add portable `_md5` and `_size` helpers that pick md5sum on Linux,
  fall back to md5(1) on macOS, and use `wc -c` instead of stat for size.

* generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD
  + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD
  (MEDIUM). Literal `__FOO__` placeholders silently rendered into
  values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install
  time. Iterate over every per-row field, plus a one-shot global check
  for DOCKER_PASSWORD before the loop. Error messages now name the
  offending field.

* Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done`
  (MEDIUM). After the destructive helm uninstall, a non-converging
  install (image-pull error, mysql kill-loop recurrence, missing PVC
  binding) hung the script forever instead of surfacing the failure.
  Add a wall-clock deadline — default 600s, override via READY_TIMEOUT —
  and exit 1 with the last-seen pod state on timeout.

* fix(migration-tools): address bugbot follow-up on PR #74

Two more issues raised on the previous fix commit:

* Readiness wait loop aborted on empty pod list (HIGH). With `set -euo
  pipefail`, the routine post-install window where no pods are visible
  yet caused `grep -c .` to exit 1, killing the script on the very first
  iteration before the wall-clock deadline could ever fire — defeating
  the bounded-wait intent. Guard the empty case explicitly. `wc -l`
  alone is also wrong because `echo ""` prints a newline.

* MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD,
  CLIENT_*, and PV_* now have (LOW). An operator who copied the example
  without editing this row passed the non-empty gate, then the literal
  __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up
  partway through with an opaque "Access denied" inside kubectl exec.
  Add the same `*__*__*` case guard right after the non-empty check.

* fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74)

The pre-source assignment

    EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}"

was a no-op: `source "$CONFIG"` runs immediately after and the example
config (and any real tenant-config.env derived from it) unconditionally
sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every
time. Operators thinking they were targeting a non-default EFS would
silently start AWS Backup on-demand jobs against the hard-coded prod
filesystem.

Move the override knob to AFTER source where env genuinely wins, drop
the hard-coded fallback, and require EFS_FS to be set somewhere (config
or override) before continuing.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): release-scope SCC SA refs (v1.2.2) (#78)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80)

* fix(client): release-scope SCC SA refs (v1.2.2)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3)

Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues:

1. (Medium) NOTES.txt:9 still hardcoded the literal
   `tracebloc-resource-monitor` for the resource-monitor DaemonSet
   display, while the actual DaemonSet name has been
   `<release>-resource-monitor` since v1.2.0. Operators see one name
   in the post-install banner and a different name when they
   `kubectl get ds`. Now routes through the same
   tracebloc.resourceMonitorName helper as the rest of the chart.

2. (Low) docs/migration-tools/generate.sh hardcoded
   `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0`
   on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3,
   and operators running generate.sh today get PVC labels stuck at
   1.1.0 even though the install ahead is 1.2.3. Helm adoption itself
   is unaffected (it keys on meta.helm.sh/release-name, not the chart
   label), but the labels lie until a subsequent upgrade reconciles
   them, and `kubectl get pvc -L helm.sh/chart` is misleading during
   migration debugging. Fixed by reading name + version from
   client/Chart.yaml at generate time.

Plus a few stale prose references caught while auditing the same path
(no functional impact, but the doc was directing operators at "client
fix in 1.2.0" as if it were still pending):

- generate.sh inline comment on `resourceMonitor: false` rephrased
  from "until client-1.2.0 is published" to "until you have verified
  the chart you're installing is 1.2.0+"
- migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to
  "mysql spec sanity (v1.1.0+ shape: ...)"
- README.md skip table cell on `resourceMonitor: false` rewritten to
  reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it
  to true without colliding with the stg release

Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers:
- Release-scoped resource-monitor name appears in NOTES.txt
- A different release renders a different name (proves the helper
  isn't accidentally hardcoded)
- Negative regex guards against the literal `tracebloc-resource-monitor`
  reappearing followed by a non-suffix character (i.e. the bare
  pre-1.2.3 form, while still letting the SCC line `tracebloc-
  resource-monitor-<release>` further down the file pass)
- `resourceMonitor: false` removes the line entirely

End-to-end smoke of generate.sh confirms PVCs ship with the live chart
version (`helm.sh/chart: client-1.2.3` after this commit, verified
against /tmp/tracebloc-migration-<demo>/pvcs.yaml).

Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains
the SCC SA-ref rename. Once #78 lands the diff against develop will
reduce to just this commit.

Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix +
tooling correctness).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* docs(claude): require @saadqbal as PR assignee (#79)

Convention captured after a session-end ask. Every PR Claude opens for
this repo must be assigned to saadqbal — orphaned PRs without an
assignee fall through the review queue.

Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running
unauthenticated). No exceptions.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants