docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight by saadqbal · Pull Request #71 · tracebloc/client

saadqbal · 2026-04-28T06:49:11Z

Summary

The 2026-04-27 `hasan-prod` migration on `tracebloc-templates-prod` tried Option B from `docs/MIGRATIONS.md` exactly as documented — strip Helm ownership labels from the live StorageClass, then `helm uninstall`, expecting Helm to skip the un-owned resource. The strip worked (live labels visibly empty), but `helm uninstall` deleted the StorageClass anyway.

Same root cause as the 2026-04-22 keep-annotation case this doc already covers: `helm uninstall` iterates the stored release manifest's resource list and DELETEs each entry, without re-checking live ownership labels first. Live ownership matters for `helm install` adoption decisions, not uninstall deletion.

So the doc was wrong about Option B. We're now 2-for-2 on production migrations bitten by the pattern "modify live resource, expect uninstall to respect it."

Changes

Option B relabelled `DOES NOT WORK` and rewritten as a cautionary tale. Removing it would leave a gap operators reliably reach for first; keeping it with a clear warning costs nothing and prevents the next person from burning a maintenance window discovering this.
Recommended paths: A (preferred) or C (fallback). Mitigation options intro clarifies B is documented for "don't do this" reasons only.
Pre-uninstall checklist gains item 7: verify no active workloads in the namespace beyond the release itself. The 2026-04-27 migration was caught off-guard by 9 customer training Jobs holding PVCs in `Terminating` via `pvc-protection` finalizers; the existing checklist had no item to surface this.
`Case study` → `Case studies` — keeps the 2026-04-22 case verbatim, appends 2026-04-27 as a second sub-section. New intro note generalises both cases as "modify live resource, expect uninstall to respect it" failures.
`CLAUDE.md` updated — drops "Option A, B, or C" and points operators at A or C only, with a brief reminder that B is the trap.

Test plan

`grep -n '^##\|^###' docs/MIGRATIONS.md` — section structure intact, no orphan anchors
Top-of-doc cross-reference updated to match the renamed anchor (`#case-studies`)
No code changes; pure docs PR

Out of scope

The chart bug that caused the prod migration to be more painful than it should have been (shared-name `tracebloc-resource-monitor` resources blocking multi-release-per-cluster) is being handled in a separate PR.

Note

Low Risk
Documentation-only changes; operational risk is low and the guidance is more conservative (warns against a previously suggested but ineffective mitigation).

Overview
Clarifies Helm migration guidance to emphasize that only Option A (inject helm.sh/resource-policy: keep via helm upgrade) or Option C (rely on PV reclaimPolicy: Retain and recreate PVCs) are valid paths, and explicitly relabels Option B (stripping live Helm ownership metadata) as “DOES NOT WORK” with an explanation of why helm uninstall still deletes resources from the stored release manifest.

Extends docs/MIGRATIONS.md with a new pre-uninstall checklist item to verify there are no non-Helm workloads (e.g., Jobs) still mounting PVCs that can stall deletion via pvc-protection, and adds a second production case study (hasan-prod, 2026-04-27) capturing the Option B failure and the active-jobs gotcha. CLAUDE.md is updated to point operators at Option A/C only.

^{Reviewed by Cursor Bugbot for commit c4ff7f3. Bugbot is set up for automated code reviews on this repo. Configure here.}

…s pre-flight The 2026-04-27 hasan-prod migration on tracebloc-templates-prod tried Option B exactly as previously documented — strip Helm ownership labels from the live StorageClass, then helm uninstall, expecting Helm to skip the un-owned resource. The strip worked (live labels visibly empty), but helm uninstall deleted the StorageClass anyway. Same root cause as the 2026-04-22 keep-annotation case the doc already covers: helm uninstall iterates the stored release manifest's resource list and DELETEs each entry, without re-checking live ownership labels first. Live ownership matters for `helm install` adoption decisions, not uninstall deletion. Changes: - Option B is relabelled "DOES NOT WORK" and rewritten as a cautionary tale — operators reliably reach for it first, so leaving it in the doc with a clear warning is more useful than removing it. - Recommended path is now Option A (preferred) or Option C (fallback); Mitigation options section intro clarifies B is documented for "don't do this" reasons only. - Pre-uninstall checklist gains item 7: verify no active workloads in the namespace beyond the release itself. The 2026-04-27 migration was caught off-guard by 9 customer training Jobs holding PVCs in Terminating via pvc-protection finalizers; the existing checklist had no item to surface this. - Case studies section now plural — keeps the 2026-04-22 case verbatim and appends 2026-04-27 as a second sub-section. New intro note generalises both cases as "modify live resource, expect uninstall to respect it" failures. - Top-of-doc cross-reference updated from #case-study-... to #case-studies (anchor changed by the rename). - CLAUDE.md updated to drop "Option A, B, or C" — now points operators at A or C only, with a brief reminder that B is the trap. No code changes; pure docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@saadqbal

* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight * chore: add default CODEOWNERS for auto-reviewer assignment (#73) * ci: add kanban closure-routing caller workflow (#75) * fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72) Two client releases on the same cluster could not both deploy the resource-monitor DaemonSet because several resources templated into the shared tracebloc-node-agents namespace used the literal name `tracebloc-resource-monitor` rather than a release-scoped name. The second `helm install` failed with: Error: ServiceAccount "tracebloc-resource-monitor" in namespace "tracebloc-node-agents" exists and cannot be imported into the current release: invalid ownership metadata; ... must equal "hasan-prod": current value is "stg". Surfaced during the 2026-04-27 hasan-prod migration on tracebloc-templates-prod; worked around at the time by setting resourceMonitor: false on the second release, which means prod customers currently lose their per-CLIENT_ID metric stream until this lands. What changed: - New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`, centralised in _helpers.tpl alongside the existing per-release name helpers (secretName, serviceAccountName, etc.). - DaemonSet metadata.name, spec.selector.matchLabels.app, pod label app=, and spec.template.spec.serviceAccountName all now go through the helper. The selector + pod label have to move together because DaemonSet selectors are namespace-scoped: two DaemonSets in tracebloc-node-agents both selecting `app: tracebloc-resource-monitor` would each grab the other's pods, which is worse than the surface bug. - ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`) and stay as-is to avoid an unnecessary ClusterRole rename for upgrading installs. Only the *subject* names in (Cluster)RoleBinding change to point at the new SA. - Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents: the secret names were already release-scoped via tracebloc.secretName / tracebloc.registrySecretName so they did not collide. Their `app` label was the literal value, which is harmless on uniquely-named resources but inconsistent — updated for consistency. - Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton resources is a behaviour change for existing installs (DaemonSet name, ServiceAccount name, and selector label all change), so a minor bump signals that operators should review. Tests: 93 -> 98. New cases cover: - DaemonSet name + selector + serviceAccountName all release-scoped - ServiceAccount name release-scoped - ClusterRoleBinding subject points at the release-scoped SA - A second `helm template` with a different release name produces non-colliding names Verified end-to-end via `helm template stg ./client` and `helm template hasan-prod ./client` on the same chart: ServiceAccount, DaemonSet, and ClusterRoleBinding subject names all diverge per release. Upgrade path from 1.1.0: The DaemonSet and ServiceAccount rename triggers a Helm three-way merge that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs the new release-scoped one. ~30-60s gap on each node where resource metrics are not collected. DaemonSet selector is immutable, so the delete-then-create path is what we want — helm upgrade handles this automatically because the names diverge in the stored manifest. No manual orphan cleanup needed. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): allow training pods to reach mysql-client (v1.2.1) (#76) The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and external TCP/443. Training pods load their dataset from the in-namespace mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table), so under any CNI that actually enforces NetworkPolicy the connect failed with errno 111 and the Job CrashLoopBackOff'd before the first batch: Database connection failed: 2003 (HY000): Can't connect to MySQL server on 'mysql-client:3306' (111) RuntimeError: Database connection is not available for load_dataframe_from_sql_table Surfaced on a fresh client install (k3d / k3s, which enforces policy via the built-in kube-router) where jobs-manager could reach mysql but every training Job spawned with tracebloc.io/workload=training could not. Add a third egress rule scoped to podSelector {app: mysql-client} on TCP/3306. Same-namespace by default (no namespaceSelector), so it stays tight to the chart's own mysql pod and does not open the namespace generally. The egress[1] /32 ipBlock comment is updated to note that MySQL is now explicitly re-permitted by egress[2]. Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod with the training label was refused; post-fix it connects. * docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74) * docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x Captures the operational tooling validated during the 2026-04-27 stg and hasan-prod migrations and generalises it for the remaining tenants (bmw, cisco, charite) and any future tenant on the legacy chart family. What's here: - README.md walks the workflow + recommended ordering for the pending set + skip rationale for chart toggles (resourceMonitor: false, priorityClass.create: false, etc). - generate.sh consumes a tenant-config.env (gitignored) and emits, per tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml. Refuses to expand placeholder __FOO__ rows so an operator running generate.sh against the unmodified template fails fast. - migrate-tenant.sh is the parameterised runbook. `phase1` is non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand recovery point, dry-run render). `phase2` is one-shot per tenant (helm uninstall, claimRef clear, SC re-create, PVC pre-create with release-scoped Helm ownership stamp, helm install, verify mysql data + keep annotation in stored manifest). - tenant-config.example.env is the template; populated copy is the secret-bearing artifact and must stay local. No real secrets in any committed file: - DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__) - per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders - MYSQL_ROOT_PW placeholder (it's image-baked; required from env at runtime, no committed default) - .gitignore now excludes docs/migration-tools/tenant-config.env (only the .example variant is tracked) Operational notes: - Every kubectl/helm call passes --context explicitly. The 2026-04-27 prod run hit a context-drift bug mid-migration; the explicit form is a hard requirement. - values.yaml ships with resourceMonitor: false. Flip true after the release-scoped resource-monitor names land in client-1.2.0 (separate PR). Until then the shared SA in tracebloc-node-agents collides with the stg release. - Phase 1 is idempotent and re-runnable. Phase 2 is destructive and one-shot per tenant. Operators should pause and eyeball Phase 1 outputs before running Phase 2 — that's deliberately not automated. Once all four pending tenants are on client-1.x, this directory is historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade` because the new chart already templates `helm.sh/resource-policy: keep` on PVCs, so the migration protocol isn't needed for routine upgrades. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(migration-tools): address bugbot review feedback on PR #74 Three issues flagged by Cursor Bugbot on the migration scripts: * migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp verification (HIGH). Linux operators would abort Phase 1 mid-transfer. Add portable `_md5` and `_size` helpers that pick md5sum on Linux, fall back to md5(1) on macOS, and use `wc -c` instead of stat for size. * generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD (MEDIUM). Literal `__FOO__` placeholders silently rendered into values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install time. Iterate over every per-row field, plus a one-shot global check for DOCKER_PASSWORD before the loop. Error messages now name the offending field. * Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done` (MEDIUM). After the destructive helm uninstall, a non-converging install (image-pull error, mysql kill-loop recurrence, missing PVC binding) hung the script forever instead of surfacing the failure. Add a wall-clock deadline — default 600s, override via READY_TIMEOUT — and exit 1 with the last-seen pod state on timeout. * fix(migration-tools): address bugbot follow-up on PR #74 Two more issues raised on the previous fix commit: * Readiness wait loop aborted on empty pod list (HIGH). With `set -euo pipefail`, the routine post-install window where no pods are visible yet caused `grep -c .` to exit 1, killing the script on the very first iteration before the wall-clock deadline could ever fire — defeating the bounded-wait intent. Guard the empty case explicitly. `wc -l` alone is also wrong because `echo ""` prints a newline. * MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD, CLIENT_*, and PV_* now have (LOW). An operator who copied the example without editing this row passed the non-empty gate, then the literal __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up partway through with an opaque "Access denied" inside kubectl exec. Add the same `*__*__*` case guard right after the non-empty check. * fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74) The pre-source assignment EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}" was a no-op: `source "$CONFIG"` runs immediately after and the example config (and any real tenant-config.env derived from it) unconditionally sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every time. Operators thinking they were targeting a non-default EFS would silently start AWS Backup on-demand jobs against the hard-coded prod filesystem. Move the override knob to AFTER source where env genuinely wins, drop the hard-coded fallback, and require EFS_FS to be set somewhere (config or override) before continuing. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): release-scope SCC SA refs (v1.2.2) (#78) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80) * fix(client): release-scope SCC SA refs (v1.2.2) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues: 1. (Medium) NOTES.txt:9 still hardcoded the literal `tracebloc-resource-monitor` for the resource-monitor DaemonSet display, while the actual DaemonSet name has been `<release>-resource-monitor` since v1.2.0. Operators see one name in the post-install banner and a different name when they `kubectl get ds`. Now routes through the same tracebloc.resourceMonitorName helper as the rest of the chart. 2. (Low) docs/migration-tools/generate.sh hardcoded `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0` on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3, and operators running generate.sh today get PVC labels stuck at 1.1.0 even though the install ahead is 1.2.3. Helm adoption itself is unaffected (it keys on meta.helm.sh/release-name, not the chart label), but the labels lie until a subsequent upgrade reconciles them, and `kubectl get pvc -L helm.sh/chart` is misleading during migration debugging. Fixed by reading name + version from client/Chart.yaml at generate time. Plus a few stale prose references caught while auditing the same path (no functional impact, but the doc was directing operators at "client fix in 1.2.0" as if it were still pending): - generate.sh inline comment on `resourceMonitor: false` rephrased from "until client-1.2.0 is published" to "until you have verified the chart you're installing is 1.2.0+" - migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to "mysql spec sanity (v1.1.0+ shape: ...)" - README.md skip table cell on `resourceMonitor: false` rewritten to reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it to true without colliding with the stg release Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers: - Release-scoped resource-monitor name appears in NOTES.txt - A different release renders a different name (proves the helper isn't accidentally hardcoded) - Negative regex guards against the literal `tracebloc-resource-monitor` reappearing followed by a non-suffix character (i.e. the bare pre-1.2.3 form, while still letting the SCC line `tracebloc- resource-monitor-<release>` further down the file pass) - `resourceMonitor: false` removes the line entirely End-to-end smoke of generate.sh confirms PVCs ship with the live chart version (`helm.sh/chart: client-1.2.3` after this commit, verified against /tmp/tracebloc-migration-<demo>/pvcs.yaml). Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains the SCC SA-ref rename. Once #78 lands the diff against develop will reduce to just this commit. Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix + tooling correctness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * docs(claude): require @saadqbal as PR assignee (#79) Convention captured after a session-end ask. Every PR Claude opens for this repo must be assigned to saadqbal — orphaned PRs without an assignee fall through the review queue. Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running unauthenticated). No exceptions. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

saqlainsyed007 approved these changes Apr 28, 2026

View reviewed changes

saadqbal merged commit 7d1efce into develop Apr 28, 2026
3 checks passed

saadqbal mentioned this pull request Apr 28, 2026

docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x #74

Merged

4 tasks

saadqbal self-assigned this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight#71

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight#71
saadqbal merged 1 commit intodevelopfrom
docs/migrations-correct-option-b

saadqbal commented Apr 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saadqbal commented Apr 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saadqbal commented Apr 28, 2026 •

edited by cursor Bot

Loading