Skip to content

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight#71

Merged
saadqbal merged 1 commit intodevelopfrom
docs/migrations-correct-option-b
Apr 28, 2026
Merged

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight#71
saadqbal merged 1 commit intodevelopfrom
docs/migrations-correct-option-b

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented Apr 28, 2026

Summary

The 2026-04-27 `hasan-prod` migration on `tracebloc-templates-prod` tried Option B from `docs/MIGRATIONS.md` exactly as documented — strip Helm ownership labels from the live StorageClass, then `helm uninstall`, expecting Helm to skip the un-owned resource. The strip worked (live labels visibly empty), but `helm uninstall` deleted the StorageClass anyway.

Same root cause as the 2026-04-22 keep-annotation case this doc already covers: `helm uninstall` iterates the stored release manifest's resource list and DELETEs each entry, without re-checking live ownership labels first. Live ownership matters for `helm install` adoption decisions, not uninstall deletion.

So the doc was wrong about Option B. We're now 2-for-2 on production migrations bitten by the pattern "modify live resource, expect uninstall to respect it."

Changes

  • Option B relabelled `DOES NOT WORK` and rewritten as a cautionary tale. Removing it would leave a gap operators reliably reach for first; keeping it with a clear warning costs nothing and prevents the next person from burning a maintenance window discovering this.
  • Recommended paths: A (preferred) or C (fallback). Mitigation options intro clarifies B is documented for "don't do this" reasons only.
  • Pre-uninstall checklist gains item 7: verify no active workloads in the namespace beyond the release itself. The 2026-04-27 migration was caught off-guard by 9 customer training Jobs holding PVCs in `Terminating` via `pvc-protection` finalizers; the existing checklist had no item to surface this.
  • `Case study` → `Case studies` — keeps the 2026-04-22 case verbatim, appends 2026-04-27 as a second sub-section. New intro note generalises both cases as "modify live resource, expect uninstall to respect it" failures.
  • `CLAUDE.md` updated — drops "Option A, B, or C" and points operators at A or C only, with a brief reminder that B is the trap.

Test plan

  • `grep -n '^##\|^###' docs/MIGRATIONS.md` — section structure intact, no orphan anchors
  • Top-of-doc cross-reference updated to match the renamed anchor (`#case-studies`)
  • No code changes; pure docs PR

Out of scope

  • The chart bug that caused the prod migration to be more painful than it should have been (shared-name `tracebloc-resource-monitor` resources blocking multi-release-per-cluster) is being handled in a separate PR.

Note

Low Risk
Documentation-only changes; operational risk is low and the guidance is more conservative (warns against a previously suggested but ineffective mitigation).

Overview
Clarifies Helm migration guidance to emphasize that only Option A (inject helm.sh/resource-policy: keep via helm upgrade) or Option C (rely on PV reclaimPolicy: Retain and recreate PVCs) are valid paths, and explicitly relabels Option B (stripping live Helm ownership metadata) as “DOES NOT WORK” with an explanation of why helm uninstall still deletes resources from the stored release manifest.

Extends docs/MIGRATIONS.md with a new pre-uninstall checklist item to verify there are no non-Helm workloads (e.g., Jobs) still mounting PVCs that can stall deletion via pvc-protection, and adds a second production case study (hasan-prod, 2026-04-27) capturing the Option B failure and the active-jobs gotcha. CLAUDE.md is updated to point operators at Option A/C only.

Reviewed by Cursor Bugbot for commit c4ff7f3. Bugbot is set up for automated code reviews on this repo. Configure here.

…s pre-flight

The 2026-04-27 hasan-prod migration on tracebloc-templates-prod tried
Option B exactly as previously documented — strip Helm ownership labels
from the live StorageClass, then helm uninstall, expecting Helm to skip
the un-owned resource. The strip worked (live labels visibly empty), but
helm uninstall deleted the StorageClass anyway.

Same root cause as the 2026-04-22 keep-annotation case the doc already
covers: helm uninstall iterates the stored release manifest's resource
list and DELETEs each entry, without re-checking live ownership labels
first. Live ownership matters for `helm install` adoption decisions, not
uninstall deletion.

Changes:

- Option B is relabelled "DOES NOT WORK" and rewritten as a cautionary
  tale — operators reliably reach for it first, so leaving it in the doc
  with a clear warning is more useful than removing it.
- Recommended path is now Option A (preferred) or Option C (fallback);
  Mitigation options section intro clarifies B is documented for
  "don't do this" reasons only.
- Pre-uninstall checklist gains item 7: verify no active workloads in
  the namespace beyond the release itself. The 2026-04-27 migration was
  caught off-guard by 9 customer training Jobs holding PVCs in
  Terminating via pvc-protection finalizers; the existing checklist had
  no item to surface this.
- Case studies section now plural — keeps the 2026-04-22 case verbatim
  and appends 2026-04-27 as a second sub-section. New intro note
  generalises both cases as "modify live resource, expect uninstall to
  respect it" failures.
- Top-of-doc cross-reference updated from #case-study-... to
  #case-studies (anchor changed by the rename).
- CLAUDE.md updated to drop "Option A, B, or C" — now points operators
  at A or C only, with a brief reminder that B is the trap.

No code changes; pure docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@saadqbal saadqbal merged commit 7d1efce into develop Apr 28, 2026
3 checks passed
@saadqbal saadqbal self-assigned this Apr 28, 2026
saadqbal added a commit that referenced this pull request Apr 28, 2026
* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight

* chore: add default CODEOWNERS for auto-reviewer assignment (#73)

* ci: add kanban closure-routing caller workflow (#75)

* fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72)

Two client releases on the same cluster could not both deploy the
resource-monitor DaemonSet because several resources templated into the
shared tracebloc-node-agents namespace used the literal name
`tracebloc-resource-monitor` rather than a release-scoped name. The
second `helm install` failed with:

  Error: ServiceAccount "tracebloc-resource-monitor" in namespace
  "tracebloc-node-agents" exists and cannot be imported into the current
  release: invalid ownership metadata; ... must equal "hasan-prod":
  current value is "stg".

Surfaced during the 2026-04-27 hasan-prod migration on
tracebloc-templates-prod; worked around at the time by setting
resourceMonitor: false on the second release, which means prod customers
currently lose their per-CLIENT_ID metric stream until this lands.

What changed:

- New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`,
  centralised in _helpers.tpl alongside the existing per-release name
  helpers (secretName, serviceAccountName, etc.).
- DaemonSet metadata.name, spec.selector.matchLabels.app, pod label
  app=, and spec.template.spec.serviceAccountName all now go through
  the helper. The selector + pod label have to move together because
  DaemonSet selectors are namespace-scoped: two DaemonSets in
  tracebloc-node-agents both selecting `app: tracebloc-resource-monitor`
  would each grab the other's pods, which is worse than the surface bug.
- ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through
  the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding
  metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`)
  and stay as-is to avoid an unnecessary ClusterRole rename for upgrading
  installs. Only the *subject* names in (Cluster)RoleBinding change to
  point at the new SA.
- Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents:
  the secret names were already release-scoped via
  tracebloc.secretName / tracebloc.registrySecretName so they did not
  collide. Their `app` label was the literal value, which is harmless on
  uniquely-named resources but inconsistent — updated for consistency.
- Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton
  resources is a behaviour change for existing installs (DaemonSet name,
  ServiceAccount name, and selector label all change), so a minor bump
  signals that operators should review.

Tests: 93 -> 98. New cases cover:
- DaemonSet name + selector + serviceAccountName all release-scoped
- ServiceAccount name release-scoped
- ClusterRoleBinding subject points at the release-scoped SA
- A second `helm template` with a different release name produces
  non-colliding names

Verified end-to-end via `helm template stg ./client` and
`helm template hasan-prod ./client` on the same chart: ServiceAccount,
DaemonSet, and ClusterRoleBinding subject names all diverge per release.

Upgrade path from 1.1.0:

The DaemonSet and ServiceAccount rename triggers a Helm three-way merge
that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs
the new release-scoped one. ~30-60s gap on each node where resource
metrics are not collected. DaemonSet selector is immutable, so the
delete-then-create path is what we want — helm upgrade handles this
automatically because the names diverge in the stored manifest. No
manual orphan cleanup needed.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): allow training pods to reach mysql-client (v1.2.1) (#76)

The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and
external TCP/443. Training pods load their dataset from the in-namespace
mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table),
so under any CNI that actually enforces NetworkPolicy the connect failed
with errno 111 and the Job CrashLoopBackOff'd before the first batch:

  Database connection failed: 2003 (HY000): Can't connect to MySQL server
  on 'mysql-client:3306' (111)
  RuntimeError: Database connection is not available for load_dataframe_from_sql_table

Surfaced on a fresh client install (k3d / k3s, which enforces policy via
the built-in kube-router) where jobs-manager could reach mysql but every
training Job spawned with tracebloc.io/workload=training could not.

Add a third egress rule scoped to podSelector {app: mysql-client} on
TCP/3306. Same-namespace by default (no namespaceSelector), so it stays
tight to the chart's own mysql pod and does not open the namespace
generally. The egress[1] /32 ipBlock comment is updated to note that
MySQL is now explicitly re-permitted by egress[2].

Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod
with the training label was refused; post-fix it connects.

* docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74)

* docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x

Captures the operational tooling validated during the 2026-04-27 stg and
hasan-prod migrations and generalises it for the remaining tenants
(bmw, cisco, charite) and any future tenant on the legacy chart family.

What's here:

- README.md walks the workflow + recommended ordering for the pending
  set + skip rationale for chart toggles (resourceMonitor: false,
  priorityClass.create: false, etc).
- generate.sh consumes a tenant-config.env (gitignored) and emits, per
  tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml.
  Refuses to expand placeholder __FOO__ rows so an operator running
  generate.sh against the unmodified template fails fast.
- migrate-tenant.sh is the parameterised runbook. `phase1` is
  non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand
  recovery point, dry-run render). `phase2` is one-shot per tenant
  (helm uninstall, claimRef clear, SC re-create, PVC pre-create with
  release-scoped Helm ownership stamp, helm install, verify mysql data
  + keep annotation in stored manifest).
- tenant-config.example.env is the template; populated copy is the
  secret-bearing artifact and must stay local.

No real secrets in any committed file:

- DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__)
- per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders
- MYSQL_ROOT_PW placeholder (it's image-baked; required from env at
  runtime, no committed default)
- .gitignore now excludes docs/migration-tools/tenant-config.env
  (only the .example variant is tracked)

Operational notes:

- Every kubectl/helm call passes --context explicitly. The 2026-04-27
  prod run hit a context-drift bug mid-migration; the explicit form
  is a hard requirement.
- values.yaml ships with resourceMonitor: false. Flip true after the
  release-scoped resource-monitor names land in client-1.2.0 (separate
  PR). Until then the shared SA in tracebloc-node-agents collides with
  the stg release.
- Phase 1 is idempotent and re-runnable. Phase 2 is destructive and
  one-shot per tenant. Operators should pause and eyeball Phase 1
  outputs before running Phase 2 — that's deliberately not automated.

Once all four pending tenants are on client-1.x, this directory is
historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade`
because the new chart already templates `helm.sh/resource-policy: keep`
on PVCs, so the migration protocol isn't needed for routine upgrades.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(migration-tools): address bugbot review feedback on PR #74

Three issues flagged by Cursor Bugbot on the migration scripts:

* migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp
  verification (HIGH). Linux operators would abort Phase 1 mid-transfer.
  Add portable `_md5` and `_size` helpers that pick md5sum on Linux,
  fall back to md5(1) on macOS, and use `wc -c` instead of stat for size.

* generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD
  + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD
  (MEDIUM). Literal `__FOO__` placeholders silently rendered into
  values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install
  time. Iterate over every per-row field, plus a one-shot global check
  for DOCKER_PASSWORD before the loop. Error messages now name the
  offending field.

* Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done`
  (MEDIUM). After the destructive helm uninstall, a non-converging
  install (image-pull error, mysql kill-loop recurrence, missing PVC
  binding) hung the script forever instead of surfacing the failure.
  Add a wall-clock deadline — default 600s, override via READY_TIMEOUT —
  and exit 1 with the last-seen pod state on timeout.

* fix(migration-tools): address bugbot follow-up on PR #74

Two more issues raised on the previous fix commit:

* Readiness wait loop aborted on empty pod list (HIGH). With `set -euo
  pipefail`, the routine post-install window where no pods are visible
  yet caused `grep -c .` to exit 1, killing the script on the very first
  iteration before the wall-clock deadline could ever fire — defeating
  the bounded-wait intent. Guard the empty case explicitly. `wc -l`
  alone is also wrong because `echo ""` prints a newline.

* MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD,
  CLIENT_*, and PV_* now have (LOW). An operator who copied the example
  without editing this row passed the non-empty gate, then the literal
  __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up
  partway through with an opaque "Access denied" inside kubectl exec.
  Add the same `*__*__*` case guard right after the non-empty check.

* fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74)

The pre-source assignment

    EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}"

was a no-op: `source "$CONFIG"` runs immediately after and the example
config (and any real tenant-config.env derived from it) unconditionally
sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every
time. Operators thinking they were targeting a non-default EFS would
silently start AWS Backup on-demand jobs against the hard-coded prod
filesystem.

Move the override knob to AFTER source where env genuinely wins, drop
the hard-coded fallback, and require EFS_FS to be set somewhere (config
or override) before continuing.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): release-scope SCC SA refs (v1.2.2) (#78)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80)

* fix(client): release-scope SCC SA refs (v1.2.2)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3)

Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues:

1. (Medium) NOTES.txt:9 still hardcoded the literal
   `tracebloc-resource-monitor` for the resource-monitor DaemonSet
   display, while the actual DaemonSet name has been
   `<release>-resource-monitor` since v1.2.0. Operators see one name
   in the post-install banner and a different name when they
   `kubectl get ds`. Now routes through the same
   tracebloc.resourceMonitorName helper as the rest of the chart.

2. (Low) docs/migration-tools/generate.sh hardcoded
   `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0`
   on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3,
   and operators running generate.sh today get PVC labels stuck at
   1.1.0 even though the install ahead is 1.2.3. Helm adoption itself
   is unaffected (it keys on meta.helm.sh/release-name, not the chart
   label), but the labels lie until a subsequent upgrade reconciles
   them, and `kubectl get pvc -L helm.sh/chart` is misleading during
   migration debugging. Fixed by reading name + version from
   client/Chart.yaml at generate time.

Plus a few stale prose references caught while auditing the same path
(no functional impact, but the doc was directing operators at "client
fix in 1.2.0" as if it were still pending):

- generate.sh inline comment on `resourceMonitor: false` rephrased
  from "until client-1.2.0 is published" to "until you have verified
  the chart you're installing is 1.2.0+"
- migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to
  "mysql spec sanity (v1.1.0+ shape: ...)"
- README.md skip table cell on `resourceMonitor: false` rewritten to
  reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it
  to true without colliding with the stg release

Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers:
- Release-scoped resource-monitor name appears in NOTES.txt
- A different release renders a different name (proves the helper
  isn't accidentally hardcoded)
- Negative regex guards against the literal `tracebloc-resource-monitor`
  reappearing followed by a non-suffix character (i.e. the bare
  pre-1.2.3 form, while still letting the SCC line `tracebloc-
  resource-monitor-<release>` further down the file pass)
- `resourceMonitor: false` removes the line entirely

End-to-end smoke of generate.sh confirms PVCs ship with the live chart
version (`helm.sh/chart: client-1.2.3` after this commit, verified
against /tmp/tracebloc-migration-<demo>/pvcs.yaml).

Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains
the SCC SA-ref rename. Once #78 lands the diff against develop will
reduce to just this commit.

Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix +
tooling correctness).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* docs(claude): require @saadqbal as PR assignee (#79)

Convention captured after a session-end ask. Every PR Claude opens for
this repo must be assigned to saadqbal — orphaned PRs without an
assignee fall through the review queue.

Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running
unauthenticated). No exceptions.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants