Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The mandatory pre-flight check before any `helm uninstall` that is part of a mig
helm get manifest <release> -n <ns> | grep -B2 -A1 'resource-policy'
```

If the annotation is missing from the stored manifest for any resource you need to preserve, do not proceed with `helm uninstall` until you've applied Option A, B, or C from `docs/MIGRATIONS.md`. Do not assume live annotations will work.
If the annotation is missing from the stored manifest for any resource you need to preserve, do not proceed with `helm uninstall` until you've applied **Option A or Option C** from `docs/MIGRATIONS.md`. (Option B in the doc is a cautionary tale labelled "DOES NOT WORK" — stripping live Helm ownership labels does not prevent uninstall from deleting the resource. Both production migrations to date were bitten by variants of "modify the live resource, expect uninstall to respect it." Assume that pattern will keep failing.)

## Default branch

Expand Down
49 changes: 32 additions & 17 deletions docs/MIGRATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Read this **before** you plan any chart migration. Skipping the pre-flight verif
>
> Running `kubectl annotate pvc X helm.sh/resource-policy=keep` on a live resource does **nothing** at `helm uninstall` time if the chart template didn't render that annotation. Helm will delete the PVC anyway.

This bit us on 2026-04-22. See [§ Case study](#case-study-tracebloc-templates-2026-04-22) at the end.
This bit us on 2026-04-22, and a related variant bit us again on 2026-04-27. See [§ Case studies](#case-studies) at the end.

---

Expand All @@ -28,9 +28,11 @@ For every PVC / StorageClass / Namespace / Secret you must keep, verify the anno

---

## Mitigation options (in order of preference)
## Mitigation options

### Option A — `helm upgrade` to inject the annotation before uninstall
There are two real options — **A** (preferred) and **C** (fallback). A third approach, stripping live Helm ownership labels (Option B), looks like it should work and reliably gets reached for first, but does not work; we document it below as a **cautionary tale** so operators don't burn a maintenance window discovering this on a production migration. We did, twice.

### Option A — `helm upgrade` to inject the annotation before uninstall (PREFERRED)

If the old chart supports passing extra annotations on PVCs via values (or you have a post-renderer), run an upgrade that adds `helm.sh/resource-policy: keep`. This updates the stored manifest. Subsequent `helm uninstall` honours it.

Expand All @@ -44,30 +46,27 @@ helm get manifest <release> -n <ns> | grep 'resource-policy'

Only works if the chart exposes an annotations value. Many older charts don't.

### Option B — strip Helm ownership from the resource
### Option B — strip Helm ownership from the resource — DOES NOT WORK

You will be tempted to do this. Don't. It's the same class of mistake as `kubectl annotate ... helm.sh/resource-policy=keep` on a live PVC and expecting `helm uninstall` to honour it.

If the old chart does not support templating the annotation, remove the Helm ownership labels/annotations from the resource before uninstalling. Helm won't recognise it as owned and will skip it.
The intuition is: "if I remove `app.kubernetes.io/managed-by=Helm` and the `meta.helm.sh/release-name` annotation from the live resource, Helm will see it as un-owned and skip it on uninstall."

```bash
# PVC example
# PVC example — DOES NOT prevent uninstall from deleting the PVC
kubectl label pvc <pvc> -n <ns> app.kubernetes.io/managed-by-
kubectl annotate pvc <pvc> -n <ns> \
meta.helm.sh/release-name- \
meta.helm.sh/release-namespace-
```

After uninstall, re-stamp the metadata before installing the new chart so the new release can adopt it:
**Why it fails**: same root cause as the keep-annotation gotcha at the top of this doc. `helm uninstall` iterates the resource list in the **stored release manifest** (the `sh.helm.release.v1.<release>.v<N>` Secret). For each entry it issues a DELETE — without re-checking the live resource's ownership labels first. Those labels matter for `helm install`'s adoption decision, not uninstall's deletion decision.

```bash
kubectl label pvc <pvc> -n <ns> app.kubernetes.io/managed-by=Helm --overwrite
kubectl annotate pvc <pvc> -n <ns> \
meta.helm.sh/release-name=<new-release-name> \
meta.helm.sh/release-namespace=<ns> --overwrite
```
We learned this on 2026-04-27 by trying it on a StorageClass during the `hasan-prod` migration. The strip succeeded (live labels visibly empty), `helm uninstall` ran, and the StorageClass got deleted anyway. See [§ Case studies](#case-studies).

Trade-off: the resource is briefly orphaned between uninstall and re-stamp. Data is fine; only the ownership metadata is in flux.
If you find yourself reaching for Option B, you actually want Option A or Option C.

### Option C — accept PVC deletion, rely on PV `reclaimPolicy: Retain`
### Option C — accept PVC deletion, rely on PV `reclaimPolicy: Retain` (FALLBACK)

If neither A nor B works and the underlying storage is `Retain`, the data survives PVC deletion. You'll need to rebuild the PVCs manually after uninstall:

Expand Down Expand Up @@ -114,12 +113,17 @@ Only safe if the PV's `reclaimPolicy` is `Retain`. With `Delete` you lose data t
4. Snapshot of the underlying storage (EFS / EBS / NFS) taken?
5. Translated values file rendered (`helm template ... --dry-run=server`) and reviewed?
6. PVC name + storageClassName + size in the new chart match the existing PVCs?
7. **No active workloads in the namespace beyond the release itself.** `kubectl get jobs -n <ns>` for spawned training Jobs; `kubectl get pods -n <ns>` for anything else mounting the release's PVCs. Active pods hold PVCs in `Terminating` via the `kubernetes.io/pvc-protection` finalizer (their CSI mounts keep the PVC alive until the pod releases the volume), which can stall the migration for hours. Either wait for them to drain, or — if losing in-flight work is acceptable — `kubectl delete jobs -n <ns> --all` before `helm uninstall`. We missed this during the 2026-04-27 `hasan-prod` migration; 9 customer training Jobs were live and we had to delete them mid-migration.

Only proceed to `helm uninstall` when 1 **or** 2+3+4 is satisfied, and 5+6 are verified.
Only proceed to `helm uninstall` when 1 **or** 2+3+4 is satisfied, and 5+6+7 are verified.

---

## Case study — `tracebloc-templates` migration, 2026-04-22
## Case studies

Two real migrations on the prod EKS cluster, both `eks-1.0.x` → `client-1.x`. Both ran the wrong protection trick first; both recovered via Option C. The pattern of failure has now bitten us twice on different live-resource-modification approaches; assume the next variant of "I'll just modify the live resource and helm will respect it" will also fail.

### 2026-04-22 — `tracebloc-templates` (`eks-1.0.1` → `client-1.0.3`)

- **Source:** release `tracebloc` on chart `eks-1.0.1`
- **Target:** chart `client-1.0.3`
Expand All @@ -129,6 +133,17 @@ Only proceed to `helm uninstall` when 1 **or** 2+3+4 is satisfied, and 5+6 are v
- **Recovery:** PVs had `reclaimPolicy: Retain` and EFS access points survive PV deletion, so no data was lost. We applied Option C above: recreated the StorageClass, cleared `claimRef` on the 3 PVs, re-created PVCs pointing at the retained PVs via `spec.volumeName`. `helm install <new>` then adopted the PVCs. MySQL came up with all 400K rows intact; shared and logs EFS directories intact.
- **Lesson:** never trust `kubectl annotate` alone to protect live Helm-owned resources. Always verify via `helm get manifest`.

### 2026-04-27 — `hasan-prod` (`eks-1.0.3` → `client-1.1.0`)

- **Source:** release `hasan-prod` on chart `eks-1.0.3` in `tracebloc-templates-prod`
- **Target:** chart `client-1.1.0`
- **What we did:** read this doc, identified that the StorageClass `client-storage-class` was Helm-owned (live: `app.kubernetes.io/managed-by=Helm`, `meta.helm.sh/release-name=hasan-prod`) with no `keep` annotation in the stored manifest, then tried Option B exactly as previously documented — stripped the live ownership label and the two release annotations.
- **What happened:** strip succeeded (verified live: `managed-by=` and `release-name=` both empty). `helm uninstall` ran. **The StorageClass was deleted anyway**, the PVCs went into `Terminating`, and we discovered mid-uninstall that 9 customer training Jobs were holding `client-pvc` and `client-logs-pvc` via `pvc-protection` finalizers.
- **Root cause:** `helm uninstall` iterates the stored release manifest's resource list and DELETEs each entry. It does not re-check the live resource's ownership before deleting. The label-strip is theatrical — it changes what `helm install` would adopt, not what `helm uninstall` deletes. Same root cause as the 2026-04-22 keep-annotation case, different mistake. **This is why Option B is now documented as "DOES NOT WORK" above.**
- **Compounding finding:** the pre-uninstall checklist had no item for active workloads in the namespace. The 9 training Jobs were spawned dynamically by `jobs-manager` and were not Helm-owned, so `helm uninstall` left them alone — but their pods kept the PVCs mounted, blocking deletion.
- **Recovery:** the user accepted losing the 9 in-flight training runs, so `kubectl delete jobs -n tracebloc-templates-prod --all` released the PVCs. From there, Option C as documented: cleared `claimRef` on the 3 PVs (Retain saved us again), re-created the StorageClass from the old values, pre-created PVCs with `volumeName` and the new release's Helm ownership stamp, then `helm install hasan-prod ./client-1.1.0`. mysql came up with 63 tables intact; the chart adopted the pre-created PVCs cleanly. Three independent backups (logical mysqldump, on-demand AWS Backup, daily automated AWS Backup) were untouched.
- **Lessons:** (1) Option B was a doc bug — fixed in this revision. (2) Pre-uninstall checklist now requires verifying no active workloads beyond the release itself. (3) When tempted to modify a live Helm-managed resource and expect uninstall to respect it, **assume it won't** until proven otherwise via `helm get manifest`.

---

## Verifying the new chart (`client-1.0.x` and later)
Expand Down
Loading