Skip to content

fix(warm): free DNS name on recycle + rotate stale-image warm pods + unify pod name#199

Merged
wdvr merged 1 commit into
mainfrom
fix/warm-domain-orphans-and-naming
Jun 2, 2026
Merged

fix(warm): free DNS name on recycle + rotate stale-image warm pods + unify pod name#199
wdvr merged 1 commit into
mainfrom
fix/warm-domain-orphans-and-naming

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Jun 2, 2026

Why

Diagnosing why fresh reservations kept getting -2 suffixes (e.g. bright_fox-2.devservers.io). It's not the birthday paradox over the ~13.7k name pool — it's a leak.

A recycled warm pod (_recycle_warm_pod) deleted the pod + SSH service but never deleted its placeholder domain mapping, which is written with a 7-day expiry and reservation_id="warm". generate_unique_name() counts every non-expired mapping as taken, so the picker collides against a growing junk pile.

Prod data at time of diagnosis: 449 non-expired mappings, every single one reservation_id="warm", vs only ~24 live warm pods. ~425 orphans, accumulating in 30–65-count bursts every ~12h (the warm-pool recycle cycles from image rebuilds), each holding a name for 7 days.

What

  1. Free the name on recycle_recycle_warm_pod now delete_domain_mapping(warm-domain). Placeholder expiry shortened 7d → WARM_POD_MAX_AGE_HOURS + 2h, so any orphan a missed-delete leaves self-cleans via DynamoDB TTL within hours, not days.
  2. Rotate stale-image warm pods after a rebuild (the "rotate on apply, only if not in use" ask) — reconcile_warm_pool compares each ready pod's running digest to the :latest digest in ECR and recycles at most one per type per tick (gradual; the deficit backfill recreates it on the fresh image). Never touches warm-state=claimed pods. Needs ecr:DescribeImages on the processor role (added).
  3. Unify the pod name — warm claim stamps GPU_DEV_HOSTLABEL=gpu-dev-<resid8> into the shell-ext files; the image prompt prefers it over %m/\h, so a warm-claimed pod's prompt shows the same handle you connect with (== the SSH alias from fix(cli): SSH alias keys off reservation id (warm pods reachable by resid) #185) instead of the warm-pool hostname gpu-dev-b200-<hex>. Cold pods leave it unset (their hostname already matches).

Tests

+18 unit tests (recycle mapping cleanup + error swallow, digest parse/resolve helpers, rotation: rotates one / caps one-per-tick / skips on digest match / skips on unknown digest / never claimed, short placeholder expiry, HOSTLABEL stamp). Full suite: 1166 passed.

Deploy notes

  • Needs tofu apply (prod) for the lambda + IAM. The image prompt change needs the image rebuild (rides the pending feat(image): Codex CLI on GPT-5.5 via Bedrock (no per-user key) #198 rebuild).
  • One-time cleanup of the existing 449 orphan mappings is separate (not in this PR) — safe to delete only the warm mappings not referenced by a live warm pod's warm-domain annotation.
  • After this lands, future applies auto-rotate idle warm pods onto the new image (no manual warm-pod kill needed).

…unify pod name

Root cause of '-2' suffixes on fresh reservations: a recycled warm pod deleted
the pod + service but never deleted its placeholder SSH domain mapping, written
with a 7-day expiry and reservation_id='warm'. generate_unique_name() counts
every non-expired mapping as taken, so the random adjective_animal picker was
colliding against a junk pile of orphans (prod: 449 non-expired mappings, ALL
'warm', vs only ~24 live warm pods).

- _recycle_warm_pod now delete_domain_mapping(warm-domain) so the name frees
  immediately; placeholder expiry shortened from 7d to WARM_POD_MAX_AGE_HOURS+2h
  so any missed orphan self-cleans via DynamoDB TTL within hours, not days.
- reconcile_warm_pool rotates idle warm pods off a stale image after a rebuild:
  compares each ready pod's running digest to the :latest digest in ECR and
  recycles at most ONE per type per tick (gradual; never touches claimed pods).
  Needs ecr:DescribeImages on the processor role (added in lambda.tf).
- warm claim stamps GPU_DEV_HOSTLABEL=gpu-dev-<resid8> into the shell-ext files;
  the image prompt (zshrc/bashrc) prefers it over %m/\h so a warm-claimed pod's
  prompt shows the same handle you connect with (== the SSH alias), instead of
  the warm-pool hostname. Cold pods leave it unset (hostname already matches).

Tests: +18 unit (recycle mapping cleanup, digest helpers, rotation cap/guards,
short placeholder expiry, HOSTLABEL stamp). Full suite 1166 passed.
@wdvr wdvr merged commit 66e4edd into main Jun 2, 2026
3 checks passed
@wdvr wdvr deleted the fix/warm-domain-orphans-and-naming branch June 2, 2026 06:09
wdvr added a commit that referenced this pull request Jun 2, 2026
…202)

Root cause of 'new image never reaches pods' (codex stayed broken after apply) and
the warm-rotation thrash:

Pods used :latest + imagePullPolicy=IfNotPresent. After a rebuild, a node that
already has an old :latest cached does NOT re-pull — kubelet serves the stale
image to every new pod until the prepuller finishes re-pulling 27GB (~5-6min/node,
24 nodes). A cold reserve in that window gets the old (broken-codex) image; and
the #199 warm rotation recycles old-image pods that instantly come back on the
cached old :latest -> recycled again -> thrash.

Fix (the pattern #191 already uses for build jobs): pin pods to the immutable
hash tag latest-<context-hash> (local.full_image_uri). Each rebuild = a tag the
node has never seen, so IfNotPresent pulls the NEW image -> guaranteed-correct, no
stale window, and the warm rotation converges (the recycled pod can't come up on
the old cache; it pulls the new tag). Prepuller pinned to the same tag so it
pre-warms the exact ref. Tag is immutable/stable, so OOM-restart still works.

ami-baker/eks user-data keep :latest (boot-time LAYER prewarm; same digest, fast
manifest-only pod pull). No docker files changed -> no image rebuild on apply;
this is a lambda-env + prepuller-DS change only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant