fix(warm): free DNS name on recycle + rotate stale-image warm pods + unify pod name by wdvr · Pull Request #199 · wdvr/osdc

wdvr · 2026-06-02T05:50:53Z

Why

Diagnosing why fresh reservations kept getting -2 suffixes (e.g. bright_fox-2.devservers.io). It's not the birthday paradox over the ~13.7k name pool — it's a leak.

A recycled warm pod (_recycle_warm_pod) deleted the pod + SSH service but never deleted its placeholder domain mapping, which is written with a 7-day expiry and reservation_id="warm". generate_unique_name() counts every non-expired mapping as taken, so the picker collides against a growing junk pile.

Prod data at time of diagnosis: 449 non-expired mappings, every single one reservation_id="warm", vs only ~24 live warm pods. ~425 orphans, accumulating in 30–65-count bursts every ~12h (the warm-pool recycle cycles from image rebuilds), each holding a name for 7 days.

What

Free the name on recycle — _recycle_warm_pod now delete_domain_mapping(warm-domain). Placeholder expiry shortened 7d → WARM_POD_MAX_AGE_HOURS + 2h, so any orphan a missed-delete leaves self-cleans via DynamoDB TTL within hours, not days.
Rotate stale-image warm pods after a rebuild (the "rotate on apply, only if not in use" ask) — reconcile_warm_pool compares each ready pod's running digest to the :latest digest in ECR and recycles at most one per type per tick (gradual; the deficit backfill recreates it on the fresh image). Never touches warm-state=claimed pods. Needs ecr:DescribeImages on the processor role (added).
Unify the pod name — warm claim stamps GPU_DEV_HOSTLABEL=gpu-dev-<resid8> into the shell-ext files; the image prompt prefers it over %m/\h, so a warm-claimed pod's prompt shows the same handle you connect with (== the SSH alias from fix(cli): SSH alias keys off reservation id (warm pods reachable by resid) #185) instead of the warm-pool hostname gpu-dev-b200-<hex>. Cold pods leave it unset (their hostname already matches).

Tests

+18 unit tests (recycle mapping cleanup + error swallow, digest parse/resolve helpers, rotation: rotates one / caps one-per-tick / skips on digest match / skips on unknown digest / never claimed, short placeholder expiry, HOSTLABEL stamp). Full suite: 1166 passed.

Deploy notes

Needs tofu apply (prod) for the lambda + IAM. The image prompt change needs the image rebuild (rides the pending feat(image): Codex CLI on GPT-5.5 via Bedrock (no per-user key) #198 rebuild).
One-time cleanup of the existing 449 orphan mappings is separate (not in this PR) — safe to delete only the warm mappings not referenced by a live warm pod's warm-domain annotation.
After this lands, future applies auto-rotate idle warm pods onto the new image (no manual warm-pod kill needed).

…unify pod name Root cause of '-2' suffixes on fresh reservations: a recycled warm pod deleted the pod + service but never deleted its placeholder SSH domain mapping, written with a 7-day expiry and reservation_id='warm'. generate_unique_name() counts every non-expired mapping as taken, so the random adjective_animal picker was colliding against a junk pile of orphans (prod: 449 non-expired mappings, ALL 'warm', vs only ~24 live warm pods). - _recycle_warm_pod now delete_domain_mapping(warm-domain) so the name frees immediately; placeholder expiry shortened from 7d to WARM_POD_MAX_AGE_HOURS+2h so any missed orphan self-cleans via DynamoDB TTL within hours, not days. - reconcile_warm_pool rotates idle warm pods off a stale image after a rebuild: compares each ready pod's running digest to the :latest digest in ECR and recycles at most ONE per type per tick (gradual; never touches claimed pods). Needs ecr:DescribeImages on the processor role (added in lambda.tf). - warm claim stamps GPU_DEV_HOSTLABEL=gpu-dev-<resid8> into the shell-ext files; the image prompt (zshrc/bashrc) prefers it over %m/\h so a warm-claimed pod's prompt shows the same handle you connect with (== the SSH alias), instead of the warm-pool hostname. Cold pods leave it unset (hostname already matches). Tests: +18 unit (recycle mapping cleanup, digest helpers, rotation cap/guards, short placeholder expiry, HOSTLABEL stamp). Full suite 1166 passed.

…202) Root cause of 'new image never reaches pods' (codex stayed broken after apply) and the warm-rotation thrash: Pods used :latest + imagePullPolicy=IfNotPresent. After a rebuild, a node that already has an old :latest cached does NOT re-pull — kubelet serves the stale image to every new pod until the prepuller finishes re-pulling 27GB (~5-6min/node, 24 nodes). A cold reserve in that window gets the old (broken-codex) image; and the #199 warm rotation recycles old-image pods that instantly come back on the cached old :latest -> recycled again -> thrash. Fix (the pattern #191 already uses for build jobs): pin pods to the immutable hash tag latest-<context-hash> (local.full_image_uri). Each rebuild = a tag the node has never seen, so IfNotPresent pulls the NEW image -> guaranteed-correct, no stale window, and the warm rotation converges (the recycled pod can't come up on the old cache; it pulls the new tag). Prepuller pinned to the same tag so it pre-warms the exact ref. Tag is immutable/stable, so OOM-restart still works. ami-baker/eks user-data keep :latest (boot-time LAYER prewarm; same digest, fast manifest-only pod pull). No docker files changed -> no image rebuild on apply; this is a lambda-env + prepuller-DS change only.

wdvr merged commit 66e4edd into main Jun 2, 2026
3 checks passed

wdvr deleted the fix/warm-domain-orphans-and-naming branch June 2, 2026 06:09

wdvr mentioned this pull request Jun 2, 2026

fix(image): pin dev/warm pods + prepuller to immutable hash tag (stop stale :latest serving + warm thrash) #202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(warm): free DNS name on recycle + rotate stale-image warm pods + unify pod name#199

fix(warm): free DNS name on recycle + rotate stale-image warm pods + unify pod name#199
wdvr merged 1 commit into
mainfrom
fix/warm-domain-orphans-and-naming

wdvr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Jun 2, 2026

Why

What

Tests

Deploy notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant