Skip to content

fix(ondemand): wait_for_rollout=false on the build worker (stop apply failing on its cold image pull)#201

Merged
wdvr merged 1 commit into
mainfrom
fix/ondemand-wait-for-rollout
Jun 2, 2026
Merged

fix(ondemand): wait_for_rollout=false on the build worker (stop apply failing on its cold image pull)#201
wdvr merged 1 commit into
mainfrom
fix/ondemand-wait-for-rollout

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Jun 2, 2026

Symptom

tofu apply errors on:

Error: Waiting for rollout to finish: 1 replicas wanted; 0 replicas Ready
  with kubernetes_deployment_v1.pytorch_ondemand

…but the deployment is actually healthy right after (revision converges, pod 1/1 Running).

Cause

The on-demand build worker is a Recreate-strategy Deployment pinned to the build node, which has no image-prepuller. On every image change it tears down the old pod first, then cold-pulls the ~28 GB image (measured 5m42s) with 0 replicas Ready in the gap. That exceeds terraform's default wait_for_rollout window → the apply fails, even though the rollout finishes seconds/minutes later. This will recur on every image rebuild.

Fix

wait_for_rollout = false on this resource. It's a background worker (requesters fall through to in-pod builds while it's restarting), so the apply shouldn't block or fail on it. Mirrors the existing pytorch-snapshot DaemonSet (git-cache.tf:55).

tf-only, doesn't touch the docker image hash.

The on-demand builder is a Recreate-strategy Deployment on the build node, which
has no image-prepuller, so every image change makes it cold-pull the ~28GB image
(~6min) with 0 replicas Ready in the gap. That exceeds the default
wait_for_rollout window and FAILS the apply ('1 replicas wanted; 0 replicas
Ready') even though the rollout converges seconds later. The worker isn't
user-facing (requesters fall through to in-pod builds while it's down), so don't
gate the apply on it — same as the pytorch-snapshot DaemonSet (git-cache.tf:55).
@wdvr wdvr merged commit 8104304 into main Jun 2, 2026
3 checks passed
@wdvr wdvr deleted the fix/ondemand-wait-for-rollout branch June 2, 2026 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant