feat: add preStop drain and terminationGracePeriodSeconds to CaddyConfig#217
Merged
feat: add preStop drain and terminationGracePeriodSeconds to CaddyConfig#217
Conversation
…Caddy During rolling updates, Caddy pods receive SIGTERM while Cloudflare still holds persistent connections, causing 521 errors. Add two new CaddyConfig fields to allow operators to configure graceful drain: - preStopSleepSeconds: injects a preStop exec sleep on all containers so load-balancer endpoint propagation completes before SIGTERM is sent - terminationGracePeriodSeconds: pod-level override to ensure the grace period is long enough to cover the preStop sleep + Caddy shutdown Both fields are wired through Args → SimpleContainerArgs → PodSpec and the Lifecycle hook respectively.
… restarts time.Now() was used at pulumi eval time, so caddy-updated-at always changed on every pulumi up even when the Caddyfile was identical. This dirtied the pod template on every app deployment, causing a Caddy rolling restart each time — which triggered Cloudflare 521 errors due to persistent connections being dropped before Cloudflare rerouted them. History: the original value was the static string "latest" (PR #59 changed it to time.Now() as an "improvement"). The intent was informational — not a rollout trigger. Fix: derive caddy-updated-at from the Caddyfile content hash (same source as caddy-update-hash). The annotation value is now stable across pulumi ups when the Caddyfile hasn't changed, so K8s sees no pod template diff → no rollout. Caddy still rolls when the Caddyfile actually changes (different hash). Confirmed root cause via GCP Cloud Logging: all three Caddy patch events on 2026-04-10 had identical hash (03709a04d391d8ac) but different timestamps, proving time.Now() was the sole cause of every rollout.
…revent spurious pod restarts Root cause: caddy-updated-at/caddy-updated-by were patched into spec.template.metadata.annotations which triggers a rolling restart on every change. Combined with time.Now() being evaluated on every pulumi up, this caused Caddy to roll on every app deploy, producing Cloudflare 521 errors while the old pod was terminating. Changes: - DeploymentPatchArgs gains DeploymentAnnotations for metadata-only patches (no pod restart) - caddy-updated-at and caddy-updated-by moved to DeploymentAnnotations in gke_autopilot_stack.go and kube_run.go; only caddy-update-hash (content-driven) remains in pod template annotations - Extract buildPodTemplatePatch/buildDeploymentMetadataPatch helpers for testability - Extract buildPreStopLifecycle helper for testability - Add unit tests: patch target isolation, preStop lifecycle injection (nil/zero/positive) - Fix gci import formatting in caddy.go, deployment.go, simple_container.go, deployment_patch.go
smecsia
reviewed
Apr 10, 2026
smecsia
approved these changes
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root Cause
Two Cloudflare 521 incidents on 2026-04-09 (support-bot) and 2026-04-10 (wallet) were confirmed via GCP Cloud Logging to be caused by Caddy rolling restarts triggered by unrelated app deploys.
How it happened:
caddy-updated-at: time.Now()was patched intospec.template.metadata.annotationsof the Caddy Deployment.time.Now()is evaluated on everypulumi up, Caddy rolled on every app deploy, even when the Caddyfile didn't change.Evidence:
deploy-bot@payspace-475408.iam.gserviceaccount.compatched the Caddy Deployment seconds before each 521.caddy-update-hash(03709a04d391d8ac...) but differentcaddy-updated-attimestamps — confirming idempotency failure.Fix: Two layers
Layer 1 — Remove the idempotency bug (root cause)
caddy-updated-at/caddy-updated-byare informational audit-trail annotations. They don't need to live inspec.template.metadata(which triggers pod restarts). They now go ondeployment.metadataonly.caddy-update-hashstays inspec.template.metadata— this is the only annotation that should trigger Caddy to reload, and only when the Caddyfile actually changes.Before: every
pulumi uppatchedtime.Now()into pod-template → always-dirty → always rollsAfter: only a Caddyfile content change changes
caddy-update-hash→ Caddy rolls only when neededImplementation:
DeploymentPatchArgsgets a newDeploymentAnnotationsfield (maps tometadata.annotations, no pod restart)buildPodTemplatePatch/buildDeploymentMetadataPatchextracted as testable helpersgke_autopilot_stack.goandkube_run.goupdated to route annotations correctlyLayer 2 — Graceful shutdown for unavoidable rolling updates
When a Caddyfile change genuinely requires a rollout (new service, route change), the old pod should drain connections before SIGTERM. New
CaddyConfigfields:preStopSleepSeconds— injectsexec: sleep NpreStop lifecycle hook, giving the load balancer time to remove the pod from the backend pool before the container receives SIGTERMterminationGracePeriodSeconds— pod-level grace period; must be >preStopSleepSecondsProduction config (already applied in
server.yaml):Files changed
pkg/clouds/k8s/types.goCaddyConfig←PreStopSleepSeconds,TerminationGracePeriodSecondspkg/clouds/pulumi/kubernetes/deployment_patch.goDeploymentAnnotationsfield + dual-patch logic + testable helperspkg/clouds/pulumi/kubernetes/deployment.goArgs←PreStopSleepSeconds,TerminationGracePeriodSeconds; extractbuildPreStopLifecyclepkg/clouds/pulumi/kubernetes/simple_container.goSimpleContainerArgs←TerminationGracePeriodSecondspkg/clouds/pulumi/kubernetes/caddy.goDeploySimpleContainercallpkg/clouds/pulumi/gcp/gke_autopilot_stack.gocaddy-updated-at/by→DeploymentAnnotations; keepcaddy-update-hashin pod templatepkg/clouds/pulumi/kubernetes/kube_run.gopkg/clouds/pulumi/kubernetes/deployment_patch_test.goTesting
All 8 new tests pass. All existing tests unchanged.