Skip to content

feat: add preStop drain and terminationGracePeriodSeconds to CaddyConfig#217

Merged
Cre-eD merged 4 commits intomainfrom
fix/caddy-prestop-drain
Apr 10, 2026
Merged

feat: add preStop drain and terminationGracePeriodSeconds to CaddyConfig#217
Cre-eD merged 4 commits intomainfrom
fix/caddy-prestop-drain

Conversation

@Cre-eD
Copy link
Copy Markdown
Contributor

@Cre-eD Cre-eD commented Apr 10, 2026

Root Cause

Two Cloudflare 521 incidents on 2026-04-09 (support-bot) and 2026-04-10 (wallet) were confirmed via GCP Cloud Logging to be caused by Caddy rolling restarts triggered by unrelated app deploys.

How it happened:

  1. caddy-updated-at: time.Now() was patched into spec.template.metadata.annotations of the Caddy Deployment.
  2. Kubernetes treats any change to pod-template annotations as a diff → triggers a rolling restart.
  3. Since time.Now() is evaluated on every pulumi up, Caddy rolled on every app deploy, even when the Caddyfile didn't change.
  4. During the rolling update, the old Caddy pod received SIGTERM before Cloudflare's edge finished draining persistent connections → 521 errors.

Evidence:

  • GCP audit logs: deploy-bot@payspace-475408.iam.gserviceaccount.com patched the Caddy Deployment seconds before each 521.
  • Three patches on 2026-04-10 had identical caddy-update-hash (03709a04d391d8ac...) but different caddy-updated-at timestamps — confirming idempotency failure.
  • Caddy pod SIGTERM timestamps (16:10:12, 16:10:23) match the 521 at 16:10:27; (09:58:49, 09:59:00) match the 521 at 09:59:06.

Fix: Two layers

Layer 1 — Remove the idempotency bug (root cause)

caddy-updated-at / caddy-updated-by are informational audit-trail annotations. They don't need to live in spec.template.metadata (which triggers pod restarts). They now go on deployment.metadata only.

caddy-update-hash stays in spec.template.metadata — this is the only annotation that should trigger Caddy to reload, and only when the Caddyfile actually changes.

Before: every pulumi up patched time.Now() into pod-template → always-dirty → always rolls
After: only a Caddyfile content change changes caddy-update-hash → Caddy rolls only when needed

Implementation:

  • DeploymentPatchArgs gets a new DeploymentAnnotations field (maps to metadata.annotations, no pod restart)
  • buildPodTemplatePatch / buildDeploymentMetadataPatch extracted as testable helpers
  • Both gke_autopilot_stack.go and kube_run.go updated to route annotations correctly

Layer 2 — Graceful shutdown for unavoidable rolling updates

When a Caddyfile change genuinely requires a rollout (new service, route change), the old pod should drain connections before SIGTERM. New CaddyConfig fields:

  • preStopSleepSeconds — injects exec: sleep N preStop lifecycle hook, giving the load balancer time to remove the pod from the backend pool before the container receives SIGTERM
  • terminationGracePeriodSeconds — pod-level grace period; must be > preStopSleepSeconds

Production config (already applied in server.yaml):

preStopSleepSeconds: 10
terminationGracePeriodSeconds: 30

Files changed

File Change
pkg/clouds/k8s/types.go CaddyConfigPreStopSleepSeconds, TerminationGracePeriodSeconds
pkg/clouds/pulumi/kubernetes/deployment_patch.go DeploymentAnnotations field + dual-patch logic + testable helpers
pkg/clouds/pulumi/kubernetes/deployment.go ArgsPreStopSleepSeconds, TerminationGracePeriodSeconds; extract buildPreStopLifecycle
pkg/clouds/pulumi/kubernetes/simple_container.go SimpleContainerArgsTerminationGracePeriodSeconds
pkg/clouds/pulumi/kubernetes/caddy.go Wire new fields into DeploySimpleContainer call
pkg/clouds/pulumi/gcp/gke_autopilot_stack.go Route caddy-updated-at/byDeploymentAnnotations; keep caddy-update-hash in pod template
pkg/clouds/pulumi/kubernetes/kube_run.go Same as above
pkg/clouds/pulumi/kubernetes/deployment_patch_test.go New: patch target isolation, preStop lifecycle injection tests

Testing

go test ./pkg/clouds/pulumi/kubernetes/... -run "TestBuildPodTemplatePatch|TestBuildDeploymentMetadataPatch|TestPatchTargetsSeparation|TestBuildPreStopLifecycle"

All 8 new tests pass. All existing tests unchanged.

Cre-eD added 3 commits April 10, 2026 18:37
…Caddy

During rolling updates, Caddy pods receive SIGTERM while Cloudflare still
holds persistent connections, causing 521 errors. Add two new CaddyConfig
fields to allow operators to configure graceful drain:

- preStopSleepSeconds: injects a preStop exec sleep on all containers so
  load-balancer endpoint propagation completes before SIGTERM is sent
- terminationGracePeriodSeconds: pod-level override to ensure the grace
  period is long enough to cover the preStop sleep + Caddy shutdown

Both fields are wired through Args → SimpleContainerArgs → PodSpec and the
Lifecycle hook respectively.
… restarts

time.Now() was used at pulumi eval time, so caddy-updated-at always changed
on every pulumi up even when the Caddyfile was identical. This dirtied the
pod template on every app deployment, causing a Caddy rolling restart each
time — which triggered Cloudflare 521 errors due to persistent connections
being dropped before Cloudflare rerouted them.

History: the original value was the static string "latest" (PR #59 changed
it to time.Now() as an "improvement"). The intent was informational — not a
rollout trigger.

Fix: derive caddy-updated-at from the Caddyfile content hash (same source as
caddy-update-hash). The annotation value is now stable across pulumi ups when
the Caddyfile hasn't changed, so K8s sees no pod template diff → no rollout.
Caddy still rolls when the Caddyfile actually changes (different hash).

Confirmed root cause via GCP Cloud Logging: all three Caddy patch events on
2026-04-10 had identical hash (03709a04d391d8ac) but different timestamps,
proving time.Now() was the sole cause of every rollout.
…revent spurious pod restarts

Root cause: caddy-updated-at/caddy-updated-by were patched into spec.template.metadata.annotations
which triggers a rolling restart on every change. Combined with time.Now() being evaluated on every
pulumi up, this caused Caddy to roll on every app deploy, producing Cloudflare 521 errors while
the old pod was terminating.

Changes:
- DeploymentPatchArgs gains DeploymentAnnotations for metadata-only patches (no pod restart)
- caddy-updated-at and caddy-updated-by moved to DeploymentAnnotations in gke_autopilot_stack.go
  and kube_run.go; only caddy-update-hash (content-driven) remains in pod template annotations
- Extract buildPodTemplatePatch/buildDeploymentMetadataPatch helpers for testability
- Extract buildPreStopLifecycle helper for testability
- Add unit tests: patch target isolation, preStop lifecycle injection (nil/zero/positive)
- Fix gci import formatting in caddy.go, deployment.go, simple_container.go, deployment_patch.go
Comment thread pkg/clouds/pulumi/kubernetes/kube_run.go
@Cre-eD Cre-eD merged commit bdece19 into main Apr 10, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants