Design: proactive (zero-downtime) rolling restart — pre-empt the failover before recreating the primary #231

melancholictheory · 2026-06-07T20:23:46Z

melancholictheory
Jun 7, 2026

Cross-references #198 (Replication + Sentinel design). Posting this as a separate note because the pattern spans all topologies, not just the ones in #198.

Problem

A config or image change today goes through the StatefulSet RollingUpdate strategy, which restarts pods highest-ordinal-first — including the primary. Even with HA you then eat a reactive failover-detection window (~15s in our testing) on every routine change: the pod terminates, and only afterwards does something notice and promote a replica. You paid for replication and still took a write outage on a planned restart.

The fix is to make the handover happen before the primary pod is touched, so a fresh replica is already serving by the time the old primary restarts → the planned-restart window shrinks to ~0.

We implemented and validated this independently (an out-of-tree operator) across Standalone/Replication/Sentinel/Cluster, opt-in and default-off, on kind/k3d with continuous writes (0 lost keys vs. the reactive path). Sharing the full design + the gotchas in case it's useful as you build out the rollout lifecycle — it lines up with the direction in #198.

Core sequence

Switch the managed STS to updateStrategy: OnDelete so the operator owns the order instead of the STS controller.
Detect "stale" pods exactly the way the STS controller would: compare each pod's controller-revision-hash label against the STS .status.updateRevision. This catches any pod-template change (config-hash, image, resources, a restart token), needs no separate cursor, and keeps the loop stateless / resumable across reconciles.
Roll all replicas first (restarting a replica needs no failover), one at a time, waiting for Ready between each.
When only the primary is still stale, hand it off (per-topology, below) to an already-fresh, Ready replica, then delete the old primary pod.

Per-topology handover (the part that differs)

Replication (operator-arbitrated): pick an up-to-date replica — max replication offset, master_link_status: up, not just "a replica exists" — REPLICAOF NO ONE, repoint the old primary at it, then delete. (This is the flow from our earlier comment on [Design] Valkey Replication and Sentinel Support #198.)
Sentinel: do not operator-REPLICAOF. Issue SENTINEL FAILOVER mymaster and let the quorum elect — an operator-driven promotion races Sentinel's own election and you get a double-failover (operator promotes one replica, Sentinel elects another). The Sentinels demote the old master to a replica; it then rolls as an ordinary stale replica on the next pass. Roll the Sentinel monitor pods one at a time, holding quorum (≥3). A repeated SENTINEL FAILOVER while one is already in flight returns -INPROG, which is harmless and can be swallowed — so the stateless loop is safe to re-run.
Cluster: CLUSTER FAILOVER on a fresh replica of the shard, wait for the role change to show up in CLUSTER NODES, then delete the old master pod. Go shard by shard, one at a time, respecting the PDB. On Valkey 9.1+ this pairs naturally with atomic slot migration for any concurrent resharding.

Safety gates / gotchas worth stealing

Pause the reactive failover loop while a planned rollout is in progress. Otherwise the rollout driver and the reactive promoter fight over the primary.
Only hand off to a replica that is already on the new revision and Ready. Promote a lagging/just-restarted replica and you've defeated the point.
Empty-master guard (a reactive-safety invariant the rollout depends on): a primary pod that restarts empty and masters itself must not be adopted — otherwise data-carrying replicas resync down to the empty node (silent total loss). Rank candidate masters by DBSIZE and promote the data-holding replica over an empty self-master. This was a real data-loss bug for us under chaos before we added the guard.
Plumbing: the driver needs pods: delete RBAC; RetryOnConflict around PDB updates; and terminationGracePeriodSeconds ≥ the manual-failover timeout so a SIGKILL doesn't interrupt a handover.

Honest scope

This is planned rolls only, and the handover is graceful → ~0 loss. Unplanned primary loss (node drain / eviction / preemption / kubectl delete pod) is a separate axis the reactive loop still owns, and since replication is async it can drop acked-but-unreplicated writes. As @jdheyburn noted on #123, shutdown-on-sigterm failover (Valkey 9.0, Cluster) is the right node-local safety net there — it does a graceful manual failover on SIGTERM before the process exits, and only fires while the node is still primary so it won't double-fire with an operator promotion. The two are complementary: operator drives planned rolls deterministically, shutdown-on-sigterm covers out-of-band termination.

How it maps onto the current direction

Nothing here forces the CRD shape being debated in #198 — it sits in the per-resource lifecycle. ValkeyReplication / ValkeySentinel would each drive their own STS rollout with the step above; when ValkeyNode lands, the same "stale → roll replicas → hand off primary → roll primary" step applies per node. Opt-in / default-off until soaked seems right given it's a data-correctness path.

Happy to write this up as a proper design doc or open a draft PR if there's appetite — or fold it into #198 if you'd rather keep it there. What's the preferred venue?

jdheyburn · 2026-06-10T08:19:56Z

jdheyburn
Jun 10, 2026
Maintainer

Hi @melancholictheory, thanks for taking the time to write this up.

We follow an architecture of singleton StatefulSets or Deployments (decided by the user), abstracted by a ValkeyNode CR. We did this for separation of concerns; we have documented the design decision here. Because of this abstraction, it isn't required for us to set updateStrategy: OnDelete on the StatefulSets, since the ValkeyCluster controller will reconcile the ValkeyNodes sequentially.

Replica ValkeyNodes are rolled first before the primary is, and a proactive failover happens on the primary before it is rolled. Since the replicas would have rolled by then, the failover would always happen against a pod on the new spec. Stale pod detection is achieved by comparing the full ValkeyNode spec (image, resource requests, config hash, etc.) of desired vs current.

Having said that, there is room for these improvements:

Select the replica with the highest replication offset to failover
Ensure we failover to a replica on the latest spec
- in practice our sequential rolling already guarantees this implicitly, but an explicit assertion would make it a hard safety gate rather than a relied-upon ordering invariant
Ensure primary for a shard doesn't come up empty and cause a replica to resync down to it
shutdown-on-sigterm failover config setting

0 replies

melancholictheory · 2026-06-10T08:47:41Z

melancholictheory
Jun 10, 2026
Author

this lines up almost exactly with what we landed on, good to see we converged independently. the ValkeyNode abstraction giving you sequential reconcile without OnDelete is cleaner than driving the STS updateStrategy directly. we ended up doing the equivalent, but it reads better expressed as a per-node CR.

on your four:

1 and 2 feel like one gate to me. the replica you fail over to should be selected, not just "the next one": filter to replicas that are on the desired spec AND synced (caught up), then among those pick the highest replication offset. that makes 2 the hard precondition and 1 the tiebreaker, in a single selection step rather than two separate asserts. (offset detail we went into over on #228: a graceful CLUSTER FAILOVER already pauses writes and waits for the target to catch up, so it's safe whichever you pick, the win from highest-offset is a shorter write-pause and a smaller window if the primary dies mid-pause.)

3 is the one we spent the most time on. two layers worked for us: persistence so a restarted primary loads its data instead of coming up empty, and the failover-first ordering so a recreated primary rejoins as a replica and syncs up from the current owner rather than the other way around. the dangerous path is when an empty node is treated as the slot owner outside that ordering (bootstrap, scale, recovery). cluster-allow-replica-migration no (already merged in #222) removes one auto-demote footgun there. the genuinely hard variant is losing a whole shard and recovering from empty, which looks like what #240 is now tracking.

4 is the missing piece for out-of-band termination, we wired the same thing up. shutdown-on-sigterm failover (valkey 9.0, valkey#1091) makes a primary hand off to a replica during graceful shutdown, so node drains, evictions, or preemption that the operator's planned failover never sees still fail over cleanly. two notes: version-gate it to >= 9.0, and set terminationGracePeriodSeconds >= cluster-manual-failover-timeout so SIGKILL doesn't cut the handoff short. no double-failover risk since it only fires if the node is still primary at SIGTERM. (more in #120.)

happy to help with any of these if useful.

0 replies

melancholictheory · 2026-06-11T14:12:28Z

melancholictheory
Jun 11, 2026
Author

i can put up the PR for #4 (shutdown-on-sigterm failover) if it's useful, it's a small change to the base config. one thing i'd want to settle before opening it: the failover flag is valkey 9.0+, and an older server rejects the flag at boot, so it can't go in unconditionally unless 9.0+ is the assumed floor. the docs already say "requires valkey 9.0+", so rendering it unconditionally is probably fine, but the operator doesn't have any version gating today, so i wanted to check your preference: render it unconditionally and lean on the 9.0+ floor, or add a small image-tag -> version check (which would also be reusable for #152 and the live-config work)?

the other coupling is the grace period: the handoff has to finish inside the pod's terminationGracePeriodSeconds, so if that's shorter than cluster-manual-failover-timeout, SIGKILL can cut it off. happy to either set a sane default / expose it in the same PR, or keep it doc-only and leave the pod field for a follow-up, whichever you'd rather.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: proactive (zero-downtime) rolling restart — pre-empt the failover before recreating the primary #231

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Design: proactive (zero-downtime) rolling restart — pre-empt the failover before recreating the primary #231

Uh oh!

melancholictheory Jun 7, 2026

Problem

Core sequence

Per-topology handover (the part that differs)

Safety gates / gotchas worth stealing

Honest scope

How it maps onto the current direction

Replies: 3 comments

Uh oh!

jdheyburn Jun 10, 2026 Maintainer

Uh oh!

melancholictheory Jun 10, 2026 Author

Uh oh!

melancholictheory Jun 11, 2026 Author

melancholictheory
Jun 7, 2026

jdheyburn
Jun 10, 2026
Maintainer

melancholictheory
Jun 10, 2026
Author

melancholictheory
Jun 11, 2026
Author