Design: proactive (zero-downtime) rolling restart — pre-empt the failover before recreating the primary #231
Replies: 3 comments
-
|
Hi @melancholictheory, thanks for taking the time to write this up. We follow an architecture of singleton StatefulSets or Deployments (decided by the user), abstracted by a ValkeyNode CR. We did this for separation of concerns; we have documented the design decision here. Because of this abstraction, it isn't required for us to set Replica ValkeyNodes are rolled first before the primary is, and a proactive failover happens on the primary before it is rolled. Since the replicas would have rolled by then, the failover would always happen against a pod on the new spec. Stale pod detection is achieved by comparing the full ValkeyNode spec (image, resource requests, config hash, etc.) of desired vs current. Having said that, there is room for these improvements:
|
Beta Was this translation helpful? Give feedback.
-
|
this lines up almost exactly with what we landed on, good to see we converged independently. the ValkeyNode abstraction giving you sequential reconcile without OnDelete is cleaner than driving the STS updateStrategy directly. we ended up doing the equivalent, but it reads better expressed as a per-node CR. on your four: 1 and 2 feel like one gate to me. the replica you fail over to should be selected, not just "the next one": filter to replicas that are on the desired spec AND synced (caught up), then among those pick the highest replication offset. that makes 2 the hard precondition and 1 the tiebreaker, in a single selection step rather than two separate asserts. (offset detail we went into over on #228: a graceful CLUSTER FAILOVER already pauses writes and waits for the target to catch up, so it's safe whichever you pick, the win from highest-offset is a shorter write-pause and a smaller window if the primary dies mid-pause.) 3 is the one we spent the most time on. two layers worked for us: persistence so a restarted primary loads its data instead of coming up empty, and the failover-first ordering so a recreated primary rejoins as a replica and syncs up from the current owner rather than the other way around. the dangerous path is when an empty node is treated as the slot owner outside that ordering (bootstrap, scale, recovery). cluster-allow-replica-migration no (already merged in #222) removes one auto-demote footgun there. the genuinely hard variant is losing a whole shard and recovering from empty, which looks like what #240 is now tracking. 4 is the missing piece for out-of-band termination, we wired the same thing up. shutdown-on-sigterm failover (valkey 9.0, valkey#1091) makes a primary hand off to a replica during graceful shutdown, so node drains, evictions, or preemption that the operator's planned failover never sees still fail over cleanly. two notes: version-gate it to >= 9.0, and set terminationGracePeriodSeconds >= cluster-manual-failover-timeout so SIGKILL doesn't cut the handoff short. no double-failover risk since it only fires if the node is still primary at SIGTERM. (more in #120.) happy to help with any of these if useful. |
Beta Was this translation helpful? Give feedback.
-
|
i can put up the PR for #4 (shutdown-on-sigterm failover) if it's useful, it's a small change to the base config. one thing i'd want to settle before opening it: the failover flag is valkey 9.0+, and an older server rejects the flag at boot, so it can't go in unconditionally unless 9.0+ is the assumed floor. the docs already say "requires valkey 9.0+", so rendering it unconditionally is probably fine, but the operator doesn't have any version gating today, so i wanted to check your preference: render it unconditionally and lean on the 9.0+ floor, or add a small image-tag -> version check (which would also be reusable for #152 and the live-config work)? the other coupling is the grace period: the handoff has to finish inside the pod's terminationGracePeriodSeconds, so if that's shorter than cluster-manual-failover-timeout, SIGKILL can cut it off. happy to either set a sane default / expose it in the same PR, or keep it doc-only and leave the pod field for a follow-up, whichever you'd rather. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Cross-references #198 (Replication + Sentinel design). Posting this as a separate note because the pattern spans all topologies, not just the ones in #198.
Problem
A config or image change today goes through the StatefulSet
RollingUpdatestrategy, which restarts pods highest-ordinal-first — including the primary. Even with HA you then eat a reactive failover-detection window (~15s in our testing) on every routine change: the pod terminates, and only afterwards does something notice and promote a replica. You paid for replication and still took a write outage on a planned restart.The fix is to make the handover happen before the primary pod is touched, so a fresh replica is already serving by the time the old primary restarts → the planned-restart window shrinks to ~0.
We implemented and validated this independently (an out-of-tree operator) across Standalone/Replication/Sentinel/Cluster, opt-in and default-off, on kind/k3d with continuous writes (0 lost keys vs. the reactive path). Sharing the full design + the gotchas in case it's useful as you build out the rollout lifecycle — it lines up with the direction in #198.
Core sequence
updateStrategy: OnDeleteso the operator owns the order instead of the STS controller.controller-revision-hashlabel against the STS.status.updateRevision. This catches any pod-template change (config-hash, image, resources, a restart token), needs no separate cursor, and keeps the loop stateless / resumable across reconciles.Per-topology handover (the part that differs)
master_link_status: up, not just "a replica exists" —REPLICAOF NO ONE, repoint the old primary at it, then delete. (This is the flow from our earlier comment on [Design] Valkey Replication and Sentinel Support #198.)REPLICAOF. IssueSENTINEL FAILOVER mymasterand let the quorum elect — an operator-driven promotion races Sentinel's own election and you get a double-failover (operator promotes one replica, Sentinel elects another). The Sentinels demote the old master to a replica; it then rolls as an ordinary stale replica on the next pass. Roll the Sentinel monitor pods one at a time, holding quorum (≥3). A repeatedSENTINEL FAILOVERwhile one is already in flight returns-INPROG, which is harmless and can be swallowed — so the stateless loop is safe to re-run.CLUSTER FAILOVERon a fresh replica of the shard, wait for the role change to show up inCLUSTER NODES, then delete the old master pod. Go shard by shard, one at a time, respecting the PDB. On Valkey 9.1+ this pairs naturally with atomic slot migration for any concurrent resharding.Safety gates / gotchas worth stealing
DBSIZEand promote the data-holding replica over an empty self-master. This was a real data-loss bug for us under chaos before we added the guard.pods: deleteRBAC;RetryOnConflictaround PDB updates; andterminationGracePeriodSeconds≥ the manual-failover timeout so a SIGKILL doesn't interrupt a handover.Honest scope
This is planned rolls only, and the handover is graceful → ~0 loss. Unplanned primary loss (node drain / eviction / preemption /
kubectl delete pod) is a separate axis the reactive loop still owns, and since replication is async it can drop acked-but-unreplicated writes. As @jdheyburn noted on #123,shutdown-on-sigterm failover(Valkey 9.0, Cluster) is the right node-local safety net there — it does a graceful manual failover on SIGTERM before the process exits, and only fires while the node is still primary so it won't double-fire with an operator promotion. The two are complementary: operator drives planned rolls deterministically,shutdown-on-sigtermcovers out-of-band termination.How it maps onto the current direction
Nothing here forces the CRD shape being debated in #198 — it sits in the per-resource lifecycle.
ValkeyReplication/ValkeySentinelwould each drive their own STS rollout with the step above; whenValkeyNodelands, the same "stale → roll replicas → hand off primary → roll primary" step applies per node. Opt-in / default-off until soaked seems right given it's a data-correctness path.Happy to write this up as a proper design doc or open a draft PR if there's appetite — or fold it into #198 if you'd rather keep it there. What's the preferred venue?
Beta Was this translation helpful? Give feedback.
All reactions