Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Trigger manual failover on SIGTERM to primary (cluster) #939

Open
zuiderkwast opened this issue Aug 24, 2024 · 7 comments · May be fixed by #1091
Open

[NEW] Trigger manual failover on SIGTERM to primary (cluster) #939

zuiderkwast opened this issue Aug 24, 2024 · 7 comments · May be fixed by #1091

Comments

@zuiderkwast
Copy link
Contributor

The problem/use-case that the feature addresses

When a primary disappears, its slots are not served until an automatic failover happens. It takes about 3 seconds (node timeout plus some second). It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default.

Description of the feature

When a primary receives a SIGTERM, let it trigger a failover to one of the replicas as part of the graceful shutdown.

Alternatives you've considered

Our current solution is to have a wrapper process, a small script, starting valkey. Then, this process receives the SIGTERM and can handle this. It's like a workaround though. It's better to have this built in.

Additional information

We use cluster.

@enjoy-binbin
Copy link
Member

This is a good idea. In my fork, i have a similar feature like this (detect a disk error and do failover). The basic idea is that, the primary will pick a best replica and send a CLUSTER FAILOVER to it (and wait for it). Do you like this approach or need me to try it?

@zuiderkwast
Copy link
Contributor Author

the primary will pick a best replica and send a CLUSTER FAILOVER to it (and wait for it). Do you like this approach

Yes, this is strait-forward. I like it.

I have another idea, similar but maybe faster(?) but more complex(?). This is the idea: The primary first pauses writes, then waits for the replica to replicate everything and then sends CLUSTER FAILOVER FORCE. This avoids step 1 below. This is from the docs of CLUSTER FAILOVER:

  1. The replica tells the master to stop processing queries from clients.

  2. The master replies to the replica with the current replication offset.

  3. The replica waits for the replication offset to match on its side, to make sure it processed all the data from the master before it continues.

  4. The replica starts a failover, obtains a new configuration epoch from the majority of the masters, and broadcasts the new configuration.

  5. The old master receives the configuration update: unblocks its clients and starts replying with redirection messages so that they’ll continue the chat with the new master.

And for FORCE:

If the FORCE option is given, the replica does not perform any handshake with the master, that may be not reachable, but instead just starts a failover ASAP starting from point 4. This is useful when we want to start a manual failover while the master is no longer reachable.

@enjoy-binbin
Copy link
Member

The primary first pauses writes, then waits for the replica to replicate everything and then sends CLUSTER FAILOVER FORCE.

yeah, this seems ok to me, faster.

This different i think is, one is that the replica thinks the offset is ok and start the failover, and the other is that the primary tells the replica that it can start the failover.

The CLUSTER FAILOVER one:

  1. Primary detect a SIGTERM and then pick a best replica to send CLUSTER FAILOVER in serverCron, 100ms a time.
  2. Replica receives a CLUSTER FAILOVER and tells primary to stop the write (and repsonse the offset), and wait the offset become ok. (in clusterCron, 100ms a time)
  3. Replica start the failover

primary serverCron, primary send CLUSTER FAILOVER, replcia send a MFSTART, primary send a PING, replica clusterCron and start the failover.
100ms + a command + a mfstart + a ping + 100ms

The CLUSTER FAILOVER FORCE one:

  1. Primary detect a SIGTERM and then stop the write, and then send the REPLCONF GETACK to all replicas and wait the response.
  2. Primary receives the REPLCONF ACK, check replica->repl_ack_off and primary_repl_offset, if match, send the CLUSTER FAILOVER FORCE to the replica.
  3. Replica start the failover.

primary serverCron, primary send REPLCONF GETACK, replica send REPLCONF ACK, primary send CLUSTER FAILOVER FORCE, replica clusterCron and start the failover
100ms + a command + a command + a command + 100ms

@zuiderkwast
Copy link
Contributor Author

In shutdown, we already have a feature to stop writes and wait for replicas before shutting down. These two features will need to be combined. Let's discuss the details in a PR. :) Do you want to implement this?

@enjoy-binbin
Copy link
Member

yeah, i can do it in this week.

@enjoy-binbin
Copy link
Member

enjoy-binbin commented Sep 3, 2024

i dont have enouth time to write the test right now, i guess you might want to take a look at it in advance, so here is the commit enjoy-binbin@9777d01

i did a small manual testing locally and it seems to be work, i will try to find time to finish the test code and the rest of it.

@zuiderkwast
Copy link
Contributor Author

Don't worry. I hope we can have it for 8.2 so you have a few months to finish it. 😸

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this issue Sep 30, 2024
When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

This closes valkey-io#939.
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this issue Sep 30, 2024
When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

This closes valkey-io#939.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants