Introduce direct failure detector #8488

asias · 2021-04-15T02:55:45Z

Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

A direct failure detector can solve this problem and simplify the code a lot.

Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly. For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

It changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

Compatible

It only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Currently, gossip uses the updates of the gossip heartbeat from gossip messages to decide if a node is up or down. This means if a node is actually down but the gossip messages are delayed in the network, the marking of node down can be delayed. For example, a node sends 20 gossip messages in 20 seconds before it is dead. Each message is delayed 15 seconds by the network for some reason. A node receives those delayed messages one after another. Those delayed messages will prevent this node from being marked as down. Because heartbeat update is received just before the threshold to mark a node down is triggered which is around 20 seconds by default. As a result, this node will not be marked as down in 20 * 15 seconds = 300 seconds, much longer than the ~20 seconds node down detection time in normal cases. In this patch, a new failure detector is implemented. - Direct detection The existing failure detector can get gossip heartbeat updates indirectly. For example: Node A can talk to Node B Node B can talk to Node C Node A can not talk to Node C, due to network issues Node A will not mark Node B to be down because Node A can get heart beat of Node C from node B indirectly. This indirect detection is not very useful because when Node A decides if it should send requests to Node C, the requests from Node A to C will fail while Node A thinks it can communicate with Node C. This patch changes the failure detection to be direct. It uses the existing gossip echo message to detect directly. Gossip echo messages will be sent to peer nodes periodically. A peer node will be marked as down if a timeout threshold has been meet. Since the failure detection is peer to peer, it avoids the delayed message issue mentioned above. - Parallel detection The old failure detector uses shard zero only. This new failure detector utilizes all the shards to perform the failure detection, each shard handling a subset of live nodes. For example, if the cluster has 32 nodes and each node has 16 shards, each shard will handle only 2 nodes. With a 16 nodes cluster, each node has 16 shards, each shard will handle only one peer node. A gossip message will be sent to peer nodes every 2 seconds. The extra echo messages traffic produced compared to the old failure detector is negligible. - Deterministic detection Users can configure the failure_detector_timeout_in_ms to set the threshold to mark a node down. It is the maximum time between two successful echo message before gossip marks a node down. It is easier to understand than the old phi_convict_threshold. - Compatible This patch only uses the existing gossip echo message. Nodes with or without this patch can work together. Fixes scylladb#8488

bhalevy · 2021-06-28T09:53:17Z

@scylladb/scylla-maint although this was opened as an Enhancement, @asias mentioned it is required to fix #7570.
So I'm adding the Backport candidate label.

bhalevy · 2021-06-28T09:58:53Z

@asias, you also mentioned that 9ea57df should be backported (is it a prerequisite to 425e3b1?)

I see that we also merged 0665d9c (fixed #8712) on top of that.

Anything else that is needed to be backported for #7570?

asias · 2021-06-28T11:37:08Z

Hopefully 9ea57df is enough for #7570?

9ea57df is not a prerequisite to 425e3b.

avikivity · 2021-11-15T11:26:42Z

Major feature, so not backporting.

slivne added feature/enhancement area/gossip labels Apr 24, 2021

slivne assigned asias Apr 24, 2021

slivne added this to the 4.x milestone Apr 24, 2021

scylladb-promoter closed this as completed in 425e3b1 May 24, 2021

bhalevy added the Backport candidate label Jun 28, 2021

avikivity removed the Backport candidate label Nov 15, 2021

tzach modified the milestones: 4.x, 4.6 Nov 17, 2021

slivne mentioned this issue Apr 26, 2022

4.6.1 not able to replace node #10337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce direct failure detector #8488

Introduce direct failure detector #8488

asias commented Apr 15, 2021

bhalevy commented Jun 28, 2021

bhalevy commented Jun 28, 2021

asias commented Jun 28, 2021

avikivity commented Nov 15, 2021

Introduce direct failure detector #8488

Introduce direct failure detector #8488

Comments

asias commented Apr 15, 2021

bhalevy commented Jun 28, 2021

bhalevy commented Jun 28, 2021

asias commented Jun 28, 2021

avikivity commented Nov 15, 2021