Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce direct failure detector #8488

Closed
asias opened this issue Apr 15, 2021 · 4 comments
Closed

Introduce direct failure detector #8488

asias opened this issue Apr 15, 2021 · 4 comments

Comments

@asias
Copy link
Contributor

asias commented Apr 15, 2021

Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

A direct failure detector can solve this problem and simplify the code a lot.

  • Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly. For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

It changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

  • Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

  • Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

  • Compatible

It only uses the existing gossip echo message. Nodes with or without
this patch can work together.

asias added a commit to asias/scylla that referenced this issue Apr 15, 2021
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes scylladb#8488
asias added a commit to asias/scylla that referenced this issue Apr 20, 2021
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes scylladb#8488
asias added a commit to asias/scylla that referenced this issue Apr 20, 2021
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes scylladb#8488
asias added a commit to asias/scylla that referenced this issue Apr 21, 2021
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes scylladb#8488
@slivne slivne added this to the 4.x milestone Apr 24, 2021
asias added a commit to asias/scylla that referenced this issue May 12, 2021
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes scylladb#8488
asias added a commit to asias/scylla that referenced this issue May 21, 2021
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes scylladb#8488
@bhalevy
Copy link
Member

bhalevy commented Jun 28, 2021

@scylladb/scylla-maint although this was opened as an Enhancement, @asias mentioned it is required to fix #7570.
So I'm adding the Backport candidate label.

@bhalevy
Copy link
Member

bhalevy commented Jun 28, 2021

@asias, you also mentioned that 9ea57df should be backported (is it a prerequisite to 425e3b1?)

I see that we also merged 0665d9c (fixed #8712) on top of that.

Anything else that is needed to be backported for #7570?

@asias
Copy link
Contributor Author

asias commented Jun 28, 2021

Hopefully 9ea57df is enough for #7570?

9ea57df is not a prerequisite to 425e3b.

@avikivity
Copy link
Member

Major feature, so not backporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants