Repair-based streaming is "best effort" which leads to linearizability violation during nodetool removenode #7359

kostja · 2020-10-07T19:58:48Z

Recently Scylla introduced repair-based streaming - a feature which allows to stream a token range from all available replicas, thus increasing consistency of the new node during streaming. However, repair-based streaming does not insist on using at least a QUORUM of replicas, which leads to lost data if streaming is accompanied with temporary or permanent network failure.

How to repeat:

Create a cluster of 5 nodes.
Create a table with replication factor = 3
Stop or partition away node 1.
Insert 100 records into the table, ensuring all nodes store data ranges using QUORUM or LWT.
Start the stopped node or resolve the network partitioning.
Kill and remove with nodetool removenode node 2 concurrently with stopping or partioning away node 3. This leads to node 1 being used exclusively to stream some ranges, thus replicating missing ranges to a quorum of nodes.
Observe removenode passes successfully, even though some token ranges have lost quorum. The linearizability is permanently violated, since once removenode completes the troubled ranges will begin accepting new QUORUM or LWT writes, based on empty history, which corrupts data even if node 2 later rejoins the cluster.
Read data with SERIAL or QUORUM and observe linearizability violation. Some keys are missing.

Suggested fix:

nodetool removenode should block or retry until networking partition heals and not continue in absence of quorum.
the current behavior should be moved behind a switch, something like --best-effort

Please note that once we begin streaming from a quorum, we do not need the concept of "mandatory" node,
i.e. removenode becomes less different from decommission, and #7087 could be fixed by reverting the change that broke it.

kostja · 2020-10-07T20:06:53Z

Dtest test case: https://github.com/kostja/scylla-dtest/tree/gh-7359

slivne · 2020-10-11T07:46:28Z

nodetool remoenode supports flags today: force, status - we should add another option aligned with that (I am not sure we support the other flags - but lets not override them).

I am not sure weshould insist for ever on removenode - its better to fail at some point and return failure in my view

asias · 2020-10-12T07:06:59Z

In your procedure above, both n2 and n3 are down. With RF = 3, consider a range is owned by n1, n2, n3 before the removenode, and is owned by n1, n5, n3 after the removenode ops. n1 is the only live replica. The problem is not we do not stream from quorum replicas, we actually ask to stream from all possible replicas. The problem is there is only one live replica.

We should reject the removenode ops if we want to be safe. But currently removenode ops is designed to be "best-effort", it favours to remove the node from the cluster even if some of nodes failed to stream the data. There is a lot of work to do if we want to make removenode to be 100% safe.

On the operation side, we should not perform remvoenode ops when a node is temporarily down. A much safer option would be running replace node operation or replace + decommission ops.

asias · 2020-10-12T07:12:20Z

I do agree we should be safe by default and allow user to continue with best effort flag explicitly.

asias · 2020-10-12T07:16:39Z

I do not like the idea to block or retry until networking is healed though. It is asking for trouble (We had a lot of retry logic for streaming and repair and it is painful to debug and reason about). We should fail fast and allow user to fix the network or the node and retry the operation.

asias · 2020-10-12T07:20:35Z

Alternatively, we might consider to drop the remvenode command completely. It is much more complicated operation than all the others. It can be replaced by a replace + decommission operation.

Currently, nodetool removenode operation runs in best effort mode which means if a node that owns the range after the removenode operation is down during the operation, the removenode node operation will continue to succeed without requiring that node to perform data syncing. This can cause data consistency issues. For example, Five nodes in the cluster, RF = 3, for a range, n1, n2, n3 is the old replicas, n2 is being removed, after the removenode operation, the new replicas are n1, n5, n3. If n3 is down during the removenode operation, only n1 will be used to sync data with the new owner n5. This will break QUORUM read consistency if n1 happens to miss some writes. To make removenode safe by default, we should fail the removenode operation if the operation is not safe. If users want to continue the unsafe removenode anyway, a best_effort_to_removenode option can be used to switch to old best effort mode. Fixes scylladb#7359

kostja · 2020-11-11T18:29:06Z

Could we expedite reviewing & merging this pull request?

Currently removenode works like below: - The coordinator node advertises the node to be removed in REMOVING_TOKEN status in gossip - Existing nodes learn the node in REMOVING_TOKEN status - Existing nodes sync data for the range it owns - Existing nodes send notification to the coordinator - The coordinator node waits for notification and announce the node in REMOVED_TOKEN Current problems: - Existing nodes do not tell the coordinator if the data sync is ok or failed. - The coordinator can not abort the removenode operation in case of error - Failed removenode operation will make the node to be removed in REMOVING_TOKEN forever. - The removenode runs in best effort mode which may cause data consistency issues. It means if a node that owns the range after the removenode operation is down during the operation, the removenode node operation will continue to succeed without requiring that node to perform data syncing. This can cause data consistency issues. For example, Five nodes in the cluster, RF = 3, for a range, n1, n2, n3 is the old replicas, n2 is being removed, after the removenode operation, the new replicas are n1, n5, n3. If n3 is down during the removenode operation, only n1 will be used to sync data with the new owner n5. This will break QUORUM read consistency if n1 happens to miss some writes. Improvements in this patch: - This patch makes the removenode safe by default. We require all nodes in the cluster to participate in the removenode operation and sync data if needed. We fail the removenode operation if any of them is down or fails. If the user want the removenode operation to succeed even if some of the nodes are not available, the user has to explicitly pass a list of nodes that can be skipped for the operation. $ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id> Example restful api: $ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5" - The coordinator can abort data sync on existing nodes For example, if one of the nodes fails to sync data. It makes no sense for other nodes to continue to sync data because the whole operation will fail anyway. - The coordinator can decide which nodes to ignore and pass the decision to other nodes Previously, there is no way for the coordinator to tell existing nodes to run in strict mode or best effort mode. Users will have to modify config file or run a restful api cmd on all the nodes to select strict or best effort mode. With this patch, the cluster wide configuration is eliminated. Fixes scylladb#7359

Currently removenode works like below: - The coordinator node advertises the node to be removed in REMOVING_TOKEN status in gossip - Existing nodes learn the node in REMOVING_TOKEN status - Existing nodes sync data for the range it owns - Existing nodes send notification to the coordinator - The coordinator node waits for notification and announce the node in REMOVED_TOKEN Current problems: - Existing nodes do not tell the coordinator if the data sync is ok or failed. - The coordinator can not abort the removenode operation in case of error - Failed removenode operation will make the node to be removed in REMOVING_TOKEN forever. - The removenode runs in best effort mode which may cause data consistency issues. It means if a node that owns the range after the removenode operation is down during the operation, the removenode node operation will continue to succeed without requiring that node to perform data syncing. This can cause data consistency issues. For example, Five nodes in the cluster, RF = 3, for a range, n1, n2, n3 is the old replicas, n2 is being removed, after the removenode operation, the new replicas are n1, n5, n3. If n3 is down during the removenode operation, only n1 will be used to sync data with the new owner n5. This will break QUORUM read consistency if n1 happens to miss some writes. Improvements in this patch: - This patch makes the removenode safe by default. We require all nodes in the cluster to participate in the removenode operation and sync data if needed. We fail the removenode operation if any of them is down or fails. If the user want the removenode operation to succeed even if some of the nodes are not available, the user has to explicitly pass a list of nodes that can be skipped for the operation. $ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id> Example restful api: $ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5" - The coordinator can abort data sync on existing nodes For example, if one of the nodes fails to sync data. It makes no sense for other nodes to continue to sync data because the whole operation will fail anyway. - The coordinator can decide which nodes to ignore and pass the decision to other nodes Previously, there is no way for the coordinator to tell existing nodes to run in strict mode or best effort mode. Users will have to modify config file or run a restful api cmd on all the nodes to select strict or best effort mode. With this patch, the cluster wide configuration is eliminated. Fixes #7359 Closes #7626

avikivity · 2022-09-01T11:39:16Z

Fix present on all active branches, not backporting.

kostja assigned kostja and asias Oct 7, 2020

kostja added area/lwt area/jepsen high availability tool problem labels Oct 7, 2020

kostja added this to the 4.3 milestone Oct 7, 2020

kostja mentioned this issue Oct 7, 2020

Linearizability violation with node removes/adds #7243

Closed

1 task

slivne added the area/repair-based operations repare based node operations label Oct 11, 2020

asias mentioned this issue Oct 19, 2020

repair: Add strict mode for removenode ops #7452

Closed

asias mentioned this issue Nov 17, 2020

repair: Make removenode safe by default #7626

Closed

scylladb-promoter closed this as completed in 829b4c1 Dec 10, 2020

scylladb-promoter added the Backport candidate label Dec 10, 2020

avikivity removed the Backport candidate label Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair-based streaming is "best effort" which leads to linearizability violation during nodetool removenode #7359

Repair-based streaming is "best effort" which leads to linearizability violation during nodetool removenode #7359

kostja commented Oct 7, 2020 •

edited

kostja commented Oct 7, 2020

slivne commented Oct 11, 2020

asias commented Oct 12, 2020

asias commented Oct 12, 2020

asias commented Oct 12, 2020

asias commented Oct 12, 2020

kostja commented Nov 11, 2020

avikivity commented Sep 1, 2022

Repair-based streaming is "best effort" which leads to linearizability violation during nodetool removenode #7359

Repair-based streaming is "best effort" which leads to linearizability violation during nodetool removenode #7359

Comments

kostja commented Oct 7, 2020 • edited

kostja commented Oct 7, 2020

slivne commented Oct 11, 2020

asias commented Oct 12, 2020

asias commented Oct 12, 2020

asias commented Oct 12, 2020

asias commented Oct 12, 2020

kostja commented Nov 11, 2020

avikivity commented Sep 1, 2022

kostja commented Oct 7, 2020 •

edited