removenode: don't stream data from the leaving node #11704

gusev-p · 2022-10-04T10:03:29Z

If a removenode is run for a recently stopped node, the gossiper may not yet know that the node is down, and the removenode will fail with a Stream failed error trying to stream data from that node. In this patch leave_node is explicitly ignored when choosing a source for the token_range.

A warning has also been added when a removenode exception is caught. The removenode_abort logic that follows it may also throw, in which case information about the original exception was lost.

avikivity · 2022-10-04T11:42:12Z

Does this fix a bug? If so it needs a Fixes #nnnnn.

gusev-p · 2022-10-04T12:32:56Z

Does this fix a bug? If so it needs a Fixes #nnnnn.

No, I ran into this while writing a removenode+raft test using new Kostya's pytest facilities.

avikivity · 2022-10-04T12:34:43Z

Does this fix a bug? If so it needs a Fixes #nnnnn.

No, I ran into this while writing a removenode+raft test using new Kostya's pytest facilities.

So, in non-raft it behaves correctly and only breaks under raft?

gusev-p · 2022-10-04T12:45:14Z

So, in non-raft it behaves correctly and only breaks under raft?

No, it doesn't seem to be related to raft, we just decided to write an explicit test trying to reproduce this issue.

avikivity · 2022-10-04T12:46:50Z

I'm confused. So this is not related to raft, and not a bug? Perhaps it's a bug but not a regression?

gusev-p · 2022-10-04T12:51:20Z

I'm confused. So this is not related to raft, and not a bug? Perhaps it's a bug but not a regression?

It's not a regression. I'm not sure if this is a bug or not, just a strange behaviour I found while writing the test.

avikivity · 2022-10-04T12:52:22Z

Ok. I'll let @asias decide.

scylladb-promoter · 2022-10-04T13:51:21Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2533/

denesb · 2022-10-05T04:30:02Z

I don't think we can do that when the leaving node is the solitary owner of some of the data (think RF=1). This is not recommended but users still do it.

gusev-p · 2022-10-05T06:17:13Z

I don't think we can do that when the leaving node is the solitary owner of some of the data (think RF=1). This is not recommended but users still do it.

Ok, we could give preference to other nodes, and fallback to the leaving_node if there are no other options.

kbr- · 2022-10-05T14:57:15Z

In this patch leave_node is explicitly ignored when choosing a source for the token_range.

If I remember correctly...
removenode can also work with nodes that are UP, and in this case it's supposed to give better safety guarantees (no data loss) - like decommission.
If the node is DOWN then we give up some guarantees (we may lose data that was only replicated to that node).

OTOH in the docs we say that removenode should only be used for DOWN nodes - it kind of makes sense, after all if the node is UP we can run decommission on it.
So perhaps the correct approach here would be to reject removenode attempt on a node that is UP.

In the test we can wait until the desired removenode coordinator notices that the node we want to remove is down, only then start removenode.

gusev-p · 2022-10-05T19:02:46Z

If I remember correctly...
removenode can also work with nodes that are UP, and in this case it's supposed to give better safety guarantees (no data loss) - like decommission.

I don't see in code that the leaving_node is preferred over the others, we just iterate over all replicas for token_range in some order and choose the first live one.

So perhaps the correct approach here would be to reject removenode attempt on a node that is UP.

It might break some undocumented workflows that users do in practice, as @denesb noted above. On the other hand, I see no reason to choose leaving_node if there are other sources for token_range.

In the test we can wait until the desired removenode coordinator notices that the node we want to remove is down, only then start removenode.

Yes, we can, but this is a redundant step that does not follow from the documentation. And this actually feels like a bug - we follow the docs, but we still get an error from removenode.

kbr- · 2022-10-06T14:59:44Z

On the other hand, I see no reason to choose leaving_node if there are other sources for token_range.

There is a reason, if we want to make removenode a safe operation in the case that the removed node is UP. That's because the removed node might be the only replica that knows about some write. For example, suppose someone performed a CL=ONE write and got an ACK, meaning that the write was successful. The write could be made to this replica and not any other replica. All subsequent CL=ALL reads must see any previous confirmed CL=ONE writes, if there were no unsafe operations (like removing a dead node) in the meantime. For example, decommission operation is safe and guarantees that - we stream data out of the decommissioned node, so if there was any write that got only replicated to this node, we will keep it after decommission on some other replica.

So it all depends on the guarantees that we want to provide with removenode.

However, if it is as you say:

I don't see in code that the leaving_node is preferred over the others, we just iterate over all replicas for token_range in some order and choose the first live one.

then it means that removenode is inherently unsafe - even if the node is UP, the streaming algorithm does not guarantee to do the safe thing here which is streaming from the removed node. It may or may not stream from it.

So, if we assume that removenode is unsafe and we don't care about making it safe, the solution you proposed makes perfect sense:

Ok, we could give preference to other nodes, and fallback to the leaving_node if there are no other options.

@asias could you please share your opinion about this?

tgrabiec · 2022-10-06T15:16:45Z

I think removenode is meant to be unsafe in a sense that whatever was owned by the node is considered lost. In the safe topology changes algorithm, the first step would be to tell all other nodes to blacklist the removed node. It would be fine to fail removenode if we detect that the node is UP and tell the admin to reconsider his choice (maybe he meant decommission?).

kostja · 2022-10-06T16:17:05Z

Agree with @tgrabiec plus repair based node ops are supposed to address the issue you describe @kbr-

gusev-p · 2022-10-10T14:34:09Z

v2:

exclude the warning, it's already merged in another PR
give preference to other nodes, and fallback to the leaving_node if there are no other options

scylladb-promoter · 2022-10-10T18:36:11Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2622/

kbr- · 2022-10-11T10:04:20Z

@asias please take a look at this PR

asias · 2022-10-11T10:30:07Z

I think we'd better just abort the removenode operation if the node is still up. It is better not to play tricks. I would prefer safety here.

asias · 2022-10-11T10:37:05Z

We can do something like here:

#11362

Reject the removenode in case the node is still up.

asias · 2022-10-11T10:39:21Z

Note: most of the times, users can and should avoid using removenode operations.

avikivity · 2022-10-11T10:48:09Z

I think we'd better just abort the removenode operation if the node is still up. It is better not to play tricks. I would prefer safety here.

Agree.

tgrabiec · 2022-10-11T11:16:47Z

I think we'd better just abort the removenode operation if the node is still up. It is better not to play tricks. I would prefer safety here.

FWIW, also agree.

If a removenode is run for a recently stopped node, the gossiper may not yet know that the node is down, and the removenode will fail with a Stream failed error trying to stream data from that node. In this patch we explicitly reject removenode operation if the gossiper considers the leaving node up.

scylladb-promoter · 2022-10-12T10:17:28Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2652/

gusev-p · 2022-10-12T11:37:39Z

@benipeled ld.lld: error: failed to open build/release/test/boost/hashers_test: No space left on device

scylladb-promoter · 2022-10-13T10:01:47Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2665/

gusev-p · 2022-10-13T10:05:44Z

v2:

reject decomission operation if the living node is UP

If a removenode is run for a recently stopped node, the gossiper may not yet know that the node is down, and the removenode will fail with a Stream failed error trying to stream data from that node. In this patch we explicitly reject removenode operation if the gossiper considers the leaving node up. Closes #11704

gusev-p requested a review from tgrabiec as a code owner October 4, 2022 10:03

gusev-p requested a review from kostja October 4, 2022 10:05

tgrabiec requested a review from asias October 4, 2022 10:41

This was referenced Oct 5, 2022

raft: test_topology.test_remove_node_add_column is flaky #11721

Closed

storage_service::removenode should not silence exceptions #11722

Closed

kbr- mentioned this pull request Oct 5, 2022

service: storage_service: log removenode errors #11728

Closed

gusev-p force-pushed the remove_node_fix branch from 238ed97 to c6db1c3 Compare October 10, 2022 14:32

gusev-p force-pushed the remove_node_fix branch from c6db1c3 to ad12684 Compare October 12, 2022 09:06

scylladb-promoter closed this in c76cf59 Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removenode: don't stream data from the leaving node #11704

removenode: don't stream data from the leaving node #11704

gusev-p commented Oct 4, 2022 •

edited

avikivity commented Oct 4, 2022

gusev-p commented Oct 4, 2022

avikivity commented Oct 4, 2022

gusev-p commented Oct 4, 2022

avikivity commented Oct 4, 2022

gusev-p commented Oct 4, 2022

avikivity commented Oct 4, 2022

scylladb-promoter commented Oct 4, 2022

denesb commented Oct 5, 2022

gusev-p commented Oct 5, 2022

kbr- commented Oct 5, 2022

gusev-p commented Oct 5, 2022

kbr- commented Oct 6, 2022

tgrabiec commented Oct 6, 2022

kostja commented Oct 6, 2022 •

edited

gusev-p commented Oct 10, 2022

scylladb-promoter commented Oct 10, 2022

kbr- commented Oct 11, 2022

asias commented Oct 11, 2022

asias commented Oct 11, 2022

asias commented Oct 11, 2022

avikivity commented Oct 11, 2022

tgrabiec commented Oct 11, 2022

scylladb-promoter commented Oct 12, 2022

gusev-p commented Oct 12, 2022

scylladb-promoter commented Oct 13, 2022

gusev-p commented Oct 13, 2022

removenode: don't stream data from the leaving node #11704

removenode: don't stream data from the leaving node #11704

Conversation

gusev-p commented Oct 4, 2022 • edited

avikivity commented Oct 4, 2022

gusev-p commented Oct 4, 2022

avikivity commented Oct 4, 2022

gusev-p commented Oct 4, 2022

avikivity commented Oct 4, 2022

gusev-p commented Oct 4, 2022

avikivity commented Oct 4, 2022

scylladb-promoter commented Oct 4, 2022

denesb commented Oct 5, 2022

gusev-p commented Oct 5, 2022

kbr- commented Oct 5, 2022

gusev-p commented Oct 5, 2022

kbr- commented Oct 6, 2022

tgrabiec commented Oct 6, 2022

kostja commented Oct 6, 2022 • edited

gusev-p commented Oct 10, 2022

scylladb-promoter commented Oct 10, 2022

kbr- commented Oct 11, 2022

asias commented Oct 11, 2022

asias commented Oct 11, 2022

asias commented Oct 11, 2022

avikivity commented Oct 11, 2022

tgrabiec commented Oct 11, 2022

scylladb-promoter commented Oct 12, 2022

gusev-p commented Oct 12, 2022

scylladb-promoter commented Oct 13, 2022

gusev-p commented Oct 13, 2022

gusev-p commented Oct 4, 2022 •

edited

kostja commented Oct 6, 2022 •

edited