New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
removenode: don't stream data from the leaving node #11704
Conversation
Does this fix a bug? If so it needs a Fixes #nnnnn. |
No, I ran into this while writing a |
So, in non-raft it behaves correctly and only breaks under raft? |
No, it doesn't seem to be related to raft, we just decided to write an explicit test trying to reproduce this issue. |
I'm confused. So this is not related to raft, and not a bug? Perhaps it's a bug but not a regression? |
It's not a regression. I'm not sure if this is a bug or not, just a strange behaviour I found while writing the test. |
Ok. I'll let @asias decide. |
CI state |
I don't think we can do that when the leaving node is the solitary owner of some of the data (think RF=1). This is not recommended but users still do it. |
Ok, we could give preference to other nodes, and fallback to the leaving_node if there are no other options. |
If I remember correctly... OTOH in the docs we say that In the test we can wait until the desired removenode coordinator notices that the node we want to remove is down, only then start removenode. |
I don't see in code that the
It might break some undocumented workflows that users do in practice, as @denesb noted above. On the other hand, I see no reason to choose
Yes, we can, but this is a redundant step that does not follow from the documentation. And this actually feels like a bug - we follow the docs, but we still get an error from |
There is a reason, if we want to make So it all depends on the guarantees that we want to provide with However, if it is as you say:
then it means that So, if we assume that
@asias could you please share your opinion about this? |
I think |
238ed97
to
c6db1c3
Compare
v2:
|
CI state |
@asias please take a look at this PR |
I think we'd better just abort the removenode operation if the node is still up. It is better not to play tricks. I would prefer safety here. |
We can do something like here: Reject the removenode in case the node is still up. |
Note: most of the times, users can and should avoid using removenode operations. |
Agree. |
FWIW, also agree. |
If a removenode is run for a recently stopped node, the gossiper may not yet know that the node is down, and the removenode will fail with a Stream failed error trying to stream data from that node. In this patch we explicitly reject removenode operation if the gossiper considers the leaving node up.
c6db1c3
to
ad12684
Compare
CI state |
@benipeled |
CI state |
v2:
|
If a removenode is run for a recently stopped node, the gossiper may not yet know that the node is down, and the removenode will fail with a Stream failed error trying to stream data from that node. In this patch we explicitly reject removenode operation if the gossiper considers the leaving node up. Closes #11704
If a
removenode
is run for a recently stopped node, thegossiper
may not yet know that the node is down, and theremovenode
will fail with aStream failed
error trying to stream data from that node. In this patchleave_node
is explicitly ignored when choosing a source for thetoken_range
.A warning has also been added when a
removenode
exception is caught. Theremovenode_abort
logic that follows it may also throw, in which case information about the original exception was lost.