Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removenode: don't stream data from the leaving node #11704

Closed
wants to merge 1 commit into from

Conversation

gusev-p
Copy link

@gusev-p gusev-p commented Oct 4, 2022

If a removenode is run for a recently stopped node, the gossiper may not yet know that the node is down, and the removenode will fail with a Stream failed error trying to stream data from that node. In this patch leave_node is explicitly ignored when choosing a source for the token_range.

A warning has also been added when a removenode exception is caught. The removenode_abort logic that follows it may also throw, in which case information about the original exception was lost.

@gusev-p gusev-p requested a review from tgrabiec as a code owner October 4, 2022 10:03
@gusev-p gusev-p requested a review from kostja October 4, 2022 10:05
@tgrabiec tgrabiec requested a review from asias October 4, 2022 10:41
@avikivity
Copy link
Member

Does this fix a bug? If so it needs a Fixes #nnnnn.

@gusev-p
Copy link
Author

gusev-p commented Oct 4, 2022

Does this fix a bug? If so it needs a Fixes #nnnnn.

No, I ran into this while writing a removenode+raft test using new Kostya's pytest facilities.

@avikivity
Copy link
Member

Does this fix a bug? If so it needs a Fixes #nnnnn.

No, I ran into this while writing a removenode+raft test using new Kostya's pytest facilities.

So, in non-raft it behaves correctly and only breaks under raft?

@gusev-p
Copy link
Author

gusev-p commented Oct 4, 2022

So, in non-raft it behaves correctly and only breaks under raft?

No, it doesn't seem to be related to raft, we just decided to write an explicit test trying to reproduce this issue.

@avikivity
Copy link
Member

I'm confused. So this is not related to raft, and not a bug? Perhaps it's a bug but not a regression?

@gusev-p
Copy link
Author

gusev-p commented Oct 4, 2022

I'm confused. So this is not related to raft, and not a bug? Perhaps it's a bug but not a regression?

It's not a regression. I'm not sure if this is a bug or not, just a strange behaviour I found while writing the test.

@avikivity
Copy link
Member

Ok. I'll let @asias decide.

@scylladb-promoter
Copy link
Contributor

@denesb
Copy link
Contributor

denesb commented Oct 5, 2022

I don't think we can do that when the leaving node is the solitary owner of some of the data (think RF=1). This is not recommended but users still do it.

@gusev-p
Copy link
Author

gusev-p commented Oct 5, 2022

I don't think we can do that when the leaving node is the solitary owner of some of the data (think RF=1). This is not recommended but users still do it.

Ok, we could give preference to other nodes, and fallback to the leaving_node if there are no other options.

@kbr-
Copy link
Contributor

kbr- commented Oct 5, 2022

In this patch leave_node is explicitly ignored when choosing a source for the token_range.

If I remember correctly...
removenode can also work with nodes that are UP, and in this case it's supposed to give better safety guarantees (no data loss) - like decommission.
If the node is DOWN then we give up some guarantees (we may lose data that was only replicated to that node).

OTOH in the docs we say that removenode should only be used for DOWN nodes - it kind of makes sense, after all if the node is UP we can run decommission on it.
So perhaps the correct approach here would be to reject removenode attempt on a node that is UP.

In the test we can wait until the desired removenode coordinator notices that the node we want to remove is down, only then start removenode.

@gusev-p
Copy link
Author

gusev-p commented Oct 5, 2022

If I remember correctly...
removenode can also work with nodes that are UP, and in this case it's supposed to give better safety guarantees (no data loss) - like decommission.

I don't see in code that the leaving_node is preferred over the others, we just iterate over all replicas for token_range in some order and choose the first live one.

So perhaps the correct approach here would be to reject removenode attempt on a node that is UP.

It might break some undocumented workflows that users do in practice, as @denesb noted above. On the other hand, I see no reason to choose leaving_node if there are other sources for token_range.

In the test we can wait until the desired removenode coordinator notices that the node we want to remove is down, only then start removenode.

Yes, we can, but this is a redundant step that does not follow from the documentation. And this actually feels like a bug - we follow the docs, but we still get an error from removenode.

@kbr-
Copy link
Contributor

kbr- commented Oct 6, 2022

On the other hand, I see no reason to choose leaving_node if there are other sources for token_range.

There is a reason, if we want to make removenode a safe operation in the case that the removed node is UP. That's because the removed node might be the only replica that knows about some write. For example, suppose someone performed a CL=ONE write and got an ACK, meaning that the write was successful. The write could be made to this replica and not any other replica. All subsequent CL=ALL reads must see any previous confirmed CL=ONE writes, if there were no unsafe operations (like removing a dead node) in the meantime. For example, decommission operation is safe and guarantees that - we stream data out of the decommissioned node, so if there was any write that got only replicated to this node, we will keep it after decommission on some other replica.

So it all depends on the guarantees that we want to provide with removenode.

However, if it is as you say:

I don't see in code that the leaving_node is preferred over the others, we just iterate over all replicas for token_range in some order and choose the first live one.

then it means that removenode is inherently unsafe - even if the node is UP, the streaming algorithm does not guarantee to do the safe thing here which is streaming from the removed node. It may or may not stream from it.

So, if we assume that removenode is unsafe and we don't care about making it safe, the solution you proposed makes perfect sense:

Ok, we could give preference to other nodes, and fallback to the leaving_node if there are no other options.

@asias could you please share your opinion about this?

@tgrabiec
Copy link
Contributor

tgrabiec commented Oct 6, 2022

I think removenode is meant to be unsafe in a sense that whatever was owned by the node is considered lost. In the safe topology changes algorithm, the first step would be to tell all other nodes to blacklist the removed node. It would be fine to fail removenode if we detect that the node is UP and tell the admin to reconsider his choice (maybe he meant decommission?).

@kostja
Copy link
Contributor

kostja commented Oct 6, 2022

Agree with @tgrabiec plus repair based node ops are supposed to address the issue you describe @kbr-

@gusev-p
Copy link
Author

gusev-p commented Oct 10, 2022

v2:

  • exclude the warning, it's already merged in another PR
  • give preference to other nodes, and fallback to the leaving_node if there are no other options

@scylladb-promoter
Copy link
Contributor

@kbr-
Copy link
Contributor

kbr- commented Oct 11, 2022

@asias please take a look at this PR

@asias
Copy link
Contributor

asias commented Oct 11, 2022

I think we'd better just abort the removenode operation if the node is still up. It is better not to play tricks. I would prefer safety here.

@asias
Copy link
Contributor

asias commented Oct 11, 2022

We can do something like here:

#11362

Reject the removenode in case the node is still up.

@asias
Copy link
Contributor

asias commented Oct 11, 2022

Note: most of the times, users can and should avoid using removenode operations.

@avikivity
Copy link
Member

I think we'd better just abort the removenode operation if the node is still up. It is better not to play tricks. I would prefer safety here.

Agree.

@tgrabiec
Copy link
Contributor

I think we'd better just abort the removenode operation if the node is still up. It is better not to play tricks. I would prefer safety here.

FWIW, also agree.

If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.

In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.
@scylladb-promoter
Copy link
Contributor

@gusev-p
Copy link
Author

gusev-p commented Oct 12, 2022

@benipeled ld.lld: error: failed to open build/release/test/boost/hashers_test: No space left on device

@scylladb-promoter
Copy link
Contributor

@gusev-p
Copy link
Author

gusev-p commented Oct 13, 2022

v2:

  • reject decomission operation if the living node is UP

kbr- pushed a commit that referenced this pull request Oct 13, 2022
If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.

In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.

Closes #11704
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants