Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for node to be down in replace #12887

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

asias
Copy link
Contributor

@asias asias commented Feb 16, 2023

We can only replace a dead node. The replace procedure requires the user
to check the replaced node has down before performing replace ops.

With the node_ops_cmd infrastructure, we can easily check on all nodes
participating the replace node to make sure the node replaced has marked
as down.

Fixes #12862

Wait for nodes to be down with a timeout.
We can only replace a dead node. The replace procedure requires the user
to check the replaced node has down before performing replace ops.

With the node_ops_cmd infrastructure, we can easily check on all nodes
participating the replace node to make sure the node replaced has marked
as down.

Fixes scylladb#12862
@bhalevy
Copy link
Member

bhalevy commented Feb 16, 2023

Shouldn't the user/test just wait until the replaced node is DN on all nodes before starting the replacing node?

@scylladb-promoter
Copy link
Contributor

@asias
Copy link
Contributor Author

asias commented Feb 16, 2023

Shouldn't the user/test just wait until the replaced node is DN on all nodes before starting the replacing node?

They should but they may do not follow the procedure, especially the tests. After this patch, even if they do not, scylla core will ensure the node to be replaced is dead, otherwise reject the replace ops.

@bhalevy
Copy link
Member

bhalevy commented Feb 16, 2023

Shouldn't the user/test just wait until the replaced node is DN on all nodes before starting the replacing node?

They should but they may do not follow the procedure, especially the tests. After this patch, even if they do not, scylla core will ensure the node to be replaced is dead, otherwise reject the replace ops.

I think that the current way we reject the operation, without waiting is good enough and the test code should be fixed instead. In practice, replace is either run after a node went down inadvertently, or the admin took it down on purpose. In both cases the documented process calls for verifying that the node is down using nodetool status on all live nodes (IIRC the documentation already instruct running it in all remaining nodes, and if not, docs should be fixed to say that)

@kbr-scylla
Copy link
Contributor

#12015

@kostja
Copy link
Contributor

kostja commented Feb 16, 2023

This will slow down the node operations by the gossip convict threshold. The node state, up or down, should be taken from group0, not from gossiper, where it can change right after it was queried. Can you trust the local view anyway?
@gleb-cloudius - this is why we need to have current state and target state in the topology state machine for each node.

@asias
Copy link
Contributor Author

asias commented Feb 17, 2023

This will slow down the node operations by the gossip convict threshold.

I like tests run fast, but the test speed is not the primary goal here. A lower test is much better than a flaky test.

If we stop the node to be replaced gracefully, the other nodes will mark the node as down immediately. No time increase in the test.

The node state, up or down, should be taken from group0, not from gossiper, where it can change right after it was queried. Can you trust the local view anyway? @gleb-cloudius - this is why we need to have current state and target state in the topology state machine for each node.

As long as we still use gossip to mark node up and down and use the gossip callbacks, we need to wait for gossip to mark the node as down. Node up and down of a peer node is local view of the local node.

@asias
Copy link
Contributor Author

asias commented Feb 17, 2023

@kbr-scylla @kostja Please help with the raft test failure

https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4406/

What was wrong?

@asias
Copy link
Contributor Author

asias commented Feb 17, 2023

Shouldn't the user/test just wait until the replaced node is DN on all nodes before starting the replacing node?

They should but they may do not follow the procedure, especially the tests. After this patch, even if they do not, scylla core will ensure the node to be replaced is dead, otherwise reject the replace ops.

I think that the current way we reject the operation, without waiting is good enough and the test code should be fixed instead. In practice, replace is either run after a node went down inadvertently, or the admin took it down on purpose. In both cases the documented process calls for verifying that the node is down using nodetool status on all live nodes (IIRC the documentation already instruct running it in all remaining nodes, and if not, docs should be fixed to say that)

From the user point of view, it is best we can have less steps in the procedure. If we can do the check in scylla core, I think we should to simplify the procedure. Btw, once we have the feature to block the replaced node to rejoin the cluster, we do not really need the user to check a node is really down, we can even replace a live node.

@kbr-scylla
Copy link
Contributor

What was wrong?

https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4406/artifact/testlog/x86_64/debug/scylla-155.log
hanged during shutdown of hints_manager service:

INFO  2023-02-16 07:53:03,380 [shard 0] hints_manager - Asked to stop
INFO  2023-02-16 07:53:03,380 [shard 0] hints_manager - Asked to stop
INFO  2023-02-16 07:53:03,380 [shard 0] hints_manager - Stopped
INFO  2023-02-16 07:53:03,381 [shard 1] hints_manager - Asked to stop
INFO  2023-02-16 07:53:03,381 [shard 1] hints_manager - Asked to stop
INFO  2023-02-16 07:53:03,381 [shard 1] hints_manager - Stopped
INFO  2023-02-16 07:53:03,382 [shard 1] hints_manager - Stopped

it's a known issue: #8079
I restarted CI.

@kostja
Copy link
Contributor

kostja commented Feb 17, 2023

This will slow down the node operations by the gossip convict threshold.

I like tests run fast, but the test speed is not the primary goal here. A lower test is much better than a flaky test.

If we stop the node to be replaced gracefully, the other nodes will mark the node as down immediately. No time increase in the test.

The node state, up or down, should be taken from group0, not from gossiper, where it can change right after it was queried. Can you trust the local view anyway? @gleb-cloudius - this is why we need to have current state and target state in the topology state machine for each node.

As long as we still use gossip to mark node up and down and use the gossip callbacks, we need to wait for gossip to mark the node as down. Node up and down of a peer node is local view of the local node.

Gossip store is unreliable store. Trusting node state in gossip just sweeps the problem under the carpet. Checking once before the operation is more a courtesy than a real service - if you miss the timing of the check, you get the same behaviour. The bug should never have been acknowledged as bug, it's been documented that one shoudl use nodetool status before executing the operation.
So the patch is enshrining some legacy that strongly consistent topology changes will need to redo, retest, and on top of that, preserve compatibility with the old code.
The issue should be simply reassigned to @gleb-cloudius and he will address it in his series for the new topology changes.

@kbr-scylla
Copy link
Contributor

The node state, up or down, should be taken from group0

Marking a node as UP or DOWN is failure detection, it doesn't make sense for it to require consensus, it's a local decision made by each node separately.

@kostja
Copy link
Contributor

kostja commented Feb 17, 2023

The node state, up or down, should be taken from group0

Marking a node as UP or DOWN is failure detection, it doesn't make sense for it to require consensus, it's a local decision made by each node separately.

I am not discussing how we detect failure, I am discussing how we store the information about it. It is not stored in a linearizable storage -> we can not trust it.

Now, to detection. There are many ways to detect a node being down. Moreover, the whole concept of node-down is way less necessary if you use the centralized topology changes - you can allow removenode of a running node safely, you just fence it off. So the check, the concept, the behaviour is legacy. Now, let's imagine we respect this legacy and would like to preserve it. In this case we should store state of a node which has shut down gracefully in group0 state. This patch doesn't do it. Another option would be to initiate a node shutdown on removenode, if it is running.
My main point is this patch is not a bugfix, it's an enhancement. And we should not accept enhancements without proper planning and discussion about the desired UX. Currently we're having this after-the-fact, which is inevitably biased.

@kostja
Copy link
Contributor

kostja commented Feb 17, 2023

The node state, up or down, should be taken from group0

Marking a node as UP or DOWN is failure detection, it doesn't make sense for it to require consensus, it's a local decision made by each node separately.

Now, you actually contradict with this statement the purpose of the patch. The goal of the safety net provided by this patch is to bar removenode of a node which is not really down (e.g. some nodes think it is UP).
Well, guess you look at the status of the node, even globally, and you see it's really DOWN. We discussed it is impossible today - you linked a pull request. And a-few-moments-later the node is back up. It's a classical read-then-write scenario, it should be done through group0 state.

@kostja
Copy link
Contributor

kostja commented Feb 17, 2023

One more comment about detection. I have been advocating earlier that graceful state changes (node being up or down) should be reflected in group0 and group0 view of the node state should trump the gossiper view of it. This issue, specifically, providing a predictable and safe user experience when trying to perform a topology change on a down node, is exactly the reason we need it.
For nodes which have been gracefully shut down by DBA we just take the state from group0 and perform the operation immediately.
For nodes which are healthy and UP, according to group0 and gossip, we reject the operation.
For nodes which are UP in group0 but DOWN in gossip we can either fence the nodes off, or wait for the node to be down, depending on the removenode options.

As you can see, the optimal user experience is not as straightforward as is suggested in this patch.

@scylladb-promoter
Copy link
Contributor

@bhalevy
Copy link
Member

bhalevy commented Feb 19, 2023

Shouldn't the user/test just wait until the replaced node is DN on all nodes before starting the replacing node?

They should but they may do not follow the procedure, especially the tests. After this patch, even if they do not, scylla core will ensure the node to be replaced is dead, otherwise reject the replace ops.

I think that the current way we reject the operation, without waiting is good enough and the test code should be fixed instead. In practice, replace is either run after a node went down inadvertently, or the admin took it down on purpose. In both cases the documented process calls for verifying that the node is down using nodetool status on all live nodes (IIRC the documentation already instruct running it in all remaining nodes, and if not, docs should be fixed to say that)

From the user point of view, it is best we can have less steps in the procedure. If we can do the check in scylla core, I think we should to simplify the procedure.

The utmost simplification would be automating topology changes using raft.
We need to continue to stabilize the exiting node ops implementation for older versions' sake, that would not support topology changes over raft, but I we shouldn't add more features, like the one proposed here, IMO.
(This PR suggests to wait for node down in replace. How about remove node? Why not wait there too?)

Btw, once we have the feature to block the replaced node to rejoin the cluster, we do not really need the user to check a node is really down, we can even replace a live node.

Agreed, but then again, I think this is something we need to consider with the new topology changes implementation over raft.

@bhalevy
Copy link
Member

bhalevy commented Feb 19, 2023

This will slow down the node operations by the gossip convict threshold.

I like tests run fast, but the test speed is not the primary goal here. A lower test is much better than a flaky test.
If we stop the node to be replaced gracefully, the other nodes will mark the node as down immediately. No time increase in the test.

The node state, up or down, should be taken from group0, not from gossiper, where it can change right after it was queried. Can you trust the local view anyway? @gleb-cloudius - this is why we need to have current state and target state in the topology state machine for each node.

As long as we still use gossip to mark node up and down and use the gossip callbacks, we need to wait for gossip to mark the node as down. Node up and down of a peer node is local view of the local node.

Gossip store is unreliable store. Trusting node state in gossip just sweeps the problem under the carpet. Checking once before the operation is more a courtesy than a real service - if you miss the timing of the check, you get the same behaviour. The bug should never have been acknowledged as bug, it's been documented that one shoudl use nodetool status before executing the operation.

If nodetool status is unreliable, e.g. due to unstable state across shards, we should fix that.

So the patch is enshrining some legacy that strongly consistent topology changes will need to redo, retest, and on top of that, preserve compatibility with the old code. The issue should be simply reassigned to @gleb-cloudius and he will address it in his series for the new topology changes.

Topology changes over raft will not be backported to existing releases.
For those we need to maintain the present infrastructure and continue to fix issues with it (not adding more features though).

@gleb-cloudius
Copy link
Contributor

This will slow down the node operations by the gossip convict threshold.

Of course it is. If a node disappeared without proper shutdown message it either takes convict threshold to mark it dead or a way to fence it off the cluster in case we do not want to wait. In fact the later is needed anyway since there is no guaranty the node is not just partitioned out or will be booted later. So the only thing needed in reality is fencing a particular node from the cluster. With raft topology we can do it at the beginning of the replace operation. We need to add fencing mechanism (not the same fencing as we are adding for requests but different kind of fencing that will not allow a fenced node to connect to the cluster at all).

The node state, up or down, should be taken from group0, not from gossiper, where it can change right after it was queried. Can you trust the local view anyway? @gleb-cloudius - this is why we need to have current state and target state in the topology state machine for each node.

Repeating it will not make it true :) 1. UP, DOWN is a local view. 2. Storing it in group0 does not mean it will not change after it is queried. 3. to get up-to-date data from group0 a read barrier is required and you do not want to do it on each requires (this is what the info is accessed now). I probably forgot 3 more reasons.

@kostja
Copy link
Contributor

kostja commented Feb 20, 2023

Repeating it will not make it true :) 1. UP, DOWN is a local view.

You wish to keep it that way by all means, but it's your choice. First of all, if you know how failure conviction works, it is a global conviction, not a local one. The node is declared down when it is seen as down by the cluster and this knowledge has propagated to all nodes. This is what gossip is responsible for in the first place. So it's not a local view, it's a global view. But more importantly, it's a global view once you store the conviction result in group0.

Once you store it in group0, you can no longer flip it back easily. You do require the node to connect to the majority and change the majority view of itself to be truly back online. Otherwise it's fenced. This is how you achieve the desired result - you can not just flip a node back and forth on some nodes but not others, all node transitions, including up and down transitions, are linearized.

  1. Storing it in group0 does not mean it will not change after it is queried.

I just wrote above how it does.

  1. to get up-to-date data from group0 a read barrier is required and you do not want to do it on each requires (this is what the info is accessed now).

Not requiring a read barrier was your requirement, neither me nor Avi share it. Your current patch now uses a clever trick to avoid unnecessary read barriers, it could be extended.

I probably forgot 3 more reasons.

You didnn't foget any good reason, because there are no good reasons to not keep it in group0. Besides, you missed reason #3 and jumped right to #4.

@gleb-cloudius
Copy link
Contributor

Repeating it will not make it true :) 1. UP, DOWN is a local view.

You wish to keep it that way by all means, but it's your choice.
The only sane one.

First of all, if you know how failure conviction works, it is a global conviction, not a local one. The node is declared down when it is seen as down by the cluster and this knowledge has propagated to all nodes. This is what gossip is responsible for in the first place. So it's not a local view, it's a global view. But more importantly, it's a global view once you store the conviction result in group0.

No. We use direct pinger. The view is local. And even before that the view was sort of local since each node can see something different depending on the state of its connections. And we moved to the direct pinger because having not so local view caused complete havoc as well.

Once you store it in group0, you can no longer flip it back easily. You do require the node to connect to the majority and change the majority view of itself to be truly back online. Otherwise it's fenced. This is how you achieve the desired result - you can not just flip a node back and forth on some nodes but not others, all node transitions, including up and down transitions, are linearized.

Nothing here makes sense to me. For once if a minority is partitioned away it continue seeing entire cluster as alive because it cannot write into group0. How can it make any sense?

  1. Storing it in group0 does not mean it will not change after it is queried.

I just wrote above how it does.

You did not wrote anything like that. The moment you queried something from group0 it may change in the group0.

  1. to get up-to-date data from group0 a read barrier is required and you do not want to do it on each requires (this is what the info is accessed now).

Not requiring a read barrier was your requirement, neither me nor Avi share it. Your current patch now uses a clever trick to avoid unnecessary read barriers, it could be extended.

Avoiding read barrier on each request is not my requirement. It is a common sense requirement. Nodes state are checked on each request.

My patch uses a common trick called opportunistic locking IIRC in one place where it is possible. It is possible if a state can change only once. It is not the case for node's state.

I probably forgot 3 more reasons.

You didnn't foget any good reason, because there are no good reasons to not keep it in group0. Besides, you missed reason #3 and jumped right to #4.

Well I probably forgot reason 3. But those that I listed are enough.

@kostja
Copy link
Contributor

kostja commented Feb 20, 2023

Repeating it will not make it true :) 1. UP, DOWN is a local view.

You wish to keep it that way by all means, but it's your choice.
The only sane one.

First of all, if you know how failure conviction works, it is a global conviction, not a local one. The node is declared down when it is seen as down by the cluster and this knowledge has propagated to all nodes. This is what gossip is responsible for in the first place. So it's not a local view, it's a global view. But more importantly, it's a global view once you store the conviction result in group0.

No. We use direct pinger. The view is local. And even before that the view was sort of local since each node can see something different depending on the state of its connections. And we moved to the direct pinger because having not so local view caused complete havoc as well.

Well, maybe the havoc was caused by not implementing an epidemic failure detector correctly, which has theoretically proven convergence in logarithmic time? In any case, the fact that our failure detection is broken doesn't mean everything else should be broken as well - we need to fix the failure detection instead.

@kostja
Copy link
Contributor

kostja commented Feb 20, 2023

No. We use direct pinger. The view is local. And even before that the view was sort of local since each node can see something different depending on the state of its connections. And we moved to the direct pinger because having not so local view caused complete havoc as well.

Even if it is the local view of the topology coordinator, it is sufficient to provide linearizability of state transitions. And that's exactly what the goal is here.

@kostja
Copy link
Contributor

kostja commented Feb 20, 2023

Once you store it in group0, you can no longer flip it back easily. You do require the node to connect to the majority and change the majority view of itself to be truly back online. Otherwise it's fenced. This is how you achieve the desired result - you can not just flip a node back and forth on some nodes but not others, all node transitions, including up and down transitions, are linearized.

Nothing here makes sense to me. For once if a minority is partitioned away it continue seeing entire cluster as alive because it cannot write into group0. How can it make any sense?

A minority that is partitioned away may see the old topology. UP or DOWN states are part of the topology. How can one part make sense to you and the other not?

Besides, we have an ambiguous definition of the word "see". I do not suggest to replace the failure detector running locally. I am suggesting to augment its information with the information from group0. And I suggest that topology operations should be based on the information stored in group0. The query routing can continue be based on a gossiper, or take gropu0 information into account as well.

You for some reason think that if we switch the storage of UP and DOWN states to group0, not just topology operations, but routing operations will have to switch as well, I did not suggest that.

@gleb-cloudius
Copy link
Contributor

No. We use direct pinger. The view is local. And even before that the view was sort of local since each node can see something different depending on the state of its connections. And we moved to the direct pinger because having not so local view caused complete havoc as well.

Even if it is the local view of the topology coordinator, it is sufficient to provide linearizability of state transitions. And that's exactly what the goal is here.

First why do you want to force coordinators view on the entire cluster? And second even that is not possible since if there is a network partitioning the smaller partition will never get the update and will see entire cluster as alive (?!).

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Feb 20, 2023

Once you store it in group0, you can no longer flip it back easily. You do require the node to connect to the majority and change the majority view of itself to be truly back online. Otherwise it's fenced. This is how you achieve the desired result - you can not just flip a node back and forth on some nodes but not others, all node transitions, including up and down transitions, are linearized.

Nothing here makes sense to me. For once if a minority is partitioned away it continue seeing entire cluster as alive because it cannot write into group0. How can it make any sense?

A minority that is partitioned away may see the old topology. UP or DOWN states are part of the topology. How can one part make sense to you and the other not?

UP/DOWN is not a topology. It is a node state. You are mixing those and that is why you propose something that does not make sense. The minority see all nodes as alive and continue sending data to them which causes them to pile up until timeout. Nodes have to see other nodes they can communicate with regardless on which side of the partition they are.

Besides, we have an ambiguous definition of the word "see". I do not suggest to replace the failure detector running locally. I am suggesting to augment its information with the information from group0. And I suggest that topology operations should be based on the information stored in group0. The query routing can continue be based on a gossiper, or take gropu0 information into account as well.

Then you need to define what this up/down of your means. Rename it to make some sense. Explain how and for what you propose to use them exactly. Otherwise the conversation is meaningless. UP/DOWN have well defined meaning that does not make sens the way you use it.

You for some reason think that if we switch the storage of UP and DOWN states to group0, not just topology operations, but routing operations will have to switch as well, I did not suggest that.

Then define what exactly you are suggesting not using well established terms incorrectly.

@kbr-scylla
Copy link
Contributor

I am not discussing how we detect failure, I am discussing how we store the information about it. It is not stored in a linearizable storage -> we can not trust it.

Failure detection is unreliable, you can never "trust" it. You might think that a node is dead even though it's really alive. That's why you never depend on failure detection information to ensure safety properties. You only use it to improve liveness.

I really recommend this paper that introduced the theory of failure detectors. It's one of the best papers in Computer Science in my opinion.
https://dl.acm.org/doi/pdf/10.1145/226643.226647

Now the whole idea of checking that a node is DOWN before deciding to replace it or remove it from the cluster is of course broken, I agree with that. This is an attempt to rely on failure detection for safety, which can never work. But we already have this broken design, inherited from Cassandra. We can only make the user experience a bit better.

The real correct solution, as you all noticed, is to make the remove/replace operation safe regardless of the node being dead or alive (by using fencing).

@kostja
Copy link
Contributor

kostja commented Feb 20, 2023

@gleb-cloudius and @kbr-scylla I will respond to both of your comments. Gleb has asked me to define UP and DOWN states, what they are, and how I plan to use them.

They are additional information which is used by the topology coordinator exactly to provide some user-friendliness in cases like the one described above. I do not suggest to use UP or DOWN information on any path that is related to liveness or safety of the coordinator algorithm, but having this information available will make the choices of the algorithm more user-friendly.

@gleb-cloudius
Copy link
Contributor

@kostja but please do not call then UP/DOWN. Those are taking and have their meaning for Scylla already.

@mykaul
Copy link
Contributor

mykaul commented May 3, 2023

What's the latest here? Who owns the next action item to make progress?

@bhalevy
Copy link
Member

bhalevy commented May 3, 2023

Marked as enhancement as #12862 was already fixed in another way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scylla sometimes fails to boot with node replace
7 participants