-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for node to be down in replace #12887
base: master
Are you sure you want to change the base?
Conversation
Wait for nodes to be down with a timeout.
We can only replace a dead node. The replace procedure requires the user to check the replaced node has down before performing replace ops. With the node_ops_cmd infrastructure, we can easily check on all nodes participating the replace node to make sure the node replaced has marked as down. Fixes scylladb#12862
Shouldn't the user/test just wait until the replaced node is DN on all nodes before starting the replacing node? |
CI state |
They should but they may do not follow the procedure, especially the tests. After this patch, even if they do not, scylla core will ensure the node to be replaced is dead, otherwise reject the replace ops. |
I think that the current way we reject the operation, without waiting is good enough and the test code should be fixed instead. In practice, replace is either run after a node went down inadvertently, or the admin took it down on purpose. In both cases the documented process calls for verifying that the node is down using nodetool status on all live nodes (IIRC the documentation already instruct running it in all remaining nodes, and if not, docs should be fixed to say that) |
This will slow down the node operations by the gossip convict threshold. The node state, up or down, should be taken from group0, not from gossiper, where it can change right after it was queried. Can you trust the local view anyway? |
I like tests run fast, but the test speed is not the primary goal here. A lower test is much better than a flaky test. If we stop the node to be replaced gracefully, the other nodes will mark the node as down immediately. No time increase in the test.
As long as we still use gossip to mark node up and down and use the gossip callbacks, we need to wait for gossip to mark the node as down. Node up and down of a peer node is local view of the local node. |
@kbr-scylla @kostja Please help with the raft test failure https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4406/ What was wrong? |
From the user point of view, it is best we can have less steps in the procedure. If we can do the check in scylla core, I think we should to simplify the procedure. Btw, once we have the feature to block the replaced node to rejoin the cluster, we do not really need the user to check a node is really down, we can even replace a live node. |
https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4406/artifact/testlog/x86_64/debug/scylla-155.log
it's a known issue: #8079 |
Gossip store is unreliable store. Trusting node state in gossip just sweeps the problem under the carpet. Checking once before the operation is more a courtesy than a real service - if you miss the timing of the check, you get the same behaviour. The bug should never have been acknowledged as bug, it's been documented that one shoudl use nodetool status before executing the operation. |
Marking a node as UP or DOWN is failure detection, it doesn't make sense for it to require consensus, it's a local decision made by each node separately. |
I am not discussing how we detect failure, I am discussing how we store the information about it. It is not stored in a linearizable storage -> we can not trust it. Now, to detection. There are many ways to detect a node being down. Moreover, the whole concept of node-down is way less necessary if you use the centralized topology changes - you can allow removenode of a running node safely, you just fence it off. So the check, the concept, the behaviour is legacy. Now, let's imagine we respect this legacy and would like to preserve it. In this case we should store state of a node which has shut down gracefully in group0 state. This patch doesn't do it. Another option would be to initiate a node shutdown on removenode, if it is running. |
Now, you actually contradict with this statement the purpose of the patch. The goal of the safety net provided by this patch is to bar removenode of a node which is not really down (e.g. some nodes think it is UP). |
One more comment about detection. I have been advocating earlier that graceful state changes (node being up or down) should be reflected in group0 and group0 view of the node state should trump the gossiper view of it. This issue, specifically, providing a predictable and safe user experience when trying to perform a topology change on a down node, is exactly the reason we need it. As you can see, the optimal user experience is not as straightforward as is suggested in this patch. |
CI state |
The utmost simplification would be automating topology changes using raft.
Agreed, but then again, I think this is something we need to consider with the new topology changes implementation over raft. |
If nodetool status is unreliable, e.g. due to unstable state across shards, we should fix that.
Topology changes over raft will not be backported to existing releases. |
Of course it is. If a node disappeared without proper shutdown message it either takes convict threshold to mark it dead or a way to fence it off the cluster in case we do not want to wait. In fact the later is needed anyway since there is no guaranty the node is not just partitioned out or will be booted later. So the only thing needed in reality is fencing a particular node from the cluster. With raft topology we can do it at the beginning of the replace operation. We need to add fencing mechanism (not the same fencing as we are adding for requests but different kind of fencing that will not allow a fenced node to connect to the cluster at all).
Repeating it will not make it true :) 1. UP, DOWN is a local view. 2. Storing it in group0 does not mean it will not change after it is queried. 3. to get up-to-date data from group0 a read barrier is required and you do not want to do it on each requires (this is what the info is accessed now). I probably forgot 3 more reasons. |
You wish to keep it that way by all means, but it's your choice. First of all, if you know how failure conviction works, it is a global conviction, not a local one. The node is declared down when it is seen as down by the cluster and this knowledge has propagated to all nodes. This is what gossip is responsible for in the first place. So it's not a local view, it's a global view. But more importantly, it's a global view once you store the conviction result in group0. Once you store it in group0, you can no longer flip it back easily. You do require the node to connect to the majority and change the majority view of itself to be truly back online. Otherwise it's fenced. This is how you achieve the desired result - you can not just flip a node back and forth on some nodes but not others, all node transitions, including up and down transitions, are linearized.
I just wrote above how it does.
Not requiring a read barrier was your requirement, neither me nor Avi share it. Your current patch now uses a clever trick to avoid unnecessary read barriers, it could be extended.
You didnn't foget any good reason, because there are no good reasons to not keep it in group0. Besides, you missed reason #3 and jumped right to #4. |
No. We use direct pinger. The view is local. And even before that the view was sort of local since each node can see something different depending on the state of its connections. And we moved to the direct pinger because having not so local view caused complete havoc as well.
Nothing here makes sense to me. For once if a minority is partitioned away it continue seeing entire cluster as alive because it cannot write into group0. How can it make any sense?
You did not wrote anything like that. The moment you queried something from group0 it may change in the group0.
Avoiding read barrier on each request is not my requirement. It is a common sense requirement. Nodes state are checked on each request. My patch uses a common trick called opportunistic locking IIRC in one place where it is possible. It is possible if a state can change only once. It is not the case for node's state.
Well I probably forgot reason 3. But those that I listed are enough. |
Well, maybe the havoc was caused by not implementing an epidemic failure detector correctly, which has theoretically proven convergence in logarithmic time? In any case, the fact that our failure detection is broken doesn't mean everything else should be broken as well - we need to fix the failure detection instead. |
Even if it is the local view of the topology coordinator, it is sufficient to provide linearizability of state transitions. And that's exactly what the goal is here. |
A minority that is partitioned away may see the old topology. UP or DOWN states are part of the topology. How can one part make sense to you and the other not? Besides, we have an ambiguous definition of the word "see". I do not suggest to replace the failure detector running locally. I am suggesting to augment its information with the information from group0. And I suggest that topology operations should be based on the information stored in group0. The query routing can continue be based on a gossiper, or take gropu0 information into account as well. You for some reason think that if we switch the storage of UP and DOWN states to group0, not just topology operations, but routing operations will have to switch as well, I did not suggest that. |
First why do you want to force coordinators view on the entire cluster? And second even that is not possible since if there is a network partitioning the smaller partition will never get the update and will see entire cluster as alive (?!). |
UP/DOWN is not a topology. It is a node state. You are mixing those and that is why you propose something that does not make sense. The minority see all nodes as alive and continue sending data to them which causes them to pile up until timeout. Nodes have to see other nodes they can communicate with regardless on which side of the partition they are.
Then you need to define what this up/down of your means. Rename it to make some sense. Explain how and for what you propose to use them exactly. Otherwise the conversation is meaningless. UP/DOWN have well defined meaning that does not make sens the way you use it.
Then define what exactly you are suggesting not using well established terms incorrectly. |
Failure detection is unreliable, you can never "trust" it. You might think that a node is dead even though it's really alive. That's why you never depend on failure detection information to ensure safety properties. You only use it to improve liveness. I really recommend this paper that introduced the theory of failure detectors. It's one of the best papers in Computer Science in my opinion. Now the whole idea of checking that a node is DOWN before deciding to replace it or remove it from the cluster is of course broken, I agree with that. This is an attempt to rely on failure detection for safety, which can never work. But we already have this broken design, inherited from Cassandra. We can only make the user experience a bit better. The real correct solution, as you all noticed, is to make the remove/replace operation safe regardless of the node being dead or alive (by using fencing). |
@gleb-cloudius and @kbr-scylla I will respond to both of your comments. Gleb has asked me to define UP and DOWN states, what they are, and how I plan to use them. They are additional information which is used by the topology coordinator exactly to provide some user-friendliness in cases like the one described above. I do not suggest to use UP or DOWN information on any path that is related to liveness or safety of the coordinator algorithm, but having this information available will make the choices of the algorithm more user-friendly. |
@kostja but please do not call then UP/DOWN. Those are taking and have their meaning for Scylla already. |
What's the latest here? Who owns the next action item to make progress? |
Marked as enhancement as #12862 was already fixed in another way. |
We can only replace a dead node. The replace procedure requires the user
to check the replaced node has down before performing replace ops.
With the node_ops_cmd infrastructure, we can easily check on all nodes
participating the replace node to make sure the node replaced has marked
as down.
Fixes #12862