-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Declare readiness only if the node and its peers consider at most one node not-UN #1108
Conversation
5610c31
to
5fbd0df
Compare
43c6a81
to
2935c85
Compare
2935c85
to
5e45c44
Compare
Generally we should also try moving away from the current implementation of |
5e45c44
to
584605c
Compare
@tnozicka - let's update with today's meeting discussion on what the right way to do it is. |
@rzetelskik FYI: Hence if you only wait for a gossip state (which The solution we use in our Ansible Role is probing the CQL (TCP) port instead of the gossip state. And implementing this is trivial in Ansible which I suspect should also be the case in K8S. @asias Please, correct me if I'm missing something in the above. |
I guess we need to check CQL in addition to UN but we need UN, most importantly from the other nodes, not itself. After talking with Avi we have a go ahead to implement the N^2 algorithm where node considers itself ready, only if a quorum of nodes considers it as ready. As quorum depends on a current state of keyspaces and we don't have a way to enforce minimal quorum, quorum is always N-1 for us. |
(e.g. all the nodes can serve CQL and not be able to talk to each other, serving only C_ONE while we need to have C_QUORUM) |
|
This should not possibly happen - a node will have to at least be able to contact seeders which are not itself. The bottom line: it does make sense to make sure that the restarting node is UN from other nodes perspective however in certain cases this should not be a hard requirement for a RR. And, still, you need to check CQL port in addition to UN check. |
The more I think of what you wrote above the more I think it's simply wrong to implement this on the Operator level, @tnozicka. Let's start the last thing you wrote - you are going to not allow next node restart till ALL nodes (except for this node that is) consider this node as UN. This is definitely going to backfire in cases I mentioned above. But let's take a step back for a moment and recall what is the problem you want to avoid by this PR according to the opening message: a "split brain" situation in order to allow QUORUM queries. Why QUORUM? Why not ALL, or LOCAL_QUORUM, or LOCAL_ONE, or EACH_QUORUM? I think this distinction is... strange because there are too many assumptions:
A more understandable distinction is that we want to ensure that there is not more than a single node DOWN during a rolling restart. Why? Because this is the best we can do. And it has nothing to do with a particular CL. So it all comes down to restarting the next node only after the current node is fully operational - including joining the ring. But isn't this ("joining the ring" part) the problem Raft is meant to resolve? - And, surprise, it IS! You may THINK you did but I've just shown you before a few examples which you probably missed and which are not going to be covered by you "N-1" idea. I think this is mathematically impossible unless you give up on high availability or implement Raft on the Operator level and I really hope this is not what you want to do. Clearly this has to be resolved on the scylla level (and I believe it's already done since according to @kostja Raft is fully operational) and the nodes should not move to And I believe this is the direction, right, Kostja? If "yes" then no need to re-invent the |
@vladzcloudius thanks for the input, I think you've raised quite a few valid points. Let me try and address some of them.
That's true and in the current implementation of the readiness probe we are checking for
Which doesn't prevent it from the scenario mentioned above, does it?
I believe the confusion is coming from the issue that this PR originated from - #1077. What you've written right below
is what this PR is trying to achieve, i.e. for our readiness probes to only allow for one non-UN node in the entire cluster, but according to all the nodes. So the readiness probe of each node is going to state whether the entire cluster is ready for a topology change, according to this node.
I believe we're all aware of that. The thing is this issue breaks HA during updates, upgrades, scaling etc. And afaik 5.1 does not yet handle topology changes with raft. I'd personally consider these changes only a temporary workaround until the support for topology changes is released. |
@rzetelskik This is not supposed to be racy and scylla is not supposed to listen to the CQL port before the node considers itself as UN. If that's not the case I think this is supposed to be a bug in scylla and this is where it's supposed to be fixed - not here.
Again, the CQL port is not supposed to become available until gossip has settled and Operator (or an Ansible Role for the matter) is not supposed to be the place to ensure that - scylla is.
Yes and on the way it breaks the HA - see my previous comments on the matter.
My point was that you are not going to be able to resolve this issue on the Operator level and as a result you are always going to either break the HA or break some other corner case - therefore you shouldn't even try. Requiring all nodes to be up and not allowing any partitioning during a rolling restart is a too harsh of a requirement. See more examples why it is terrible here: #1108 (comment) In short my points were:
|
…peers consider at most one node not-UN
584605c
to
c0b4e54
Compare
I'll try to explain a bit more but this thread has branched a bit so if you feel a particular point was missed I am happy to get back to it. Also we have talked about readiness probe a lot this week in the operator team with an intention to unstuck the fix / adjust it's direction so even some of our previous views written here might have changed. We have went through various approaches and reasoning but what we ended up with is that, for the purposes of the scylla operator and k8s, the readiness needs to mean "I am up and I am part of the quorum (aka N/2+1)". This primarily comes from a need to tolerate voluntary disruptions like rolling restart of Kubernetes nodes where we don't have any control except using a PDB, contrary to say a topology or any other change when we can instrument an action. During a Kubernetes node upgrade / rolling restart we need to maintain the ability for users to use scylladb with at least consistency quorum. The only way to do that is using a PDB (currently max 1 unavailable for RF=3) and PDBs are bound to readiness - so a pod may become ready, only if it is able to serve CL_QUORUM or the next one could go down sooner and we'd break the invariant. So the /readyz algorithm will be something like:
For voluntary topology changes we won't use readiness but rather instrument a hook that checks that all the nodes in the cluster are UN - if it were to be done before raft it would mean "everyone thinks about everyone that they are UN". Regarding a rolling restart, that gets blocked in some cases even today. E.g. StatefulSets won't trigger an update when there are broken nodes, so this PR won't change that. Then obviously if a quorum of nodes goes down with the new readiness approach, all the rest will go unready as well. The question that seems separate from this PR would be whether a rolling restart is required as an unstuck procedure - there are differences like some of the configs being auto reloaded, failed containers auto restart and there are other options like |
@tnozicka You seem to have ignored a lot of what I wrote before. Here (#1108 (comment)) in particular. This "readiness" check is not supposed to be used in a RR of a healthy cluster according to a description (if it is - please, update the description). And as such it's supposed to work and not cause "dead-locks" in any possible situation. Please, explain to me how will you be able to start 7 out of 9 Scylla PODs if they crash using the algorithm you've described? |
I've tried to take a bit of a fresh look, for the record it wasn't meant to discard anyone's input, but it was hard to reply to all given the conversation has branched and the input (algorithm / our approach) has changed as well. The PR title and descriptions are outdated and before we reach a consensus I'd prefer we focus on the proposed solution described in #1108 (comment). When we have that consensus let's update the PR but it felt better we don't oscillate the PR it self but I don't feet that strong. (e.g. PR description / title updates retrigger the CI every time [about 4h of compute time].)
I don't think it does. It may block k8s node upgrades, for rolling updates we still have the control. Even today the "single node down" blocks rolling update / restart. It doesn't mean there is a "dead lock", as I've mentioned in #1108 (comment) there are other ways to deal with it.
The 9 pods are started and restarted (on failure) independently and readiness is not used as a precondition to join the cluster. Also I've been thinking a bit more about the algorithm I described over the weekend and while I still think it would work if nodes were what determines the quorum, with scylladb that's actually the keyspace. With many keyspaces scattered over the nodes this may become challenging. I guess in the worst case the keyspaces can be distributed over the nodes in a way that you can only ever take down 1 node even with RF>=5. So I think we need to come up with something better. I am also happy to invite you for the next arch session we have on the topic, if you'd be interested. |
I wasn't talking about bootstrapping, @tnozicka.
Indeed. CQL availability is completely defined by CLs that are used in a specific use case. |
I wasn't talking about bootstrapping either. As I've mentioned in the comment before, rolling restart may not be the only way out and it isn't possible even today. So you'd say
I think the ability to sustain C_QUORUM is an important invariant we have to keep so the apps keep working correctly. I wouldn't say we have invented it, recall this mentioned by people higher in the chain then me. Making updates not disruptive is important for the users as well, I believe that's what scylla cloud and our docs try to achieve too. It's just in the presence of all the keyspace combination, the intersection of those internal quorum is |
Which part of CQL protocol defines this "important invariant"?
Updates with CL=ONE, LOCAL_ONE, LOCAL_QUORUM in a multi-DC config, TWO do not require that strong guarantee you are trying to achieve. So, again, could you clarify what you are talking about exactly and which docs you refer? AFAIK none of our docs specify that "only CL=QUORUM is going to guarantee whatever".
"join the cluster" usually means bootstrapping in Cassandra/Scylla lingo. ;)
How come? How will you define that the node has started successfully? You have to have a corresponding condition.
Are you suggesting that "people higher in the chain then me" can not be mistaken? ;) In any case, this conversation began going in circles. My point stands as it was before: checking a UN on a node that just started and then checking a CQL port is a procedure that is more than enough. It's a DB responsibility to ensure uninterrupted service when a single node is being restarted. When it starts listening to CQL denotes a node's readyness to serve CQL requests. If they are going to fail - it's a DB (Scylla) bug and you can't and should not fix it on the Operator level. Any other complication you are trying to introduce are (1) not obvious/baseless and (2) not supposed to be implemented on the Operator level. |
Apologies for the confusion, as we only do single DC with the operator I was technically talking about CL=LOCAL_QUORUM. Would you agree that's an invariant we should maintain?
I don't think that's what I said. I've said that people beyond my team shared that view. Any and all of us can be mistaken ;)
I am happy to have the call, but I feel like we'd need an external opinion on whether a rolling restart should keep CL=LOCAL_QUORUM working all the time. That would be a good starting point to talk about the right way to solve it. In the meantime, I've also filed scylladb/scylladb#13156 to have a discussion on the procedure that's independent of scylla-operator, e.g. scylla cloud is affected in the same way. |
Of course not. My argument is not multi-DC dependent.
Your call. I've already commented on the GH issue above. We can continue there. In general, like I said before you are ignoring my arguments and this makes this discussion very counterproductive and cyclic. |
@rzetelskik: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Closing as stale until there's any progress in scylladb/scylladb#13156 |
Description of your changes:
Currently our readiness probe only checks whether the node considers itself UN. We've hit issues related to it many times. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain. For example, it's quite easy to hit the following sequence:
{1: {1: UN, 2: UN, 3:UN}, 2: {1: UN, 2: UN, 3: UN}, 3: {1: UN, 2: UN, 3: UN}}
{1: {1: UN, 2: UN, 3:DN}, 2: {1: UN, 2: UN, 3: DN}, 3: {1: DN, 2: DN, 3: UN}}
{1: {1: UN, 2: DN, 3:DN}, 3: {1: DN, 2: DN, 3: UN}}
At this point you'll get errors from queries with quorum consistency regardless of which node you hit.
So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change.
Essentially we want our readiness probes' semantics to no longer just be "I'm ready to serve traffic" but "I'm ready for a topology change". However we weren't able to just declare readiness when, and only when, a node considers all of its peers UN, since that caused some issues with rollouts.
This PR implements an approach suggested by @tnozicka: each node declares readiness when it and its peers consider at most one (and the same) node not-UN.
Which issue is resolved by this Pull Request:
Resolves #1077