Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declare readiness only if the node and its peers consider at most one node not-UN #1108

Closed
wants to merge 1 commit into from

Conversation

rzetelskik
Copy link
Member

@rzetelskik rzetelskik commented Nov 30, 2022

Description of your changes:
Currently our readiness probe only checks whether the node considers itself UN. We've hit issues related to it many times. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain. For example, it's quite easy to hit the following sequence:

  1. Entire 3 node cluster up: {1: {1: UN, 2: UN, 3:UN}, 2: {1: UN, 2: UN, 3: UN}, 3: {1: UN, 2: UN, 3: UN}}
  2. Restart node 3 and wait for it to come back up and declare readiness.
  3. Node 3 is up but cluster has a split-brain: {1: {1: UN, 2: UN, 3:DN}, 2: {1: UN, 2: UN, 3: DN}, 3: {1: DN, 2: DN, 3: UN}}
  4. Restart node 2: {1: {1: UN, 2: DN, 3:DN}, 3: {1: DN, 2: DN, 3: UN}}

At this point you'll get errors from queries with quorum consistency regardless of which node you hit.

So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change.
Essentially we want our readiness probes' semantics to no longer just be "I'm ready to serve traffic" but "I'm ready for a topology change". However we weren't able to just declare readiness when, and only when, a node considers all of its peers UN, since that caused some issues with rollouts.

This PR implements an approach suggested by @tnozicka: each node declares readiness when it and its peers consider at most one (and the same) node not-UN.

Which issue is resolved by this Pull Request:
Resolves #1077

@rzetelskik rzetelskik force-pushed the topology_probe branch 3 times, most recently from 5610c31 to 5fbd0df Compare November 30, 2022 17:21
@rzetelskik rzetelskik changed the title [WIP] Modify readiness probe to wait for N-1 UP NORMAL nodes [WIP] Modify readiness probe to declare readiness when the node and its peers consider at most one node not-UN Dec 1, 2022
@rzetelskik rzetelskik changed the title [WIP] Modify readiness probe to declare readiness when the node and its peers consider at most one node not-UN [WIP] Declare readiness only if the node and its peers consider at most one node not-UN Dec 1, 2022
@rzetelskik rzetelskik force-pushed the topology_probe branch 3 times, most recently from 43c6a81 to 2935c85 Compare December 1, 2022 15:52
@rzetelskik rzetelskik changed the title [WIP] Declare readiness only if the node and its peers consider at most one node not-UN Declare readiness only if the node and its peers consider at most one node not-UN Dec 1, 2022
@rzetelskik rzetelskik added the kind/bug Categorizes issue or PR as related to a bug. label Dec 1, 2022
@rzetelskik rzetelskik added this to the v1.8 milestone Dec 1, 2022
@rzetelskik rzetelskik marked this pull request as ready for review December 1, 2022 15:54
@rzetelskik rzetelskik added kind/bug Categorizes issue or PR as related to a bug. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 1, 2022
@rzetelskik
Copy link
Member Author

Generally we should also try moving away from the current implementation of scyllaClient.Status which makes 5 calls to Scylla API and use /failure_detector/endpoints instead. Looks like it would require some ugly parsing but it would also let us resolve #769 (as suggested here scylladb/scylladb#8275 (comment)). @tnozicka lmk what you think and if it should land separately or in this PR.

@mykaul
Copy link
Contributor

mykaul commented Dec 1, 2022

@tnozicka - let's update with today's meeting discussion on what the right way to do it is.
Thanks.

@vladzcloudius
Copy link

@rzetelskik FYI: UN check is not good enough because what you really care is UN and node listening on a CQL port. And AFAIR our bootup code allows the later only after the former.

Hence if you only wait for a gossip state (which UN is about) before restarting the following node you allow a race when 2 nodes will be unreachable to clients.

The solution we use in our Ansible Role is probing the CQL (TCP) port instead of the gossip state. And implementing this is trivial in Ansible which I suspect should also be the case in K8S.

@asias Please, correct me if I'm missing something in the above.

@tnozicka
Copy link
Member

tnozicka commented Dec 2, 2022

I guess we need to check CQL in addition to UN but we need UN, most importantly from the other nodes, not itself. After talking with Avi we have a go ahead to implement the N^2 algorithm where node considers itself ready, only if a quorum of nodes considers it as ready. As quorum depends on a current state of keyspaces and we don't have a way to enforce minimal quorum, quorum is always N-1 for us.

@tnozicka
Copy link
Member

tnozicka commented Dec 2, 2022

(e.g. all the nodes can serve CQL and not be able to talk to each other, serving only C_ONE while we need to have C_QUORUM)

@tnozicka
Copy link
Member

tnozicka commented Dec 2, 2022

@tnozicka lmk what you think and if it should land separately or in this PR.

  • separately
  • whether to do it now or plan it depends on a time estimate - for now, I can live with N==5 for something that grows O(N^2)

@vladzcloudius
Copy link

(e.g. all the nodes can serve CQL and not be able to talk to each other, serving only C_ONE while we need to have C_QUORUM)

This should not possibly happen - a node will have to at least be able to contact seeders which are not itself.
And what if cluster IS partitioned? You won't allow a rolling restart?
What if you cluster is messed up and multiple nodes crashed and to bring them back you need to execute a RR?

The bottom line: it does make sense to make sure that the restarting node is UN from other nodes perspective however in certain cases this should not be a hard requirement for a RR.

And, still, you need to check CQL port in addition to UN check.

@vladzcloudius
Copy link

vladzcloudius commented Dec 2, 2022

I guess we need to check CQL in addition to UN but we need UN, most importantly from the other nodes, not itself. After talking with Avi we have a go ahead to implement the N^2 algorithm where node considers itself ready, only if a quorum of nodes considers it as ready. As quorum depends on a current state of keyspaces and we don't have a way to enforce minimal quorum, quorum is always N-1 for us.

The more I think of what you wrote above the more I think it's simply wrong to implement this on the Operator level, @tnozicka.

Let's start the last thing you wrote - you are going to not allow next node restart till ALL nodes (except for this node that is) consider this node as UN.

This is definitely going to backfire in cases I mentioned above.
So, it's already problematic.

But let's take a step back for a moment and recall what is the problem you want to avoid by this PR according to the opening message: a "split brain" situation in order to allow QUORUM queries. Why QUORUM? Why not ALL, or LOCAL_QUORUM, or LOCAL_ONE, or EACH_QUORUM? I think this distinction is... strange because there are too many assumptions:

  • For RF<3 even a single node DOWN is not going to allow QUORUM.
  • Who said the user cares about the CL=QUORUM?

A more understandable distinction is that we want to ensure that there is not more than a single node DOWN during a rolling restart. Why? Because this is the best we can do. And it has nothing to do with a particular CL.

So it all comes down to restarting the next node only after the current node is fully operational - including joining the ring.

But isn't this ("joining the ring" part) the problem Raft is meant to resolve? - And, surprise, it IS!
And we know that this is not trivial. So, what makes you think you will be able to resolve it (in it's completeness!) on the Operator level?

You may THINK you did but I've just shown you before a few examples which you probably missed and which are not going to be covered by you "N-1" idea.

I think this is mathematically impossible unless you give up on high availability or implement Raft on the Operator level and I really hope this is not what you want to do.

Clearly this has to be resolved on the scylla level (and I believe it's already done since according to @kostja Raft is fully operational) and the nodes should not move to UN state till they haven't properly joined the cluster.

And I believe this is the direction, right, Kostja?

If "yes" then no need to re-invent the Raft wheel and you'd better follow my original proposal (probe the TCP port) which is going to simplify your solution significantly.

@rzetelskik
Copy link
Member Author

rzetelskik commented Dec 5, 2022

@vladzcloudius thanks for the input, I think you've raised quite a few valid points. Let me try and address some of them.

FYI: UN check is not good enough because what you really care is UN and node listening on a CQL port. And AFAIR our bootup code allows the later only after the former.

That's true and in the current implementation of the readiness probe we are checking for is_native_transport_running. It's racy, it doesn't guarantee that CQL server is up and we should move away from it - we have an issue for it and I've linked it in one of my previous comments (issue: #769, approach suggested by Asias: scylladb/scylladb#8275 (comment)). However, getting an OK from is_native_transport_running means that the same prerequisites were met as when you're probing the CQL port - specifically, what we care about here - the gossip has settled.

This should not possibly happen - a node will have to at least be able to contact seeders which are not itself.

Which doesn't prevent it from the scenario mentioned above, does it?

But let's take a step back for a moment and recall what is the problem you want to avoid by this PR according to the opening message: a "split brain" situation in order to allow QUORUM queries. Why QUORUM?

I believe the confusion is coming from the issue that this PR originated from - #1077.

What you've written right below

A more understandable distinction is that we want to ensure that there is not more than a single node DOWN during a rolling restart. Why? Because this is the best we can do. And it has nothing to do with a particular CL.
So it all comes down to restarting the next node only after the current node is fully operational - including joining the ring.

is what this PR is trying to achieve, i.e. for our readiness probes to only allow for one non-UN node in the entire cluster, but according to all the nodes. So the readiness probe of each node is going to state whether the entire cluster is ready for a topology change, according to this node.

But isn't this ("joining the ring" part) the problem Raft is meant to resolve? - And, surprise, it IS!

I believe we're all aware of that. The thing is this issue breaks HA during updates, upgrades, scaling etc. And afaik 5.1 does not yet handle topology changes with raft. I'd personally consider these changes only a temporary workaround until the support for topology changes is released.

@vladzcloudius
Copy link

That's true and in the current implementation of the readiness probe we are checking for is_native_transport_running. It's racy, it doesn't guarantee that CQL server is up and we should move away from it - we have an issue for it and I've linked it in one of my previous comments (issue: #769, approach suggested by Asias: scylladb/scylladb#8275 (comment)).

@rzetelskik This is not supposed to be racy and scylla is not supposed to listen to the CQL port before the node considers itself as UN. If that's not the case I think this is supposed to be a bug in scylla and this is where it's supposed to be fixed - not here.

However, getting an OK from is_native_transport_running means that the same prerequisites were met as when you're probing the CQL port - specifically, what we care about here - the gossip has settled.

Again, the CQL port is not supposed to become available until gossip has settled and Operator (or an Ansible Role for the matter) is not supposed to be the place to ensure that - scylla is.

A more understandable distinction is that we want to ensure that there is not more than a single node DOWN during a rolling restart. Why? Because this is the best we can do. And it has nothing to do with a particular CL.
So it all comes down to restarting the next node only after the current node is fully operational - including joining the ring.

is what this PR is trying to achieve, i.e. for our readiness probes to only allow for one non-UN node in the entire cluster, but according to all the nodes. So the readiness probe of each node is going to state whether the entire cluster is ready for a topology change, according to this node.

Yes and on the way it breaks the HA - see my previous comments on the matter.

But isn't this ("joining the ring" part) the problem Raft is meant to resolve? - And, surprise, it IS!

I believe we're all aware of that. The thing is this issue breaks HA during updates, upgrades, scaling etc. And afaik 5.1 does not yet handle topology changes with raft. I'd personally consider these changes only a temporary workaround until the support for topology changes is released.

My point was that you are not going to be able to resolve this issue on the Operator level and as a result you are always going to either break the HA or break some other corner case - therefore you shouldn't even try.

Requiring all nodes to be up and not allowing any partitioning during a rolling restart is a too harsh of a requirement. See more examples why it is terrible here: #1108 (comment)

In short my points were:

  1. Don't try to solve the scylla problem on the Operator level.
  2. Use the correct probing criteria and checking for UN is a wrong one because it's not the last thing that happens in the initialization path (see @asias comment you referenced above).

@rzetelskik rzetelskik modified the milestones: v1.8, v1.9 Jan 4, 2023
@tnozicka
Copy link
Member

I'll try to explain a bit more but this thread has branched a bit so if you feel a particular point was missed I am happy to get back to it. Also we have talked about readiness probe a lot this week in the operator team with an intention to unstuck the fix / adjust it's direction so even some of our previous views written here might have changed.

We have went through various approaches and reasoning but what we ended up with is that, for the purposes of the scylla operator and k8s, the readiness needs to mean "I am up and I am part of the quorum (aka N/2+1)". This primarily comes from a need to tolerate voluntary disruptions like rolling restart of Kubernetes nodes where we don't have any control except using a PDB, contrary to say a topology or any other change when we can instrument an action. During a Kubernetes node upgrade / rolling restart we need to maintain the ability for users to use scylladb with at least consistency quorum. The only way to do that is using a PDB (currently max 1 unavailable for RF=3) and PDBs are bound to readiness - so a pod may become ready, only if it is able to serve CL_QUORUM or the next one could go down sooner and we'd break the invariant.

So the /readyz algorithm will be something like:

  1. check you are UN
  2. check what everyone else thinks about you
  3. if SUM(UN) > N/2 => ready
    with raft:
  4. ask the local node to ask the leader whether you are part of the quorum

For voluntary topology changes we won't use readiness but rather instrument a hook that checks that all the nodes in the cluster are UN - if it were to be done before raft it would mean "everyone thinks about everyone that they are UN".

Regarding a rolling restart, that gets blocked in some cases even today. E.g. StatefulSets won't trigger an update when there are broken nodes, so this PR won't change that. Then obviously if a quorum of nodes goes down with the new readiness approach, all the rest will go unready as well. The question that seems separate from this PR would be whether a rolling restart is required as an unstuck procedure - there are differences like some of the configs being auto reloaded, failed containers auto restart and there are other options like kubectl delete pods with cluster selector (it's not rolling but those pods are probably not working either).

@vladzcloudius
Copy link

I'll try to explain a bit more but this thread has branched a bit so if you feel a particular point was missed I am happy to get back to it. Also we have talked about readiness probe a lot this week in the operator team with an intention to unstuck the fix / adjust it's direction so even some of our previous views written here might have changed.

We have went through various approaches and reasoning but what we ended up with is that, for the purposes of the scylla operator and k8s, the readiness needs to mean "I am up and I am part of the quorum (aka N/2+1)". This primarily comes from a need to tolerate voluntary disruptions like rolling restart of Kubernetes nodes where we don't have any control except using a PDB, contrary to say a topology or any other change when we can instrument an action. During a Kubernetes node upgrade / rolling restart we need to maintain the ability for users to use scylladb with at least consistency quorum. The only way to do that is using a PDB (currently max 1 unavailable for RF=3) and PDBs are bound to readiness - so a pod may become ready, only if it is able to serve CL_QUORUM or the next one could go down sooner and we'd break the invariant.

So the /readyz algorithm will be something like:

1. check you are UN

2. check what everyone else thinks about you

3. if SUM(UN) > N/2 => ready
   with raft:

4. ask the local node to ask the leader whether you are part of the quorum

For voluntary topology changes we won't use readiness but rather instrument a hook that checks that all the nodes in the cluster are UN - if it were to be done before raft it would mean "everyone thinks about everyone that they are UN".

Regarding a rolling restart, that gets blocked in some cases even today. E.g. StatefulSets won't trigger an update when there are broken nodes, so this PR won't change that. Then obviously if a quorum of nodes goes down with the new readiness approach, all the rest will go unready as well. The question that seems separate from this PR would be whether a rolling restart is required as an unstuck procedure - there are differences like some of the configs being auto reloaded, failed containers auto restart and there are other options like kubectl delete pods with cluster selector (it's not rolling but those pods are probably not working either).

@tnozicka You seem to have ignored a lot of what I wrote before. Here (#1108 (comment)) in particular.

This "readiness" check is not supposed to be used in a RR of a healthy cluster according to a description (if it is - please, update the description).

And as such it's supposed to work and not cause "dead-locks" in any possible situation.

Please, explain to me how will you be able to start 7 out of 9 Scylla PODs if they crash using the algorithm you've described?

@tnozicka
Copy link
Member

@tnozicka You seem to have ignored a lot of what I wrote before. Here (#1108 (comment)) in particular.
This "readiness" check is not supposed to be used in a RR of a healthy cluster according to a description (if it is - please, update the description).

I've tried to take a bit of a fresh look, for the record it wasn't meant to discard anyone's input, but it was hard to reply to all given the conversation has branched and the input (algorithm / our approach) has changed as well.

The PR title and descriptions are outdated and before we reach a consensus I'd prefer we focus on the proposed solution described in #1108 (comment). When we have that consensus let's update the PR but it felt better we don't oscillate the PR it self but I don't feet that strong. (e.g. PR description / title updates retrigger the CI every time [about 4h of compute time].)

And as such it's supposed to work and not cause "dead-locks" in any possible situation.

I don't think it does. It may block k8s node upgrades, for rolling updates we still have the control. Even today the "single node down" blocks rolling update / restart. It doesn't mean there is a "dead lock", as I've mentioned in #1108 (comment) there are other ways to deal with it.

Please, explain to me how will you be able to start 7 out of 9 Scylla PODs if they crash using the algorithm you've described?

The 9 pods are started and restarted (on failure) independently and readiness is not used as a precondition to join the cluster.

Also I've been thinking a bit more about the algorithm I described over the weekend and while I still think it would work if nodes were what determines the quorum, with scylladb that's actually the keyspace. With many keyspaces scattered over the nodes this may become challenging. I guess in the worst case the keyspaces can be distributed over the nodes in a way that you can only ever take down 1 node even with RF>=5. So I think we need to come up with something better. I am also happy to invite you for the next arch session we have on the topic, if you'd be interested.

@vladzcloudius
Copy link

The 9 pods are started and restarted (on failure) independently and readiness is not used as a precondition to join the cluster.

I wasn't talking about bootstrapping, @tnozicka.
I was referring the situation when 7 out 9 PODs crashed.
Then you fixed the reason for the crash (including the nodes that didn't crash) and now what to execute a RR procedure.
You algorithms would not allow that.

Also I've been thinking a bit more about the algorithm I described over the weekend and while I still think it would work if nodes were what determines the quorum, with scylladb that's actually the keyspace. With many keyspaces scattered over the nodes this may become challenging. I guess in the worst case the keyspaces can be distributed over the nodes in a way that you can only ever take down 1 node even with RF>=5. So I think we need to come up with something better. I am also happy to invite you for the next arch session we have on the topic, if you'd be interested.

Indeed.
On top of that - "QUORUM" requirement in the context of CQL availability is synthetic (not generic). You, guys, invented this requirement based on not clear what and now work hard trying to achieve that.

CQL availability is completely defined by CLs that are used in a specific use case.
And QUORUM is neither the only one nor the mainly used one.

@tnozicka
Copy link
Member

I wasn't talking about bootstrapping, @tnozicka.
I was referring the situation when 7 out 9 PODs crashed.
Then you fixed the reason for the crash (including the nodes that didn't crash) and now what to execute a RR procedure.
You algorithms would not allow that.

I wasn't talking about bootstrapping either. As I've mentioned in the comment before, rolling restart may not be the only way out and it isn't possible even today. So you'd say kubectl delete the DN pods and kube would recreate them. It also technically doesn't have to depend on the readiness - for voluntary operations we have options beyond readiness probes.

You, guys, invented this requirement based on not clear what and now work hard trying to achieve that.

I think the ability to sustain C_QUORUM is an important invariant we have to keep so the apps keep working correctly. I wouldn't say we have invented it, recall this mentioned by people higher in the chain then me. Making updates not disruptive is important for the users as well, I believe that's what scylla cloud and our docs try to achieve too. It's just in the presence of all the keyspace combination, the intersection of those internal quorum is N-1 instead of N/2-1, which is what the docs say, except is doesn't say how to reliably make sure the new node is UN (+ considered UN by the others).

@vladzcloudius
Copy link

vladzcloudius commented Mar 13, 2023

I think the ability to sustain C_QUORUM is an important invariant we have to keep so the apps keep working correctly.

Which part of CQL protocol defines this "important invariant"?

Making updates not disruptive is important for the users as well, I believe that's what scylla cloud and our docs try to achieve too.

Updates with CL=ONE, LOCAL_ONE, LOCAL_QUORUM in a multi-DC config, TWO do not require that strong guarantee you are trying to achieve. So, again, could you clarify what you are talking about exactly and which docs you refer? AFAIK none of our docs specify that "only CL=QUORUM is going to guarantee whatever".

I wasn't talking about bootstrapping either.

"join the cluster" usually means bootstrapping in Cassandra/Scylla lingo. ;)

As I've mentioned in the comment before, rolling restart may not be the only way out and it isn't possible even today. So you'd say kubectl delete the DN pods and kube would recreate them. It also technically doesn't have to depend on the readiness.

How come? How will you define that the node has started successfully? You have to have a corresponding condition.

You, guys, invented this requirement based on not clear what and now work hard trying to achieve that.
I wouldn't say we have invented it, recall this mentioned by people higher in the chain then me.

Are you suggesting that "people higher in the chain then me" can not be mistaken? ;)
In any case - it's you who's representing a specific point of view claiming some not-very-obvious assumptions are "invariants". Own it! ;)

In any case, this conversation began going in circles.
I suggest we have a call and discuss this if you want.

My point stands as it was before: checking a UN on a node that just started and then checking a CQL port is a procedure that is more than enough. It's a DB responsibility to ensure uninterrupted service when a single node is being restarted. When it starts listening to CQL denotes a node's readyness to serve CQL requests. If they are going to fail - it's a DB (Scylla) bug and you can't and should not fix it on the Operator level.

Any other complication you are trying to introduce are (1) not obvious/baseless and (2) not supposed to be implemented on the Operator level.

@tnozicka
Copy link
Member

Updates with CL=ONE, LOCAL_ONE, LOCAL_QUORUM in a multi-DC config, TWO do not require that strong guarantee you are trying to achieve. So, again, could you clarify what you are talking about exactly and which docs you refer? AFAIK none of our docs specify that "only CL=QUORUM is going to guarantee whatever".

Apologies for the confusion, as we only do single DC with the operator I was technically talking about CL=LOCAL_QUORUM. Would you agree that's an invariant we should maintain?

Which part of CQL protocol defines this "important invariant"?
Not the protocol but saying your cluster is HA with RF>=3 kind of implies it from user's point of view.

How come? How will you define that the node has started successfully? You have to have a corresponding condition.
We can have other conditions / status. The only think that is strictly bound to readiness is PDBs. While we are technically able not to use readiness for it we'd try to avoid it, if possible.

You, guys, invented this requirement based on not clear what and now work hard trying to achieve that.
I wouldn't say we have invented it, recall this mentioned by people higher in the chain then me.

Are you suggesting that "people higher in the chain then me" can not be mistaken? ;)
In any case - it's you who's representing a specific point of view claiming some not-very-obvious assumptions are "invariants". Own it! ;)

I don't think that's what I said. I've said that people beyond my team shared that view. Any and all of us can be mistaken ;)

In any case, this conversation began going in circles.
I suggest we have a call and discuss this if you want.
My point stands as it was before: checking a UN on a node that just started and then checking a CQL port is a procedure that is more than enough. It's a DB responsibility to ensure uninterrupted service when a single node is being restarted. When it starts listening to CQL denotes a node's readyness to serve CQL requests. If they are going to fail - it's a DB (Scylla) bug and you can't and should not fix it on the Operator level.

I am happy to have the call, but I feel like we'd need an external opinion on whether a rolling restart should keep CL=LOCAL_QUORUM working all the time. That would be a good starting point to talk about the right way to solve it.

In the meantime, I've also filed scylladb/scylladb#13156 to have a discussion on the procedure that's independent of scylla-operator, e.g. scylla cloud is affected in the same way.

@vladzcloudius
Copy link

Updates with CL=ONE, LOCAL_ONE, LOCAL_QUORUM in a multi-DC config, TWO do not require that strong guarantee you are trying to achieve. So, again, could you clarify what you are talking about exactly and which docs you refer? AFAIK none of our docs specify that "only CL=QUORUM is going to guarantee whatever".

Apologies for the confusion, as we only do single DC with the operator I was technically talking about CL=LOCAL_QUORUM. Would you agree that's an invariant we should maintain?

Of course not. My argument is not multi-DC dependent.
And you haven't explained which "docs" this "invariant" is based on. I guess this is because there isn't any. Exactly like I suspected and how it's supposed to be.

In any case, this conversation began going in circles.
I suggest we have a call and discuss this if you want.
My point stands as it was before: checking a UN on a node that just started and then checking a CQL port is a procedure that is more than enough. It's a DB responsibility to ensure uninterrupted service when a single node is being restarted. When it starts listening to CQL denotes a node's readyness to serve CQL requests. If they are going to fail - it's a DB (Scylla) bug and you can't and should not fix it on the Operator level.

I am happy to have the call, but I feel like we'd need an external opinion on whether a rolling restart should keep CL=LOCAL_QUORUM working all the time. That would be a good starting point to talk about the right way to solve it.

In the meantime, I've also filed scylladb/scylladb#13156 to have a discussion on the procedure that's independent of scylla-operator, e.g. scylla cloud is affected in the same way.

Your call. I've already commented on the GH issue above. We can continue there.

In general, like I said before you are ignoring my arguments and this makes this discussion very counterproductive and cyclic.

@tnozicka tnozicka removed this from the v1.9 milestone Aug 16, 2023
@scylla-operator-bot
Copy link
Contributor

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke c0b4e54 link true /test e2e-gke
ci/prow/docs c0b4e54 link true /test docs
ci/prow/verify-deps c0b4e54 link true /test verify-deps
ci/prow/verify c0b4e54 link true /test verify
ci/prow/images c0b4e54 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rzetelskik
Copy link
Member Author

Closing as stale until there's any progress in scylladb/scylladb#13156

@rzetelskik rzetelskik closed this Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster
4 participants