Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KRaft] Mixed nodes cluster is not able to recover from misconfiguration affecting the whole cluster #9426

Open
scholzj opened this issue Dec 4, 2023 · 3 comments
Assignees

Comments

@scholzj
Copy link
Member

scholzj commented Dec 4, 2023

Strimzi currently does not seem to be able to recover mixed-nodes KRaft cluster when it runs into some initial misconfiguration. This can be easily reproduced with the following steps:

  1. Deploy new KRaft mixed-nodes cluster with misconfigured node affinity.
    For example, set the affinity to a non-existent zone:
       affinity:
         nodeAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
             nodeSelectorTerms:
               - matchExpressions:
                 - key: topology.kubernetes.io/zone
                   operator: In
                   values:
                   - zone-1
  2. The Pods should be all in the Pending state and the deployment should be failing. The reconciliation should fail after 5 minutes timeout as the pods never get ready.
  3. Fix the affinity by using the correct zone or by removing it

Expected result:

  • Operator rolls the nodes and fixes the cluster

Actual result:

  • The operator tries to roll the first pod, but it never becomes ready and starts crash looping. Because of that it never moves on to the other nodes.

Some notable points:

  • This might seem unimportant for a fresh deployment of misconfigured clusters as it can be just deleted and redeployed with a correct configuration. And the operator should not allow you to bring all pods to Pending by misconfiguring the Kafka cluster as it should not roll the next pod before the previous one got ready. However, there might be some situations this happens for existing clusters:
    • While the operator might not roll the next pod when the first one is pending, the pods might be deleted by some unrelated disaster and the new node affinity will be applied to them
    • You might have correctly configured the cluster, but lose the whole AZ to some disaster and you might need to reschedule the pods.
  • This seems to have a workaround in manually deleting the pods
  • This affects only mixed nodes. Dedicated controller nodes do not seem to be affected.
    • Does this indicate an issue in our readiness probes? Where the mixed nodes are stopped by us?
    • Or is this a Kafka issue where it exists the mixed node without waiting for the quorum to form?
    • It seems the second -> Kafka shutdowns. So to fix this in Strimzi only, we would need to detect situations like this and roll all controllers. Fix in Kafka would be certainly nicer.
  • Most production clusters are expected to have a dedicated controller quorum, so they should not be affected. There is also a reasonably simple workaround. So this might not be a blocker to GA KRaft support?
@katheris
Copy link
Contributor

katheris commented Dec 7, 2023

Hey @scholzj, @tinaselenge and I have taken a look at this and we think this is related to an oversight in the KafkaRoller code, rather than a problem with the readiness probes. In the original proposal for the readiness probes for KRaft we stated that "If a combined pod has not become ready, KafkaRoller should check that all other combined pods have been scheduled (i.e. not in pending state) before rolling the pod or waiting for it to become ready." (https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md)

We've both tried to reproduce this issue and found that the KafkaRoller does eventually move on to the other pods, it just takes ~10mins to do so because it tries (and times out) 10 times first.

Ideally we should update the KafkaRoller so it doesn't wait for the pod to become ready if all the other pods are combined nodes that are currently pending. The tricky thing is the current KafkaRoller considers each pod independently, so we think the changes are probably not straightforward. We've identified two possible places we could make the change in KafkaRoller, but would be interested if you have other suggestions. Also Tina is looking into how tricky this would be to do in the new KafkaRoller in parallel.

One place to make the change would be to check for pending pods in the rollingRestart method where we order the pods. Then if there are multiple pending pods and all pods are combined nodes we could flag that somehow so that later when we restart the first pod we know not to wait. This would fully address the problem but adds extra complexity to the KafkaRoller.

Another option would be to make a change in the restartIfNecessary method so that if checkIfRestartOrReconfigureRequired returns with forceRestart true and we are restarting a combined node we don't restart and await readiness, and instead restart and throw a ForceableProblem so that the roller tries again later with this pod and therefore moves onto the others. This would probably be easier to follow in the code, but means we don't wait for readiness even if the other pods are up. Another option would be to add logic at this point to check the state of the other pods.

Either approach results in extra calls to fetch the other pods and check their state, so we need to decide if we want to make this sort of change, or whether the current delay of ~10mins is acceptable for now.

@scholzj
Copy link
Member Author

scholzj commented Dec 7, 2023

As I said before - I do not think this is a blocker. But I personally also do not think rolling them after ~10 minutes is enough to call it not an issue. So I would say this issue stands. Whether it has to be fixed now or one day in some new roller, that is more up to discussion I guess. I don't think I have any opinion on how to fix it TBH. I don't think my understanding of the KafkaRoller is really good enough for that 😉.

@scholzj
Copy link
Member Author

scholzj commented Dec 14, 2023

Discussed in the community call on 14.12.: Should be fixed.

@tinaselenge tinaselenge self-assigned this Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants