New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing new KafkaRoller #103
base: main
Are you sure you want to change the base?
Conversation
8c79a95
to
9c6154b
Compare
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
9c6154b
to
c74f0b4
Compare
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
ca68601
to
5abafe6
Compare
56d7a24
to
4baf73a
Compare
Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
4baf73a
to
33ec40e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?
- `CrashLoopBackOff` | ||
- `ImagePullBackOff` | ||
- `ContainerCreating` | ||
- `Pending` and `Unschedulable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it pending, or pending for more than X ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't check how long it's been in Pending state. If we find that the pod is in Pending state and has an old revision, we restart it, otherwise the pod is skipped anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I linked my comment to the wrong row. It was more appropriate for the previous row, where we mention "ContainerCreating" state.
Now, we consider a Pod stuck when it has "Pending" status and at least one Kafka Container in "ContainerCreating" state (which is the default for containers). This is not necessarily true, because this state includes the time spent downloading container images over the network.
Restarting the pod immediately, we may actually make the situation worst. That's why I was suggesting to add a timeout after which consider the Pod as stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree with this. I think with the current roller, the assumption is that we restarted the pod and waited for it to become ready for 5 mins and timed out. When timed out, the pod is retried and then if the pod is not ready and in ContainerCreating
state, we consider it stuck. Also the stuck pod is only retried if the revision has an old revision (e.g pod image is updated, we may need to restart it anyway).
With the new roller, we do have a wait for pods in ContainerCreating
state. I did not specifically explain this but make this clearer in the proposed changes section.
c060c24
to
433316f
Compare
Tidy up Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
433316f
to
4f91a5a
Compare
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
97bdef2
to
941fe43
Compare
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal. Thanks for it 👍 .
STs POV:
I think we would need to also design multiple tests to cover all states, which KafkaRoller v2
. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...
Side note about performance:
What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates
of multiple nodes when we use batching mechanism...
06x-new-kafka-roller.md
Outdated
- After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. | ||
|
||
|
||
12. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, seeing this algorithm, makes me think that it would be great to have a TLA+ expert, who would model a KafkaRoller and prove the correctness and safety of this algorithm formally using some safety and liveness properties (e.g., No node should exceed max retries (safety), no simultaneous restart of all controller nodes (safety) or All nodes should eventually be SERVING (liveness)) and more....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fvaleri might be able to help with this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLA+ is certainly a useful tool and this would be a great use case. The problem is that it has a steep learning curve and it requires a lot of practice to really get the benefit out of it. I can write a simple spec, but I lack the experience and time to write anything like this.
Things I don't like much about TLA+: the model you create is so distant from the actual implementation that it doesn't protect you from nasty bugs (see KRaft and Kafka replication protocol issues), it is really easy to hit the state space explosion so you have to know some mitigation techniques (still it may require big machines to verify a complex model over multiple days).
Co-authored-by: Maros Orsak <maros.orsak159@gmail.com> Signed-off-by: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinaselenge thanks for the example, it really helps.
I left some comments, let me know if something is not clear or you want to discuss further.
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2. | ||
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. | ||
- The reassignment request is processed and foo-0 partition now has ISR [1, 4]. | ||
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.
The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.
I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a dedicated proposal IMO, but let's start by logging an issue.
06x-new-kafka-roller.md
Outdated
|
||
- KafkaRoller takes a long time to reconcile combined nodes if they are all in `Pending` state. This is because the combined node does not become ready until the quorum is formed and KafkaRoller waits for a pod to become ready before it attempts to restart other nodes. In order for the quorum to form, at least the majority of controller nodes need to be running at the same time. This is not easy to solve in the current KafkaRoller without introducing some major changes because it processes each node individually and there is no mechanism to restart multiple nodes in parallel. More information can be found [here](https://github.com/strimzi/strimzi-kafka-operator/issues/9426). | ||
|
||
- The quorum health check is based on the `controller.quorum.fetch.timeout.ms` configuration which it reads from the desired configurations passed from the Reconciler. However, `CAReconcilor` and manual rolling update pass null value for desired configurations because in both cases, the nodes don't need reconfigurations. This results in performing the quorum healthcheck based on the hard-coded default value of `controller.quorum.fetch.timeout.ms` rather than the accurate configuration value when doing manual rolling update and rolling nodes for certificate renewal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this issue, but the description is not clear IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried rewording the paragraph, do you think it's clearer now?
06x-new-kafka-roller.md
Outdated
5. It then considers restarting `controller-1` and checks if the quorum health would be impacted. The operator sends a request to to the active controller to describe the quorum replication state. It finds that majority of the follower controllers have caught up with the quorum leader within the `controller.quorum.fetch.timeout.ms`. | ||
6. The roller restarts `controller-1` as it has no impact on the quorum health. When it has `SERVING` state, the roller repeats the quorum check and restarts `controller-2` and then `controller-0`. | ||
7. It then considers restarting `combined-4`, so it performs quorum healthcheck and then availability check. Both check passes therefore `combined-4` is restarted. The same is repeated for `combined-5`. | ||
8. All controller and combined nodes have `SERVING` state and `numRestartAttempts` set to 1, therefore the roller checks the broker nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why numRestartAttempts matters here? Should we reset to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we make sure all the controller nodes have been restarted, before we proceed to the broker nodes. I will make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.
931adbd
to
1060fee
Compare
Add possible transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
1060fee
to
e56d1f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nits but otherwise looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tinaselenge. Thanks for the updates. I think this is definitely the right direction, but I leaved some more comments for you to consider. It may be that I'm missing some detail, so feel free to correct me.
- `CrashLoopBackOff` | ||
- `ImagePullBackOff` | ||
- `ContainerCreating` | ||
- `Pending` and `Unschedulable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I linked my comment to the wrong row. It was more appropriate for the previous row, where we mention "ContainerCreating" state.
Now, we consider a Pod stuck when it has "Pending" status and at least one Kafka Container in "ContainerCreating" state (which is the default for containers). This is not necessarily true, because this state includes the time spent downloading container images over the network.
Restarting the pod immediately, we may actually make the situation worst. That's why I was suggesting to add a timeout after which consider the Pod as stuck.
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2. | ||
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. | ||
- The reassignment request is processed and foo-0 partition now has ISR [1, 4]. | ||
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a dedicated proposal IMO, but let's start by logging an issue.
06x-new-kafka-roller.md
Outdated
- Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestartAttempts`. | ||
- After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationalTimeoutMs` is reached. | ||
- If the timeout is reached, throw `TimeoutException`. If a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 2. | ||
- After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this?
A Kafka background thread ensures that the leader role is shifted to the preferred replica once it's in sync and a configured imbalance threshold is reached. This is enabled by default (see auto.leader.rebalance.enable). I think this may be enough.
Added flow diagram for state transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
02e63f7
to
6842473
Compare
Thanks everyone who reviewed the proposal! |
For more implementation details, the POC implementation code can be checked in RackRolling.java. All the related classes are in the same package , rolling.
The tests illustrating various cases with different set of configurations is in RackRollingTest.java.
The logic for switching to the new roller is in KafkaReconciler.java class.