Introducing new KafkaRoller #103

tinaselenge · 2024-01-02T13:33:26Z

For more implementation details, the POC implementation code can be checked in RackRolling.java. All the related classes are in the same package , rolling.

The tests illustrating various cases with different set of configurations is in RackRollingTest.java.

The logic for switching to the new roller is in KafkaReconciler.java class.

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

06x-new-kafka-roller.md

Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

fvaleri

Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?

06x-new-kafka-roller.md

fvaleri · 2024-02-24T16:58:27Z

06x-new-kafka-roller.md

+- `CrashLoopBackOff`
+- `ImagePullBackOff`
+- `ContainerCreating`
+- `Pending` and `Unschedulable`


Is it pending, or pending for more than X ms?

Actually we don't check how long it's been in Pending state. If we find that the pod is in Pending state and has an old revision, we restart it, otherwise the pod is skipped anyway.

Sorry, I linked my comment to the wrong row. It was more appropriate for the previous row, where we mention "ContainerCreating" state.

Now, we consider a Pod stuck when it has "Pending" status and at least one Kafka Container in "ContainerCreating" state (which is the default for containers). This is not necessarily true, because this state includes the time spent downloading container images over the network.

Restarting the pod immediately, we may actually make the situation worst. That's why I was suggesting to add a timeout after which consider the Pod as stuck.

Yes I agree with this. I think with the current roller, the assumption is that we restarted the pod and waited for it to become ready for 5 mins and timed out. When timed out, the pod is retried and then if the pod is not ready and in ContainerCreating state, we consider it stuck. Also the stuck pod is only retried if the revision has an old revision (e.g pod image is updated, we may need to restart it anyway).

With the new roller, we do have a wait for pods in ContainerCreating state. I did not specifically explain this but make this clearer in the proposed changes section.

06x-new-kafka-roller.md

Tidy up Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge · 2024-04-23T15:20:22Z

@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think.

see-quick

Nice proposal. Thanks for it 👍 .

STs POV:

I think we would need to also design multiple tests to cover all states, which KafkaRoller v2. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...

Side note about performance:

What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates of multiple nodes when we use batching mechanism...

06x-new-kafka-roller.md

see-quick · 2024-04-25T14:37:07Z

06x-new-kafka-roller.md

+   - After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. 
+
+
+12. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails.


Hmm, seeing this algorithm, makes me think that it would be great to have a TLA+ expert, who would model a KafkaRoller and prove the correctness and safety of this algorithm formally using some safety and liveness properties (e.g., No node should exceed max retries (safety), no simultaneous restart of all controller nodes (safety) or All nodes should eventually be SERVING (liveness)) and more....

@fvaleri might be able to help with this :)

TLA+ is certainly a useful tool and this would be a great use case. The problem is that it has a steep learning curve and it requires a lot of practice to really get the benefit out of it. I can write a simple spec, but I lack the experience and time to write anything like this.

Things I don't like much about TLA+: the model you create is so distant from the actual implementation that it doesn't protect you from nasty bugs (see KRaft and Kafka replication protocol issues), it is really easy to hit the state space explosion so you have to know some mitigation techniques (still it may require big machines to verify a complex model over multiple days).

06x-new-kafka-roller.md

Co-authored-by: Maros Orsak <maros.orsak159@gmail.com> Signed-off-by: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>

fvaleri

@tinaselenge thanks for the example, it really helps.

I left some comments, let me know if something is not clear or you want to discuss further.

06x-new-kafka-roller.md

fvaleri · 2024-04-25T15:15:17Z

06x-new-kafka-roller.md

+   - Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
+   - KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
+   - The reassignment request is processed and foo-0 partition now has ISR [1, 4].
+   - KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.


In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.

The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.

I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?

Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.

This should have a dedicated proposal IMO, but let's start by logging an issue.

fvaleri · 2024-04-25T15:21:00Z

06x-new-kafka-roller.md

+
+- KafkaRoller takes a long time to reconcile combined nodes if they are all in `Pending` state. This is because the combined node does not become ready until the quorum is formed and KafkaRoller waits for a pod to become ready before it attempts to restart other nodes. In order for the quorum to form, at least the majority of controller nodes need to be running at the same time. This is not easy to solve in the current KafkaRoller without introducing some major changes because it processes each node individually and there is no mechanism to restart multiple nodes in parallel. More information can be found [here](https://github.com/strimzi/strimzi-kafka-operator/issues/9426).
+
+- The quorum health check is based on the `controller.quorum.fetch.timeout.ms` configuration which it reads from the desired configurations passed from the Reconciler. However, `CAReconcilor` and manual rolling update pass null value for desired configurations because in both cases, the nodes don't need reconfigurations. This results in performing the quorum healthcheck based on the hard-coded default value of `controller.quorum.fetch.timeout.ms` rather than the accurate configuration value when doing manual rolling update and rolling nodes for certificate renewal.


I know this issue, but the description is not clear IMO.

I have tried rewording the paragraph, do you think it's clearer now?

06x-new-kafka-roller.md

fvaleri · 2024-04-25T16:20:35Z

06x-new-kafka-roller.md

+5. It then considers restarting `controller-1` and checks if the quorum health would be impacted. The operator sends a request to to the active controller to describe the quorum replication state. It finds that majority of the follower controllers have caught up with the quorum leader within the `controller.quorum.fetch.timeout.ms`. 
+6. The roller restarts `controller-1` as it has no impact on the quorum health. When it has `SERVING` state, the roller repeats the quorum check and restarts `controller-2` and then `controller-0`. 
+7. It then considers restarting `combined-4`, so it performs quorum healthcheck and then availability check. Both check passes therefore `combined-4` is restarted. The same is repeated for `combined-5`.
+8. All controller and combined nodes have `SERVING` state and `numRestartAttempts` set to 1, therefore the roller checks the broker nodes.


Why numRestartAttempts matters here? Should we reset to 0?

This is how we make sure all the controller nodes have been restarted, before we proceed to the broker nodes. I will make this clearer.

06x-new-kafka-roller.md

katheris

Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.

06x-new-kafka-roller.md

Add possible transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

katheris

Two small nits but otherwise looks good to me

06x-new-kafka-roller.md

fvaleri

Hi @tinaselenge. Thanks for the updates. I think this is definitely the right direction, but I leaved some more comments for you to consider. It may be that I'm missing some detail, so feel free to correct me.

fvaleri · 2024-05-07T14:27:34Z

06x-new-kafka-roller.md

+- `CrashLoopBackOff`
+- `ImagePullBackOff`
+- `ContainerCreating`
+- `Pending` and `Unschedulable`


Sorry, I linked my comment to the wrong row. It was more appropriate for the previous row, where we mention "ContainerCreating" state.

Now, we consider a Pod stuck when it has "Pending" status and at least one Kafka Container in "ContainerCreating" state (which is the default for containers). This is not necessarily true, because this state includes the time spent downloading container images over the network.

Restarting the pod immediately, we may actually make the situation worst. That's why I was suggesting to add a timeout after which consider the Pod as stuck.

06x-new-kafka-roller.md

fvaleri · 2024-05-07T14:44:02Z

06x-new-kafka-roller.md

+   - Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
+   - KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
+   - The reassignment request is processed and foo-0 partition now has ISR [1, 4].
+   - KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.


This should have a dedicated proposal IMO, but let's start by logging an issue.

fvaleri · 2024-05-07T14:54:11Z

06x-new-kafka-roller.md

+   - Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestartAttempts`.
+   - After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationalTimeoutMs` is reached.
+   - If the timeout is reached, throw `TimeoutException`. If a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 2.
+   - After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. 


What about this?

A Kafka background thread ensures that the leader role is shifted to the preferred replica once it's in sync and a configured imbalance threshold is reached. This is enabled by default (see auto.leader.rebalance.enable). I think this may be enough.

06x-new-kafka-roller.md

Added flow diagram for state transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge · 2024-05-23T08:26:35Z

Thanks everyone who reviewed the proposal!
@tombentley @ppatierno @scholzj , would you be able to take a look at the proposal as well when you get a chance? Thank you.

tinaselenge force-pushed the kafka-roller-2 branch from 8c79a95 to 9c6154b Compare January 2, 2024 13:52

Intoducing the new KafkaRoller that only supports KRaft mode

c74f0b4

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 9c6154b to c74f0b4 Compare January 2, 2024 16:04

Add explanation for retrying the node

5abafe6

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from ca68601 to 5abafe6 Compare February 19, 2024 18:03

tombentley reviewed Feb 20, 2024

View reviewed changes

tinaselenge force-pushed the kafka-roller-2 branch from 56d7a24 to 4baf73a Compare February 21, 2024 12:21

Address review comments

33ec40e

Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 4baf73a to 33ec40e Compare February 21, 2024 13:30

fvaleri reviewed Feb 24, 2024

View reviewed changes

tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from c060c24 to 433316f Compare March 15, 2024 12:11

Address comments from Federico

4f91a5a

Tidy up Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 433316f to 4f91a5a Compare March 15, 2024 12:12

tinaselenge marked this pull request as ready for review March 15, 2024 12:29

Add more description on how unready nodes are handled.

941fe43

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 97bdef2 to 941fe43 Compare April 3, 2024 15:48

Add an example of rolling restart

1147134

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

see-quick reviewed Apr 25, 2024

View reviewed changes

Update 06x-new-kafka-roller.md

9a777c3

Co-authored-by: Maros Orsak <maros.orsak159@gmail.com> Signed-off-by: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>

fvaleri reviewed Apr 25, 2024

View reviewed changes

katheris reviewed Apr 25, 2024

View reviewed changes

tinaselenge commented Apr 30, 2024

View reviewed changes

06x-new-kafka-roller.md Show resolved Hide resolved

tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 931adbd to 1060fee Compare April 30, 2024 13:58

katheris reviewed Apr 30, 2024

View reviewed changes

06x-new-kafka-roller.md Outdated Show resolved Hide resolved

Address review comments

e56d1f8

Add possible transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 1060fee to e56d1f8 Compare May 3, 2024 10:28

katheris approved these changes May 7, 2024

View reviewed changes

06x-new-kafka-roller.md Outdated Show resolved Hide resolved

06x-new-kafka-roller.md Outdated Show resolved Hide resolved

fvaleri reviewed May 7, 2024

View reviewed changes

Addressed review comments

6842473

Added flow diagram for state transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 02e63f7 to 6842473 Compare May 14, 2024 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing new KafkaRoller #103

Introducing new KafkaRoller #103

tinaselenge commented Jan 2, 2024 •

edited

fvaleri left a comment

fvaleri Feb 24, 2024

tinaselenge Mar 12, 2024

fvaleri May 7, 2024

tinaselenge May 10, 2024

tinaselenge commented Apr 23, 2024

see-quick left a comment •

edited

see-quick Apr 25, 2024

tinaselenge Apr 25, 2024

fvaleri Apr 26, 2024 •

edited

fvaleri left a comment •

edited

fvaleri Apr 25, 2024 •

edited

tinaselenge Apr 30, 2024

fvaleri May 7, 2024

fvaleri Apr 25, 2024

tinaselenge Apr 30, 2024

fvaleri Apr 25, 2024

tinaselenge Apr 30, 2024

katheris left a comment

katheris left a comment

fvaleri left a comment

fvaleri May 7, 2024

fvaleri May 7, 2024

fvaleri May 7, 2024

tinaselenge commented May 23, 2024

		- After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message.


		12. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails.


		- KafkaRoller takes a long time to reconcile combined nodes if they are all in `Pending` state. This is because the combined node does not become ready until the quorum is formed and KafkaRoller waits for a pod to become ready before it attempts to restart other nodes. In order for the quorum to form, at least the majority of controller nodes need to be running at the same time. This is not easy to solve in the current KafkaRoller without introducing some major changes because it processes each node individually and there is no mechanism to restart multiple nodes in parallel. More information can be found [here](https://github.com/strimzi/strimzi-kafka-operator/issues/9426).

		- The quorum health check is based on the `controller.quorum.fetch.timeout.ms` configuration which it reads from the desired configurations passed from the Reconciler. However, `CAReconcilor` and manual rolling update pass null value for desired configurations because in both cases, the nodes don't need reconfigurations. This results in performing the quorum healthcheck based on the hard-coded default value of `controller.quorum.fetch.timeout.ms` rather than the accurate configuration value when doing manual rolling update and rolling nodes for certificate renewal.

Introducing new KafkaRoller #103

Are you sure you want to change the base?

Introducing new KafkaRoller #103

Conversation

tinaselenge commented Jan 2, 2024 • edited

fvaleri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tinaselenge commented Apr 23, 2024

see-quick left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fvaleri Apr 26, 2024 • edited

Choose a reason for hiding this comment

fvaleri left a comment • edited

Choose a reason for hiding this comment

fvaleri Apr 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katheris left a comment

Choose a reason for hiding this comment

katheris left a comment

Choose a reason for hiding this comment

fvaleri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tinaselenge commented May 23, 2024

tinaselenge commented Jan 2, 2024 •

edited

see-quick left a comment •

edited

fvaleri Apr 26, 2024 •

edited

fvaleri left a comment •

edited

fvaleri Apr 25, 2024 •

edited