Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing new KafkaRoller #103

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

tinaselenge
Copy link
Contributor

@tinaselenge tinaselenge commented Jan 2, 2024

For more implementation details, the POC implementation code can be checked in RackRolling.java. All the related classes are in the same package , rolling.

The tests illustrating various cases with different set of configurations is in RackRollingTest.java.

The logic for switching to the new roller is in KafkaReconciler.java class.

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Made some improvements on the structure

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
- `CrashLoopBackOff`
- `ImagePullBackOff`
- `ContainerCreating`
- `Pending` and `Unschedulable`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it pending, or pending for more than X ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we don't check how long it's been in Pending state. If we find that the pod is in Pending state and has an old revision, we restart it, otherwise the pod is skipped anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I linked my comment to the wrong row. It was more appropriate for the previous row, where we mention "ContainerCreating" state.

Now, we consider a Pod stuck when it has "Pending" status and at least one Kafka Container in "ContainerCreating" state (which is the default for containers). This is not necessarily true, because this state includes the time spent downloading container images over the network.

Restarting the pod immediately, we may actually make the situation worst. That's why I was suggesting to add a timeout after which consider the Pod as stuck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree with this. I think with the current roller, the assumption is that we restarted the pod and waited for it to become ready for 5 mins and timed out. When timed out, the pod is retried and then if the pod is not ready and in ContainerCreating state, we consider it stuck. Also the stuck pod is only retried if the revision has an old revision (e.g pod image is updated, we may need to restart it anyway).

With the new roller, we do have a wait for pods in ContainerCreating state. I did not specifically explain this but make this clearer in the proposed changes section.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from c060c24 to 433316f Compare March 15, 2024 12:11
Tidy up

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge tinaselenge marked this pull request as ready for review March 15, 2024 12:29
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge
Copy link
Contributor Author

@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think.

Copy link
Member

@see-quick see-quick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal. Thanks for it 👍 .


STs POV:

I think we would need to also design multiple tests to cover all states, which KafkaRoller v2. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...

Side note about performance:

What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates of multiple nodes when we use batching mechanism...

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
- After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message.


12. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, seeing this algorithm, makes me think that it would be great to have a TLA+ expert, who would model a KafkaRoller and prove the correctness and safety of this algorithm formally using some safety and liveness properties (e.g., No node should exceed max retries (safety), no simultaneous restart of all controller nodes (safety) or All nodes should eventually be SERVING (liveness)) and more....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvaleri might be able to help with this :)

Copy link
Contributor

@fvaleri fvaleri Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLA+ is certainly a useful tool and this would be a great use case. The problem is that it has a steep learning curve and it requires a lot of practice to really get the benefit out of it. I can write a simple spec, but I lack the experience and time to write anything like this.

Things I don't like much about TLA+: the model you create is so distant from the actual implementation that it doesn't protect you from nasty bugs (see KRaft and Kafka replication protocol issues), it is really easy to hit the state space explosion so you have to know some mitigation techniques (still it may require big machines to verify a complex model over multiple days).

06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Co-authored-by: Maros Orsak <maros.orsak159@gmail.com>
Signed-off-by: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge thanks for the example, it really helps.

I left some comments, let me know if something is not clear or you want to discuss further.

06x-new-kafka-roller.md Show resolved Hide resolved
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
- The reassignment request is processed and foo-0 partition now has ISR [1, 4].
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.
Copy link
Contributor

@fvaleri fvaleri Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.

The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.

I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a dedicated proposal IMO, but let's start by logging an issue.


- KafkaRoller takes a long time to reconcile combined nodes if they are all in `Pending` state. This is because the combined node does not become ready until the quorum is formed and KafkaRoller waits for a pod to become ready before it attempts to restart other nodes. In order for the quorum to form, at least the majority of controller nodes need to be running at the same time. This is not easy to solve in the current KafkaRoller without introducing some major changes because it processes each node individually and there is no mechanism to restart multiple nodes in parallel. More information can be found [here](https://github.com/strimzi/strimzi-kafka-operator/issues/9426).

- The quorum health check is based on the `controller.quorum.fetch.timeout.ms` configuration which it reads from the desired configurations passed from the Reconciler. However, `CAReconcilor` and manual rolling update pass null value for desired configurations because in both cases, the nodes don't need reconfigurations. This results in performing the quorum healthcheck based on the hard-coded default value of `controller.quorum.fetch.timeout.ms` rather than the accurate configuration value when doing manual rolling update and rolling nodes for certificate renewal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this issue, but the description is not clear IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried rewording the paragraph, do you think it's clearer now?

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
5. It then considers restarting `controller-1` and checks if the quorum health would be impacted. The operator sends a request to to the active controller to describe the quorum replication state. It finds that majority of the follower controllers have caught up with the quorum leader within the `controller.quorum.fetch.timeout.ms`.
6. The roller restarts `controller-1` as it has no impact on the quorum health. When it has `SERVING` state, the roller repeats the quorum check and restarts `controller-2` and then `controller-0`.
7. It then considers restarting `combined-4`, so it performs quorum healthcheck and then availability check. Both check passes therefore `combined-4` is restarted. The same is repeated for `combined-5`.
8. All controller and combined nodes have `SERVING` state and `numRestartAttempts` set to 1, therefore the roller checks the broker nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why numRestartAttempts matters here? Should we reset to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we make sure all the controller nodes have been restarted, before we proceed to the broker nodes. I will make this clearer.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 931adbd to 1060fee Compare April 30, 2024 13:58
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Add possible transitions

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small nits but otherwise looks good to me

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tinaselenge. Thanks for the updates. I think this is definitely the right direction, but I leaved some more comments for you to consider. It may be that I'm missing some detail, so feel free to correct me.

- `CrashLoopBackOff`
- `ImagePullBackOff`
- `ContainerCreating`
- `Pending` and `Unschedulable`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I linked my comment to the wrong row. It was more appropriate for the previous row, where we mention "ContainerCreating" state.

Now, we consider a Pod stuck when it has "Pending" status and at least one Kafka Container in "ContainerCreating" state (which is the default for containers). This is not necessarily true, because this state includes the time spent downloading container images over the network.

Restarting the pod immediately, we may actually make the situation worst. That's why I was suggesting to add a timeout after which consider the Pod as stuck.

06x-new-kafka-roller.md Show resolved Hide resolved
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
- The reassignment request is processed and foo-0 partition now has ISR [1, 4].
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a dedicated proposal IMO, but let's start by logging an issue.

- Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestartAttempts`.
- After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationalTimeoutMs` is reached.
- If the timeout is reached, throw `TimeoutException`. If a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 2.
- After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

A Kafka background thread ensures that the leader role is shifted to the preferred replica once it's in sync and a configured imbalance threshold is reached. This is enabled by default (see auto.leader.rebalance.enable). I think this may be enough.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
Added flow diagram for state transitions

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge
Copy link
Contributor Author

Thanks everyone who reviewed the proposal!
@tombentley @ppatierno @scholzj , would you be able to take a look at the proposal as well when you get a chance? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants