Allow reconciliation to continue when scale-down is blocked due to brokers still in use #9585

scholzj · 2024-01-24T00:35:28Z

Type of change

Enhancement / new feature

Description

Currently, when a user tries to scale down the Kafka cluster but the brokers that would be removed are still in use, we just throw an exception and fail the reconciliation. This works fine in general. But it makes it hard to build any automation on top of it since it essentially blocks any other operation tasks. So for example, if use by mistake scales down the Kafka cluster, the operator starts throwing errors and will not proceed with for example renewing certificates.

This PR tries to address it by reverting the scale-down changes and allowing the reconciliation to proceed as usual. It takes a pragmatic approach and reverts all scale-downs even if some of the brokers would be empty and it would be possible to shut them down. Since Kafka has no fencing, this could mean the brokers that were empty get new topics assigned. But it helps to simplify the logic significantly because it does not need to deal with situations such as when user tries to scale down brokers 3 and 4 where broker 3 is empty and broker 4 is still in use. In such case, we cannot simply correct the replication factor by 1 as that would mean broker 3 (the empty one) is kept and broker 4 is removed => so it will be still blocked with an exception.

From the technical perspective, this PR introduces a new class KafkaClusterCreator that contains the logic. This class is used to:

Create the Kafka cluster model
Check if it has any scale-down issues
If it has any scale-down issues, it tries to fix them and recursively creates the Kafka cluster model again (but this time, any scale-down issues are just raised as an error -> in case they were not fixed for the first time)
It is written in a way to make it easier to extend it in the next steps to check against removing the broker role and possibly other checks (we might consider moving the storage checks to it from the KafkaCluster.fromCrd(...) method).

The KafkaClusterCreator is used by the KafkaAssemblyOperator and the created KafkaCluster instance is used to create the KafkaRecocniler. I considered some other options as well, but rejected them at the end:

Doing it in the constructor of KafkaReconciler => I rejected it because it seemed to push a lot of logic into the constructor and also had issues with asynchronous code. The advantage of the current approach is that it keeps the constructor clean with only assignments of parameters to object fields.
Doing it as the first step of the KafkaReconciler. This would work quite nicely. But it would mean that the KafkaCluster object is not final in the KafkaReconciler class. That seemed like something that can be tricky to maintain in the future and is something to avoid.
Keeping the original KafkaCluster object and just modifying the node assignments inside it. This would probably be easier for the initial implementation. But it might be harder to maintain as it would mean the fields inside the KafkaCluster will not be final, will be subject to change and we will need to be careful what parts are changed and where and when are the used. Creating a brand new KafkaCluster object seemed like a better approach.

This is currently a draft sued for a preliminary review of the code. It does not contain any new tests and it does not update existing tests to the new code.

This should resolve #9232.

Checklist

Write tests
Make sure all tests pass
Update documentation
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
Reference relevant issue(s) and close them after merging
Update CHANGELOG.md

scholzj · 2024-01-24T00:35:59Z

@ppatierno Would you mind having a quick look if the solution for this makes sense to you before I write all the tests etc.?

ppatierno

The overall logic seems to be reasonable to me. I left a couple of questions.
I will start crying about conflicts I see with my migration PR later ...

...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaClusterCreator.java

scholzj · 2024-01-25T18:45:30Z

/azp run regression

azure-pipelines · 2024-01-25T18:45:42Z

Azure Pipelines successfully started running 1 pipeline(s).

…okers still in use - Closes strimzi#9232 Signed-off-by: Jakub Scholz <www@scholzj.com>

Signed-off-by: Jakub Scholz <www@scholzj.com>

scholzj · 2024-01-26T20:19:48Z

/azp run regression

azure-pipelines · 2024-01-26T20:20:01Z

Azure Pipelines successfully started running 1 pipeline(s).

scholzj · 2024-01-26T23:56:54Z

/azp run feature-gates-regression

azure-pipelines · 2024-01-26T23:57:04Z

Azure Pipelines successfully started running 1 pipeline(s).

scholzj · 2024-01-27T12:01:37Z

/azp run kraft-regression

azure-pipelines · 2024-01-27T12:01:48Z

Azure Pipelines successfully started running 1 pipeline(s).

ppatierno

LGTM. I saw multiple TODO which should be removed.

im-konge

LGTM, thanks

Signed-off-by: Jakub Scholz <www@scholzj.com>

scholzj · 2024-01-27T19:13:13Z

/azp run regression

azure-pipelines · 2024-01-27T19:13:23Z

Azure Pipelines successfully started running 1 pipeline(s).

…okers still in use (strimzi#9585) Signed-off-by: John McPeek <juan.mcpeek@gmail.com>

…okers still in use (strimzi#9585) Signed-off-by: Jakub Scholz <www@scholzj.com>

scholzj added this to the 0.40.0 milestone Jan 24, 2024

scholzj requested a review from ppatierno January 24, 2024 00:35

ppatierno reviewed Jan 25, 2024

View reviewed changes

...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaClusterCreator.java Show resolved Hide resolved

...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaClusterCreator.java Outdated Show resolved Hide resolved

ppatierno reviewed Jan 25, 2024

View reviewed changes

...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaClusterCreator.java Show resolved Hide resolved

scholzj added 2 commits January 26, 2024 17:45

Allow reconciliation to continue when scale-down is blocked due to br…

e9ed687

…okers still in use - Closes strimzi#9232 Signed-off-by: Jakub Scholz <www@scholzj.com>

Add tests

ff5eb9f

Signed-off-by: Jakub Scholz <www@scholzj.com>

scholzj force-pushed the allow-reconciliation-to-continue-when-scale-down-is-prevented-due-to-brokers-in-use branch from 418fe7d to ff5eb9f Compare January 26, 2024 18:56

Docs and CHANGELOG

f3134d1

Signed-off-by: Jakub Scholz <www@scholzj.com>

scholzj marked this pull request as ready for review January 26, 2024 23:57

scholzj requested a review from ppatierno January 26, 2024 23:57

scholzj requested a review from im-konge January 26, 2024 23:57

ppatierno approved these changes Jan 27, 2024

View reviewed changes

see-quick approved these changes Jan 27, 2024

View reviewed changes

im-konge approved these changes Jan 27, 2024

View reviewed changes

Review comments PP

2f986e4

Signed-off-by: Jakub Scholz <www@scholzj.com>

scholzj merged commit 35a39fb into strimzi:main Jan 28, 2024
21 checks passed

scholzj deleted the allow-reconciliation-to-continue-when-scale-down-is-prevented-due-to-brokers-in-use branch January 28, 2024 13:19

john-mcpeek pushed a commit to john-mcpeek/strimzi-kafka-operator that referenced this pull request Jan 28, 2024

Allow reconciliation to continue when scale-down is blocked due to br…

3a50659

…okers still in use (strimzi#9585) Signed-off-by: John McPeek <juan.mcpeek@gmail.com>

adamj-m pushed a commit to jamesfarlow/strimzi-kafka-operator that referenced this pull request Feb 1, 2024

Allow reconciliation to continue when scale-down is blocked due to br…

2cc4de9

…okers still in use (strimzi#9585) Signed-off-by: Jakub Scholz <www@scholzj.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reconciliation to continue when scale-down is blocked due to brokers still in use #9585

Allow reconciliation to continue when scale-down is blocked due to brokers still in use #9585

scholzj commented Jan 24, 2024

scholzj commented Jan 24, 2024

ppatierno left a comment

scholzj commented Jan 25, 2024

azure-pipelines bot commented Jan 25, 2024

scholzj commented Jan 26, 2024

azure-pipelines bot commented Jan 26, 2024

scholzj commented Jan 26, 2024

azure-pipelines bot commented Jan 26, 2024

scholzj commented Jan 27, 2024

azure-pipelines bot commented Jan 27, 2024

ppatierno left a comment

im-konge left a comment

scholzj commented Jan 27, 2024

azure-pipelines bot commented Jan 27, 2024

Allow reconciliation to continue when scale-down is blocked due to brokers still in use #9585

Allow reconciliation to continue when scale-down is blocked due to brokers still in use #9585

Conversation

scholzj commented Jan 24, 2024

Type of change

Description

Checklist

scholzj commented Jan 24, 2024

ppatierno left a comment

Choose a reason for hiding this comment

scholzj commented Jan 25, 2024

azure-pipelines bot commented Jan 25, 2024

scholzj commented Jan 26, 2024

azure-pipelines bot commented Jan 26, 2024

scholzj commented Jan 26, 2024

azure-pipelines bot commented Jan 26, 2024

scholzj commented Jan 27, 2024

azure-pipelines bot commented Jan 27, 2024

ppatierno left a comment

Choose a reason for hiding this comment

im-konge left a comment

Choose a reason for hiding this comment

scholzj commented Jan 27, 2024

azure-pipelines bot commented Jan 27, 2024