Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reconciliation to continue when scale-down is blocked due to brokers still in use #9585

Conversation

scholzj
Copy link
Member

@scholzj scholzj commented Jan 24, 2024

Type of change

  • Enhancement / new feature

Description

Currently, when a user tries to scale down the Kafka cluster but the brokers that would be removed are still in use, we just throw an exception and fail the reconciliation. This works fine in general. But it makes it hard to build any automation on top of it since it essentially blocks any other operation tasks. So for example, if use by mistake scales down the Kafka cluster, the operator starts throwing errors and will not proceed with for example renewing certificates.

This PR tries to address it by reverting the scale-down changes and allowing the reconciliation to proceed as usual. It takes a pragmatic approach and reverts all scale-downs even if some of the brokers would be empty and it would be possible to shut them down. Since Kafka has no fencing, this could mean the brokers that were empty get new topics assigned. But it helps to simplify the logic significantly because it does not need to deal with situations such as when user tries to scale down brokers 3 and 4 where broker 3 is empty and broker 4 is still in use. In such case, we cannot simply correct the replication factor by 1 as that would mean broker 3 (the empty one) is kept and broker 4 is removed => so it will be still blocked with an exception.

From the technical perspective, this PR introduces a new class KafkaClusterCreator that contains the logic. This class is used to:

  • Create the Kafka cluster model
  • Check if it has any scale-down issues
  • If it has any scale-down issues, it tries to fix them and recursively creates the Kafka cluster model again (but this time, any scale-down issues are just raised as an error -> in case they were not fixed for the first time)
  • It is written in a way to make it easier to extend it in the next steps to check against removing the broker role and possibly other checks (we might consider moving the storage checks to it from the KafkaCluster.fromCrd(...) method).

The KafkaClusterCreator is used by the KafkaAssemblyOperator and the created KafkaCluster instance is used to create the KafkaRecocniler. I considered some other options as well, but rejected them at the end:

  • Doing it in the constructor of KafkaReconciler => I rejected it because it seemed to push a lot of logic into the constructor and also had issues with asynchronous code. The advantage of the current approach is that it keeps the constructor clean with only assignments of parameters to object fields.
  • Doing it as the first step of the KafkaReconciler. This would work quite nicely. But it would mean that the KafkaCluster object is not final in the KafkaReconciler class. That seemed like something that can be tricky to maintain in the future and is something to avoid.
  • Keeping the original KafkaCluster object and just modifying the node assignments inside it. This would probably be easier for the initial implementation. But it might be harder to maintain as it would mean the fields inside the KafkaCluster will not be final, will be subject to change and we will need to be careful what parts are changed and where and when are the used. Creating a brand new KafkaCluster object seemed like a better approach.

This is currently a draft sued for a preliminary review of the code. It does not contain any new tests and it does not update existing tests to the new code.

This should resolve #9232.

Checklist

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md

@scholzj scholzj added this to the 0.40.0 milestone Jan 24, 2024
@scholzj
Copy link
Member Author

scholzj commented Jan 24, 2024

@ppatierno Would you mind having a quick look if the solution for this makes sense to you before I write all the tests etc.?

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall logic seems to be reasonable to me. I left a couple of questions.
I will start crying about conflicts I see with my migration PR later ...

@scholzj
Copy link
Member Author

scholzj commented Jan 25, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…okers still in use - Closes strimzi#9232

Signed-off-by: Jakub Scholz <www@scholzj.com>
Signed-off-by: Jakub Scholz <www@scholzj.com>
@scholzj scholzj force-pushed the allow-reconciliation-to-continue-when-scale-down-is-prevented-due-to-brokers-in-use branch from 418fe7d to ff5eb9f Compare January 26, 2024 18:56
Signed-off-by: Jakub Scholz <www@scholzj.com>
@scholzj
Copy link
Member Author

scholzj commented Jan 26, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member Author

scholzj commented Jan 26, 2024

/azp run feature-gates-regression

@scholzj scholzj marked this pull request as ready for review January 26, 2024 23:57
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj scholzj requested a review from im-konge January 26, 2024 23:57
@scholzj
Copy link
Member Author

scholzj commented Jan 27, 2024

/azp run kraft-regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I saw multiple TODO which should be removed.

Copy link
Member

@im-konge im-konge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Signed-off-by: Jakub Scholz <www@scholzj.com>
@scholzj
Copy link
Member Author

scholzj commented Jan 27, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj scholzj merged commit 35a39fb into strimzi:main Jan 28, 2024
21 checks passed
@scholzj scholzj deleted the allow-reconciliation-to-continue-when-scale-down-is-prevented-due-to-brokers-in-use branch January 28, 2024 13:19
john-mcpeek pushed a commit to john-mcpeek/strimzi-kafka-operator that referenced this pull request Jan 28, 2024
…okers still in use (strimzi#9585)

Signed-off-by: John McPeek <juan.mcpeek@gmail.com>
adamj-m pushed a commit to jamesfarlow/strimzi-kafka-operator that referenced this pull request Feb 1, 2024
…okers still in use (strimzi#9585)

Signed-off-by: Jakub Scholz <www@scholzj.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revert blocked scale-down and continue reconciliation
4 participants