-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed the KafkaRebalance resource reconciliation getting stuck if CC pod is restarted during rebalance #10224
Conversation
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java
Outdated
Show resolved
Hide resolved
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ShubhamRwt. Left few comments, but overall it LGTM.
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java
Outdated
Show resolved
Hide resolved
...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java
Show resolved
Hide resolved
...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...test/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/MockCruiseControl.java
Outdated
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java
Outdated
Show resolved
Hide resolved
...n/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java
Show resolved
Hide resolved
...n/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Show resolved
Hide resolved
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some nits.
.compose(v -> { | ||
// Sets the user task to empty | ||
cruiseControlServer.setupUserTasktoEmpty(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the empty line, and there is another one at the start of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally added them to have some differentiation but I will remove it.
assertState(context, client, namespace, RESOURCE_NAME, KafkaRebalanceState.Rebalancing); | ||
})) | ||
.compose(v -> { | ||
// Sets the user task to empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Sets the user task to empty | |
// Sets a user_tasks response with an empty task list simulating CC restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Should be reviewed by @ppatierno as an SME.
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
} else { | ||
if (cruiseControlResponse.getJson().isEmpty()) { | ||
// Cruise Control restarted: reset the state because the tasks queue is not persisted | ||
// this may also happen when the tasks' retention time expires, or the cache becomes full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not strictly say the CC was restarted. Let's be generic about receiving an empty task queue from CC, or anyway the task we are asking for doesn't exist anymore. Of course CC restarted could be a reason but maybe not the only one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC restart will be the most common cause, but I think the second line clarifies that it's not the only possible cause for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so "Cruise Control restarted" should be used as a use case, not as a strong statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So something like:
This may happen if:
1. Cruise Control restarted so resetting the state because the tasks queue is not persisted
2. Task's retention time expired, or the cache has become full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine with me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, Thanks Paolo
default: | ||
throw new IllegalStateException("Unexpected user task status: " + taskStatus); | ||
if (userTasks.isEmpty()) { | ||
// Cruise Control restarted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto as before, it could be not just about CC restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should say "Cruise Control restarted, or the task response expired, or it was deleted because the internal cache was full".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it should be used even for the other comment I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above ?
This may happen if:
1. Cruise Control restarted so resetting the state because the tasks queue is not persisted
2. Task's retention time expired, or the cache has become full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
...n/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java
Show resolved
Hide resolved
@ShubhamRwt I had a pass and left comments, but also |
|
@ShubhamRwt have you tried if the issue happens when we are in |
Reading through the code we have @ppatierno I reached to the conclusion that this issue can happen in case of |
@ShubhamRwt No need to investigate, I was curious if you did such a test, because I noticed a different behaviour we apply between asking for a proposal/checking the status and rebalancing/checking the status. |
Waiting for the regression test to complete, I will push the new comments related changes w.r.t it. |
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ShubhamRwt
Thanks @ShubhamRwt |
@ppatierno I don't remember any specific reason why we would do this on purpose, I believe this was an oversight. |
Type of change
Select the type of your PR
Description
This PR fixes #10091 . This PR makes sure that in case the CC pod is restarted in middle of a rebalance, then we generate a new optimization proposal and then this newly generated proposal can be reviewed by the user.
Checklist
Please go through this checklist and make sure all applicable tasks have been done