Fixed the KafkaRebalance resource reconciliation getting stuck if CC pod is restarted during rebalance #10224

ShubhamRwt · 2024-06-13T13:10:11Z

Type of change

Select the type of your PR

Bugfix

Description

This PR fixes #10091 . This PR makes sure that in case the CC pod is restarted in middle of a rebalance, then we generate a new optimization proposal and then this newly generated proposal can be reviewed by the user.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

Write tests
Make sure all tests pass
Update documentation
Check RBAC rights for Kubernetes / OpenShift roles
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
Reference relevant issue(s) and close them after merging
Update CHANGELOG.md
Supply screenshots for visual changes, such as Grafana dashboards

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java

systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java

scholzj · 2024-06-14T07:49:16Z

/azp run regression

azure-pipelines · 2024-06-14T07:49:28Z

Azure Pipelines successfully started running 1 pipeline(s).

fvaleri

Thanks @ShubhamRwt. Left few comments, but overall it LGTM.

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java

systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java

...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java

...test/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/MockCruiseControl.java

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java

...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java

systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java

...n/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

fvaleri

LGTM. Left some nits.

fvaleri · 2024-06-27T13:26:17Z

...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java

+                .compose(v -> {
+                    // Sets the user task to empty
+                    cruiseControlServer.setupUserTasktoEmpty();
+


Please remove the empty line, and there is another one at the start of this method.

I intentionally added them to have some differentiation but I will remove it.

fvaleri · 2024-06-27T13:27:59Z

...t/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorTest.java

+                    assertState(context, client, namespace, RESOURCE_NAME, KafkaRebalanceState.Rebalancing);
+                }))
+                .compose(v -> {
+                    // Sets the user task to empty


Suggested change

// Sets the user task to empty

// Sets a user_tasks response with an empty task list simulating CC restart

scholzj

LGTM. Should be reviewed by @ppatierno as an SME.

scholzj · 2024-06-27T16:45:04Z

/azp run regression

azure-pipelines · 2024-06-27T16:45:15Z

Azure Pipelines successfully started running 1 pipeline(s).

ppatierno · 2024-06-28T07:33:41Z

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java

-                            } else {
+                    if (cruiseControlResponse.getJson().isEmpty()) {
+                        // Cruise Control restarted: reset the state because the tasks queue is not persisted
+                        // this may also happen when the tasks' retention time expires, or the cache becomes full


I would not strictly say the CC was restarted. Let's be generic about receiving an empty task queue from CC, or anyway the task we are asking for doesn't exist anymore. Of course CC restarted could be a reason but maybe not the only one.

CC restart will be the most common cause, but I think the second line clarifies that it's not the only possible cause for this.

so "Cruise Control restarted" should be used as a use case, not as a strong statement

So something like:

This may happen if: 1. Cruise Control restarted so resetting the state because the tasks queue is not persisted 2. Task's retention time expired, or the cache has become full

fine with me

Great, Thanks Paolo

ppatierno · 2024-06-28T07:34:04Z

...n/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java

-                                        default:
-                                            throw new IllegalStateException("Unexpected user task status: " + taskStatus);
+                                    if (userTasks.isEmpty()) {
+                                        // Cruise Control restarted


Ditto as before, it could be not just about CC restarted.

Here we should say "Cruise Control restarted, or the task response expired, or it was deleted because the internal cache was full".

And it should be used even for the other comment I think.

Same as above ?

This may happen if: 1. Cruise Control restarted so resetting the state because the tasks queue is not persisted 2. Task's retention time expired, or the cache has become full

...n/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java

ppatierno · 2024-06-28T07:46:34Z

@ShubhamRwt I had a pass and left comments, but also testKafkaCCAndRebalanceWithMultipleCOs seems to fail. Could you also "resolve conversation" for the feedback you addressed please? (to shrink the long page here, thanks!).

scholzj · 2024-06-28T08:34:11Z

testKafkaCCAndRebalanceWithMultipleCOs is IIRC flaky, so it might be unrelated.

ppatierno · 2024-06-28T09:06:49Z

@ShubhamRwt have you tried if the issue happens when we are in PendingProposal and not Rebalancing? Because I see your code fixes the scenario in onRebalancing only.

ShubhamRwt · 2024-06-28T09:15:05Z

@ShubhamRwt have you tried if the issue happens when we are in PendingProposal and not Rebalancing? Because I see your code fixes the scenario in onRebalancing only.

Reading through the code we have @ppatierno I reached to the conclusion that this issue can happen in case of Rebalancing state only. The error logs states that the error happens when the usertasks array is empty and going through the our code we are using the getUserTaskStatus method or you can say the userTask related stuff only in the Rebalancing state. I can try the use case you suggested and see if we get some completely different issue with it

ppatierno · 2024-06-28T09:32:36Z

@ShubhamRwt No need to investigate, I was curious if you did such a test, because I noticed a different behaviour we apply between asking for a proposal/checking the status and rebalancing/checking the status.
When we ask for a proposal, we hit the rebalance endpoint with a POST but we do the same even for checking the status (which to me sounds asking for a new proposal, not checking the status of the already asked one).
When we ask for the rebalancing, we hit the rebalance endpoint with a POST first and then the user tasks endpoint with a GET for checking the status.
This reflects what you say but at the same time rings an alarm to me ... why do we have this different behaviour? Why we don't use the same POST and GET (for the status) for a proposal? In the end they are different for the dryrun flag. Maybe we should investigate more on this, unless there is an obvious reason (I don't remember). @kyguy any thoughts?

ShubhamRwt · 2024-06-28T11:48:28Z

Waiting for the regression test to complete, I will push the new comments related changes w.r.t it.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

kyguy

Nice work @ShubhamRwt

scholzj · 2024-07-01T17:45:33Z

Thanks @ShubhamRwt

kyguy · 2024-07-01T17:51:38Z

This reflects what you say but at the same time rings an alarm to me ... why do we have this different behaviour? Why we don't use the same POST and GET (for the status) for a proposal? In the end they are different for the dryrun flag. Maybe we should investigate more on this, unless there is an obvious reason (I don't remember). @kyguy any thoughts?

@ppatierno I don't remember any specific reason why we would do this on purpose, I believe this was an oversight.

ppatierno · 2024-07-02T12:53:20Z

@kyguy I opened this issue #10294 I think it's worth investigating and getting the code right.

ShubhamRwt requested review from scholzj, ppatierno, kyguy and fvaleri June 13, 2024 13:12

ShubhamRwt added this to the 0.42.0 milestone Jun 13, 2024

scholzj reviewed Jun 14, 2024

View reviewed changes

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java Outdated Show resolved Hide resolved

systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java Outdated Show resolved Hide resolved

fvaleri reviewed Jun 18, 2024

View reviewed changes

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java Outdated Show resolved Hide resolved

systemtest/src/test/java/io/strimzi/systemtest/cruisecontrol/CruiseControlST.java Outdated Show resolved Hide resolved

scholzj added the cluster-balancing label Jun 19, 2024

fvaleri reviewed Jun 24, 2024

View reviewed changes

scholzj reviewed Jun 25, 2024

View reviewed changes

ShubhamRwt added 3 commits June 27, 2024 12:51

Fix the CC bug

b1539d8

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Added the unit test

314fe48

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Added suggestions by Fede and Jakub

5857d12

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt force-pushed the fixCCbug branch from 6fa9871 to 5857d12 Compare June 27, 2024 10:55

fvaleri approved these changes Jun 27, 2024

View reviewed changes

scholzj approved these changes Jun 27, 2024

View reviewed changes

ppatierno reviewed Jun 28, 2024

View reviewed changes

ppatierno approved these changes Jun 28, 2024

View reviewed changes

Added suggestions by Paolo

fc92b4c

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj removed this from the 0.42.0 milestone Jul 1, 2024

scholzj modified the milestones: 0.43.0, 0.42.0 Jul 1, 2024

kyguy approved these changes Jul 1, 2024

View reviewed changes

scholzj merged commit 822b4ac into strimzi:main Jul 1, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the KafkaRebalance resource reconciliation getting stuck if CC pod is restarted during rebalance #10224

Fixed the KafkaRebalance resource reconciliation getting stuck if CC pod is restarted during rebalance #10224

ShubhamRwt commented Jun 13, 2024 •

edited

Loading

scholzj commented Jun 14, 2024

azure-pipelines bot commented Jun 14, 2024

fvaleri left a comment

fvaleri left a comment

fvaleri Jun 27, 2024 •

edited

Loading

ShubhamRwt Jun 27, 2024

fvaleri Jun 27, 2024

scholzj left a comment

scholzj commented Jun 27, 2024

azure-pipelines bot commented Jun 27, 2024

ppatierno Jun 28, 2024

fvaleri Jun 28, 2024

ppatierno Jun 28, 2024

ShubhamRwt Jun 28, 2024

ppatierno Jun 28, 2024

ShubhamRwt Jun 28, 2024

ppatierno Jun 28, 2024

fvaleri Jun 28, 2024

ppatierno Jun 28, 2024

ShubhamRwt Jun 28, 2024

ppatierno Jun 28, 2024

ppatierno commented Jun 28, 2024

scholzj commented Jun 28, 2024

ppatierno commented Jun 28, 2024

ShubhamRwt commented Jun 28, 2024

ppatierno commented Jun 28, 2024 •

edited

Loading

ShubhamRwt commented Jun 28, 2024

kyguy left a comment

scholzj commented Jul 1, 2024

kyguy commented Jul 1, 2024

ppatierno commented Jul 2, 2024 •

edited

Loading

	// Sets the user task to empty
	// Sets a user_tasks response with an empty task list simulating CC restart

Fixed the KafkaRebalance resource reconciliation getting stuck if CC pod is restarted during rebalance #10224

Fixed the KafkaRebalance resource reconciliation getting stuck if CC pod is restarted during rebalance #10224

Conversation

ShubhamRwt commented Jun 13, 2024 • edited Loading

Type of change

Description

Checklist

scholzj commented Jun 14, 2024

azure-pipelines bot commented Jun 14, 2024

fvaleri left a comment

Choose a reason for hiding this comment

fvaleri left a comment

Choose a reason for hiding this comment

fvaleri Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scholzj left a comment

Choose a reason for hiding this comment

scholzj commented Jun 27, 2024

azure-pipelines bot commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppatierno commented Jun 28, 2024

scholzj commented Jun 28, 2024

ppatierno commented Jun 28, 2024

ShubhamRwt commented Jun 28, 2024

ppatierno commented Jun 28, 2024 • edited Loading

ShubhamRwt commented Jun 28, 2024

kyguy left a comment

Choose a reason for hiding this comment

scholzj commented Jul 1, 2024

kyguy commented Jul 1, 2024

ppatierno commented Jul 2, 2024 • edited Loading

ShubhamRwt commented Jun 13, 2024 •

edited

Loading

fvaleri Jun 27, 2024 •

edited

Loading

ppatierno commented Jun 28, 2024 •

edited

Loading

ppatierno commented Jul 2, 2024 •

edited

Loading