Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the KafkaRebalance resource reconciliation getting stuck if CC pod is restarted during rebalance #10224

Merged
merged 4 commits into from
Jul 1, 2024

Conversation

ShubhamRwt
Copy link
Contributor

@ShubhamRwt ShubhamRwt commented Jun 13, 2024

Type of change

Select the type of your PR

  • Bugfix

Description

This PR fixes #10091 . This PR makes sure that in case the CC pod is restarted in middle of a rebalance, then we generate a new optimization proposal and then this newly generated proposal can be reviewed by the user.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@ShubhamRwt ShubhamRwt added this to the 0.42.0 milestone Jun 13, 2024
@scholzj
Copy link
Member

scholzj commented Jun 14, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ShubhamRwt. Left few comments, but overall it LGTM.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some nits.

.compose(v -> {
// Sets the user task to empty
cruiseControlServer.setupUserTasktoEmpty();

Copy link
Contributor

@fvaleri fvaleri Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the empty line, and there is another one at the start of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally added them to have some differentiation but I will remove it.

assertState(context, client, namespace, RESOURCE_NAME, KafkaRebalanceState.Rebalancing);
}))
.compose(v -> {
// Sets the user task to empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Sets the user task to empty
// Sets a user_tasks response with an empty task list simulating CC restart

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Should be reviewed by @ppatierno as an SME.

@scholzj
Copy link
Member

scholzj commented Jun 27, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

} else {
if (cruiseControlResponse.getJson().isEmpty()) {
// Cruise Control restarted: reset the state because the tasks queue is not persisted
// this may also happen when the tasks' retention time expires, or the cache becomes full
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not strictly say the CC was restarted. Let's be generic about receiving an empty task queue from CC, or anyway the task we are asking for doesn't exist anymore. Of course CC restarted could be a reason but maybe not the only one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC restart will be the most common cause, but I think the second line clarifies that it's not the only possible cause for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so "Cruise Control restarted" should be used as a use case, not as a strong statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something like:

This may happen if:
1. Cruise Control restarted so resetting the state because the tasks queue is not persisted
2. Task's retention time expired, or the cache has become full

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, Thanks Paolo

default:
throw new IllegalStateException("Unexpected user task status: " + taskStatus);
if (userTasks.isEmpty()) {
// Cruise Control restarted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as before, it could be not just about CC restarted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should say "Cruise Control restarted, or the task response expired, or it was deleted because the internal cache was full".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it should be used even for the other comment I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above ?

This may happen if:
1. Cruise Control restarted so resetting the state because the tasks queue is not persisted
2. Task's retention time expired, or the cache has become full

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@ppatierno
Copy link
Member

@ShubhamRwt I had a pass and left comments, but also testKafkaCCAndRebalanceWithMultipleCOs seems to fail. Could you also "resolve conversation" for the feedback you addressed please? (to shrink the long page here, thanks!).

@scholzj
Copy link
Member

scholzj commented Jun 28, 2024

testKafkaCCAndRebalanceWithMultipleCOs is IIRC flaky, so it might be unrelated.

@ppatierno
Copy link
Member

@ShubhamRwt have you tried if the issue happens when we are in PendingProposal and not Rebalancing? Because I see your code fixes the scenario in onRebalancing only.

@ShubhamRwt
Copy link
Contributor Author

@ShubhamRwt have you tried if the issue happens when we are in PendingProposal and not Rebalancing? Because I see your code fixes the scenario in onRebalancing only.

Reading through the code we have @ppatierno I reached to the conclusion that this issue can happen in case of Rebalancing state only. The error logs states that the error happens when the usertasks array is empty and going through the our code we are using the getUserTaskStatus method or you can say the userTask related stuff only in the Rebalancing state. I can try the use case you suggested and see if we get some completely different issue with it

@ppatierno
Copy link
Member

ppatierno commented Jun 28, 2024

@ShubhamRwt No need to investigate, I was curious if you did such a test, because I noticed a different behaviour we apply between asking for a proposal/checking the status and rebalancing/checking the status.
When we ask for a proposal, we hit the rebalance endpoint with a POST but we do the same even for checking the status (which to me sounds asking for a new proposal, not checking the status of the already asked one).
When we ask for the rebalancing, we hit the rebalance endpoint with a POST first and then the user tasks endpoint with a GET for checking the status.
This reflects what you say but at the same time rings an alarm to me ... why do we have this different behaviour? Why we don't use the same POST and GET (for the status) for a proposal? In the end they are different for the dryrun flag. Maybe we should investigate more on this, unless there is an obvious reason (I don't remember). @kyguy any thoughts?

@ShubhamRwt
Copy link
Contributor Author

Waiting for the regression test to complete, I will push the new comments related changes w.r.t it.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@scholzj scholzj removed this from the 0.42.0 milestone Jul 1, 2024
@scholzj scholzj modified the milestones: 0.43.0, 0.42.0 Jul 1, 2024
Copy link
Member

@kyguy kyguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ShubhamRwt

@scholzj scholzj merged commit 822b4ac into strimzi:main Jul 1, 2024
13 checks passed
@scholzj
Copy link
Member

scholzj commented Jul 1, 2024

Thanks @ShubhamRwt

@kyguy
Copy link
Member

kyguy commented Jul 1, 2024

This reflects what you say but at the same time rings an alarm to me ... why do we have this different behaviour? Why we don't use the same POST and GET (for the status) for a proposal? In the end they are different for the dryrun flag. Maybe we should investigate more on this, unless there is an obvious reason (I don't remember). @kyguy any thoughts?

@ppatierno I don't remember any specific reason why we would do this on purpose, I believe this was an oversight.

@ppatierno
Copy link
Member

ppatierno commented Jul 2, 2024

@kyguy I opened this issue #10294 I think it's worth investigating and getting the code right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: KafkaRebalance resource reconciliation gets stuck if CC pod is restarted during rebalance
5 participants