Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CPU estimation in the Cruise Control capacity configuration #8576

Closed
scholzj opened this issue May 29, 2023 · 3 comments
Closed

Improve CPU estimation in the Cruise Control capacity configuration #8576

scholzj opened this issue May 29, 2023 · 3 comments
Assignees
Labels

Comments

@scholzj
Copy link
Member

scholzj commented May 29, 2023

The Cruise Control Capacity configuration generator seems to completely ignore the resource request today. That seems to result in weird situations. For example:

  1. When the request is set to 5 CPU and the limit is set to 10 CPU, Strimzi decides to ignore both and sets the CPU capacity to 1.0 which seems to be clearly wrong. While I think preferring the limit might be questionable and subjective, setting the CPU to 1.0 seems clearly wrong since it is well below both the limit and request.

  2. When the request is set to 5 CPU and the limit is not set, Strimzi decides to ignore the request and sets the CPU capacity to 1.0 which seems to be clearly wrong. While the current code seems to prefer a limit instead of a request and in this case, the limit is unknown ... setting 1.0 as the capacity seems clearly wrong since the request of 5 CPUs is clearly much higher and these CPUs will be always available.

We should update the logic to either take the request configuration properly into account or at least not misconfigure it by setting the CPU capacity to 1.0 when we cannot decide.

@kyguy
Copy link
Member

kyguy commented May 30, 2023

Yes, having cores == 1 is not great or correct default value. We have discussed in the past we can only be guaranteed accurate balances based on CPU when the following:

(A) resource.requests == resource.limits
(B) User provides CPU overrides per broker

To summarize the reasoning from the past concerning (A)

(1) When requests != limits e.g  requests=1 and limits=3, we are not guaranteed to have accurate CPU balances, since every broker’s real CPU capacity would be somewhere between 1 and 3 cores.
(2) When requests == <_value_> and limits == null we are not guaranteed to have accurate CPU balances, since the broker load could be underestimated when a broker uses more CPU resources than requested. Kubernetes will reserve the cores requested for the pod but still allow the pod to use all of the resources available to the node. e.g.  When a pod with a 2 core request is using 4 cores, Cruise Control underestimates the load that moves from that broker pod to another since Cruise Control thinks it is moving only 2 cores worth of load when it is really moving 4 cores worth of load.
(3) When requests == null and limits == <_value_>:  We are guaranteed to have accurate CPU balances, since K8s will set the requests value equal to the limit value as detailed in k8s documentation. [1] Therefore, it would work in the same way as setting resource.requests == resource.limits.

Cruise Control can take CPU capacity config in two forms:
(A) Percentage of available CPU resources
(B) CPU Cores

Originally, we used (A) exclusively, but I updated the code [2] to use (B) for better UX. However, when I did this update I accidentally removed the usage of (A) which was intended to be used in circumstances like this. (e.g. instead of setting CPU limit to 1 core, it is supposed to set the CPU limit to 100 percent of available CPU resources)

Like setting the capacity limit to requests or limits alone, the usage of (A) does not guarantee accurate rebalances, but I agree that it would be better UX than setting the cores == 1 by default.

Short term recommendation:
(1) Add this information to the strange code (so we don't forget why it is the way it is)
(2) Update the default to usage of (A) (though I am open to setting it to requests or limits in the situations described above too) and maybe even log a warning about balance accuracy.

Long term recommendation:
We create a custom BrokerCapacityResolver plugin for Strimzi Cruise Control for detecting and setting this capacity settings accurately and automatically

[1] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits
[2] #6892

@tombentley
Copy link
Member

Triaged on 15/06/2023: Kyle is going to work on implementing the short term recommendation described above.

@kyguy
Copy link
Member

kyguy commented Jun 15, 2023

Plan is to move forward with Jakub's recommendation: For the cases where requests != limits or requests == <value> and limits == null we should set the CPU capacity == requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants