Add CPU overrides for CC capacity config #6892

kyguy · 2022-06-02T23:59:42Z

Type of change

Enhancement / new feature

Description

For accurate rebalances between brokers running on nodes with heterogeneous CPU resources, Cruise Control must know the CPU capacity limit of individual brokers. This PR allows users to specify capacity limit overrides for lists of individual Kafka brokers in the overrides property in Kafka.spec.cruiseControl.brokerCapacity. This PR also bumps the Cruise Control version to pick up an enhancement [1] which allows us to specify milliCPU and fractional core CPU capacity values.

This PR addresses the CPU capacity issues of #6265 and the UI issues of #5951

apiVersion: {KafkaApiVersion}
kind: Kafka
metadata:
  name: my-cluster
spec:
  # ...
  cruiseControl:
    # ...
    brokerCapacity:
      cpu: "4"
      overrides:
      - brokers: [0]
        cpu: 1000m
      - brokers: [1, 2]
        cpu: "1.555"

[1] linkedin/cruise-control#1831

Checklist

Please go through this checklist and make sure all applicable tasks have been done

Write tests
Make sure all tests pass
Update documentation
Check RBAC rights for Kubernetes / OpenShift roles
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
Reference relevant issue(s) and close them after merging
Update CHANGELOG.md
Supply screenshots for visual changes, such as Grafana dashboards

scholzj

I left some comments. Mostly about the naming.

But I also wonder a bit about the use-case for this. If the broker resources are set, we should always set the automatically (and your docs changes suggest we do).

So, what is the use-case? When you run on dedicated hosts and don't set the CPU request is the only situation when I might see it useful. So it seems a bit niche. It might be good to explain it in the docs when would you use this.

scholzj · 2022-06-06T07:15:46Z

api/src/main/java/io/strimzi/api/kafka/model/balancing/BrokerCapacity.java

 @EqualsAndHashCode
 public class BrokerCapacity implements UnknownPropertyPreserving, Serializable {

    private static final long serialVersionUID = 1L;

    private String disk;
    private Integer cpuUtilization;
+    private String cpuCores;


Kubernetes is using just cpu. Not cpuCores. What is the reason for using different name?

I wanted to draw a distinction between how we were setting the cpu resource before with cpuUtilization (which is now deprecated).

It is still different from cpuUtilization. And if I understand it right, you use the same format as the cpu field in Kubernetes.

@ppatierno @tombentley WDYT?

If we can stick with naming and terminology that users already know from builtin Kubernetes features, it's much better imho. My +1 for using cpu.

scholzj · 2022-06-06T07:16:35Z

api/src/main/java/io/strimzi/api/kafka/model/balancing/BrokerCapacity.java

+    @Description("Broker capacity for CPU resource in cores or milliCPU. " +
+            "For example, 1, 1.500, 1500m.")


Suggested change

@Description("Broker capacity for CPU resource in cores or milliCPU. " +

"For example, 1, 1.500, 1500m.")

@Description("Broker capacity for CPU resource in cores or millicores. " +

"For example, 1, 1.500, 1500m.")

Should we link here to Kubernetes docs for the details about the units?

scholzj · 2022-06-06T07:16:48Z

api/src/main/java/io/strimzi/api/kafka/model/balancing/BrokerCapacityOverride.java

 @EqualsAndHashCode
 public class BrokerCapacityOverride implements UnknownPropertyPreserving, Serializable {
    private static final long serialVersionUID = 1L;

    private List<Integer> brokers;
+    private String cpuCores;


Same as above.

scholzj · 2022-06-06T07:18:08Z

...r-operator/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/BrokerCapacity.java

@@ -9,20 +9,19 @@ public class BrokerCapacity {
    // CC designates the id of this default broker entry as "-1".
    public static final int DEFAULT_BROKER_ID = -1;
    public static final String DEFAULT_BROKER_DOC = "This is the default capacity. Capacity unit used for disk is in MiB, cpu is in percentage, network throughput is in KiB.";
-
-    public static final String DEFAULT_CPU_UTILIZATION_CAPACITY = "100";  // as a percentage (0-100)
+    public static final String DEFAULT_CPU_CORE_CAPACITY = "1";  // as a percentage (0-100)


So, is this 1 CPU core? Or is it 100% of available cores? The name suggests to me it is the first one. But the comment suggests the other. It would be great to have it more clear from the name.

I think the comment is just a left over of copying the previous line. It should be just 1 core. The comment is wrong.

scholzj · 2022-06-06T07:21:57Z

documentation/api/io.strimzi.api.kafka.model.CruiseControlSpec.adoc

+NOTE: Disk capacity limits are automatically generated by Strimzi, so you do not need to set them.

-[NOTE]
-====
-In order to guarantee accurate rebalance proposal when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.
+NOTE: CPU capacity limits are automatically generated by Strimzi when you set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.
 That way, all CPU resources are reserved upfront and are always available.
 This configuration allows Cruise Control to properly evaluate the CPU utilization when preparing the rebalance proposals based on CPU goals.


Can we somehow join both notes into one? I think it looks a bit weird to have the two notes right after each other.

scholzj · 2022-06-06T07:23:54Z

/azp run regression

azure-pipelines · 2022-06-06T07:24:05Z

Azure Pipelines successfully started running 1 pipeline(s).

ppatierno · 2022-06-06T08:24:58Z

...r-operator/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/BrokerCapacity.java

@@ -9,20 +9,19 @@ public class BrokerCapacity {
    // CC designates the id of this default broker entry as "-1".
    public static final int DEFAULT_BROKER_ID = -1;
    public static final String DEFAULT_BROKER_DOC = "This is the default capacity. Capacity unit used for disk is in MiB, cpu is in percentage, network throughput is in KiB.";
-
-    public static final String DEFAULT_CPU_UTILIZATION_CAPACITY = "100";  // as a percentage (0-100)
+    public static final String DEFAULT_CPU_CORE_CAPACITY = "1";  // as a percentage (0-100)


I think the comment is just a left over of copying the previous line. It should be just 1 core. The comment is wrong.

ppatierno · 2022-06-06T08:27:08Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/CpuCapacity.java

+        this.cores = milliCputoCpu(Quantities.parseCpuAsMilliCpus(cores));
+    }
+
+    public static String milliCputoCpu(int milliCPU) {


just a nit ... milliCpuToCpu with capital T

ppatierno · 2022-06-06T08:28:01Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/Capacity.java

+    public static CpuCapacity processCpu(io.strimzi.api.kafka.model.balancing.BrokerCapacity bc, BrokerCapacityOverride override, String cpuBasedOnRequirements) {
+        if (cpuBasedOnRequirements != null) {
+            if ((override != null && override.getCpuCores() != null) || (bc != null && bc.getCpuCores() != null)) {
+                LOGGER.warnOp("Ignoring CPU capacity override settings since they are set automatically set to resource limits");


too many "set" ?

ppatierno · 2022-06-06T08:34:15Z

documentation/api/io.strimzi.api.kafka.model.CruiseControlSpec.adoc

@@ -140,19 +140,17 @@ You specify capacity limits for Kafka broker resources in the `brokerCapacity` p
 They are enabled by default and you can change their default values.
 Capacity limits can be set for the following broker resources:

+* `cpuCores`        - CPU resource in milliCPU or CPU cores (Default: 1)


"millicores"

mimaison · 2022-06-07T09:03:49Z

cluster-operator/src/test/java/io/strimzi/operator/cluster/model/CruiseControlTest.java

-        p2.setId(1);
-        volumes.add(p2);
+        Map<String, Quantity> requests = new HashMap<>(1);
+        requests.put(Capacity.RESOURCE_TYPE, new Quantity("400m"));


We could use Collections.singletonMap() here and a few places below

I think that in a new code we tend to use Map.of(...). -if it can be immutable.

ShubhamRwt · 2022-06-07T12:02:36Z

LGTM. The test seems to be failing due to some uncommitted files

scholzj · 2022-06-07T17:32:59Z

/azp run regression

azure-pipelines · 2022-06-07T17:33:10Z

Azure Pipelines successfully started running 1 pipeline(s).

scholzj

LGTM. Should be reviewed by SMEs.

ppatierno

I left some comments about examples to fix, they are still using cpuCores instead of cpu. The same for the description of this PR, it would be better fixing with cpu.
After the above changes I will approve it.

ppatierno · 2022-06-07T19:20:59Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/Capacity.java

 *         outboundNetwork: 40000KB/s
 *       - brokers: [1, 2]
+ *         cpuCores: 4000m


above example should have cpu instead of cpuCores

ppatierno · 2022-06-07T19:22:12Z

documentation/api/io.strimzi.api.kafka.model.CruiseControlSpec.adoc

        inboundNetwork: 20000KiB/s
        outboundNetwork: 20000KiB/s
      - brokers: [1, 2]
+        cpuCores: 3000m


above cpuCores should be cpu

PaulRMellor

Looks great! 👍

PaulRMellor · 2022-06-08T14:34:02Z

documentation/api/io.strimzi.api.kafka.model.CruiseControlSpec.adoc

 * `inboundNetwork`  - Inbound network throughput in byte units per second (Default: 10000KiB/s)
 * `outboundNetwork` - Outbound network throughput in byte units per second (Default: 10000KiB/s)

 For network throughput, use an integer value with standard Kubernetes byte units (K, M, G) or their bibyte (power of two) equivalents (Ki, Mi, Gi) per second.

 NOTE: Disk and CPU capacity limits are automatically generated by Strimzi, so you do not need to set them.
-
-[NOTE]
-====
 In order to guarantee accurate rebalance proposal when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.


Suggested change

In order to guarantee accurate rebalance proposal when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.

In order to guarantee accurate rebalance proposals when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.

PaulRMellor · 2022-06-08T14:42:04Z

api/src/main/java/io/strimzi/api/kafka/model/balancing/BrokerCapacityOverride.java

+    @Pattern("^[0-9]+([.][0-9]{0,3}|[m]?)$")
+    @Description("Broker capacity for CPU resource in cores or millicores. " +
+            "For example, 1, 1.500, 1500m. " +
+            "For more details on valid CPU resource units see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu")


Suggested change

"For more details on valid CPU resource units see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu")

"For more information on valid CPU resource units, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu.")

tomncooper · 2022-06-09T15:03:00Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/Capacity.java

+
+    private static Integer getResourceRequirement(ResourceRequirements resources, ResourceRequirementType requirementType) {
+        if (resources != null) {
+            Map<String, Quantity> resourceRequirement = requirementType == ResourceRequirementType.REQUEST ? resources.getRequests() : resources.getLimits();


If you wonna be really fancy you can add a method to the enum to do this then you could just do Map<String, Quantity> resourceRequirement = requirementType.getResouceMap(resources).

tomncooper · 2022-06-09T15:04:28Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/Capacity.java

+        if (resources != null) {
+            Map<String, Quantity> resourceRequirement = requirementType == ResourceRequirementType.REQUEST ? resources.getRequests() : resources.getLimits();
+            if (resourceRequirement != null) {
+                Quantity quantity = resourceRequirement.get(RESOURCE_TYPE);


In fact you put this logic (with the step above) in the enum too and have Quantity quantity = requirementType.getQuantity(resources)

@tomncooper Can you double check to make sure my refactoring makes sense?

I do want to be fancy

tomncooper

LGTM

scholzj · 2022-06-09T19:13:31Z

/azp run regression

azure-pipelines · 2022-06-09T19:13:42Z

Azure Pipelines successfully started running 1 pipeline(s).

scholzj · 2022-06-10T17:57:12Z

@kyguy The regression error seems to be CC related:

[ERROR] Errors: 
[ERROR] io.strimzi.systemtest.operators.MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs(ExtensionContext)
[ERROR]   Run 1: MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs:217 » Wait
[ERROR]   Run 2: MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs:217 » Wait
[ERROR]   Run 3: MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs:217 » Wait

So maybe this is related to your changes?

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

scholzj · 2022-06-14T14:48:04Z

/azp run regression

azure-pipelines · 2022-06-14T14:48:26Z

Azure Pipelines successfully started running 1 pipeline(s).

kyguy force-pushed the cc-cpu-capacity-override branch 2 times, most recently from e77dfdc to f6e5eb6 Compare June 3, 2022 22:00

kyguy added this to the 0.30.0 milestone Jun 6, 2022

kyguy marked this pull request as ready for review June 6, 2022 04:33

kyguy requested review from ShubhamRwt, im-konge, ppatierno, mimaison, tomncooper and scholzj June 6, 2022 04:33

scholzj reviewed Jun 6, 2022

View reviewed changes

ppatierno reviewed Jun 6, 2022

View reviewed changes

im-konge approved these changes Jun 7, 2022

View reviewed changes

mimaison reviewed Jun 7, 2022

View reviewed changes

kyguy requested a review from PaulRMellor June 7, 2022 15:24

scholzj approved these changes Jun 7, 2022

View reviewed changes

ppatierno requested changes Jun 7, 2022

View reviewed changes

ppatierno approved these changes Jun 8, 2022

View reviewed changes

kyguy requested a review from mimaison June 8, 2022 13:08

PaulRMellor approved these changes Jun 8, 2022

View reviewed changes

ShubhamRwt approved these changes Jun 9, 2022

View reviewed changes

tomncooper reviewed Jun 9, 2022

View reviewed changes

tomncooper approved these changes Jun 9, 2022

View reviewed changes

mimaison approved these changes Jun 9, 2022

View reviewed changes

kyguy added 11 commits June 14, 2022 08:28

Add CPU overrides for CC capacity config

c223a16

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Addressing comments - JS, PP

fa955f7

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Generate heml charts

f206a67

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Fix CRD property validation + doc example

0f04e04

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Refactoring + change property cpuCores -> cpu

b16c1eb

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Update ST

abe5847

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Fixing examples

81e37fb

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Addressing comments - PM

ffe12a5

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Generate helm charts

94e9440

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Addressing feedback - TC

d1ce2a3

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

Update default CC config doc

00a9527

Signed-off-by: Kyle Liberti <kliberti@redhat.com>

kyguy force-pushed the cc-cpu-capacity-override branch from 787d628 to 00a9527 Compare June 14, 2022 12:31

scholzj merged commit 2627f85 into strimzi:main Jun 14, 2022

kyguy deleted the cc-cpu-capacity-override branch June 15, 2022 21:29

kyguy mentioned this pull request Jul 8, 2022

CruiseControl CPU Utilization Computation #5951

Closed

kyguy mentioned this pull request May 30, 2023

Improve CPU estimation in the Cruise Control capacity configuration #8576

Closed

		@Description("Broker capacity for CPU resource in cores or milliCPU. " +
		"For example, 1, 1.500, 1500m.")

	In order to guarantee accurate rebalance proposal when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.
	In order to guarantee accurate rebalance proposals when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.

	"For more details on valid CPU resource units see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu")
	"For more information on valid CPU resource units, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu.")

Add CPU overrides for CC capacity config #6892

Add CPU overrides for CC capacity config #6892

Conversation

kyguy commented Jun 2, 2022 • edited

Type of change

Description

Checklist

scholzj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scholzj commented Jun 6, 2022

azure-pipelines bot commented Jun 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scholzj Jun 7, 2022 • edited

Choose a reason for hiding this comment

ShubhamRwt commented Jun 7, 2022

scholzj commented Jun 7, 2022

azure-pipelines bot commented Jun 7, 2022

scholzj left a comment

Choose a reason for hiding this comment

ppatierno left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaulRMellor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomncooper Jun 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomncooper left a comment

Choose a reason for hiding this comment

scholzj commented Jun 9, 2022

azure-pipelines bot commented Jun 9, 2022

scholzj commented Jun 10, 2022

scholzj commented Jun 14, 2022

azure-pipelines bot commented Jun 14, 2022

kyguy commented Jun 2, 2022 •

edited

scholzj Jun 7, 2022 •

edited

ppatierno left a comment •

edited

tomncooper Jun 9, 2022 •

edited