Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU overrides for CC capacity config #6892

Merged
merged 11 commits into from Jun 14, 2022

Conversation

kyguy
Copy link
Member

@kyguy kyguy commented Jun 2, 2022

Type of change

  • Enhancement / new feature

Description

For accurate rebalances between brokers running on nodes with heterogeneous CPU resources, Cruise Control must know the CPU capacity limit of individual brokers. This PR allows users to specify capacity limit overrides for lists of individual Kafka brokers in the overrides property in Kafka.spec.cruiseControl.brokerCapacity. This PR also bumps the Cruise Control version to pick up an enhancement [1] which allows us to specify milliCPU and fractional core CPU capacity values.

This PR addresses the CPU capacity issues of #6265 and the UI issues of #5951

apiVersion: {KafkaApiVersion}
kind: Kafka
metadata:
  name: my-cluster
spec:
  # ...
  cruiseControl:
    # ...
    brokerCapacity:
      cpu: "4"
      overrides:
      - brokers: [0]
        cpu: 1000m
      - brokers: [1, 2]
        cpu: "1.555"

[1] linkedin/cruise-control#1831

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@kyguy kyguy force-pushed the cc-cpu-capacity-override branch 2 times, most recently from e77dfdc to f6e5eb6 Compare June 3, 2022 22:00
@kyguy kyguy added this to the 0.30.0 milestone Jun 6, 2022
@kyguy kyguy marked this pull request as ready for review June 6, 2022 04:33
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Mostly about the naming.

But I also wonder a bit about the use-case for this. If the broker resources are set, we should always set the automatically (and your docs changes suggest we do).

So, what is the use-case? When you run on dedicated hosts and don't set the CPU request is the only situation when I might see it useful. So it seems a bit niche. It might be good to explain it in the docs when would you use this.

@EqualsAndHashCode
public class BrokerCapacity implements UnknownPropertyPreserving, Serializable {

private static final long serialVersionUID = 1L;

private String disk;
private Integer cpuUtilization;
private String cpuCores;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes is using just cpu. Not cpuCores. What is the reason for using different name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to draw a distinction between how we were setting the cpu resource before with cpuUtilization (which is now deprecated).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still different from cpuUtilization. And if I understand it right, you use the same format as the cpu field in Kubernetes.

@ppatierno @tombentley WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can stick with naming and terminology that users already know from builtin Kubernetes features, it's much better imho. My +1 for using cpu.

Comment on lines 79 to 80
@Description("Broker capacity for CPU resource in cores or milliCPU. " +
"For example, 1, 1.500, 1500m.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Description("Broker capacity for CPU resource in cores or milliCPU. " +
"For example, 1, 1.500, 1500m.")
@Description("Broker capacity for CPU resource in cores or millicores. " +
"For example, 1, 1.500, 1500m.")

Should we link here to Kubernetes docs for the details about the units?

@EqualsAndHashCode
public class BrokerCapacityOverride implements UnknownPropertyPreserving, Serializable {
private static final long serialVersionUID = 1L;

private List<Integer> brokers;
private String cpuCores;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -9,20 +9,19 @@ public class BrokerCapacity {
// CC designates the id of this default broker entry as "-1".
public static final int DEFAULT_BROKER_ID = -1;
public static final String DEFAULT_BROKER_DOC = "This is the default capacity. Capacity unit used for disk is in MiB, cpu is in percentage, network throughput is in KiB.";

public static final String DEFAULT_CPU_UTILIZATION_CAPACITY = "100"; // as a percentage (0-100)
public static final String DEFAULT_CPU_CORE_CAPACITY = "1"; // as a percentage (0-100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this 1 CPU core? Or is it 100% of available cores? The name suggests to me it is the first one. But the comment suggests the other. It would be great to have it more clear from the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is just a left over of copying the previous line. It should be just 1 core. The comment is wrong.

Comment on lines 149 to 152
NOTE: Disk capacity limits are automatically generated by Strimzi, so you do not need to set them.

[NOTE]
====
In order to guarantee accurate rebalance proposal when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.
NOTE: CPU capacity limits are automatically generated by Strimzi when you set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.
That way, all CPU resources are reserved upfront and are always available.
This configuration allows Cruise Control to properly evaluate the CPU utilization when preparing the rebalance proposals based on CPU goals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we somehow join both notes into one? I think it looks a bit weird to have the two notes right after each other.

@scholzj
Copy link
Member

scholzj commented Jun 6, 2022

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -9,20 +9,19 @@ public class BrokerCapacity {
// CC designates the id of this default broker entry as "-1".
public static final int DEFAULT_BROKER_ID = -1;
public static final String DEFAULT_BROKER_DOC = "This is the default capacity. Capacity unit used for disk is in MiB, cpu is in percentage, network throughput is in KiB.";

public static final String DEFAULT_CPU_UTILIZATION_CAPACITY = "100"; // as a percentage (0-100)
public static final String DEFAULT_CPU_CORE_CAPACITY = "1"; // as a percentage (0-100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is just a left over of copying the previous line. It should be just 1 core. The comment is wrong.

this.cores = milliCputoCpu(Quantities.parseCpuAsMilliCpus(cores));
}

public static String milliCputoCpu(int milliCPU) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit ... milliCpuToCpu with capital T

public static CpuCapacity processCpu(io.strimzi.api.kafka.model.balancing.BrokerCapacity bc, BrokerCapacityOverride override, String cpuBasedOnRequirements) {
if (cpuBasedOnRequirements != null) {
if ((override != null && override.getCpuCores() != null) || (bc != null && bc.getCpuCores() != null)) {
LOGGER.warnOp("Ignoring CPU capacity override settings since they are set automatically set to resource limits");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many "set" ?

@@ -140,19 +140,17 @@ You specify capacity limits for Kafka broker resources in the `brokerCapacity` p
They are enabled by default and you can change their default values.
Capacity limits can be set for the following broker resources:

* `cpuCores` - CPU resource in milliCPU or CPU cores (Default: 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"millicores"

p2.setId(1);
volumes.add(p2);
Map<String, Quantity> requests = new HashMap<>(1);
requests.put(Capacity.RESOURCE_TYPE, new Quantity("400m"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use Collections.singletonMap() here and a few places below

Copy link
Member

@scholzj scholzj Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in a new code we tend to use Map.of(...). -if it can be immutable.

@ShubhamRwt
Copy link
Contributor

LGTM. The test seems to be failing due to some uncommitted files

@kyguy kyguy requested a review from PaulRMellor June 7, 2022 15:24
@scholzj
Copy link
Member

scholzj commented Jun 7, 2022

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Should be reviewed by SMEs.

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about examples to fix, they are still using cpuCores instead of cpu. The same for the description of this PR, it would be better fixing with cpu.
After the above changes I will approve it.

* outboundNetwork: 40000KB/s
* - brokers: [1, 2]
* cpuCores: 4000m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above example should have cpu instead of cpuCores

inboundNetwork: 20000KiB/s
outboundNetwork: 20000KiB/s
- brokers: [1, 2]
cpuCores: 3000m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above cpuCores should be cpu

@kyguy kyguy requested a review from mimaison June 8, 2022 13:08
Copy link
Contributor

@PaulRMellor PaulRMellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 👍

* `inboundNetwork` - Inbound network throughput in byte units per second (Default: 10000KiB/s)
* `outboundNetwork` - Outbound network throughput in byte units per second (Default: 10000KiB/s)

For network throughput, use an integer value with standard Kubernetes byte units (K, M, G) or their bibyte (power of two) equivalents (Ki, Mi, Gi) per second.

NOTE: Disk and CPU capacity limits are automatically generated by Strimzi, so you do not need to set them.

[NOTE]
====
In order to guarantee accurate rebalance proposal when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to guarantee accurate rebalance proposal when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.
In order to guarantee accurate rebalance proposals when using CPU goals, you can set CPU requests equal to CPU limits in `Kafka.spec.kafka.resources`.

@Pattern("^[0-9]+([.][0-9]{0,3}|[m]?)$")
@Description("Broker capacity for CPU resource in cores or millicores. " +
"For example, 1, 1.500, 1500m. " +
"For more details on valid CPU resource units see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"For more details on valid CPU resource units see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu")
"For more information on valid CPU resource units, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu.")


private static Integer getResourceRequirement(ResourceRequirements resources, ResourceRequirementType requirementType) {
if (resources != null) {
Map<String, Quantity> resourceRequirement = requirementType == ResourceRequirementType.REQUEST ? resources.getRequests() : resources.getLimits();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wonna be really fancy you can add a method to the enum to do this then you could just do Map<String, Quantity> resourceRequirement = requirementType.getResouceMap(resources).

if (resources != null) {
Map<String, Quantity> resourceRequirement = requirementType == ResourceRequirementType.REQUEST ? resources.getRequests() : resources.getLimits();
if (resourceRequirement != null) {
Quantity quantity = resourceRequirement.get(RESOURCE_TYPE);
Copy link
Contributor

@tomncooper tomncooper Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact you put this logic (with the step above) in the enum too and have Quantity quantity = requirementType.getQuantity(resources)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomncooper Can you double check to make sure my refactoring makes sense?

I do want to be fancy

Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scholzj
Copy link
Member

scholzj commented Jun 9, 2022

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Jun 10, 2022

@kyguy The regression error seems to be CC related:

[ERROR] Errors: 
[ERROR] io.strimzi.systemtest.operators.MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs(ExtensionContext)
[ERROR]   Run 1: MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs:217 » Wait
[ERROR]   Run 2: MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs:217 » Wait
[ERROR]   Run 3: MultipleClusterOperatorsIsolatedST.testKafkaCCAndRebalanceWithMultipleCOs:217 » Wait

So maybe this is related to your changes?

kyguy added 11 commits June 14, 2022 08:28
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
@scholzj
Copy link
Member

scholzj commented Jun 14, 2022

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj scholzj merged commit 2627f85 into strimzi:main Jun 14, 2022
@kyguy kyguy deleted the cc-cpu-capacity-override branch June 15, 2022 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants