Skip to content

fix: bump Cruise Control to 2.5.123 for cgroup v2 compatibility#2359

Merged
bert-e merged 1 commit intodevelopment/2.14from
bugfix/ZENKO-5227/fix-cc-crashloop-cgroupv2
Mar 20, 2026
Merged

fix: bump Cruise Control to 2.5.123 for cgroup v2 compatibility#2359
bert-e merged 1 commit intodevelopment/2.14from
bugfix/ZENKO-5227/fix-cc-crashloop-cgroupv2

Conversation

@delthas
Copy link
Contributor

@delthas delthas commented Mar 19, 2026

Summary

The Cruise Control pod (end2end-base-queue-cruisecontrol) enters CrashLoopBackOff on hosts with cgroup v2 (e.g. kernel 6.19+). This PR bumps the Cruise Control image from 2.5.101 to 2.5.123 to resolve it.

Fixes ZENKO-5227

Scope of impact

This only affects local development environments and the Zenko CI pipeline — anywhere the cluster runs on a host kernel that defaults to cgroup v2 (e.g. Arch Linux, Fedora, Ubuntu 22.04+, or any kernel 5.2+ with cgroup v2 enabled). ARTESCA production clusters are not affected: ARTESCA deploys on Rocky Linux 8.10 (kernel 4.18), which uses cgroup v1. The JDK's CgroupV2Subsystem code path is never reached on cgroup v1 hosts, so the bug cannot trigger there.

That said, this would become a blocker if ARTESCA ever moves to RHEL 9 / Rocky 9 (kernel 5.14+, cgroup v2 by default).

The error

The pod crashes on startup with a fatal JVM abort:

Exception in thread "main" java.lang.reflect.InvocationTargetException
Caused by: java.lang.NullPointerException
    at jdk.internal.platform.cgroupv2.CgroupV2Subsystem.getInstance(Unknown Source)
    at jdk.internal.platform.CgroupSubsystemFactory.create(Unknown Source)
    ...
    at java.lang.management.ManagementFactory.getOperatingSystemMXBean(Unknown Source)
    at io.prometheus.jmx.shaded.io.prometheus.client.hotspot.StandardExports.<init>(StandardExports.java:43)
    at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:30)

FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed

The JMX Prometheus Java agent (-javaagent) runs during JVM startup. It calls DefaultExports.initialize()ManagementFactory.getOperatingSystemMXBean()CgroupV2Subsystem.getInstance(), which throws a NullPointerException when parsing cgroup v2 filesystem entries. Since Java treats agent premain failures as fatal, the entire JVM aborts before Cruise Control can start.

Root cause

The Cruise Control image 2.5.101 ships JDK 11.0.16.1 (Eclipse Temurin), which has a bug in its cgroup v2 support. The CgroupV2Subsystem.getInstance() method fails with an NPE on newer kernels (verified on 6.19.8-arch1-1 with cgroup2 mounts). This is reproducible even without the JMX agent — simply running java -XshowSettings:system in the container triggers the same crash.

This is not a JMX exporter version issue. The bug is in the JDK itself.

The fix

Bump cruise-control from 2.5.101 to 2.5.123 in solution/deps.yaml. The 2.5.123 image ships JDK 17.0.7 (Eclipse Temurin), which handles cgroup v2 correctly. Verified by running java -XshowSettings:system in a 2.5.123 container — it reads cgroup v2 metrics without error.

The upstream changes between Cruise Control 2.5.101 and 2.5.123 (LinkedIn's cruise-control) are patch-level: CVE dependency bumps (snakeyaml, scala, Netty, org.json), bug fixes (leader CPU util, offline partitions, concurrency adjuster NPE), and non-breaking additions (partition movement metrics, per-broker concurrency adjuster). No config format changes, no removed APIs. The docker-cruise-control Dockerfile diff between the two tags is just a JDK 11→17 bump, a Node 16→20 bump (build-time only), and OCI labels.

Alternatives considered

  1. Add -XX:-UseContainerSupport to KAFKA_OPTS: This disables the JDK's container/cgroup detection entirely, sidestepping the crash. Confirmed working. Rejected because it also disables memory/CPU limit awareness — the JVM would ignore container resource constraints, which could cause OOM kills or CPU overuse.

  2. Upgrade only the JMX exporter (from 0.16.1 to 0.17.1+): Initially suspected as the fix, but the bug is in the JDK, not the exporter. Upgrading the exporter alone would not help since CgroupV2Subsystem.getInstance() is called by the JDK's ManagementFactory, not by the exporter directly.

  3. Pin to a patched JDK 11 build: JDK 11.0.19+ has cgroup v2 fixes. However, there is no cruise-control image built with a patched JDK 11 — Banzai moved to JDK 17 starting with 2.5.113. Building a custom image would add maintenance burden for no benefit over using the upstream 2.5.123.

@bert-e
Copy link
Contributor

bert-e commented Mar 19, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Mar 19, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

@delthas delthas requested review from a team, benzekrimaha and maeldonn March 19, 2026 14:05
Copy link
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also switch to a "newer" (if any) adobe fork of this image?

@delthas
Copy link
Contributor Author

delthas commented Mar 20, 2026

should we also switch to a "newer" (if any) adobe fork of this image?

Adobe has newer versions:

┌──────────────────────────┬────────────┬──────────────────────────────┐
│           Tag            │ CC Version │             Date             │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 3.0.3-adbe-20251008      │ 3.0.3      │ Oct 2025                     │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 3.0.3-adbe-20250804      │ 3.0.3      │ Aug 2025 (koperator default) │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 2.5.133-adbe-20250818-rc │ 2.5.133    │ Aug 2025 (RC)                │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 2.5.133-adbe-20240806-rc │ 2.5.133    │ Aug 2024 (RC)                │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 2.5.133-adbe-20240313    │ 2.5.133    │ Mar 2024                     │
└──────────────────────────┴────────────┴──────────────────────────────┘

But the switch to 3.0.3 sounds more invovled. I'd suggest keeping the small increment for this MR, enough to get rid of the crash loop in Zenko CI, and consider switching to the latest image when working on the move to the Adobe koperator fork.

@delthas
Copy link
Contributor Author

delthas commented Mar 20, 2026

/approve

@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.14

The following branches will NOT be impacted:

  • development/2.10
  • development/2.11
  • development/2.12
  • development/2.13
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@scality scality deleted a comment from bert-e Mar 20, 2026
@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.14

The following branches have NOT changed:

  • development/2.10
  • development/2.11
  • development/2.12
  • development/2.13
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

Please check the status of the associated issue ZENKO-5227.

Goodbye delthas.

@bert-e bert-e merged commit b466cea into development/2.14 Mar 20, 2026
51 of 56 checks passed
@bert-e bert-e deleted the bugfix/ZENKO-5227/fix-cc-crashloop-cgroupv2 branch March 20, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants