fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug by delthas · Pull Request #2728 · scality/backbeat

delthas · 2026-03-26T16:02:17Z

Summary

Bumps node-rdkafka from ^2.12.0 (librdkafka 2.3.0) to ^3.6.0 (currently resolving to 3.6.1 / librdkafka 2.12.0) as a maintenance upgrade, picking up bugfixes accumulated across librdkafka 2.3.0→2.12.0.

Also fixes a latent race condition in BackbeatConsumer._bootstrapConsumer() exposed by librdkafka 2.10.0+.

Context

This was initially motivated by investigating a flaky CI failure in the "Kafka Cleaner" test (Zenko CI run), where the backbeat-metrics topic had unconsumed messages after rapid pod replacement. While the original hypothesis (cooperative-sticky rebalance bug confluentinc/librdkafka#4908) turned out not to apply — BackbeatConsumer uses eager rebalance (range,roundrobin), not cooperative-sticky — the upgrade is still worthwhile for the accumulated bugfixes, and exposed a real bug in the bootstrap code that needed fixing.

Bootstrap consumer fix

The librdkafka upgrade exposed a latent race condition in _bootstrapConsumer(). The bootstrap previously used setInterval(200ms) to call consume(1, consumeCb), dispatching C++ async workers to the libuv thread pool (each with a 1000ms timeout). Up to 5 workers could be in flight concurrently.

When the bootstrap match was found, the old code called clearInterval then immediately unsubscribe(). But clearInterval only prevents new consume calls — workers already in the C++ thread pool continue running. With librdkafka < 2.10.0, unsubscribe() effectively invalidated these stale workers (they'd return empty or error). With librdkafka >= 2.10.0 ("Enhanced handling for subscribe/unsubscribe edge cases"), these workers survive across the unsubscribe→subscribe transition and dequeue messages from the next subscription, delivering them to the bootstrap's consumeCb which silently ignores non-bootstrap messages.

This was confirmed by local reproduction and bisection:

node-rdkafka	librdkafka	`_bootstrapConsumer` test
3.3.1	2.8.0	PASS
3.4.0	2.10.0	FAIL
3.6.1	2.12.0	FAIL

The fix replaces setInterval with chained setTimeout: each consume(1) is only scheduled after the previous one completes, guaranteeing at most one C++ async worker is in flight at a time. This makes unsubscribe() safe to call directly from the callback with no drain/polling needed.

Why ^3.6.0

node-rdkafka	librdkafka
2.18.0 (latest 2.x)	2.3.0
3.0.0	2.3.0
3.4.0	2.10.0
3.6.1	2.12.0

We set the floor to ^3.6.0 (librdkafka 2.12.0). KIP-848 (new consumer group protocol) is opt-in (group.protocol=consumer must be explicitly set) and does not affect the default classic protocol.

Upgrade safety

librdkafka 2.3.0 → 2.12.0: The librdkafka CHANGELOG shows no consumer-breaking changes in this range. The only notable breaking change is in v2.4.0 where INVALID_RECORD producer errors became non-retriable — this does not affect consumers. The metadata recovery behavior change in 2.10.0 (brokers not in metadata responses are removed and clients re-bootstrap) is a minor behavioral difference that should not impact normal operation.

node-rdkafka 2.x → 3.x: The 3.0.0 release only dropped support for EOL Node.js versions — no API changes. Backbeat requires Node >= 20 and runs on Node 22.14.0.

Issue: BB-760

bert-e · 2026-03-26T16:02:21Z

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-26T16:02:30Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

package.json

claude · 2026-03-26T16:03:50Z

^3.5.0 semver range allows 3.6.0+ (librdkafka 2.12.0), contradicting the PR description's goal of avoiding 2.12.0 metadata recovery changes. Pin tighter or clarify intent.

Review by Claude Code

codecov · 2026-03-26T16:11:21Z

Codecov Report

❌ Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.43%. Comparing base (79e1ace) to head (0a24e53).
⚠️ Report is 2 commits behind head on development/9.4.

Files with missing lines	Patch %	Lines
lib/BackbeatConsumer.js	90.90%	2 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
lib/queuePopulator/KafkaLogConsumer/LogConsumer.js	`90.17% <ø> (ø)`
lib/BackbeatConsumer.js	`92.83% <90.90%> (-1.00%)`	⬇️

... and 2 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.37% <ø> (ø)`
Core Library	`80.51% <90.90%> (-0.04%)`	⬇️
Ingestion	`70.53% <ø> (-0.62%)`	⬇️
Lifecycle	`79.01% <ø> (ø)`
Oplog Populator	`85.83% <ø> (ø)`
Replication	`59.61% <ø> (ø)`
Bucket Scanner	`85.76% <ø> (ø)`

@@                 Coverage Diff                 @@
##           development/9.4    #2728      +/-   ##
===================================================
- Coverage            74.50%   74.43%   -0.07%     
===================================================
  Files                  200      200              
  Lines                13610    13614       +4     
===================================================
- Hits                 10140    10134       -6     
- Misses                3460     3470      +10     
  Partials                10       10

Flag	Coverage Δ
api:retry	`9.13% <0.00%> (-0.01%)`	⬇️
api:routes	`8.95% <0.00%> (-0.01%)`	⬇️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`9.07% <0.00%> (+0.01%)`	⬆️
ingestion	`12.49% <0.00%> (-0.06%)`	⬇️
lib	`7.60% <90.90%> (-0.01%)`	⬇️
lifecycle	`18.74% <86.36%> (-0.11%)`	⬇️
notification	`1.02% <0.00%> (-0.01%)`	⬇️
oplogPopulator	`0.14% <0.00%> (-0.01%)`	⬇️
replication	`18.48% <86.36%> (+<0.01%)`	⬆️
unit	`51.10% <0.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

package.json

claude · 2026-03-26T16:15:09Z

LGTM — the dependency bump is clean with no code changes required. The node-rdkafka 3.x release has no API breaking changes (only dropped EOL Node.js support), and the librdkafka upgrade path from 2.3.0 to 2.12.0 has no consumer-breaking changes. The root cause analysis in the PR body is thorough.

- PR body says ^3.5.0 but package.json has ^3.6.0 — update one to match the other

Review by Claude Code

package.json

claude · 2026-03-26T16:18:37Z

Excellent investigation and write-up. The dependency bump itself is clean — only package.json and yarn.lock touched, no API changes in node-rdkafka 3.x, and the librdkafka upgrade path is well-justified.

- PR title says ^3.5.0 but the code sets the floor to ^3.6.0 — title should be updated to match

Review by Claude Code

claude · 2026-03-30T09:54:10Z

LGTM

The fix is well-investigated and the code change is minimal and correct. The in-flight consume tracking with deferred unsubscribe is a clean solution to the race condition exposed by librdkafka 2.10.0+. The counter is properly incremented before each consume(1, consumeCb) call and decremented at the top of consumeCb (including on error paths), so it should always drain to 0 after clearInterval stops new consume calls.

Review by Claude Code

delthas · 2026-04-07T08:35:14Z

On hold because it is not known yet whether this PR actually helps with the CI.

bert-e · 2026-04-07T08:35:22Z

Jira issue not found

The Jira issue BB-760 was not found.

Upgrades node-rdkafka to ^3.6.0 (resolving to 3.6.1 / librdkafka 2.12.0) to fix a cooperative-sticky partition assignor bug that causes partitions to become orphaned during consumer group rebalances. Issue: BB-760

…scribe The bootstrap consumer uses setInterval(200ms) to call consume(1, cb), creating multiple concurrent C++ async workers (each with a 1000ms timeout). Since librdkafka 2.10.0, these workers survive an unsubscribe→subscribe transition and can dequeue messages from the next subscription, causing them to be lost to the normal consume pipeline. Track in-flight workers with a counter and defer unsubscribe() until all have completed. Issue: BB-760

claude · 2026-04-07T13:06:55Z

LGTM

Review by Claude Code

bert-e · 2026-04-07T13:31:45Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

Pin 'range,roundrobin' (the current librdkafka default) explicitly so that future version updates cannot silently change the rebalance strategy. Our consumers rely on eager rebalancing. Issue: BB-760

claude · 2026-04-08T12:56:10Z

LGTM

The in-flight consume tracking in _finishBootstrap correctly handles the race between clearInterval and already-dispatched C++ workers. The partition.assignment.strategy pinning in both BackbeatConsumer and LogConsumer is a good defensive measure against future default changes.

Review by Claude Code

lib/BackbeatConsumer.js

…trap Address review feedback: instead of tracking in-flight consume workers and polling for drain, eliminate concurrent workers entirely by using chained setTimeout. Each consume(1) call is only scheduled after the previous one completes, guaranteeing at most one C++ async worker is in flight. This makes unsubscribe() safe to call immediately from the callback. Issue: BB-760

claude · 2026-04-09T14:18:17Z

LGTM

Review by Claude Code

francoisferrand · 2026-04-10T06:50:57Z

lib/BackbeatConsumer.js

+                if (err || bootstrapDone) {
+                    if (!bootstrapDone) {
+                        setTimeout(consumeNext, 200);
+                    }
+                    return undefined;
+                }


Suggested change

if (err || bootstrapDone) {

if (!bootstrapDone) {

setTimeout(consumeNext, 200);

}

return undefined;

}

if (bootstrapDone) {

return undefined;

}

if (err) {

setTimeout(consumeNext, 200);

return undefined;

}

francoisferrand · 2026-04-10T06:56:14Z

lib/BackbeatConsumer.js

+                    return undefined;
+                }
+                let matched = false;
+                messages.forEach(message => {


instead of forEach and a variable, best to use find():

so we don't even try to process the message beyond the bootstrap

make core more readable

Suggested change

messages.forEach(message => {

cont receivedBootstrap = messages.find(message => {

const bootstrapId = JSON.parse(message.value).bootstrapId;

self._log.info('bootstraping backbeat consumer: received bootstrap message', {

bootstrapId, topic: self._topic, groupId: self._groupId

});

return bootstrapId === lastBootstrapId;

});

if (!receivedBootstrap) {

return setTimeout(consumeNext, 200);

}

self._log.info('backbeat consumer is bootstrapped', { topic: self._topic, groupId: self._groupId });

bootstrapDone = true;

clearInterval(producerTimer);

self._consumer.offsetsStore([{ topic: self._topic, partition: message.partition, offset: message.offset + 1 }]);

self._consumer.commit();

self._consumer.unsubscribe();

producer.close(() => {

self._bootstrapping = false;

self._onReady();

});

francoisferrand · 2026-04-10T06:59:43Z

lib/BackbeatConsumer.js

+                            self._log.info('backbeat consumer is bootstrapped',
+                                { topic: self._topic, groupId: self._groupId });
+                            matched = true;
+                            bootstrapDone = true;


is this variable still needed?

we update it when processing the "result" of consume()

since we are now guaranteed to have a single call to consume(), it is the current one

thus no guard needed

francoisferrand · 2026-04-10T07:01:08Z

lib/BackbeatConsumer.js

+                                { topic: self._topic, groupId: self._groupId });
+                            matched = true;
+                            bootstrapDone = true;
+                            clearInterval(producerTimer);


producerTimer is a local variable, and does not seem to be initialized?

francoisferrand · 2026-04-10T07:10:44Z

lib/BackbeatConsumer.js

+        // since librdkafka 2.10.0 they survive an
+        // unsubscribe→subscribe transition, stealing messages from
+        // the next subscription.
+        function consumeNext() {


since it runs the timer repeatedly until bootstrap received, we can't really call this function consume next... so maybe just consume ?

- Rename consumeNext to consume - Remove bootstrapDone flag (unnecessary with chained setTimeout since only one consume worker is ever in flight) - Use find() instead of forEach + matched flag - Flatten the match handling out of the forEach callback Issue: BB-760

claude · 2026-04-10T17:02:05Z

LGTM

Review by Claude Code

claude bot reviewed Mar 26, 2026

View reviewed changes

package.json Outdated Show resolved Hide resolved

claude bot reviewed Mar 26, 2026

View reviewed changes

package.json Show resolved Hide resolved

delthas force-pushed the bugfix/BB-760/bump-node-rdkafka branch from f528083 to 1735ada Compare March 26, 2026 16:17

claude bot reviewed Mar 26, 2026

View reviewed changes

package.json Show resolved Hide resolved

delthas changed the title ~~fix(deps): bump node-rdkafka to ^3.5.0 to fix cooperative-sticky rebalance bug~~ fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug Mar 26, 2026

delthas requested a review from francoisferrand April 1, 2026 07:36

delthas marked this pull request as draft April 7, 2026 08:35

delthas marked this pull request as ready for review April 7, 2026 12:57

delthas added 2 commits April 7, 2026 15:05

fix(deps): bump node-rdkafka from ^2.12.0 to ^3.6.0

ab84dda

Upgrades node-rdkafka to ^3.6.0 (resolving to 3.6.1 / librdkafka 2.12.0) to fix a cooperative-sticky partition assignor bug that causes partitions to become orphaned during consumer group rebalances. Issue: BB-760

delthas force-pushed the bugfix/BB-760/bump-node-rdkafka branch from aa79ee9 to 11fe6ee Compare April 7, 2026 13:05

delthas changed the base branch from development/9.3 to development/9.4 April 7, 2026 13:05

scality deleted a comment from bert-e Apr 7, 2026

fix(config): explicitly pin partition.assignment.strategy

7d7f640

Pin 'range,roundrobin' (the current librdkafka default) explicitly so that future version updates cannot silently change the rebalance strategy. Our consumers rely on eager rebalancing. Issue: BB-760

francoisferrand reviewed Apr 8, 2026

View reviewed changes

lib/BackbeatConsumer.js Outdated Show resolved Hide resolved

francoisferrand mentioned this pull request Apr 9, 2026

BB-705: Bump rdkafka to fix DEP0048 util.isError() #2667

Closed

delthas requested review from a team, benzekrimaha and francoisferrand April 9, 2026 14:29

francoisferrand reviewed Apr 10, 2026

View reviewed changes

-                messages.forEach(message => {
+cont receivedBootstrap = messages.find(message => {
+   const bootstrapId = JSON.parse(message.value).bootstrapId;
+   self._log.info('bootstraping backbeat consumer: received bootstrap message', {
+      bootstrapId, topic: self._topic, groupId: self._groupId
+   });
+   return bootstrapId === lastBootstrapId;
+});
+if (!receivedBootstrap) {
+   return setTimeout(consumeNext, 200);
+}
+self._log.info('backbeat consumer is bootstrapped', { topic: self._topic, groupId: self._groupId });
+bootstrapDone = true;
+clearInterval(producerTimer);
+self._consumer.offsetsStore([{ topic: self._topic, partition: message.partition, offset: message.offset + 1 }]);
+self._consumer.commit();
+self._consumer.unsubscribe();
+producer.close(() => {
+   self._bootstrapping = false;
+   self._onReady();
+});

Conversation

delthas commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Bootstrap consumer fix

Why ^3.6.0

Upgrade safety

Uh oh!

bert-e commented Mar 26, 2026

Hello delthas,

Uh oh!

bert-e commented Mar 26, 2026

Waiting for approval

Uh oh!

Uh oh!

claude bot commented Mar 26, 2026

Uh oh!

codecov bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

claude bot commented Mar 26, 2026

Uh oh!

Uh oh!

claude bot commented Mar 26, 2026

Uh oh!

claude bot commented Mar 30, 2026

Uh oh!

delthas commented Apr 7, 2026

Uh oh!

bert-e commented Apr 7, 2026

Jira issue not found

Uh oh!

claude bot commented Apr 7, 2026

Uh oh!

bert-e commented Apr 7, 2026

Waiting for approval

Uh oh!

claude bot commented Apr 8, 2026

Uh oh!

Uh oh!

claude bot commented Apr 9, 2026

Uh oh!

francoisferrand Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

francoisferrand Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

francoisferrand Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

francoisferrand Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

francoisferrand Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

delthas commented Mar 26, 2026 •

edited

Loading

codecov bot commented Mar 26, 2026 •

edited

Loading