Skip to content

fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug#2728

Open
delthas wants to merge 5 commits intodevelopment/9.4from
bugfix/BB-760/bump-node-rdkafka
Open

fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug#2728
delthas wants to merge 5 commits intodevelopment/9.4from
bugfix/BB-760/bump-node-rdkafka

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Mar 26, 2026

Summary

Bumps node-rdkafka from ^2.12.0 (librdkafka 2.3.0) to ^3.6.0 (currently resolving to 3.6.1 / librdkafka 2.12.0) as a maintenance upgrade, picking up bugfixes accumulated across librdkafka 2.3.0→2.12.0.

Also fixes a latent race condition in BackbeatConsumer._bootstrapConsumer() exposed by librdkafka 2.10.0+.

Context

This was initially motivated by investigating a flaky CI failure in the "Kafka Cleaner" test (Zenko CI run), where the backbeat-metrics topic had unconsumed messages after rapid pod replacement. While the original hypothesis (cooperative-sticky rebalance bug confluentinc/librdkafka#4908) turned out not to apply — BackbeatConsumer uses eager rebalance (range,roundrobin), not cooperative-sticky — the upgrade is still worthwhile for the accumulated bugfixes, and exposed a real bug in the bootstrap code that needed fixing.

Bootstrap consumer fix

The librdkafka upgrade exposed a latent race condition in _bootstrapConsumer(). The bootstrap previously used setInterval(200ms) to call consume(1, consumeCb), dispatching C++ async workers to the libuv thread pool (each with a 1000ms timeout). Up to 5 workers could be in flight concurrently.

When the bootstrap match was found, the old code called clearInterval then immediately unsubscribe(). But clearInterval only prevents new consume calls — workers already in the C++ thread pool continue running. With librdkafka < 2.10.0, unsubscribe() effectively invalidated these stale workers (they'd return empty or error). With librdkafka >= 2.10.0 ("Enhanced handling for subscribe/unsubscribe edge cases"), these workers survive across the unsubscribe→subscribe transition and dequeue messages from the next subscription, delivering them to the bootstrap's consumeCb which silently ignores non-bootstrap messages.

This was confirmed by local reproduction and bisection:

node-rdkafka librdkafka _bootstrapConsumer test
3.3.1 2.8.0 PASS
3.4.0 2.10.0 FAIL
3.6.1 2.12.0 FAIL

The fix replaces setInterval with chained setTimeout: each consume(1) is only scheduled after the previous one completes, guaranteeing at most one C++ async worker is in flight at a time. This makes unsubscribe() safe to call directly from the callback with no drain/polling needed.

Why ^3.6.0

node-rdkafka librdkafka
2.18.0 (latest 2.x) 2.3.0
3.0.0 2.3.0
3.4.0 2.10.0
3.6.1 2.12.0

We set the floor to ^3.6.0 (librdkafka 2.12.0). KIP-848 (new consumer group protocol) is opt-in (group.protocol=consumer must be explicitly set) and does not affect the default classic protocol.

Upgrade safety

librdkafka 2.3.0 → 2.12.0: The librdkafka CHANGELOG shows no consumer-breaking changes in this range. The only notable breaking change is in v2.4.0 where INVALID_RECORD producer errors became non-retriable — this does not affect consumers. The metadata recovery behavior change in 2.10.0 (brokers not in metadata responses are removed and clients re-bootstrap) is a minor behavioral difference that should not impact normal operation.

node-rdkafka 2.x → 3.x: The 3.0.0 release only dropped support for EOL Node.js versions — no API changes. Backbeat requires Node >= 20 and runs on Node 22.14.0.

Issue: BB-760

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 26, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 26, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

  • ^3.5.0 semver range allows 3.6.0+ (librdkafka 2.12.0), contradicting the PR description's goal of avoiding 2.12.0 metadata recovery changes. Pin tighter or clarify intent.

    Review by Claude Code

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.43%. Comparing base (79e1ace) to head (0a24e53).
⚠️ Report is 2 commits behind head on development/9.4.

Files with missing lines Patch % Lines
lib/BackbeatConsumer.js 90.90% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/queuePopulator/KafkaLogConsumer/LogConsumer.js 90.17% <ø> (ø)
lib/BackbeatConsumer.js 92.83% <90.90%> (-1.00%) ⬇️

... and 2 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.37% <ø> (ø)
Core Library 80.51% <90.90%> (-0.04%) ⬇️
Ingestion 70.53% <ø> (-0.62%) ⬇️
Lifecycle 79.01% <ø> (ø)
Oplog Populator 85.83% <ø> (ø)
Replication 59.61% <ø> (ø)
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.4    #2728      +/-   ##
===================================================
- Coverage            74.50%   74.43%   -0.07%     
===================================================
  Files                  200      200              
  Lines                13610    13614       +4     
===================================================
- Hits                 10140    10134       -6     
- Misses                3460     3470      +10     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.13% <0.00%> (-0.01%) ⬇️
api:routes 8.95% <0.00%> (-0.01%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.07% <0.00%> (+0.01%) ⬆️
ingestion 12.49% <0.00%> (-0.06%) ⬇️
lib 7.60% <90.90%> (-0.01%) ⬇️
lifecycle 18.74% <86.36%> (-0.11%) ⬇️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.48% <86.36%> (+<0.01%) ⬆️
unit 51.10% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

LGTM — the dependency bump is clean with no code changes required. The node-rdkafka 3.x release has no API breaking changes (only dropped EOL Node.js support), and the librdkafka upgrade path from 2.3.0 to 2.12.0 has no consumer-breaking changes. The root cause analysis in the PR body is thorough.

- PR body says ^3.5.0 but package.json has ^3.6.0 — update one to match the other

Review by Claude Code

@delthas delthas force-pushed the bugfix/BB-760/bump-node-rdkafka branch from f528083 to 1735ada Compare March 26, 2026 16:17
@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

Excellent investigation and write-up. The dependency bump itself is clean — only package.json and yarn.lock touched, no API changes in node-rdkafka 3.x, and the librdkafka upgrade path is well-justified.

- PR title says ^3.5.0 but the code sets the floor to ^3.6.0 — title should be updated to match

Review by Claude Code

@delthas delthas changed the title fix(deps): bump node-rdkafka to ^3.5.0 to fix cooperative-sticky rebalance bug fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug Mar 26, 2026
@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

LGTM

The fix is well-investigated and the code change is minimal and correct. The in-flight consume tracking with deferred unsubscribe is a clean solution to the race condition exposed by librdkafka 2.10.0+. The counter is properly incremented before each consume(1, consumeCb) call and decremented at the top of consumeCb (including on error paths), so it should always drain to 0 after clearInterval stops new consume calls.

Review by Claude Code

@delthas delthas requested a review from francoisferrand April 1, 2026 07:36
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented Apr 7, 2026

On hold because it is not known yet whether this PR actually helps with the CI.

@delthas delthas marked this pull request as draft April 7, 2026 08:35
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 7, 2026

Jira issue not found

The Jira issue BB-760 was not found.

@delthas delthas marked this pull request as ready for review April 7, 2026 12:57
delthas added 2 commits April 7, 2026 15:05
Upgrades node-rdkafka to ^3.6.0 (resolving to 3.6.1 / librdkafka 2.12.0)
to fix a cooperative-sticky partition assignor bug that causes partitions
to become orphaned during consumer group rebalances.

Issue: BB-760
…scribe

The bootstrap consumer uses setInterval(200ms) to call consume(1, cb),
creating multiple concurrent C++ async workers (each with a 1000ms
timeout). Since librdkafka 2.10.0, these workers survive an
unsubscribe→subscribe transition and can dequeue messages from the next
subscription, causing them to be lost to the normal consume pipeline.

Track in-flight workers with a counter and defer unsubscribe() until
all have completed.

Issue: BB-760
@delthas delthas force-pushed the bugfix/BB-760/bump-node-rdkafka branch from aa79ee9 to 11fe6ee Compare April 7, 2026 13:05
@delthas delthas changed the base branch from development/9.3 to development/9.4 April 7, 2026 13:05
@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

LGTM

Review by Claude Code

@scality scality deleted a comment from bert-e Apr 7, 2026
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 7, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

Pin 'range,roundrobin' (the current librdkafka default) explicitly
so that future version updates cannot silently change the rebalance
strategy. Our consumers rely on eager rebalancing.

Issue: BB-760
@claude
Copy link
Copy Markdown

claude bot commented Apr 8, 2026

LGTM

The in-flight consume tracking in _finishBootstrap correctly handles the race between clearInterval and already-dispatched C++ workers. The partition.assignment.strategy pinning in both BackbeatConsumer and LogConsumer is a good defensive measure against future default changes.

Review by Claude Code

…trap

Address review feedback: instead of tracking in-flight consume workers
and polling for drain, eliminate concurrent workers entirely by using
chained setTimeout. Each consume(1) call is only scheduled after the
previous one completes, guaranteeing at most one C++ async worker is
in flight. This makes unsubscribe() safe to call immediately from
the callback.

Issue: BB-760
@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

LGTM

Review by Claude Code

@delthas delthas requested review from a team, benzekrimaha and francoisferrand April 9, 2026 14:29
Comment on lines +888 to +893
if (err || bootstrapDone) {
if (!bootstrapDone) {
setTimeout(consumeNext, 200);
}
return undefined;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (err || bootstrapDone) {
if (!bootstrapDone) {
setTimeout(consumeNext, 200);
}
return undefined;
}
if (bootstrapDone) {
return undefined;
}
if (err) {
setTimeout(consumeNext, 200);
return undefined;
}

return undefined;
}
let matched = false;
messages.forEach(message => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of forEach and a variable, best to use find():

  • so we don't even try to process the message beyond the bootstrap
  • make core more readable
Suggested change
messages.forEach(message => {
cont receivedBootstrap = messages.find(message => {
const bootstrapId = JSON.parse(message.value).bootstrapId;
self._log.info('bootstraping backbeat consumer: received bootstrap message', {
bootstrapId, topic: self._topic, groupId: self._groupId
});
return bootstrapId === lastBootstrapId;
});
if (!receivedBootstrap) {
return setTimeout(consumeNext, 200);
}
self._log.info('backbeat consumer is bootstrapped', { topic: self._topic, groupId: self._groupId });
bootstrapDone = true;
clearInterval(producerTimer);
self._consumer.offsetsStore([{ topic: self._topic, partition: message.partition, offset: message.offset + 1 }]);
self._consumer.commit();
self._consumer.unsubscribe();
producer.close(() => {
self._bootstrapping = false;
self._onReady();
});

self._log.info('backbeat consumer is bootstrapped',
{ topic: self._topic, groupId: self._groupId });
matched = true;
bootstrapDone = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this variable still needed?

  • we update it when processing the "result" of consume()
  • since we are now guaranteed to have a single call to consume(), it is the current one
  • thus no guard needed

{ topic: self._topic, groupId: self._groupId });
matched = true;
bootstrapDone = true;
clearInterval(producerTimer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

producerTimer is a local variable, and does not seem to be initialized?

// since librdkafka 2.10.0 they survive an
// unsubscribe→subscribe transition, stealing messages from
// the next subscription.
function consumeNext() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it runs the timer repeatedly until bootstrap received, we can't really call this function consume next... so maybe just consume ?

- Rename consumeNext to consume
- Remove bootstrapDone flag (unnecessary with chained setTimeout
  since only one consume worker is ever in flight)
- Use find() instead of forEach + matched flag
- Flatten the match handling out of the forEach callback

Issue: BB-760
@claude
Copy link
Copy Markdown

claude bot commented Apr 10, 2026

LGTM

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants