rabbit_khepri: Fix topic binding deletion leak #15025

mkuratczyk · 2025-11-27T10:20:27Z

Why

We use a Khepri projection to compute a graph for bindings that have a topic exchange as their source. This allows more efficient queries during routing. This graph is not stored in Khepri, only in the projection ETS table.

When a binding is deleted, we need to clean up the graph. However, the pattern used to match the trie edges to delete was incorrect, leading to "orphaned" trie edges. The accumulation of these leftovers caused a memory leak.

How

The pattern was fixed to correctly match the appropriate trie edges.

However, this fix alone is effective for new deployments of RabbitMQ only, when the projection function is registered for the first time. We also need to handle the update of already registered projections in existing clusters.

To achieve that, first, we renamed the projection from rabbit_khepri_topic_trie to rabbit_khepri_topic_trie_v2 to distinguish the bad one and the good one. Any updated RabbitMQ nodes in an existing cluster will use this new projection. Other existing out-of-date nodes will continue to use the old projection. Because both projections continue to exist, the cluster will still be affected by the memory leak.

Then, each node will verify on startup if all other cluster members support the new projection. If that is the case, they will unregister the old projection. Therefore, once all nodes in a cluster are up-to-date and use the new projection, the old one will go away and the leaked memory will be reclaimed.

This startup check could have been made simpler with a feature flag. We decided to go with a custom check in case a user would try to upgrade from a 4.1.x release that has the fix to a 4.2.x release that does not for instance. A feature flag would have prevented that upgrade path.

Fixes #15024.

jdennison-iforium · 2025-11-27T10:57:26Z

Thanks very much for this.

I note the 'backport-v4.2.x' label. Will this also be back-ported for v4.1.x?

dumbbell · 2025-11-27T11:18:40Z

This will only fix the problem for new deployments, not existing ones. The reason is that the projection function is stored in Khepri the first time it is registered. This is mandatory to ensure the behaviour is consistent across a cluster.

The current code in rabbit_khepri doesn’t try to manage updates of this function unfortunately.

michaelklishin · 2025-11-27T22:37:24Z

@dumbbell @mkuratczyk so, should we merge this as a first step and then see what can be done to work around the "persistent" nature of projections in Khepri?

dumbbell · 2025-11-28T08:10:35Z

I think we should have a way forward for upgrades first.

[Why] We use a Khepri projection to compute a graph for bindings that have a topic exchange as their source. This allows more efficient queries during routing. This graph is not stored in Khepri, only in the projection ETS table. When a binding is deleted, we need to clean up the graph. However, the pattern used to match the trie edges to delete was incorrect, leading to "orphaned" trie edges. The accumulation of these leftovers caused a memory leak. [How] The pattern was fixed to correctly match the appropriate trie edges. However, this fix alone is effective for new deployments of RabbitMQ only, when the projection function is registered for the first time. We also need to handle the update of already registered projections in existing clusters. To achieve that, first, we renamed the projection from `rabbit_khepri_topic_trie` to `rabbit_khepri_topic_trie_v2` to distinguish the bad one and the good one. Any updated RabbitMQ nodes in an existing cluster will use this new projection. Other existing out-of-date nodes will continue to use the old projection. Because both projections continue to exist, the cluster will still be affected by the memory leak. Then, each node will verify on startup if all other cluster members support the new projection. If that is the case, they will unregister the old projection. Therefore, once all nodes in a cluster are up-to-date and use the new projection, the old one will go away and the leaked memory will be reclaimed. This startup check could have been made simpler with a feature flag. We decided to go with a custom check in case a user would try to upgrade from a 4.1.x release that has the fix to a 4.2.x release that does not for instance. A feature flag would have prevented that upgrade path. Fixes #15024.

[Why] We use a Khepri projection to compute a graph for bindings that have a topic exchange as their source. This allows more efficient queries during routing. This graph is not stored in Khepri, only in the projection ETS table. When a binding is deleted, we need to clean up the graph. However, the pattern used to match the trie edges to delete was incorrect, leading to "orphaned" trie edges. The accumulation of these leftovers caused a memory leak. [How] The pattern was fixed to correctly match the appropriate trie edges. However, this fix alone is effective for new deployments of RabbitMQ only, when the projection function is registered for the first time. We also need to handle the update of already registered projections in existing clusters. To achieve that, first, we renamed the projection from `rabbit_khepri_topic_trie` to `rabbit_khepri_topic_trie_v2` to distinguish the bad one and the good one. Any updated RabbitMQ nodes in an existing cluster will use this new projection. Other existing out-of-date nodes will continue to use the old projection. Because both projections continue to exist, the cluster will still be affected by the memory leak. Then, each node will verify on startup if all other cluster members support the new projection. If that is the case, they will unregister the old projection. Therefore, once all nodes in a cluster are up-to-date and use the new projection, the old one will go away and the leaked memory will be reclaimed. This startup check could have been made simpler with a feature flag. We decided to go with a custom check in case a user would try to upgrade from a 4.1.x release that has the fix to a 4.2.x release that does not for instance. A feature flag would have prevented that upgrade path. Fixes #15024. (cherry picked from commit 76dcd92)

rabbit_khepri: Fix topic binding deletion leak (backport #15025)

michaelklishin added this to the 4.3.0 milestone Nov 27, 2025

michaelklishin added the backport-v4.2.x label Nov 27, 2025

dumbbell force-pushed the fix-topic-binding-deletion branch 3 times, most recently from 5b60eeb to 42cddec Compare December 3, 2025 14:35

dumbbell marked this pull request as draft December 3, 2025 14:35

dumbbell force-pushed the fix-topic-binding-deletion branch from 42cddec to cc3d23f Compare December 4, 2025 08:54

dumbbell changed the title ~~Fix topic binding deletion leak (Khepri-only)~~ rabbit_khepri: Fix topic binding deletion leak Dec 4, 2025

dumbbell marked this pull request as ready for review December 4, 2025 09:44

mkuratczyk merged commit 76dcd92 into main Dec 4, 2025
576 of 577 checks passed

mkuratczyk deleted the fix-topic-binding-deletion branch December 4, 2025 14:15

mergify bot mentioned this pull request Dec 4, 2025

rabbit_khepri: Fix topic binding deletion leak (backport #15025) #15063

Merged

michaelklishin added a commit that referenced this pull request Dec 4, 2025

Merge pull request #15063 from rabbitmq/mergify/bp/v4.2.x/pr-15025

73e5d93

rabbit_khepri: Fix topic binding deletion leak (backport #15025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rabbit_khepri: Fix topic binding deletion leak #15025

rabbit_khepri: Fix topic binding deletion leak #15025

mkuratczyk commented Nov 27, 2025 •

edited by dumbbell

Loading

Uh oh!

jdennison-iforium commented Nov 27, 2025

Uh oh!

dumbbell commented Nov 27, 2025

Uh oh!

michaelklishin commented Nov 27, 2025

Uh oh!

dumbbell commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rabbit_khepri: Fix topic binding deletion leak #15025

rabbit_khepri: Fix topic binding deletion leak #15025

Conversation

mkuratczyk commented Nov 27, 2025 • edited by dumbbell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

How

Uh oh!

jdennison-iforium commented Nov 27, 2025

Uh oh!

dumbbell commented Nov 27, 2025

Uh oh!

michaelklishin commented Nov 27, 2025

Uh oh!

dumbbell commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mkuratczyk commented Nov 27, 2025 •

edited by dumbbell

Loading