-
Notifications
You must be signed in to change notification settings - Fork 4k
rabbit_khepri: Fix topic binding deletion leak #15025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks very much for this. I note the 'backport-v4.2.x' label. Will this also be back-ported for v4.1.x? |
|
This will only fix the problem for new deployments, not existing ones. The reason is that the projection function is stored in Khepri the first time it is registered. This is mandatory to ensure the behaviour is consistent across a cluster. The current code in |
|
@dumbbell @mkuratczyk so, should we merge this as a first step and then see what can be done to work around the "persistent" nature of projections in Khepri? |
|
I think we should have a way forward for upgrades first. |
5b60eeb to
42cddec
Compare
[Why] We use a Khepri projection to compute a graph for bindings that have a topic exchange as their source. This allows more efficient queries during routing. This graph is not stored in Khepri, only in the projection ETS table. When a binding is deleted, we need to clean up the graph. However, the pattern used to match the trie edges to delete was incorrect, leading to "orphaned" trie edges. The accumulation of these leftovers caused a memory leak. [How] The pattern was fixed to correctly match the appropriate trie edges. However, this fix alone is effective for new deployments of RabbitMQ only, when the projection function is registered for the first time. We also need to handle the update of already registered projections in existing clusters. To achieve that, first, we renamed the projection from `rabbit_khepri_topic_trie` to `rabbit_khepri_topic_trie_v2` to distinguish the bad one and the good one. Any updated RabbitMQ nodes in an existing cluster will use this new projection. Other existing out-of-date nodes will continue to use the old projection. Because both projections continue to exist, the cluster will still be affected by the memory leak. Then, each node will verify on startup if all other cluster members support the new projection. If that is the case, they will unregister the old projection. Therefore, once all nodes in a cluster are up-to-date and use the new projection, the old one will go away and the leaked memory will be reclaimed. This startup check could have been made simpler with a feature flag. We decided to go with a custom check in case a user would try to upgrade from a 4.1.x release that has the fix to a 4.2.x release that does not for instance. A feature flag would have prevented that upgrade path. Fixes #15024.
42cddec to
cc3d23f
Compare
[Why] We use a Khepri projection to compute a graph for bindings that have a topic exchange as their source. This allows more efficient queries during routing. This graph is not stored in Khepri, only in the projection ETS table. When a binding is deleted, we need to clean up the graph. However, the pattern used to match the trie edges to delete was incorrect, leading to "orphaned" trie edges. The accumulation of these leftovers caused a memory leak. [How] The pattern was fixed to correctly match the appropriate trie edges. However, this fix alone is effective for new deployments of RabbitMQ only, when the projection function is registered for the first time. We also need to handle the update of already registered projections in existing clusters. To achieve that, first, we renamed the projection from `rabbit_khepri_topic_trie` to `rabbit_khepri_topic_trie_v2` to distinguish the bad one and the good one. Any updated RabbitMQ nodes in an existing cluster will use this new projection. Other existing out-of-date nodes will continue to use the old projection. Because both projections continue to exist, the cluster will still be affected by the memory leak. Then, each node will verify on startup if all other cluster members support the new projection. If that is the case, they will unregister the old projection. Therefore, once all nodes in a cluster are up-to-date and use the new projection, the old one will go away and the leaked memory will be reclaimed. This startup check could have been made simpler with a feature flag. We decided to go with a custom check in case a user would try to upgrade from a 4.1.x release that has the fix to a 4.2.x release that does not for instance. A feature flag would have prevented that upgrade path. Fixes #15024. (cherry picked from commit 76dcd92)
rabbit_khepri: Fix topic binding deletion leak (backport #15025)
Why
We use a Khepri projection to compute a graph for bindings that have a topic exchange as their source. This allows more efficient queries during routing. This graph is not stored in Khepri, only in the projection ETS table.
When a binding is deleted, we need to clean up the graph. However, the pattern used to match the trie edges to delete was incorrect, leading to "orphaned" trie edges. The accumulation of these leftovers caused a memory leak.
How
The pattern was fixed to correctly match the appropriate trie edges.
However, this fix alone is effective for new deployments of RabbitMQ only, when the projection function is registered for the first time. We also need to handle the update of already registered projections in existing clusters.
To achieve that, first, we renamed the projection from
rabbit_khepri_topic_trietorabbit_khepri_topic_trie_v2to distinguish the bad one and the good one. Any updated RabbitMQ nodes in an existing cluster will use this new projection. Other existing out-of-date nodes will continue to use the old projection. Because both projections continue to exist, the cluster will still be affected by the memory leak.Then, each node will verify on startup if all other cluster members support the new projection. If that is the case, they will unregister the old projection. Therefore, once all nodes in a cluster are up-to-date and use the new projection, the old one will go away and the leaked memory will be reclaimed.
This startup check could have been made simpler with a feature flag. We decided to go with a custom check in case a user would try to upgrade from a 4.1.x release that has the fix to a 4.2.x release that does not for instance. A feature flag would have prevented that upgrade path.
Fixes #15024.