Upgrade to 4.3.0 can fail when the tie_binding_to_dest_with_keep_while_cond feature flag is enabled
#16587
-
Community Support Policy
RabbitMQ version usedother (please specify) Erlang version used27.3.x Operating system (distribution) usedUbuntu 24.04.4 LTS How is RabbitMQ deployed?Kubernetes Operator(s) from Team RabbitMQ rabbitmq-diagnostics status outputLogs from node 1 (with sensitive values edited out)Logs from node 2 (if applicable, with sensitive values edited out)Logs from node 3 (if applicable, with sensitive values edited out)rabbitmq.confRabbitMQ is deployed using the RabbitMQ Kubernetes Cluster Operator on Kubernetes. No custom Khepri, feature flag, exchange, queue, or binding configuration has been applied beyond the configuration shown above. Steps to deploy RabbitMQ clusterDeploy RabbitMQ using the RabbitMQ Kubernetes Cluster Operator. Steps to reproduce the behavior in questionSteps to reproduce the behavior
Additional observations What problem are you trying to solve?After upgrading RabbitMQ from 4.2.6 to 4.3.0, I want to complete the upgrade by enabling all required feature flags and removing the warning shown in the Management UI. The cluster is healthy and fully operational, but the feature flag tie_binding_to_dest_with_keep_while_cond cannot be enabled because it consistently fails with a function_clause exception. I am trying to understand whether this is a known issue with the upgrade path from 4.2.x to 4.3.x, whether there is a supported workaround, or whether the feature flag migration is encountering an unexpected metadata state that requires remediation. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 2 replies
-
|
It would be extremely helpful if you detailed the exact upgrade steps you took. https://www.rabbitmq.com/docs/upgrade For instance, did you ensure that all feature flags were enabled prior to the upgrade? https://www.rabbitmq.com/docs/feature-flags#how-to-enable-feature-flags |
Beta Was this translation helpful? Give feedback.
-
|
A proposed fix #16590. |
Beta Was this translation helpful? Give feedback.
-
|
I can’t reproduce with the instructions you shared. Could you please give more details if you can reproduce? Like what exchanges/queues/bindings you create exactly? I need to understand why you end up with an exchange tree node in the metadata store without an exchange stored underneath. |
Beta Was this translation helpful? Give feedback.
-
|
@dumbbell I have some additional details. Khepri uses Three code paths can leave a node at Topic PermissionsThe exchange field of The Definition import and Binding CreationBinding creation seems to have a genuine race with tthe deletion of the source exchange. Before Exchange Serial Bump
The Possible Reproduction ExampleConsider a cluster on Operator wanted user app to publish on any topic exchange in dev.cyferd and ran rabbitmqctl set_topic_permissions -p dev.cyferd app ".*" ".*" ".*"The exchange-name argument (the first The record lands at Definitions import that carries the same topic permissions can have the same effect and is Things work well until
|
Beta Was this translation helpful? Give feedback.
-
|
The simplest explanation I can think of:
Filtering out data-less nodes should be enough. |
Beta Was this translation helpful? Give feedback.
Topic permissions: indeed, nothing enforces that the exchange exists before setting a topic permission on it. I checked the Mnesia-based code and that’s the same. So we always allowed to set a topic permission on a non-existing exchange.
Binding creation: this is a known issue, we should put a FIXME there until we fix it.
Exchange serial bump: like topic permissions, the code with Mnesia didn’t check the existence of the exchange first. I wonder if we should change the put to ensure the exchange exists. Unfortunately, the
next_serial/1function always returns a serial: it has no room for error handling.With this in mind, I think your proposed patch is fine. I will review it in #1…