-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.12.8 Shovel plugin crashes after upgrade with existing shovels #9894
Comments
Originally introduced in 5f0981c5a3b |
The change in ID format comes from the migration to Khepri, in particular, the need to avoid storing function references in the schema data store. Now we have to support both formats for a period of time. |
I upgraded from 3.12.7, I'm getting the same error. |
During a rolling upgrade, all cluster nodes collectively may (and usually will, due to Shovel migration during node restarts) contain mirrored_supervisor children with IDs that use two different parameters (see referenced commits below). The old format should not trip up node startup, so new nodes must accept it in a few places, and try to use these older values during dynamic Shovel spec cleanup. References ccc22cb, 5f0981c, #9785. See #9894.
This cannot be reproduced with a single node, only in a mixed version cluster. @gomoripeti can you please give #9909 a try? |
thanks for the quick fix, I will test the upgrade with the patch |
During a rolling upgrade, all cluster nodes collectively may (and usually will, due to Shovel migration during node restarts) contain mirrored_supervisor children with IDs that use two different parameters (see referenced commits below). The old format should not trip up node startup, so new nodes must accept it in a few places, and try to use these older values during dynamic Shovel spec cleanup. References ccc22cb, 5f0981c, #9785. See #9894. (cherry picked from commit 2ce0307) # Conflicts: # deps/rabbitmq_shovel/src/rabbit_shovel_dyn_worker_sup_sup.erl
During a rolling upgrade, all cluster nodes collectively may (and usually will, due to Shovel migration during node restarts) contain mirrored_supervisor children with IDs that use two different parameters (see referenced commits below). The old format should not trip up node startup, so new nodes must accept it in a few places, and try to use these older values during dynamic Shovel spec cleanup. References ccc22cb, 5f0981c, rabbitmq#9785. See rabbitmq#9894. (cherry picked from commit 2ce0307) # Conflicts: # deps/rabbitmq_shovel/src/rabbit_shovel_dyn_worker_sup_sup.erl
[Why] An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8 was shared in issue #9894 to demonstrate that the change of child ID format broke rolling upgrades when there are existing dynamic shovels. [How] The testcase uses 4 nodes: * one reference node * one node to host source and target queues * one "old" node * one "new" node The reference node is using the new version to see what format it uses. The node hosting queues is using the old version but it is not relevant for this one? The testcase uses the old node to create the dynamic shovel, then the new node to simulate an upgrade by clustering it with the old node and stopping the old one.
[Why] An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8 was shared in issue #9894 to demonstrate that the change of child ID format broke rolling upgrades when there are existing dynamic shovels. [How] The testcase uses 4 nodes: * one reference node * one node to host source and target queues * one "old" node * one "new" node The reference node is using the new version to see what format it uses. The node hosting queues is using the old version but it is not relevant for this one? The testcase uses the old node to create the dynamic shovel, then the new node to simulate an upgrade by clustering it with the old node and stopping the old one.
[Why] An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8 was shared in issue #9894 to demonstrate that the change of child ID format broke rolling upgrades when there are existing dynamic shovels. [How] The testcase uses 4 nodes: * one reference node * one node to host source and target queues * one "old" node * one "new" node The reference node is using the new version to see what format it uses. The node hosting queues is using the old version but it is not relevant for this one? The testcase uses the old node to create the dynamic shovel, then the new node to simulate an upgrade by clustering it with the old node and stopping the old one.
[Why] An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8 was shared in issue #9894 to demonstrate that the change of child ID format broke rolling upgrades when there are existing dynamic shovels. [How] The testcase uses 4 nodes: * one reference node * one node to host source and target queues * one "old" node * one "new" node The reference node is using the new version to see what format it uses. The node hosting queues is using the old version but it is not relevant for this one? The testcase uses the old node to create the dynamic shovel, then the new node to simulate an upgrade by clustering it with the old node and stopping the old one.
[Why] An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8 was shared in issue #9894 to demonstrate that the change of child ID format broke rolling upgrades when there are existing dynamic shovels. [How] The testcase uses 4 nodes: * one reference node * one node to host source and target queues * one "old" node * one "new" node The reference node is using the new version to see what format it uses. The node hosting queues is using the old version but it is not relevant for this one? The testcase uses the old node to create the dynamic shovel, then the new node to simulate an upgrade by clustering it with the old node and stopping the old one.
[Why] An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8 was shared in issue #9894 to demonstrate that the change of child ID format broke rolling upgrades when there are existing dynamic shovels. [How] The testcase uses 4 nodes: * one reference node * one node to host source and target queues * one "old" node * one "new" node The reference node is using the new version to see what format it uses. The node hosting queues is using the old version but it is not relevant for this one? The testcase uses the old node to create the dynamic shovel, then the new node to simulate an upgrade by clustering it with the old node and stopping the old one.
Describe the bug
Commit ccc22cb changed the id format of the children in the mirrored supervisor
rabbit_shovel_dyn_worker_sup
. However the child spec of a mirrored supervisor is stored in Mnesia and survives a rolling restart. During an upgrade with existing dynamic shovels the below crash was observed on the first node that is upgraded because of the new code hitting old id format.For the record on another node:
I upgraded from 3.11.24 but I think one can start from any version prior to 3.12.8.
EDIT: I believe it only happens on multi-node clusters.
Reproduction steps
Expected behavior
Existing shovels should still work after upgrade to 3.12.8, possibly by executing a DB migration converting the child IDs.
Additional context
No response
The text was updated successfully, but these errors were encountered: