Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.12.8 Shovel plugin crashes after upgrade with existing shovels #9894

Closed
gomoripeti opened this issue Nov 7, 2023 · 5 comments
Closed

3.12.8 Shovel plugin crashes after upgrade with existing shovels #9894

gomoripeti opened this issue Nov 7, 2023 · 5 comments
Labels
Milestone

Comments

@gomoripeti
Copy link
Contributor

gomoripeti commented Nov 7, 2023

Describe the bug

Commit ccc22cb changed the id format of the children in the mirrored supervisor rabbit_shovel_dyn_worker_sup. However the child spec of a mirrored supervisor is stored in Mnesia and survives a rolling restart. During an upgrade with existing dynamic shovels the below crash was observed on the first node that is upgraded because of the new code hitting old id format.

BOOT FAILED
===========
Error during startup: {error,
                       {rabbitmq_shovel,
                        {{shutdown,
                          {failed_to_start_child,
                           rabbit_shovel_dyn_worker_sup_sup,
                           {'EXIT',
                            {function_clause,
                             [{rabbit_shovel_dyn_worker_sup_sup,id,
                               [<<"shovel1">>],
                               [{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},
                                {line,100}]},
                              {rabbit_shovel_dyn_worker_sup_sup,
                               '-cleanup_specs/0-fun-2-',2,
                               [{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},
                                {line,90}]},
                              {sets,fold_bucket,3,
                               [{file,"sets.erl"},{line,503}]},
                              {sets,fold_seg,4,[{file,"sets.erl"},{line,499}]},
                              {sets,fold_segs,4,
                               [{file,"sets.erl"},{line,495}]},
                              {rabbit_shovel_dyn_worker_sup_sup,
                               cleanup_specs,0,
                               [{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},
                                {line,93}]},
                              {rabbit_shovel_dyn_worker_sup_sup,start_child,
                               2,
                               [{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},
                                {line,42}]},
                              {rabbit_shovel_dyn_worker_sup_sup,
                               '-start_link/0-lc$^0/1-0-',1,
                               [{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},
                                {line,28}]}]}}}},
                         {rabbit_shovel,start,[normal,[]]}}}}

For the record on another node:

1> mirrored_supervisor:which_children(rabbit_shovel_dyn_worker_sup_sup).
[{{<<"/">>,<<"shovel1">>},
  <49058.4184.0>,worker,
  [rabbit_shovel_dyn_worker_sup]}]

I upgraded from 3.11.24 but I think one can start from any version prior to 3.12.8.

EDIT: I believe it only happens on multi-node clusters.

Reproduction steps

  1. Create a multi-node cluster with version prior to 3.12.8 (eg 3.11.24)
  2. Create a dynamic shovel that is not auto-deleted (eg shovelling between 2 local queues)
  3. Upgrade first node to 3.12.8 and restart.
  4. Observer the shovel plugin crash on the first node which prevents boot

Expected behavior

Existing shovels should still work after upgrade to 3.12.8, possibly by executing a DB migration converting the child IDs.

Additional context

No response

@michaelklishin
Copy link
Member

Originally introduced in 5f0981c5a3b

@michaelklishin
Copy link
Member

The change in ID format comes from the migration to Khepri, in particular, the need to avoid storing function references in the schema data store.

Now we have to support both formats for a period of time.

@htve
Copy link

htve commented Nov 9, 2023

I upgraded from 3.12.7, I'm getting the same error.

@michaelklishin michaelklishin added this to the 3.12.9 milestone Nov 9, 2023
michaelklishin added a commit that referenced this issue Nov 13, 2023
During a rolling upgrade, all cluster nodes collectively
may (and usually will, due to Shovel migration during node restarts)
contain mirrored_supervisor children with IDs that use two different
parameters (see referenced commits below).

The old format should not trip up node startup, so new
nodes must accept it in a few places, and try to use
these older values during dynamic Shovel spec cleanup.

References ccc22cb, 5f0981c, #9785.

See #9894.
@michaelklishin
Copy link
Member

This cannot be reproduced with a single node, only in a mixed version cluster.

@gomoripeti can you please give #9909 a try?

@gomoripeti
Copy link
Contributor Author

thanks for the quick fix, I will test the upgrade with the patch

mergify bot pushed a commit that referenced this issue Nov 14, 2023
During a rolling upgrade, all cluster nodes collectively
may (and usually will, due to Shovel migration during node restarts)
contain mirrored_supervisor children with IDs that use two different
parameters (see referenced commits below).

The old format should not trip up node startup, so new
nodes must accept it in a few places, and try to use
these older values during dynamic Shovel spec cleanup.

References ccc22cb, 5f0981c, #9785.

See #9894.

(cherry picked from commit 2ce0307)

# Conflicts:
#	deps/rabbitmq_shovel/src/rabbit_shovel_dyn_worker_sup_sup.erl
gomoripeti pushed a commit to cloudamqp/rabbitmq-server that referenced this issue Dec 8, 2023
During a rolling upgrade, all cluster nodes collectively
may (and usually will, due to Shovel migration during node restarts)
contain mirrored_supervisor children with IDs that use two different
parameters (see referenced commits below).

The old format should not trip up node startup, so new
nodes must accept it in a few places, and try to use
these older values during dynamic Shovel spec cleanup.

References ccc22cb, 5f0981c, rabbitmq#9785.

See rabbitmq#9894.

(cherry picked from commit 2ce0307)

# Conflicts:
#	deps/rabbitmq_shovel/src/rabbit_shovel_dyn_worker_sup_sup.erl
gomoripeti pushed a commit to cloudamqp/rabbitmq-server that referenced this issue Dec 8, 2023
dumbbell added a commit that referenced this issue Feb 7, 2024
[Why]
An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8
was shared in issue #9894 to demonstrate that the change of child ID
format broke rolling upgrades when there are existing dynamic shovels.

[How]
The testcase uses 4 nodes:
* one reference node
* one node to host source and target queues
* one "old" node
* one "new" node

The reference node is using the new version to see what format it uses.

The node hosting queues is using the old version but it is not relevant
for this one?

The testcase uses the old node to create the dynamic shovel, then the
new node to simulate an upgrade by clustering it with the old node and
stopping the old one.
dumbbell added a commit that referenced this issue Feb 8, 2024
[Why]
An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8
was shared in issue #9894 to demonstrate that the change of child ID
format broke rolling upgrades when there are existing dynamic shovels.

[How]
The testcase uses 4 nodes:
* one reference node
* one node to host source and target queues
* one "old" node
* one "new" node

The reference node is using the new version to see what format it uses.

The node hosting queues is using the old version but it is not relevant
for this one?

The testcase uses the old node to create the dynamic shovel, then the
new node to simulate an upgrade by clustering it with the old node and
stopping the old one.
dumbbell added a commit that referenced this issue Feb 8, 2024
[Why]
An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8
was shared in issue #9894 to demonstrate that the change of child ID
format broke rolling upgrades when there are existing dynamic shovels.

[How]
The testcase uses 4 nodes:
* one reference node
* one node to host source and target queues
* one "old" node
* one "new" node

The reference node is using the new version to see what format it uses.

The node hosting queues is using the old version but it is not relevant
for this one?

The testcase uses the old node to create the dynamic shovel, then the
new node to simulate an upgrade by clustering it with the old node and
stopping the old one.
dumbbell added a commit that referenced this issue Feb 12, 2024
[Why]
An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8
was shared in issue #9894 to demonstrate that the change of child ID
format broke rolling upgrades when there are existing dynamic shovels.

[How]
The testcase uses 4 nodes:
* one reference node
* one node to host source and target queues
* one "old" node
* one "new" node

The reference node is using the new version to see what format it uses.

The node hosting queues is using the old version but it is not relevant
for this one?

The testcase uses the old node to create the dynamic shovel, then the
new node to simulate an upgrade by clustering it with the old node and
stopping the old one.
dumbbell added a commit that referenced this issue Feb 13, 2024
[Why]
An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8
was shared in issue #9894 to demonstrate that the change of child ID
format broke rolling upgrades when there are existing dynamic shovels.

[How]
The testcase uses 4 nodes:
* one reference node
* one node to host source and target queues
* one "old" node
* one "new" node

The reference node is using the new version to see what format it uses.

The node hosting queues is using the old version but it is not relevant
for this one?

The testcase uses the old node to create the dynamic shovel, then the
new node to simulate an upgrade by clustering it with the old node and
stopping the old one.
michaelklishin pushed a commit that referenced this issue Feb 29, 2024
[Why]
An upgrade scenario going from RabbitMQ 3.11.24 to the upcoming 3.12.8
was shared in issue #9894 to demonstrate that the change of child ID
format broke rolling upgrades when there are existing dynamic shovels.

[How]
The testcase uses 4 nodes:
* one reference node
* one node to host source and target queues
* one "old" node
* one "new" node

The reference node is using the new version to see what format it uses.

The node hosting queues is using the old version but it is not relevant
for this one?

The testcase uses the old node to create the dynamic shovel, then the
new node to simulate an upgrade by clustering it with the old node and
stopping the old one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants