Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading a cluster with quorum queues to OTP26 fails with checksum mismatch #8057

Closed
mkuratczyk opened this issue Apr 28, 2023 · 3 comments · Fixed by #8143
Closed

Upgrading a cluster with quorum queues to OTP26 fails with checksum mismatch #8057

mkuratczyk opened this issue Apr 28, 2023 · 3 comments · Fixed by #8143
Assignees
Milestone

Comments

@mkuratczyk
Copy link
Contributor

Describe the bug

NOTE: RabbitMQ does not support OTP26 yet. This issue should not affect any users.

When performing a rolling upgrade from OTP25 to OTP26, the first node running OTP26 to re-join the cluster will not be able to accept Ra snapshots:

** Reason for termination = error:{badmatch,1522447362}
** Callback modules = [ra_server_proc]
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{ra_log_snapshot,complete_accept,2,
                      [{file,"ra_log_snapshot.erl"},{line,85}]},
     {ra_snapshot,accept_chunk,4,[{file,"ra_snapshot.erl"},{line,281}]},
     {ra_server,handle_receive_snapshot,2,
                [{file,"ra_server.erl"},{line,1227}]},
     {ra_server_proc,handle_receive_snapshot,2,
                     [{file,"ra_server_proc.erl"},{line,1052}]},
     {ra_server_proc,receive_snapshot,3,
                     [{file,"ra_server_proc.erl"},{line,805}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1377}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
** Time-outs: {2,
               [{state_timeout,receive_snapshot_timeout},
                {{timeout,tick},tick_timeout}]}
** Client <18278.7917.0> is remote on node 'rabbit@qq-s1000-server-0.qq-s1000-nodes.qq'

  crasher:
    initial call: ra_server_proc:init/1
    pid: <0.632.0>
    registered_name: '%2F_fivers-8'
    exception error: no match of right hand side value 1522447362
      in function  ra_log_snapshot:complete_accept/2 (ra_log_snapshot.erl, line 85)
      in call from ra_snapshot:accept_chunk/4 (ra_snapshot.erl, line 281)
      in call from ra_server:handle_receive_snapshot/2 (ra_server.erl, line 1227)
      in call from ra_server_proc:handle_receive_snapshot/2 (ra_server_proc.erl, line 1052)
      in call from ra_server_proc:receive_snapshot/3 (ra_server_proc.erl, line 805)
      in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1377)

Full log file:
upgrade.log.gz

Reproduction steps

  1. Deploy RabbitMQ with OTP25 (it can be the main branch or 3.11, I will use main)
  2. Deploy some quorum queue workload (I'm using perf-test -x 10 -y 10 -r 5000 -c 500 -qp fivers-%d -qpf 1 -qpt 10 -qa x-max-length=1000000)
  3. Perform a rolling upgrade to OTP26

Expected behavior

Successful upgrade :)

Additional context

No response

@mkuratczyk
Copy link
Contributor Author

mkuratczyk commented Apr 28, 2023

I've put together a script that reproduces this locally:

#!/bin/bash

# start a 3-node cluster with OTP25
source ~/.kerl/25.3.1/activate
bazel clean
bazel run start-cluster

# stop rabbit-0
rabbitmqctl -n rabbit-0 shutdown

# publish some messages
java -jar perf-test-dev.jar -H amqp://localhost:5673 -qq -u qq -c 500 -ms -z 30

# start rabbit-0 on OTP26
source ~/.kerl/26.0-rc3/activate
bazel run start-cluster NODES=1

@mkuratczyk
Copy link
Contributor Author

mkuratczyk commented May 1, 2023

The problem is caused by different map ordering in OTP26. Ra snapshot metadata is a map that is later serialized with term_to_binary and is a part of data that the checksum is calculated on. Due to different map ordering, with OTP26 the elements of the map are written in a different order and therefore lead to a different checksum.

OTP 25.3.1

1> Meta = #{cluster => [{'%2F_qq','rabbit-0@mkuratczykPF0JR'}, {'%2F_qq','rabbit-1@mkuratczykPF0JR'}, {'%2F_qq','rabbit-2@mkuratczykPF0JR'}], index => 611960,machine_version => 3,term => 1}.
#{index => 611960,term => 1,
  cluster =>
      [{'%2F_qq','rabbit-0@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-1@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-2@mkuratczykPF0JR'}],
  machine_version => 3}
2> MetaBin = erlang:term_to_binary(Meta).
<<131,116,0,0,0,4,119,5,105,110,100,101,120,98,0,9,86,120,
  119,4,116,101,114,109,97,1,119,7,99,...>>
3> erlang:crc32(MetaBin).
2066562623

OTP 26.0-rc3

1> Meta = #{cluster => [{'%2F_qq','rabbit-0@mkuratczykPF0JR'}, {'%2F_qq','rabbit-1@mkuratczykPF0JR'}, {'%2F_qq','rabbit-2@mkuratczykPF0JR'}], index => 611960,machine_version => 3,term => 1}.
#{cluster =>
      [{'%2F_qq','rabbit-0@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-1@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-2@mkuratczykPF0JR'}],
  index => 611960,machine_version => 3,term => 1}
2> MetaBin = erlang:term_to_binary(Meta).
<<131,116,0,0,0,4,100,0,7,99,108,117,115,116,101,114,108,
  0,0,0,3,104,2,100,0,6,37,50,70,...>>
3> erlang:crc32(MetaBin).
3828560182

@michaelklishin
Copy link
Member

#8143 makes rolling upgrades to Erlang 26 succeed under a constant load involving QQs.

@michaelklishin michaelklishin added this to the 3.12.0 milestone May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants