Skip to content

Conversation

luos
Copy link
Contributor

@luos luos commented Mar 11, 2020

Proposed Changes

Hi,

We have a case with 3.7.15 where some queues are crashed and not restarted in case of full partition (all nodes stopping). I tested 3.7.23 which also has this bug and 3.8.2.

This affects a three node pause_minority cluster with ha-sync-mode=automatic.

The root cause of the issue is the following crash:

** {{badmatch,{error,{queue_supervisor_not_found,[]}}},[{rabbit_amqqueue_sup_sup,start_queue_process,3,[{file,"src/rabbit_amqqueue_sup_sup.erl"},{line,45}]},{rabbit_mirror_queue_misc,'-add_mirror/3-fun-0-',4,[{file,"src/rabbit_mirror_queue_misc.erl"},{line,251}]},{rabbit_misc,with_exit_handler,2,[{file,"src/rabbit_misc.erl"},{line,514}]},{rabbit_mirror_queue_misc,'-add_mirrors/3-lc$^0/1-0-',3,[{file,"src/rabbit_mirror_queue_misc.erl"},{line,239}]},{rabbit_mirror_queue_misc,add_mirrors,3,[{file,"src/rabbit_mirror_queue_misc.erl"},{line,239}]},{rabbit_mirror_queue_master,init_with_existing_bq,3,[{file,"src/rabbit_mirror_queue_master.erl"},{line,127}]},{rabbit_mirror_queue_master,init,3,[{file,"src/rabbit_mirror_queue_master.erl"},{line,102}]},{rabbit_amqqueue_process,init_it2,3,[{file,"src/rabbit_amqqueue_process.erl"},{line,207}]}]}

Which starts a bunch of crashes, which in turn crashes the queue losing all messages.
After these crashes there are some more crashes like duplicate_live_mater or Stopping queue because of not synchronised slave.

This line can also fail with badmatch on QPid:
https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_mirror_queue_slave.erl#L125

I tested the fix and in my tests with the original code it was around 1 in 5 chances of at least one queue breaking out of 500, with the fix it is zero while testing an "all nodes paused" scenario.
Tested 3.7.23 and 3.8.2.

I am open to refactor start_queue_process to return {ok, Pid} | {error, Error} as well if you'd prefer that.

This was discussed in the past in the context of starting all RabbitMQ at the same time, however in this case it's a built in mechanism of RabbitMQ which can not be avoided if we want reliable message delivery.

Thanks

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

Further Comments

If this is a relatively large or complex change, kick off the discussion by
explaining why you chose the solution you did and what alternatives you
considered, etc.

A starting master can create mirrors on other nodes while that is not fully
running as there is a race condition between vhost sup and queue sup sup
starting. this crashes the queues with error

{{badmatch,{error,{queue_supervisor_not_found,[]}}

This error is now caught in a try catch.
@michaelklishin michaelklishin changed the base branch from v3.8.x to master March 11, 2020 17:57
@michaelklishin michaelklishin changed the base branch from master to v3.8.x March 11, 2020 17:58
@michaelklishin
Copy link
Collaborator

@luos thank you. Please submit PRs against master in the future.

Adding error handling of this kind makes sense, even if all it does is "containing the damage".

@michaelklishin
Copy link
Collaborator

@luos also, I assume this can be backported to v3.7.x?

@luos
Copy link
Contributor Author

luos commented Mar 11, 2020

Hi,

Yes, please backport it.

I tried to build it against master but I could not compile it run RabbitMQ because it was crashing, I'll try again tomorrow and open a PR against it if you need it.

Thanks

@michaelklishin michaelklishin merged commit 3c1b317 into rabbitmq:v3.8.x Mar 11, 2020
michaelklishin added a commit that referenced this pull request Mar 11, 2020
After network partition some classic queues are crashed, do not have mirrors

(cherry picked from commit 3c1b317)
michaelklishin added a commit that referenced this pull request Mar 11, 2020
After network partition some classic queues are crashed, do not have mirrors

(cherry picked from commit 3c1b317)
@michaelklishin
Copy link
Collaborator

Backported to v3.7.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants