New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock while syncing mirrored queues #714
Comments
The logs report several and plenty errors related to the GM (see below), which seem to be the cause of this error. As we have previously seen, a 'master' retrieves an incorrect database state where it is a slave too. This test environment has 3 nodes clustered with HA queues: rabbit04, rabbit05, rabbit06. A partial partition is caused by dropping the connection from rabbit05 to rabbit04. Following the error traces and with some additional logs, we can see the changes on the GM group:
rabbit05 detects the death of rabbit04 (through partial partition) and stores the new GM group as r05 and r06. rabbit06 sees it and instants later detects rabbit05 in pause_minority, updating the GM group to r06 only. During the next gm:check_neighbours call that rabbit04 executes, it will crash as the GM group that retrieves from mnesia doesn't contain r04. Indeed, r04 is alive and connected to r06, so it gets all database updates. It seems the handling of 'DOWN' messages on partial network partitions, is not properly handled (or recovered afterwards) and creates an inconsistent state across the cluster. See here.
and
and
|
I have a patch currently under testing, that solves the deadlock and apparently allows the cluster to recover. In this patch, every GM process verifies if it is still on the group before call The patch needs longer rounds of testing, as now it eventually hits #545 (on Ubuntu) which seems to cause mismatches on the delta calculations (as seen in https://groups.google.com/forum/#!topic/rabbitmq-users/3QKj-UBqz-g by @johnfoldager). The delta error has been seen before, so it might be an independent problem. Also, during the terminate of the queues (those stopped) it can lead to:
I will try to publish the patch next week once I am more confident on it. |
@dcorbacho 👍 on the idea of GM processes cleanly terminating to re-join later. |
Would it be possible to have the patch/fix for 3.5.7 as well? |
No more |
The tests for #676 exposed a deadlock in the syncing of mirrored queues. Occasionally while #676 bug is present and always when the Mnesia patch is successfully applied to Erlang 18.3, a queue master enters a deadlock.
The output of
rabbit_diagnostics:maybe_stuck/0
shows how the master is waiting for the syncer in a receive clause, while the syncer is waiting for the slaves in another receive clause.The problem here are the list of slaves retrieved by the master in https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_mirror_queue_master.erl#L154, which contain the master itself. Thus, master is waiting for syncer, which it is at the same time waiting for master (as a slave) and will never answer because:
Some debug logs added that show the problem:
In fact, the cluster is still blocked but it seems a new master for that queue has been elected:
The text was updated successfully, but these errors were encountered: