Deadlock while syncing mirrored queues #714

dcorbacho · 2016-03-30T11:26:43Z

The tests for #676 exposed a deadlock in the syncing of mirrored queues. Occasionally while #676 bug is present and always when the Mnesia patch is successfully applied to Erlang 18.3, a queue master enters a deadlock.

The output of rabbit_diagnostics:maybe_stuck/0 shows how the master is waiting for the syncer in a receive clause, while the syncer is waiting for the slaves in another receive clause.

2016-03-30 07:02:53 [{pid,<5588.1224.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{rabbit_mirror_queue_sync,master_go,8,
                              [{file,"src/rabbit_mirror_queue_sync.erl"},
                               {line,94}]},
                          {rabbit_mirror_queue_master,sync_mirrors,3,
                              [{file,"src/rabbit_mirror_queue_master.erl"},
                               {line,159}]},
                          {rabbit_amqqueue_process,handle_call,3,
                              [{file,"src/rabbit_amqqueue_process.erl"},
                               {line,1112}]},
                          {gen_server2,handle_msg,2,
                              [{file,"src/gen_server2.erl"},{line,1028}]},
                          {proc_lib,wake_up,3,
                              [{file,"proc_lib.erl"},{line,250}]}]},
                     {initial_call,{proc_lib,init_p,5}},
                     {message_queue_len,1771},
                     {links,
                         [<5588.1223.0>,<5588.1373.0>,<5588.6588.0>,
                          #Port<5588.8500>]},
                     {monitors,[]},
                     {monitored_by,
                         [<5588.923.0>,<5588.911.0>,<5588.7969.0>,
                          <5588.8274.0>,<5588.7038.0>,<5588.919.0>,
                          <5588.6588.0>,<5588.931.0>,<5588.826.0>,
                          <5588.934.0>,<5588.7930.0>,<5588.7000.0>,
                          <5588.6587.0>]},
                     {heap_size,833026}]
2016-03-30 07:02:53 [{pid,<5588.6588.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{rabbit_mirror_queue_sync,
                              '-await_slaves/2-lc$^0/1-0-',2,
                              [{file,"src/rabbit_mirror_queue_sync.erl"},
                               {line,210}]},
                          {rabbit_mirror_queue_sync,syncer,4,
                              [{file,"src/rabbit_mirror_queue_sync.erl"},
                               {line,197}]}]},
                     {initial_call,{erlang,apply,2}},
                     {message_queue_len,1},
                     {links,[<5588.1224.0>]},
                     {monitors,[{process,<5588.1224.0>}]},
                     {monitored_by,[]},
                     {heap_size,233}]

The problem here are the list of slaves retrieved by the master in https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_mirror_queue_master.erl#L154, which contain the master itself. Thus, master is waiting for syncer, which it is at the same time waiting for master (as a slave) and will never answer because:

It thinks is a master
Is blocked in the receive

Some debug logs added that show the problem:

=INFO REPORT==== 30-Mar-2016::08:51:18 ===
Mirrored queue 'myQuueue_a_3' in vhost '/': Slave <rabbit@t-srv-rabbit06.1.1258.0> saw deaths of mirrors <rabbit@t-srv-rabbit05.1.1303.0> <rabbit@t-srv-rabbit04.3.870.0>

=INFO REPORT==== 30-Mar-2016::08:51:18 ===
Mirrored queue 'myQuueue_a_3' in vhost '/': Promoting slave <rabbit@t-srv-rabbit06.1.1258.0> to master

=WARNING REPORT==== 30-Mar-2016::08:51:19 ===
(<0.1258.0>) MASTER rabbit_mirror_queue_sync:master_go({resource,<<"/">>,
                                                        queue,
                                                        <<"myQuueue_a_3">>}) syncer <0.1429.0> slaves []
(<0.1258.0>) MASTER messages {messages,[]}

=WARNING REPORT==== 30-Mar-2016::08:51:19 ===
(<0.1429.0>) SYNCER rabbit_mirror_queue_sync:await_slaves []

=WARNING REPORT==== 30-Mar-2016::08:52:56 ===
(<0.1258.0>) MASTER rabbit_mirror_queue_sync:master_go({resource,<<"/">>,
                                                        queue,
                                                        <<"myQuueue_a_3">>}) syncer <0.11496.0> slaves [<0.1258.0>,
                                                                                                        <6676.1990.0>]
(<0.1258.0>) MASTER messages {messages,[]}

=WARNING REPORT==== 30-Mar-2016::08:52:56 ===
(<0.11496.0>) SYNCER rabbit_mirror_queue_sync:await_slaves [{<0.1258.0>,
                                                             'rabbit@t-srv-rabbit06'},
                                                            {<6676.1990.0>,
                                                             'rabbit@t-srv-rabbit05'}]

=WARNING REPORT==== 30-Mar-2016::08:52:56 ===
(<0.11496.0>) SYNCER rabbit_mirror_queue_sync:await_slaves receive <0.1258.0>

In fact, the cluster is still blocked but it seems a new master for that queue has been elected:

root@t-srv-rabbit06:~# rabbitmqctl eval 'rabbit_amqqueue:lookup({resource,<<"/">>,
>                                                         queue,
>                                                         <<"myQuueue_a_3">>}).'
{ok,{amqqueue,{resource,<<"/">>,queue,<<"myQuueue_a_3">>},
              true,false,none,[],<5588.14250.0>,
              [<6426.21064.0>,<6427.10558.3>],
              [],
              ['rabbit@t-srv-rabbit04'],
              [{vhost,<<"/">>},
               {name,<<"all">>},
               {pattern,<<>>},
               {'apply-to',<<"all">>},
               {definition,[{<<"ha-mode">>,<<"all">>},
                            {<<"ha-sync-mode">>,<<"automatic">>}]},
               {priority,0}],
              [{<6427.10559.3>,<6427.10558.3>},
               {<6426.21066.0>,<6426.21064.0>},
               {<5588.14251.0>,<5588.14250.0>}],
              [],live}}

rabbitmqctl eval 'erlang:process_info(list_to_pid("<0.14250.0>")).'
[{current_function,{gen_server2,process_next_msg,1}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<5588.14249.0>,<5588.15384.0>,#Port<5588.9658>]},
 {dictionary,
     [{{credit_from,<5588.960.0>},1721},
      {delegate,delegate_8},
      {{credit_to,<6431.6063.0>},50},
      {{"/var/lib/rabbitmq/mnesia/rabbit@t-srv-rabbit06/queues/A9XGIAWLC9TKPRDRRNC482GB8/journal.jif",
        fhc_file},
       {file,1,true}},
      {'$ancestors',
          [<5588.14249.0>,rabbit_amqqueue_sup_sup,rabbit_sup,<5588.826.0>]},
      {{credit_to,<6435.13156.0>},10},
      {fhc_age_tree,
          {1,{{-576459687147248595,#Ref<5588.0.2.48637>},true,nil,nil}}},
      {process_name,
          {rabbit_amqqueue_process,
              {resource,<<"/">>,queue,<<"myQuueue_a_3">>}}},
      {{#Ref<5588.0.2.48637>,fhc_handle},
       {handle,
           {file_descriptor,prim_file,{#Port<5588.9658>,89}},
           #Ref<5588.0.2.48637>,0,false,0,infinity,[],<<>>,0,0,0,0,0,true,
           "/var/lib/rabbitmq/mnesia/rabbit@t-srv-rabbit06/queues/A9XGIAWLC9TKPRDRRNC482GB8/journal.jif",
           [write,binary,raw,read],
           [{write_buffer,infinity}],
           true,true,-576459687147248595}},
      {{credit_to,<6435.13150.0>},10},
      {credit_blocked,[]},
      {{credit_to,<6431.6062.0>},50},
      {credit_flow_default_credit,{200,50}},
      {{credit_to,<6431.6047.0>},49},
      {guid,{{829587482,95375708,1576193894,3592586924},1}},
      {'$initial_call',{gen,init_it,6}},
      {{credit_to,<6435.13113.0>},10}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<5588.825.0>},
 {total_heap_size,196650},
 {heap_size,196650},
 {stack_size,7},
 {reductions,1884012},
 {garbage_collection,
     [{min_bin_vheap_size,46422},
      {min_heap_size,233},
      {fullsweep_after,65535},
      {minor_gcs,0}]},
 {suspending,[]}]

The text was updated successfully, but these errors were encountered:

dcorbacho · 2016-03-31T11:15:10Z

The logs report several and plenty errors related to the GM (see below), which seem to be the cause of this error. As we have previously seen, a 'master' retrieves an incorrect database state where it is a slave too.

This test environment has 3 nodes clustered with HA queues: rabbit04, rabbit05, rabbit06. A partial partition is caused by dropping the connection from rabbit05 to rabbit04.

Following the error traces and with some additional logs, we can see the changes on the GM group:

            T0 ----------------T1----------------------------- T2 --------------------- T3
rabbit04     All                All                           All                     All -> Crashes!
rabbit05     All   partition    {death, r04}, r05, r06        pause_minority          pause_minority
rabbit06     All                {death, r04} through r05      {death,r05}, r06         r06

rabbit05 detects the death of rabbit04 (through partial partition) and stores the new GM group as r05 and r06. rabbit06 sees it and instants later detects rabbit05 in pause_minority, updating the GM group to r06 only. During the next gm:check_neighbours call that rabbit04 executes, it will crash as the GM group that retrieves from mnesia doesn't contain r04. Indeed, r04 is alive and connected to r06, so it gets all database updates.

It seems the handling of 'DOWN' messages on partial network partitions, is not properly handled (or recovered afterwards) and creates an inconsistent state across the cluster. See here.

=ERROR REPORT==== 31-Mar-2016::07:24:37 ===
** Generic server <0.917.0> terminating
** Last message in was {'$gen_cast',{'$gm',5,check_neighbours}}
** When Server state == {state,
                         {0,<0.917.0>},
                         {{0,<21912.1217.0>},#Ref<0.0.2.709>},
                         {{1,<21910.1249.0>},#Ref<0.0.2.799>},
                         {resource,<<"/">>,queue,<<"myQuueue_a_5">>},
                         rabbit_mirror_queue_coordinator,
                         {2,
                          [{{0,<0.917.0>},
                            {view_member,
                             {0,<0.917.0>},
                             [],
                             {0,<21912.1217.0>},
                             {1,<21910.1249.0>}}},
                           {{0,<21912.1217.0>},
                            {view_member,
                             {0,<21912.1217.0>},
                             [],
                             {1,<21910.1249.0>},
                             {0,<0.917.0>}}},
                           {{1,<21910.1249.0>},
                            {view_member,
                             {1,<21910.1249.0>},
                             [],
                             {0,<0.917.0>},
                             {0,<21912.1217.0>}}}]},
                         1617,
....
** Reason for termination == 
** {function_clause,
       [{orddict,fetch,
            [{0,<0.917.0>},
             [{{1,<21910.1249.0>},
               {view_member,
                   {1,<21910.1249.0>},
                   [],
                   {1,<21910.1249.0>},
                   {1,<21910.1249.0>}}}]],
            [{file,"orddict.erl"},{line,80}]},
        {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1267}]},
        {gm,change_view,2,[{file,"src/gm.erl"},{line,1439}]},
        {gm,handle_cast,2,[{file,"src/gm.erl"},{line,649}]},
        {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

and

=ERROR REPORT==== 31-Mar-2016::07:46:33 ===
** Generic server <0.14462.0> terminating
** Last message in was {'DOWN',#Ref<0.0.2.59115>,process,<21908.8418.0>,
                               noconnection}
** When Server state == {state,
                            {55,<0.14462.0>},
                            {{60,<21908.8418.0>},#Ref<0.0.2.59115>},
                            {{60,<21908.8418.0>},#Ref<0.0.2.59116>},
                            {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                            rabbit_mirror_queue_coordinator,
                            {62,
                             [{{55,<0.14462.0>},
                               {view_member,
                                   {55,<0.14462.0>},
                                   [],
                                   {60,<21908.8418.0>},
                                   {60,<21908.8418.0>}}},
                              {{60,<21908.8418.0>},
                               {view_member,
                                   {60,<21908.8418.0>},
                                   [],
                                   {55,<0.14462.0>},
                                   {55,<0.14462.0>}}}]},
                            0,
                            [{{55,<0.14462.0>},{member,{[],[]},-1,-1}},
                             {{60,<21908.8418.0>},{member,{[],[]},30,30}}],
                            [<0.14589.0>],
                            {[],[]},
                            [],0,undefined,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                            false}
** Reason for termination == 
** {function_clause,
       [{orddict,fetch,
            [{55,<0.14462.0>},
             [{{62,<21909.16396.0>},
               {view_member,
                   {62,<21909.16396.0>},
                   [],
                   {43,<21908.7319.0>},
                   {43,<21908.7319.0>}}}]],
            [{file,"orddict.erl"},{line,80}]},
        {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1267}]},
        {gm,change_view,2,[{file,"src/gm.erl"},{line,1439}]},
        {gm,handle_info,2,[{file,"src/gm.erl"},{line,760}]},
        {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}

and

=ERROR REPORT==== 31-Mar-2016::07:27:15 ===
** Generic server <0.10502.0> terminating
** Last message in was {'$gen_cast',
                           {gm_deaths,[<21908.4277.0>,<21908.4108.0>]}}
** When Server state == {state,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"myQuueue_a_8">>},
                                true,false,none,[],<0.10466.0>,[],[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"all">>},
                                 {pattern,<<>>},
                                 {'apply-to',<<"all">>},
                                 {definition,
                                     [{<<"ha-mode">>,<<"all">>},
                                      {<<"ha-sync-mode">>,<<"automatic">>}]},
                                 {priority,0}],
                                [{<21909.6330.0>,<21909.6326.0>}],
                                [],live},
                            <0.10467.0>,
                            {state,
                                {dict,2,16,16,8,80,48,
                                    {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]},
                                    {{[],[],[],[],[],[],[],[],[],[],[],
                                      [[<21909.6570.0>|#Ref<0.0.2.36114>]],
                                      [],
                                      [[<21908.850.0>|#Ref<0.0.1.37197>]],
                                      [],[]}}},
                                erlang},
                            #Fun<rabbit_mirror_queue_master.7.98604090>,
                            #Fun<rabbit_mirror_queue_master.8.98604090>}
** Reason for termination == 
** {{case_clause,{ok,<21909.6678.0>,[],[]}},
    [{rabbit_mirror_queue_coordinator,handle_cast,2,
         [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,354}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

dcorbacho · 2016-04-08T17:00:05Z

I have a patch currently under testing, that solves the deadlock and apparently allows the cluster to recover.

In this patch, every GM process verifies if it is still on the group before call check_neighbours. It it is not, it stops. This allows the process to stop gracefully and rejoin the cluster afterwards so the synchronization of the queues and master/slave election happens as usually.

The patch needs longer rounds of testing, as now it eventually hits #545 (on Ubuntu) which seems to cause mismatches on the delta calculations (as seen in https://groups.google.com/forum/#!topic/rabbitmq-users/3QKj-UBqz-g by @johnfoldager). The delta error has been seen before, so it might be an independent problem. Also, during the terminate of the queues (those stopped) it can lead to:

** Reason for termination == 
** {function_clause,
       [{rabbit_variable_queue,d,
            [{delta,21045,3504,21045}],
            [{file,"src/rabbit_variable_queue.erl"},{line,1100}]},
        {rabbit_variable_queue,maybe_deltas_to_betas,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,2469}]},
        {rabbit_variable_queue,purge_betas_and_deltas,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,1667}]},
        {rabbit_variable_queue,purge1,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,1638}]},
        {rabbit_variable_queue,purge_and_index_reset,1,
            [{file,"src/rabbit_variable_queue.erl"},{line,1619}]},
        {rabbit_variable_queue,delete_and_terminate,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,545}]},
        {rabbit_priority_queue,delete_and_terminate,2,
            [{file,"src/rabbit_priority_queue.erl"},{line,181}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1146}]}]}
** In 'terminate' callback with reason ==
** normal

I will try to publish the patch next week once I am more confident on it.

michaelklishin · 2016-04-08T18:51:32Z

@dcorbacho 👍 on the idea of GM processes cleanly terminating to re-join later.

johnfoldager · 2016-04-18T11:59:58Z

Would it be possible to have the patch/fix for 3.5.7 as well?

michaelklishin · 2016-04-18T12:02:46Z

No more 3.5.x releases are planned at the moment. This particular change is probably easy to backport but there are several related issues that aren't.

dcorbacho self-assigned this Mar 30, 2016

michaelklishin added this to the 3.6.2 milestone Mar 30, 2016

michaelklishin added bug effort-low labels Mar 30, 2016

This was referenced Apr 5, 2016

Cluster blocked after pause minority #676

Closed

Cluster blocked while performing continuous restarts #675

Closed

michaelklishin modified the milestones: 3.6.x, 3.6.2 Apr 5, 2016

michaelklishin added effort-medium and removed effort-low labels Apr 5, 2016

This was referenced Apr 14, 2016

Stop deleted GM member when partial partition is detected #748

Merged

Mirrored queue crash with out of sync ACKs #749

Closed

michaelklishin closed this as completed Apr 15, 2016

michaelklishin modified the milestones: 3.6.2, 3.6.x Apr 15, 2016

Dzol mentioned this issue Apr 27, 2016

Priority queue fails to synchronize after node restart. #687

Closed

dcorbacho mentioned this issue May 9, 2016

Windows eacces / eexist error on journal.jif file during pause minority #545

Closed

dcorbacho mentioned this issue Sep 12, 2016

GM - group deleted during partial partitions #950

Closed

lemenkov mentioned this issue Feb 8, 2017

Cluster Node Crashes After Connection Failure #1003

Closed

jeroenmaelbrancke mentioned this issue Mar 14, 2018

Update our core dependencies openvstorage/framework#1839

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock while syncing mirrored queues #714

Deadlock while syncing mirrored queues #714

dcorbacho commented Mar 30, 2016

dcorbacho commented Mar 31, 2016

dcorbacho commented Apr 8, 2016

michaelklishin commented Apr 8, 2016

johnfoldager commented Apr 18, 2016

michaelklishin commented Apr 18, 2016

Deadlock while syncing mirrored queues #714

Deadlock while syncing mirrored queues #714

Comments

dcorbacho commented Mar 30, 2016

dcorbacho commented Mar 31, 2016

dcorbacho commented Apr 8, 2016

michaelklishin commented Apr 8, 2016

johnfoldager commented Apr 18, 2016

michaelklishin commented Apr 18, 2016