Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock while syncing mirrored queues #714

Closed
dcorbacho opened this issue Mar 30, 2016 · 5 comments
Closed

Deadlock while syncing mirrored queues #714

dcorbacho opened this issue Mar 30, 2016 · 5 comments
Assignees
Milestone

Comments

@dcorbacho
Copy link
Contributor

The tests for #676 exposed a deadlock in the syncing of mirrored queues. Occasionally while #676 bug is present and always when the Mnesia patch is successfully applied to Erlang 18.3, a queue master enters a deadlock.

The output of rabbit_diagnostics:maybe_stuck/0 shows how the master is waiting for the syncer in a receive clause, while the syncer is waiting for the slaves in another receive clause.

2016-03-30 07:02:53 [{pid,<5588.1224.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{rabbit_mirror_queue_sync,master_go,8,
                              [{file,"src/rabbit_mirror_queue_sync.erl"},
                               {line,94}]},
                          {rabbit_mirror_queue_master,sync_mirrors,3,
                              [{file,"src/rabbit_mirror_queue_master.erl"},
                               {line,159}]},
                          {rabbit_amqqueue_process,handle_call,3,
                              [{file,"src/rabbit_amqqueue_process.erl"},
                               {line,1112}]},
                          {gen_server2,handle_msg,2,
                              [{file,"src/gen_server2.erl"},{line,1028}]},
                          {proc_lib,wake_up,3,
                              [{file,"proc_lib.erl"},{line,250}]}]},
                     {initial_call,{proc_lib,init_p,5}},
                     {message_queue_len,1771},
                     {links,
                         [<5588.1223.0>,<5588.1373.0>,<5588.6588.0>,
                          #Port<5588.8500>]},
                     {monitors,[]},
                     {monitored_by,
                         [<5588.923.0>,<5588.911.0>,<5588.7969.0>,
                          <5588.8274.0>,<5588.7038.0>,<5588.919.0>,
                          <5588.6588.0>,<5588.931.0>,<5588.826.0>,
                          <5588.934.0>,<5588.7930.0>,<5588.7000.0>,
                          <5588.6587.0>]},
                     {heap_size,833026}]
2016-03-30 07:02:53 [{pid,<5588.6588.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{rabbit_mirror_queue_sync,
                              '-await_slaves/2-lc$^0/1-0-',2,
                              [{file,"src/rabbit_mirror_queue_sync.erl"},
                               {line,210}]},
                          {rabbit_mirror_queue_sync,syncer,4,
                              [{file,"src/rabbit_mirror_queue_sync.erl"},
                               {line,197}]}]},
                     {initial_call,{erlang,apply,2}},
                     {message_queue_len,1},
                     {links,[<5588.1224.0>]},
                     {monitors,[{process,<5588.1224.0>}]},
                     {monitored_by,[]},
                     {heap_size,233}]

The problem here are the list of slaves retrieved by the master in https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_mirror_queue_master.erl#L154, which contain the master itself. Thus, master is waiting for syncer, which it is at the same time waiting for master (as a slave) and will never answer because:

  1. It thinks is a master
  2. Is blocked in the receive

Some debug logs added that show the problem:

=INFO REPORT==== 30-Mar-2016::08:51:18 ===
Mirrored queue 'myQuueue_a_3' in vhost '/': Slave <rabbit@t-srv-rabbit06.1.1258.0> saw deaths of mirrors <rabbit@t-srv-rabbit05.1.1303.0> <rabbit@t-srv-rabbit04.3.870.0>

=INFO REPORT==== 30-Mar-2016::08:51:18 ===
Mirrored queue 'myQuueue_a_3' in vhost '/': Promoting slave <rabbit@t-srv-rabbit06.1.1258.0> to master

=WARNING REPORT==== 30-Mar-2016::08:51:19 ===
(<0.1258.0>) MASTER rabbit_mirror_queue_sync:master_go({resource,<<"/">>,
                                                        queue,
                                                        <<"myQuueue_a_3">>}) syncer <0.1429.0> slaves []
(<0.1258.0>) MASTER messages {messages,[]}

=WARNING REPORT==== 30-Mar-2016::08:51:19 ===
(<0.1429.0>) SYNCER rabbit_mirror_queue_sync:await_slaves []

=WARNING REPORT==== 30-Mar-2016::08:52:56 ===
(<0.1258.0>) MASTER rabbit_mirror_queue_sync:master_go({resource,<<"/">>,
                                                        queue,
                                                        <<"myQuueue_a_3">>}) syncer <0.11496.0> slaves [<0.1258.0>,
                                                                                                        <6676.1990.0>]
(<0.1258.0>) MASTER messages {messages,[]}

=WARNING REPORT==== 30-Mar-2016::08:52:56 ===
(<0.11496.0>) SYNCER rabbit_mirror_queue_sync:await_slaves [{<0.1258.0>,
                                                             'rabbit@t-srv-rabbit06'},
                                                            {<6676.1990.0>,
                                                             'rabbit@t-srv-rabbit05'}]

=WARNING REPORT==== 30-Mar-2016::08:52:56 ===
(<0.11496.0>) SYNCER rabbit_mirror_queue_sync:await_slaves receive <0.1258.0>

In fact, the cluster is still blocked but it seems a new master for that queue has been elected:

root@t-srv-rabbit06:~# rabbitmqctl eval 'rabbit_amqqueue:lookup({resource,<<"/">>,
>                                                         queue,
>                                                         <<"myQuueue_a_3">>}).'
{ok,{amqqueue,{resource,<<"/">>,queue,<<"myQuueue_a_3">>},
              true,false,none,[],<5588.14250.0>,
              [<6426.21064.0>,<6427.10558.3>],
              [],
              ['rabbit@t-srv-rabbit04'],
              [{vhost,<<"/">>},
               {name,<<"all">>},
               {pattern,<<>>},
               {'apply-to',<<"all">>},
               {definition,[{<<"ha-mode">>,<<"all">>},
                            {<<"ha-sync-mode">>,<<"automatic">>}]},
               {priority,0}],
              [{<6427.10559.3>,<6427.10558.3>},
               {<6426.21066.0>,<6426.21064.0>},
               {<5588.14251.0>,<5588.14250.0>}],
              [],live}}
rabbitmqctl eval 'erlang:process_info(list_to_pid("<0.14250.0>")).'
[{current_function,{gen_server2,process_next_msg,1}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<5588.14249.0>,<5588.15384.0>,#Port<5588.9658>]},
 {dictionary,
     [{{credit_from,<5588.960.0>},1721},
      {delegate,delegate_8},
      {{credit_to,<6431.6063.0>},50},
      {{"/var/lib/rabbitmq/mnesia/rabbit@t-srv-rabbit06/queues/A9XGIAWLC9TKPRDRRNC482GB8/journal.jif",
        fhc_file},
       {file,1,true}},
      {'$ancestors',
          [<5588.14249.0>,rabbit_amqqueue_sup_sup,rabbit_sup,<5588.826.0>]},
      {{credit_to,<6435.13156.0>},10},
      {fhc_age_tree,
          {1,{{-576459687147248595,#Ref<5588.0.2.48637>},true,nil,nil}}},
      {process_name,
          {rabbit_amqqueue_process,
              {resource,<<"/">>,queue,<<"myQuueue_a_3">>}}},
      {{#Ref<5588.0.2.48637>,fhc_handle},
       {handle,
           {file_descriptor,prim_file,{#Port<5588.9658>,89}},
           #Ref<5588.0.2.48637>,0,false,0,infinity,[],<<>>,0,0,0,0,0,true,
           "/var/lib/rabbitmq/mnesia/rabbit@t-srv-rabbit06/queues/A9XGIAWLC9TKPRDRRNC482GB8/journal.jif",
           [write,binary,raw,read],
           [{write_buffer,infinity}],
           true,true,-576459687147248595}},
      {{credit_to,<6435.13150.0>},10},
      {credit_blocked,[]},
      {{credit_to,<6431.6062.0>},50},
      {credit_flow_default_credit,{200,50}},
      {{credit_to,<6431.6047.0>},49},
      {guid,{{829587482,95375708,1576193894,3592586924},1}},
      {'$initial_call',{gen,init_it,6}},
      {{credit_to,<6435.13113.0>},10}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<5588.825.0>},
 {total_heap_size,196650},
 {heap_size,196650},
 {stack_size,7},
 {reductions,1884012},
 {garbage_collection,
     [{min_bin_vheap_size,46422},
      {min_heap_size,233},
      {fullsweep_after,65535},
      {minor_gcs,0}]},
 {suspending,[]}]
@dcorbacho dcorbacho self-assigned this Mar 30, 2016
@michaelklishin michaelklishin added this to the 3.6.2 milestone Mar 30, 2016
@dcorbacho
Copy link
Contributor Author

The logs report several and plenty errors related to the GM (see below), which seem to be the cause of this error. As we have previously seen, a 'master' retrieves an incorrect database state where it is a slave too.

This test environment has 3 nodes clustered with HA queues: rabbit04, rabbit05, rabbit06. A partial partition is caused by dropping the connection from rabbit05 to rabbit04.

Following the error traces and with some additional logs, we can see the changes on the GM group:

            T0 ----------------T1----------------------------- T2 --------------------- T3
rabbit04     All                All                           All                     All -> Crashes!
rabbit05     All   partition    {death, r04}, r05, r06        pause_minority          pause_minority
rabbit06     All                {death, r04} through r05      {death,r05}, r06         r06

rabbit05 detects the death of rabbit04 (through partial partition) and stores the new GM group as r05 and r06. rabbit06 sees it and instants later detects rabbit05 in pause_minority, updating the GM group to r06 only. During the next gm:check_neighbours call that rabbit04 executes, it will crash as the GM group that retrieves from mnesia doesn't contain r04. Indeed, r04 is alive and connected to r06, so it gets all database updates.

It seems the handling of 'DOWN' messages on partial network partitions, is not properly handled (or recovered afterwards) and creates an inconsistent state across the cluster. See here.

=ERROR REPORT==== 31-Mar-2016::07:24:37 ===
** Generic server <0.917.0> terminating
** Last message in was {'$gen_cast',{'$gm',5,check_neighbours}}
** When Server state == {state,
                         {0,<0.917.0>},
                         {{0,<21912.1217.0>},#Ref<0.0.2.709>},
                         {{1,<21910.1249.0>},#Ref<0.0.2.799>},
                         {resource,<<"/">>,queue,<<"myQuueue_a_5">>},
                         rabbit_mirror_queue_coordinator,
                         {2,
                          [{{0,<0.917.0>},
                            {view_member,
                             {0,<0.917.0>},
                             [],
                             {0,<21912.1217.0>},
                             {1,<21910.1249.0>}}},
                           {{0,<21912.1217.0>},
                            {view_member,
                             {0,<21912.1217.0>},
                             [],
                             {1,<21910.1249.0>},
                             {0,<0.917.0>}}},
                           {{1,<21910.1249.0>},
                            {view_member,
                             {1,<21910.1249.0>},
                             [],
                             {0,<0.917.0>},
                             {0,<21912.1217.0>}}}]},
                         1617,
....
** Reason for termination == 
** {function_clause,
       [{orddict,fetch,
            [{0,<0.917.0>},
             [{{1,<21910.1249.0>},
               {view_member,
                   {1,<21910.1249.0>},
                   [],
                   {1,<21910.1249.0>},
                   {1,<21910.1249.0>}}}]],
            [{file,"orddict.erl"},{line,80}]},
        {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1267}]},
        {gm,change_view,2,[{file,"src/gm.erl"},{line,1439}]},
        {gm,handle_cast,2,[{file,"src/gm.erl"},{line,649}]},
        {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

and

=ERROR REPORT==== 31-Mar-2016::07:46:33 ===
** Generic server <0.14462.0> terminating
** Last message in was {'DOWN',#Ref<0.0.2.59115>,process,<21908.8418.0>,
                               noconnection}
** When Server state == {state,
                            {55,<0.14462.0>},
                            {{60,<21908.8418.0>},#Ref<0.0.2.59115>},
                            {{60,<21908.8418.0>},#Ref<0.0.2.59116>},
                            {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                            rabbit_mirror_queue_coordinator,
                            {62,
                             [{{55,<0.14462.0>},
                               {view_member,
                                   {55,<0.14462.0>},
                                   [],
                                   {60,<21908.8418.0>},
                                   {60,<21908.8418.0>}}},
                              {{60,<21908.8418.0>},
                               {view_member,
                                   {60,<21908.8418.0>},
                                   [],
                                   {55,<0.14462.0>},
                                   {55,<0.14462.0>}}}]},
                            0,
                            [{{55,<0.14462.0>},{member,{[],[]},-1,-1}},
                             {{60,<21908.8418.0>},{member,{[],[]},30,30}}],
                            [<0.14589.0>],
                            {[],[]},
                            [],0,undefined,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                            false}
** Reason for termination == 
** {function_clause,
       [{orddict,fetch,
            [{55,<0.14462.0>},
             [{{62,<21909.16396.0>},
               {view_member,
                   {62,<21909.16396.0>},
                   [],
                   {43,<21908.7319.0>},
                   {43,<21908.7319.0>}}}]],
            [{file,"orddict.erl"},{line,80}]},
        {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1267}]},
        {gm,change_view,2,[{file,"src/gm.erl"},{line,1439}]},
        {gm,handle_info,2,[{file,"src/gm.erl"},{line,760}]},
        {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}

and

=ERROR REPORT==== 31-Mar-2016::07:27:15 ===
** Generic server <0.10502.0> terminating
** Last message in was {'$gen_cast',
                           {gm_deaths,[<21908.4277.0>,<21908.4108.0>]}}
** When Server state == {state,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"myQuueue_a_8">>},
                                true,false,none,[],<0.10466.0>,[],[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"all">>},
                                 {pattern,<<>>},
                                 {'apply-to',<<"all">>},
                                 {definition,
                                     [{<<"ha-mode">>,<<"all">>},
                                      {<<"ha-sync-mode">>,<<"automatic">>}]},
                                 {priority,0}],
                                [{<21909.6330.0>,<21909.6326.0>}],
                                [],live},
                            <0.10467.0>,
                            {state,
                                {dict,2,16,16,8,80,48,
                                    {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]},
                                    {{[],[],[],[],[],[],[],[],[],[],[],
                                      [[<21909.6570.0>|#Ref<0.0.2.36114>]],
                                      [],
                                      [[<21908.850.0>|#Ref<0.0.1.37197>]],
                                      [],[]}}},
                                erlang},
                            #Fun<rabbit_mirror_queue_master.7.98604090>,
                            #Fun<rabbit_mirror_queue_master.8.98604090>}
** Reason for termination == 
** {{case_clause,{ok,<21909.6678.0>,[],[]}},
    [{rabbit_mirror_queue_coordinator,handle_cast,2,
         [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,354}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

@dcorbacho
Copy link
Contributor Author

I have a patch currently under testing, that solves the deadlock and apparently allows the cluster to recover.

In this patch, every GM process verifies if it is still on the group before call check_neighbours. It it is not, it stops. This allows the process to stop gracefully and rejoin the cluster afterwards so the synchronization of the queues and master/slave election happens as usually.

The patch needs longer rounds of testing, as now it eventually hits #545 (on Ubuntu) which seems to cause mismatches on the delta calculations (as seen in https://groups.google.com/forum/#!topic/rabbitmq-users/3QKj-UBqz-g by @johnfoldager). The delta error has been seen before, so it might be an independent problem. Also, during the terminate of the queues (those stopped) it can lead to:

** Reason for termination == 
** {function_clause,
       [{rabbit_variable_queue,d,
            [{delta,21045,3504,21045}],
            [{file,"src/rabbit_variable_queue.erl"},{line,1100}]},
        {rabbit_variable_queue,maybe_deltas_to_betas,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,2469}]},
        {rabbit_variable_queue,purge_betas_and_deltas,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,1667}]},
        {rabbit_variable_queue,purge1,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,1638}]},
        {rabbit_variable_queue,purge_and_index_reset,1,
            [{file,"src/rabbit_variable_queue.erl"},{line,1619}]},
        {rabbit_variable_queue,delete_and_terminate,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,545}]},
        {rabbit_priority_queue,delete_and_terminate,2,
            [{file,"src/rabbit_priority_queue.erl"},{line,181}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1146}]}]}
** In 'terminate' callback with reason ==
** normal

I will try to publish the patch next week once I am more confident on it.

@michaelklishin
Copy link
Member

@dcorbacho 👍 on the idea of GM processes cleanly terminating to re-join later.

@johnfoldager
Copy link

Would it be possible to have the patch/fix for 3.5.7 as well?

@michaelklishin
Copy link
Member

No more 3.5.x releases are planned at the moment. This particular change is probably easy to backport but there are several related issues that aren't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants