Mirrored queue crash with out of sync ACKs #749

dcorbacho · 2016-04-14T13:35:02Z

Using the patch for #714, in a 3-node cluster configured to test #545, the GM might eventually crash when processing an activity message:

** {{case_clause,
        {{{value,
              {33059,
               {publish,<9042.1151.1>,flow,
                   {message_properties,undefined,false,2048},
                   {basic_message,
                       {resource,<<"/">>,exchange,<<"testExchange">>},
                       [<<>>],
                       {content,60,
....
    [{gm,find_common,3,[{file,"src/gm.erl"},{line,1369}]},
     {gm,'-handle_msg/2-fun-2-',7,[{file,"src/gm.erl"},{line,881}]},
     {gm,with_member_acc,3,[{file,"src/gm.erl"},{line,1386}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1262}]},
     {gm,handle_msg,2,[{file,"src/gm.erl"},{line,871}]},
     {gm,handle_cast,2,[{file,"src/gm.erl"},{line,661}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}

which I believe leads in the other nodes to:

=ERROR REPORT==== 12-Apr-2016::16:18:05 ===
** Generic server <0.19461.2> terminating
** Last message in was {'$gen_cast',join}
** When Server state == {state,{9,<0.19461.2>},
                               {{9,<0.19461.2>},undefined},
                               {{9,<0.19461.2>},undefined},
                               {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                               rabbit_mirror_queue_slave,undefined,-1,
                               undefined,
                               [<0.19460.2>],
                               {[],[]},
                               [],0,undefined,
                               #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                               false}
** Reason for termination == 
** {{bad_return_value,
        {bad_flying_ets_update,1,2,
            {<<212,124,127,183,143,75,237,208,132,9,251,34,112,92,244,166>>,
             <<202,95,0,178,134,57,152,103,126,177,128,73,15,248,54,106>>}}},
    {gen_server2,call,
        [<5629.28766.2>,{add_on_right,{9,<0.19461.2>}},infinity]}}

and

=ERROR REPORT==== 12-Apr-2016::16:16:38 ===
** Generic server <0.27327.2> terminating
** Last message in was go
** When Server state == {not_started,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                                true,false,none,[],<0.26808.2>,[],[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"all">>},
                                 {pattern,<<>>},
                                 {'apply-to',<<"all">>},
                                 {definition,
                                     [{<<"ha-mode">>,<<"all">>},
                                      {<<"ha-sync-mode">>,<<"automatic">>}]},
                                 {priority,0}],
                                [{<32227.15062.2>,<32227.14697.2>},
                                 {<0.26809.2>,<0.26808.2>}],
                                [],live}}
** Reason for termination == 
** {duplicate_live_master,'rabbit@t-srv-rabbit04'}

This is not suspected to have been introduced by #714, but a consequence of the deadlock being resolved. Thus, the system continues running on partial partitions with pause_minority, and eventually reaches an inconsistent state.

The text was updated successfully, but these errors were encountered:

dcorbacho · 2016-04-27T10:13:05Z

The problem is raised when the next situation happens:

A -> B -> C is a GM group ring
A partial partition happens between A and B
A erases B as a member and updates Mnesia
A few seconds later, B is not aware of this change and records A as dead, adding itself to the group again.
C sees these changes of the left member in the ring, but continues as usual assuming it is normal. In fact, a dead member is back to life. An inconsistency processing the queues triggers {gm,find_common,3,[{file,"src/gm.erl"},{line,1369}]}. Basically, when the process is comparing both queues, some unexpected elements show up in the middle of the common block.

I'm still investigating if this is a race condition in the synchronization of the database (B must read the updates of A through the node C) or an implementation problem. Both nodes hosting A and B are indeed stopping, but the GM continues processing events until the stopped is finally reflected on the logs.

Maybe we could do something with the unexpected elements in the queue? I don't know yet.

dcorbacho · 2016-04-28T17:24:36Z

I created a PR that solves the crash, but not the root cause. I couldn't find a way to solve the broken ring, the nodes into partial partition can't see the update through the 'live' node. Maybe is a timing issue or mnesia might already be inconsistent and don't update anymore, as the inconsistent database event is triggered just afterwards.
It might happen that a tiny amount of messages do not end up in all queues with this change, but the system is still functional. @michaelklishin thoughts?

michaelklishin · 2016-04-28T17:28:55Z

When some ring members are unreachable, ignoring log operations for them is probably about as well as we can do.

Eventually gm will be replaced with a Raft-based mirroring, which has a well understood solution for logs getting out of sync.

I will take a look at the specifics in a bit.

michaelklishin · 2016-05-03T12:37:24Z

This issue is a bit too "inside baseball" => not including into release notes.

dcorbacho added the bug label Apr 14, 2016

michaelklishin assigned dcorbacho Apr 15, 2016

dcorbacho mentioned this issue Apr 28, 2016

Drop ACKs of messages from lost members of the ring #779

Merged

michaelklishin added this to the 3.6.2 milestone Apr 30, 2016

michaelklishin closed this as completed Apr 30, 2016

dcorbacho mentioned this issue May 9, 2016

Windows eacces / eexist error on journal.jif file during pause minority #545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mirrored queue crash with out of sync ACKs #749

Mirrored queue crash with out of sync ACKs #749

dcorbacho commented Apr 14, 2016

dcorbacho commented Apr 27, 2016

dcorbacho commented Apr 28, 2016

michaelklishin commented Apr 28, 2016

michaelklishin commented May 3, 2016

Mirrored queue crash with out of sync ACKs #749

Mirrored queue crash with out of sync ACKs #749

Comments

dcorbacho commented Apr 14, 2016

dcorbacho commented Apr 27, 2016

dcorbacho commented Apr 28, 2016

michaelklishin commented Apr 28, 2016

michaelklishin commented May 3, 2016