Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mirrored queue crash with out of sync ACKs #749

Closed
dcorbacho opened this issue Apr 14, 2016 · 4 comments
Closed

Mirrored queue crash with out of sync ACKs #749

dcorbacho opened this issue Apr 14, 2016 · 4 comments
Assignees
Labels
Milestone

Comments

@dcorbacho
Copy link
Contributor

Using the patch for #714, in a 3-node cluster configured to test #545, the GM might eventually crash when processing an activity message:

** {{case_clause,
        {{{value,
              {33059,
               {publish,<9042.1151.1>,flow,
                   {message_properties,undefined,false,2048},
                   {basic_message,
                       {resource,<<"/">>,exchange,<<"testExchange">>},
                       [<<>>],
                       {content,60,
....
    [{gm,find_common,3,[{file,"src/gm.erl"},{line,1369}]},
     {gm,'-handle_msg/2-fun-2-',7,[{file,"src/gm.erl"},{line,881}]},
     {gm,with_member_acc,3,[{file,"src/gm.erl"},{line,1386}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1262}]},
     {gm,handle_msg,2,[{file,"src/gm.erl"},{line,871}]},
     {gm,handle_cast,2,[{file,"src/gm.erl"},{line,661}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}

which I believe leads in the other nodes to:

=ERROR REPORT==== 12-Apr-2016::16:18:05 ===
** Generic server <0.19461.2> terminating
** Last message in was {'$gen_cast',join}
** When Server state == {state,{9,<0.19461.2>},
                               {{9,<0.19461.2>},undefined},
                               {{9,<0.19461.2>},undefined},
                               {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                               rabbit_mirror_queue_slave,undefined,-1,
                               undefined,
                               [<0.19460.2>],
                               {[],[]},
                               [],0,undefined,
                               #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                               false}
** Reason for termination == 
** {{bad_return_value,
        {bad_flying_ets_update,1,2,
            {<<212,124,127,183,143,75,237,208,132,9,251,34,112,92,244,166>>,
             <<202,95,0,178,134,57,152,103,126,177,128,73,15,248,54,106>>}}},
    {gen_server2,call,
        [<5629.28766.2>,{add_on_right,{9,<0.19461.2>}},infinity]}}

and

=ERROR REPORT==== 12-Apr-2016::16:16:38 ===
** Generic server <0.27327.2> terminating
** Last message in was go
** When Server state == {not_started,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                                true,false,none,[],<0.26808.2>,[],[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"all">>},
                                 {pattern,<<>>},
                                 {'apply-to',<<"all">>},
                                 {definition,
                                     [{<<"ha-mode">>,<<"all">>},
                                      {<<"ha-sync-mode">>,<<"automatic">>}]},
                                 {priority,0}],
                                [{<32227.15062.2>,<32227.14697.2>},
                                 {<0.26809.2>,<0.26808.2>}],
                                [],live}}
** Reason for termination == 
** {duplicate_live_master,'rabbit@t-srv-rabbit04'}

This is not suspected to have been introduced by #714, but a consequence of the deadlock being resolved. Thus, the system continues running on partial partitions with pause_minority, and eventually reaches an inconsistent state.

@dcorbacho
Copy link
Contributor Author

The problem is raised when the next situation happens:

  • A -> B -> C is a GM group ring
  • A partial partition happens between A and B
  • A erases B as a member and updates Mnesia
  • A few seconds later, B is not aware of this change and records A as dead, adding itself to the group again.
  • C sees these changes of the left member in the ring, but continues as usual assuming it is normal. In fact, a dead member is back to life. An inconsistency processing the queues triggers {gm,find_common,3,[{file,"src/gm.erl"},{line,1369}]}. Basically, when the process is comparing both queues, some unexpected elements show up in the middle of the common block.

I'm still investigating if this is a race condition in the synchronization of the database (B must read the updates of A through the node C) or an implementation problem. Both nodes hosting A and B are indeed stopping, but the GM continues processing events until the stopped is finally reflected on the logs.

Maybe we could do something with the unexpected elements in the queue? I don't know yet.

@dcorbacho
Copy link
Contributor Author

I created a PR that solves the crash, but not the root cause. I couldn't find a way to solve the broken ring, the nodes into partial partition can't see the update through the 'live' node. Maybe is a timing issue or mnesia might already be inconsistent and don't update anymore, as the inconsistent database event is triggered just afterwards.
It might happen that a tiny amount of messages do not end up in all queues with this change, but the system is still functional. @michaelklishin thoughts?

@michaelklishin
Copy link
Member

When some ring members are unreachable, ignoring log operations for them is probably about as well as we can do.

Eventually gm will be replaced with a Raft-based mirroring, which has a well understood solution for logs getting out of sync.

I will take a look at the specifics in a bit.

@michaelklishin michaelklishin added this to the 3.6.2 milestone Apr 30, 2016
@michaelklishin
Copy link
Member

This issue is a bit too "inside baseball" => not including into release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants