Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster blocked while performing continuous restarts #675

Closed
dcorbacho opened this issue Mar 4, 2016 · 3 comments
Closed

Cluster blocked while performing continuous restarts #675

dcorbacho opened this issue Mar 4, 2016 · 3 comments
Assignees
Labels
Milestone

Comments

@dcorbacho
Copy link
Contributor

Environment: 3 Ubuntu machines with RabbitMQ 3.6.1

t-srv-rabbit04
t-srv-rabbit05 (up and down script)
t-srv-rabbit06

We found two new problems while running regression tests for #581 in the 3.6.1 release.

Problem number 1:

Node 4 is blocked here:

t-srv-rabbit04 # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
2016-03-01 09:25:19 [{pid,<5588.27566.47>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{gen,do_call,4,[{file,"gen.erl"},{line,168}]},
                          {gen_server2,call,3,
                              [{file,"src/gen_server2.erl"},{line,336}]},
                          {rabbit_misc,execute_mnesia_transaction,1,
                              [{file,"src/rabbit_misc.erl"},{line,530}]},
                          {rabbit_misc,execute_mnesia_tx_with_tail,1,
                              [{file,"src/rabbit_misc.erl"},{line,572}]},
                          {rabbit_amqqueue_process,
                              '-terminate_delete/3-fun-1-',6,
                              [{file,"src/rabbit_amqqueue_process.erl"},
                               {line,258}]},
                          {rabbit_amqqueue_process,terminate_shutdown,2,
                              [{file,"src/rabbit_amqqueue_process.erl"},
                               {line,277}]},
                          {gen_server2,terminate,3,
                              [{file,"src/gen_server2.erl"},{line,1146}]},
                          {gen_server2,handle_msg,2,
                              [{file,"src/gen_server2.erl"},{line,1041}]}]},
                     {initial_call,{proc_lib,init_p,5}},
                     {message_queue_len,4},
                     {links,[<5588.27562.47>]},
                     {monitors,[{process,<5588.149.0>}]},
                     {monitored_by,
                         [<5588.144.0>,<5588.194.0>,<5588.182.0>,
                          <5588.27766.47>,<5588.8117.53>]},
                     {heap_size,196650}]

Full log here

The problem seems on the node 4: This transaction is blocked: https://github.com/rabbitmq/rabbitmq-common/blob/stable/src/rabbit_amqqueue.erl#L767

and mnesia seems to be restarting a transaction: https://github.com/erlang/otp/blob/maint/lib/mnesia/src/mnesia_tm.erl#L878 https://github.com/erlang/otp/blob/maint/lib/mnesia/src/mnesia_tm.erl#L914

From the output of maybe_stuck in node4 it seems that 2 operations might be going on over a queue: info() and terminate(). These needs more investigation, there might be a lock here.

 t-srv-rabbit05 # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
    ....
        2016-03-01 07:41:25 [{pid,<5588.598.68>},
                             {registered_name,[]},
                             {current_stacktrace,
                                 [{gen_server2,rec_nodes,7,
                                      [{file,"src/gen_server2.erl"},{line,884}]},
                                  {delegate,invoke,2,
                                      [{file,"src/delegate.erl"},{line,119}]},
                                  {delegate,invoke,2,
                                      [{file,"src/delegate.erl"},{line,97}]}]},
                             {initial_call,{erlang,apply,2}},
                             {message_queue_len,0},
                             {links,[]},
                             {monitors,
                                 [{process,{delegate_13,'rabbit@t-srv-rabbit04'}}]},
                             {monitored_by,[]},
                             {heap_size,233}]

All the processes stuck in node5 are waiting for delegate_13 in node4 (see the monitors), and this process has a very large message queue:

root@t-srv-rabbit04:/tmp# rabbitmqctl eval 'erlang:process_info(whereis(delegate_13), message_queue_len).'
                                            {message_queue_len,180828}

Problem number 2 (related to 1):

Tcp connections are not released:

Node 5:
tcp6   80585      0 10.100.0.85:5672        10.100.0.121:5406       CLOSE_WAIT  off (0.00/0/0)
tcp6   80585      0 10.100.0.85:5672        10.100.0.121:3543       CLOSE_WAIT  off (0.00/0/0)
tcp6   80585      0 10.100.0.85:5672        10.100.0.121:6342       CLOSE_WAIT  off (0.00/0/0)
tcp6   80585      0 10.100.0.85:5672        10.100.0.121:11764      CLOSE_WAIT  off (0.00/0/0)
tcp6   87208      0 10.100.0.85:5672        10.100.0.121:3717       CLOSE_WAIT  off (0.00/0/0)
……

netstat -nao |grep CLOSE_W*  | wc -l
240

root@t-srv-rabbit05:/tmp# rabbitmqctl list_connections | wc -l
524

There are no clients connected to the cluster, the server 10.100.0.121 has been stopped, but the connections are still there.

list_queues is stuck in node4, and this seems to be the stuck queue:

 {amqqueue,{resource,<<"/">>,queue,
                     <<"queue_6d42bbef-c119-4a2f-be28-bba3be301c40">>},
           false,true,none,[],<5588.27566.47>,[],[],[],
           [{vhost,<<"/">>},
            {name,<<"all">>},
            {pattern,<<>>},
            {'apply-to',<<"all">>},
            {definition,[{<<"ha-mode">>,<<"all">>},
                         {<<"ha-sync-mode">>,<<"automatic">>}]},
            {priority,0}],
           [],[],live},

This is be the queue blocked on the terminate from the maybe_stuck output (4th process, <5588.27566.47>). All other processes are waiting for it, while the queue is using a worker (process ) to run a mnesia transaction (process 6) which is being restarted (mnesia_tm:restart). The transaction is being executed here: https://github.com/rabbitmq/rabbitmq-common/blob/rabbitmq_v3_6_1_rc2/src/rabbit_misc.erl#L534 and is this one: https://github.com/rabbitmq/rabbitmq-common/blob/rabbitmq_v3_6_1_rc2/src/rabbit_amqqueue.erl#L768

@michaelklishin
Copy link
Member

I suggest using at least 3.6.1 GA.

A general note: we've seen a number of (at least on the surface) similar Mnesia issues in the last couple of months. Several were reported to the OTP team and are supposed to be fixed in OTP master.

@dcorbacho
Copy link
Contributor Author

One of the Mnesia issues was fixed in Erlang 18.3 (OTP-13284 http://erlang.org/download/otp_src_18.3.readme).

The second issue was reported at the same time this bug was filled, and is going through the last round of testing right now #676 (comment). I will see if #675 still happens with this patch or needs further investigation.

@michaelklishin
Copy link
Member

Fixed in #676 (Erlang/OTP 18.3.1), we've discovered #714 as part of the investigation, will continue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants