Crash when trying to reload rabbitmq-management plugin after removing a node #298

evanccnyc · 2015-09-04T19:39:29Z

Hi there,

I ran into a rabbitmq crash after the following scenario:

RabbitMQ 3.5.3
Ubuntu 14.04
3 servers, 1c,1d,1e

There was a network blip and 1e briefly connected/reconnected, at which point the management GUI showed it not connected to the cluster. We drained the queues and then disconnected the 1e node. At which point we lost all information on the management GUI on the remaining two servers. The interface showed lots of ????'s and error messages like "Statistics database could not be contacted.".

At which point we removed the management plugin from one of the servers (1c) via the command line and tried to readd it back with: "rabbitmq-plugins enable rabbitmq_management"

At which point 1c crashed with the following debug message:

=CRASH REPORT==== 4-Sep-2015::14:23:27 ===
  crasher:
    initial call: gen:init_it/6
    pid: <0.12775.6568>
    registered_name: []
    exception exit: {function_clause,
                        [{rabbit_channel,handle_info,2,[]},
                         {gen_server2,handle_msg,2,[]},
                         {rabbit_mgmt_wm_overview,module_info,1,[]}]}
      in function  gen_server2:terminate/3
      in call from rabbit_mgmt_wm_overview:module_info/1
    ancestors: [<0.28642.6557>,<0.3525.6568>,<0.12250.6567>,<0.4055.6567>,
                  rabbit_tcp_client_sup,rabbit_sup,<0.4863.0>]
    messages: []
    links: [<0.28642.6557>]
    dictionary: [{{queue_stats,{resource,<<"/">>,queue,<<"djjob.quick">>}},
                   [{deliver,1},{redeliver,1}]},
                  {credit_blocked,[]},
                  {delegate,delegate_3},
                  {msg_size_for_gc,2077},
                  {process_name,
                      {rabbit_channel,
                          {<<"10.178.189.57:59650 -> 10.0.11.174:5672">>,1}}}]
    trap_exit: true
    status: running
    heap_size: 4185
    stack_size: 27
    reductions: 8423
  neighbours:

=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.29726.6380>,rabbit_channel_sup}
     Context:    child_terminated
     Reason:     {function_clause,
                     [{rabbit_channel,handle_info,2,[]},
                      {gen_server2,handle_msg,2,[]},
                      {rabbit_mgmt_wm_overview,module_info,1,[]}]}
     Offender:   [{pid,<0.22944.6380>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.10639.6372>,<0.8494.6369>,<0.10639.6372>,
                           <<"10.140.163.98:52524 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"uberxxx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,
                           [{<<"publisher_confirms">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true},
                            {<<"exchange_exchange_bindings">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true}],
                           <0.20398.6380>,<0.3201.6381>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]


=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.29726.6380>,rabbit_channel_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.22944.6380>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.10639.6372>,<0.8494.6369>,<0.10639.6372>,
                           <<"10.140.163.98:52524 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"uberxxx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,
                           [{<<"publisher_confirms">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true},
                            {<<"exchange_exchange_bindings">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true}],
                           <0.20398.6380>,<0.3201.6381>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 4-Sep-2015::14:23:56 ===
  crasher:
    initial call: gen:init_it/6
    pid: <0.7322.5965>
    registered_name: []
    exception exit: {function_clause,
                        [{rabbit_channel,handle_info,2,[]},
                         {gen_server2,handle_msg,2,[]},
                         {rabbit_mgmt_wm_overview,module_info,1,[]}]}
      in function  gen_server2:terminate/3
      in call from rabbit_mgmt_wm_overview:module_info/1
    ancestors: [<0.4950.5965>,<0.24355.5952>,<0.7810.5965>,<0.8862.5947>,
                  rabbit_tcp_client_sup,rabbit_sup,<0.4863.0>]
    messages: []
    links: [<0.4950.5965>]
    dictionary: [{delegate,delegate_9},
                  {guid_secure,
                      {{11,'rabbit@rabbitmq-us-east-1c',#Ref<0.0.3333.114301>},
                       0}},
                  {msg_size_for_gc,695},
                  {{credit_from,<9876.5461.0>},196},
                  {process_name,
                      {rabbit_channel,
                          {<<"10.178.189.57:42651 -> 10.0.11.174:5672">>,1}}},
                  {{credit_from,<0.5465.0>},196},
                  {{credit_to,<0.19404.5943>},48},
                  {{xtype_to_module,direct},rabbit_exchange_type_direct},
                  {credit_blocked,[]},
                  {guid,{{829018877,3024599086,2624743100,3587034967},1}},
                  {{exchange_stats,{resource,<<"/">>,exchange,<<"cronx">>}},
                   [{publish,2}]}]
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 5867
  neighbours:

=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.4950.5965>,rabbit_channel_sup}
     Context:    child_terminated
     Reason:     {function_clause,
                     [{rabbit_channel,handle_info,2,[]},
                      {gen_server2,handle_msg,2,[]},
                      {rabbit_mgmt_wm_overview,module_info,1,[]}]}
     Offender:   [{pid,<0.7322.5965>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.19404.5943>,<0.27371.5962>,<0.19404.5943>,
                           <<"10.178.189.57:42651 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"cronx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,[],<0.5646.5965>,<0.7247.5965>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]


=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.4950.5965>,rabbit_channel_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.7322.5965>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.19404.5943>,<0.27371.5962>,<0.19404.5943>,
                           <<"10.178.189.57:42651 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"cronx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,[],<0.5646.5965>,<0.7247.5965>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]

The text was updated successfully, but these errors were encountered:

michaelklishin · 2015-09-04T22:11:20Z

I edited out 90% of the log because it had the same exception for every channel.

If stats DB cannot be reached, channels won't be able to emit stats to it. There is only one node that collects statistics and it is entirely in RAM — all of the data it has is transient by design. Enabling and disabling the plugin is unnecessary — there's plenty of threads on the mailing list that demonstrate how to make stats DB terminate with rabbitmqctl. It then will be restarted on another node.

It looks like the stats DB became unreachable right after it queried a bunch of channels. I'm not sure how much can be improved here.

evanccnyc · 2015-09-05T14:53:54Z

Couple of things:

Is there documentation on the website for what to do if the stats db becomes unreachable? I didnt see anything in my brief look. Perhaps that should be documented there?
Even though you shouldnt have to disable/enable the plugin, shouldnt the fact that it crashed while doing this be a bit of problem?

michaelklishin · 2015-09-05T14:57:00Z

https://groups.google.com/d/msg/rabbitmq-users/P5g8CRI1Owg/T01ob_Bn9oAJ
I don't see any evidence of a "plugin crash". The errors you see in the log are from channel processes that couldn't emit stats because they couldn't contact stats DB. Eventually their supervisor was restarted.
Even though there is a "crash" in the name, these are non-fatal termination and only affect a single unfortunate connection.

evanccnyc · 2015-09-05T14:58:53Z

Thanks #2 makes sense. I still think #1 should be documented on the website somewhere, I was searching for a while and couldnt find it.

michaelklishin · 2015-09-05T15:00:45Z

@evanccnyc there should be a rabbitmqctl command for that. To make it possible we'd have to first look into making rabbitmqctl extensible from plugins. The plan is to do that after 3.6.0.

I'll add a note to the site, which is open source, by the way :)

evanccnyc · 2015-09-05T15:12:51Z

Thanks. Ah next time I just submit a pull request for that.

binarin · 2016-08-05T10:10:33Z

That stacktrace is very interesting in itself. I've seen similar one, and I can't grasp what's happening. And there are 2 things that I can't explain:

Somebody is sending unasked for messages to channel process
Stacktraces of dead processes contain strange function calls to module_info in different, completely unrelated modules.

Everything starts with mnesia event about inconsistent database.
Then channel process receive strange message:

{{#Ref<0.0.23.89627>,'rabbit@messaging-node-202'},
                        [{ok,<8705.5665.45>,ok}]}

which it doesn't know how to handle (as there is no even remotely similar clause of handle_info/2).

And it results in an even more strange stacktrace:

** {function_clause,[{rabbit_channel,handle_info,2,[]},
                     {gen_server2,handle_msg,2,[]},
                     {ssl_config,module_info,1,[]}]}

How did ssl_config:module_info/1 end there? Normal stacktrace contains proc_lib:wake_up/3 there.

And on another occasion I've seen queue slave process terminated with

** {function_clause,[{orddict,fetch,2,[]},
                     {dtls_connection,module_info,1,[]},
                     {gen_server2,handle_msg,2,[]},
                     {dtls_connection,module_info,1,[]}]}

What have queue slave process and DTLS (which is not even used anywhere) in common? What is responsible for all those module_info calls?

michaelklishin closed this as completed Sep 4, 2015

michaelklishin self-assigned this Sep 4, 2015

dumbbell added this to the n/a milestone Sep 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when trying to reload rabbitmq-management plugin after removing a node #298

Crash when trying to reload rabbitmq-management plugin after removing a node #298

evanccnyc commented Sep 4, 2015

michaelklishin commented Sep 4, 2015

evanccnyc commented Sep 5, 2015

michaelklishin commented Sep 5, 2015

evanccnyc commented Sep 5, 2015

michaelklishin commented Sep 5, 2015

evanccnyc commented Sep 5, 2015

binarin commented Aug 5, 2016

Crash when trying to reload rabbitmq-management plugin after removing a node #298

Crash when trying to reload rabbitmq-management plugin after removing a node #298

Comments

evanccnyc commented Sep 4, 2015

michaelklishin commented Sep 4, 2015

evanccnyc commented Sep 5, 2015

michaelklishin commented Sep 5, 2015

evanccnyc commented Sep 5, 2015

michaelklishin commented Sep 5, 2015

evanccnyc commented Sep 5, 2015

binarin commented Aug 5, 2016