Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when trying to reload rabbitmq-management plugin after removing a node #298

Closed
evanccnyc opened this issue Sep 4, 2015 · 7 comments
Assignees
Milestone

Comments

@evanccnyc
Copy link

Hi there,

I ran into a rabbitmq crash after the following scenario:

RabbitMQ 3.5.3
Ubuntu 14.04
3 servers, 1c,1d,1e

There was a network blip and 1e briefly connected/reconnected, at which point the management GUI showed it not connected to the cluster. We drained the queues and then disconnected the 1e node. At which point we lost all information on the management GUI on the remaining two servers. The interface showed lots of ????'s and error messages like "Statistics database could not be contacted.".

At which point we removed the management plugin from one of the servers (1c) via the command line and tried to readd it back with: "rabbitmq-plugins enable rabbitmq_management"

At which point 1c crashed with the following debug message:

=CRASH REPORT==== 4-Sep-2015::14:23:27 ===
  crasher:
    initial call: gen:init_it/6
    pid: <0.12775.6568>
    registered_name: []
    exception exit: {function_clause,
                        [{rabbit_channel,handle_info,2,[]},
                         {gen_server2,handle_msg,2,[]},
                         {rabbit_mgmt_wm_overview,module_info,1,[]}]}
      in function  gen_server2:terminate/3
      in call from rabbit_mgmt_wm_overview:module_info/1
    ancestors: [<0.28642.6557>,<0.3525.6568>,<0.12250.6567>,<0.4055.6567>,
                  rabbit_tcp_client_sup,rabbit_sup,<0.4863.0>]
    messages: []
    links: [<0.28642.6557>]
    dictionary: [{{queue_stats,{resource,<<"/">>,queue,<<"djjob.quick">>}},
                   [{deliver,1},{redeliver,1}]},
                  {credit_blocked,[]},
                  {delegate,delegate_3},
                  {msg_size_for_gc,2077},
                  {process_name,
                      {rabbit_channel,
                          {<<"10.178.189.57:59650 -> 10.0.11.174:5672">>,1}}}]
    trap_exit: true
    status: running
    heap_size: 4185
    stack_size: 27
    reductions: 8423
  neighbours:

=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.29726.6380>,rabbit_channel_sup}
     Context:    child_terminated
     Reason:     {function_clause,
                     [{rabbit_channel,handle_info,2,[]},
                      {gen_server2,handle_msg,2,[]},
                      {rabbit_mgmt_wm_overview,module_info,1,[]}]}
     Offender:   [{pid,<0.22944.6380>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.10639.6372>,<0.8494.6369>,<0.10639.6372>,
                           <<"10.140.163.98:52524 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"uberxxx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,
                           [{<<"publisher_confirms">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true},
                            {<<"exchange_exchange_bindings">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true}],
                           <0.20398.6380>,<0.3201.6381>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]


=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.29726.6380>,rabbit_channel_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.22944.6380>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.10639.6372>,<0.8494.6369>,<0.10639.6372>,
                           <<"10.140.163.98:52524 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"uberxxx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,
                           [{<<"publisher_confirms">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true},
                            {<<"exchange_exchange_bindings">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true}],
                           <0.20398.6380>,<0.3201.6381>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 4-Sep-2015::14:23:56 ===
  crasher:
    initial call: gen:init_it/6
    pid: <0.7322.5965>
    registered_name: []
    exception exit: {function_clause,
                        [{rabbit_channel,handle_info,2,[]},
                         {gen_server2,handle_msg,2,[]},
                         {rabbit_mgmt_wm_overview,module_info,1,[]}]}
      in function  gen_server2:terminate/3
      in call from rabbit_mgmt_wm_overview:module_info/1
    ancestors: [<0.4950.5965>,<0.24355.5952>,<0.7810.5965>,<0.8862.5947>,
                  rabbit_tcp_client_sup,rabbit_sup,<0.4863.0>]
    messages: []
    links: [<0.4950.5965>]
    dictionary: [{delegate,delegate_9},
                  {guid_secure,
                      {{11,'rabbit@rabbitmq-us-east-1c',#Ref<0.0.3333.114301>},
                       0}},
                  {msg_size_for_gc,695},
                  {{credit_from,<9876.5461.0>},196},
                  {process_name,
                      {rabbit_channel,
                          {<<"10.178.189.57:42651 -> 10.0.11.174:5672">>,1}}},
                  {{credit_from,<0.5465.0>},196},
                  {{credit_to,<0.19404.5943>},48},
                  {{xtype_to_module,direct},rabbit_exchange_type_direct},
                  {credit_blocked,[]},
                  {guid,{{829018877,3024599086,2624743100,3587034967},1}},
                  {{exchange_stats,{resource,<<"/">>,exchange,<<"cronx">>}},
                   [{publish,2}]}]
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 5867
  neighbours:

=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.4950.5965>,rabbit_channel_sup}
     Context:    child_terminated
     Reason:     {function_clause,
                     [{rabbit_channel,handle_info,2,[]},
                      {gen_server2,handle_msg,2,[]},
                      {rabbit_mgmt_wm_overview,module_info,1,[]}]}
     Offender:   [{pid,<0.7322.5965>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.19404.5943>,<0.27371.5962>,<0.19404.5943>,
                           <<"10.178.189.57:42651 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"cronx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,[],<0.5646.5965>,<0.7247.5965>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]


=SUPERVISOR REPORT==== 4-Sep-2015::14:23:56 ===
     Supervisor: {<0.4950.5965>,rabbit_channel_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.7322.5965>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.19404.5943>,<0.27371.5962>,<0.19404.5943>,
                           <<"10.178.189.57:42651 -> 10.0.11.174:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"cronx">>,[],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"/">>,[],<0.5646.5965>,<0.7247.5965>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]
@michaelklishin
Copy link
Member

I edited out 90% of the log because it had the same exception for every channel.

If stats DB cannot be reached, channels won't be able to emit stats to it. There is only one node that collects statistics and it is entirely in RAM — all of the data it has is transient by design. Enabling and disabling the plugin is unnecessary — there's plenty of threads on the mailing list that demonstrate how to make stats DB terminate with rabbitmqctl. It then will be restarted on another node.

It looks like the stats DB became unreachable right after it queried a bunch of channels. I'm not sure how much can be improved here.

@michaelklishin michaelklishin self-assigned this Sep 4, 2015
@evanccnyc
Copy link
Author

Couple of things:

  1. Is there documentation on the website for what to do if the stats db becomes unreachable? I didnt see anything in my brief look. Perhaps that should be documented there?

  2. Even though you shouldnt have to disable/enable the plugin, shouldnt the fact that it crashed while doing this be a bit of problem?

@michaelklishin
Copy link
Member

  1. https://groups.google.com/d/msg/rabbitmq-users/P5g8CRI1Owg/T01ob_Bn9oAJ
  2. I don't see any evidence of a "plugin crash". The errors you see in the log are from channel processes that couldn't emit stats because they couldn't contact stats DB. Eventually their supervisor was restarted.
    Even though there is a "crash" in the name, these are non-fatal termination and only affect a single unfortunate connection.

@evanccnyc
Copy link
Author

Thanks #2 makes sense. I still think #1 should be documented on the website somewhere, I was searching for a while and couldnt find it.

@michaelklishin
Copy link
Member

@evanccnyc there should be a rabbitmqctl command for that. To make it possible we'd have to first look into making rabbitmqctl extensible from plugins. The plan is to do that after 3.6.0.

I'll add a note to the site, which is open source, by the way :)

@evanccnyc
Copy link
Author

Thanks. Ah next time I just submit a pull request for that.

@dumbbell dumbbell added this to the n/a milestone Sep 11, 2015
@binarin
Copy link
Contributor

binarin commented Aug 5, 2016

That stacktrace is very interesting in itself. I've seen similar one, and I can't grasp what's happening. And there are 2 things that I can't explain:

  • Somebody is sending unasked for messages to channel process
  • Stacktraces of dead processes contain strange function calls to module_info in different, completely unrelated modules.

Everything starts with mnesia event about inconsistent database.
Then channel process receive strange message:

{{#Ref<0.0.23.89627>,'rabbit@messaging-node-202'},
                        [{ok,<8705.5665.45>,ok}]}

which it doesn't know how to handle (as there is no even remotely similar clause of handle_info/2).

And it results in an even more strange stacktrace:

** {function_clause,[{rabbit_channel,handle_info,2,[]},
                     {gen_server2,handle_msg,2,[]},
                     {ssl_config,module_info,1,[]}]}

How did ssl_config:module_info/1 end there? Normal stacktrace contains proc_lib:wake_up/3 there.

And on another occasion I've seen queue slave process terminated with

** {function_clause,[{orddict,fetch,2,[]},
                     {dtls_connection,module_info,1,[]},
                     {gen_server2,handle_msg,2,[]},
                     {dtls_connection,module_info,1,[]}]}

What have queue slave process and DTLS (which is not even used anywhere) in common? What is responsible for all those module_info calls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants