Skip to content
This repository has been archived by the owner on May 13, 2022. It is now read-only.

Management UI starts responding with 500s after 1-2 hours of operation #10

Closed
badrshs opened this issue Nov 15, 2019 · 13 comments
Closed
Labels
question A question of RCA request

Comments

@badrshs
Copy link

badrshs commented Nov 15, 2019

Hi,

i'm having an issue with rabbitMQ, everything is fine on my machine but it's not on the others like ( VM)
after installation it works for 1 hour then it crash , and i need to reinstall again to get it back .

when i open the rabbitmq-management-agent I got Error 500

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.8.1\sbin>rabbitmqctl status
Status of node rabbit@IE-DEV-CNTRCT02 ...
Runtime
OS PID: 1956
OS: Windows
Uptime (seconds): 55473
RabbitMQ version: 3.8.1
Node name: rabbit@IE-DEV-CNTRCT02
Erlang configuration: Erlang/OTP 22 [erts-10.5] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:64]
Erlang processes: 402 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 60

Plugins

Enabled plugin file: C:/Users/XXXXX/AppData/Roaming/RabbitMQ/enabled_plugins
Enabled plugins:

 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * cowboy
 * cowlib
 * amqp_client

Data directory

Node data directory: c:/Users/XXXXXXX/AppData/Roaming/RabbitMQ/db/rabbit@IE-DEV-CNTRCT02-mnesia

Config files

 * c:/Users/XXXXX/AppData/Roaming/RabbitMQ/advanced.config

Log file(s)

 * C:/Users/XXXXXXX/AppData/Roaming/RabbitMQ/log/rabbit@IE-DEV-CNTRCT02.log
 * C:/Users/XXXXXXX/AppData/Roaming/RabbitMQ/log/rabbit@IE-DEV-CNTRCT02_upgrade.log

Alarms

(none)

Memory

Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 3.4358 gb
other_proc: 0.0328 gb (37.76 %)
code: 0.0297 gb (34.24 %)
other_system: 0.0112 gb (12.93 %)
allocated_unused: 0.0074 gb (8.49 %)
other_ets: 0.0033 gb (3.75 %)
atom: 0.0015 gb (1.75 %)
plugins: 0.0003 gb (0.35 %)
metrics: 0.0002 gb (0.24 %)
binary: 0.0002 gb (0.23 %)
mnesia: 0.0001 gb (0.09 %)
quorum_ets: 0.0 gb (0.05 %)
mgmt_db: 0.0 gb (0.04 %)
msg_index: 0.0 gb (0.04 %)
queue_procs: 0.0 gb (0.03 %)
connection_other: 0.0 gb (0.0 %)
connection_channels: 0.0 gb (0.0 %)
connection_readers: 0.0 gb (0.0 %)
connection_writers: 0.0 gb (0.0 %)
queue_slave_procs: 0.0 gb (0.0 %)
quorum_queue_procs: 0.0 gb (0.0 %)
reserved_unallocated: 0.0 gb (0.0 %)

File Descriptors

Total: 2, limit: 65439
Sockets: 0, limit: 58893

Free Disk Space

Low free disk space watermark: 0.05 gb
Free disk space: 20.163 gb

Totals

Connection count: 0
Queue count: 1
Virtual host count: 1

Listeners

Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: 0.0.0.0, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
Interface: 0.0.0.0, port: 15672, protocol: http, purpose: HTTP API

then stop the server by

> C:\Program Files\RabbitMQ Server\rabbitmq_server-3.8.1\sbin>rabbitmqctl stop_app
> .Stopping rabbit application on node rabbit@IE-DEV-CNTRCT02 ...

start the server again

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.8.1\sbin>rabbitmqctl start_app
Starting node rabbit@IE-DEV-CNTRCT02 ...
Error:
{:could_not_start, :rabbitmq_management_agent, {:rabbitmq_management_agent, {{:shutdown, {:failed_to_start_child, :rabbit_mgmt_agent_sup, {:shutdown, {:failed_to_start_child, :rabbit_mgmt_external_stats, {:badarg, [{:erlang, :port_command, [#Port<10510.10551>, []], [file: 'erlang.erl', line: 3143]}, {:os, :cmd, 2, [file: 'os.erl', line: 278]}, {:rabbit_mgmt_external_stats, :get_used_fd, 1, [file: 'src/rabbit_mgmt_external_stats.erl', line: 137]}, {:rabbit_mgmt_external_stats, :get_used_fd, 0, [file: 'src/rabbit_mgmt_external_stats.erl', line: 65]}, {:rabbit_mgmt_external_stats, :"-infos/2-lc$^0/1-0-", 2, [file: 'src/rabbit_mgmt_external_stats.erl', line: 181]}, {:rabbit_mgmt_external_stats, :emit_update, 1, [file: 'src/rabbit_mgmt_external_stats.erl', line: 385]}, {:rabbit_mgmt_external_stats, :init, 1, [file: 'src/rabbit_mgmt_external_stats.erl', line: 363]}, {:gen_server, :init_it, 2, [file: 'gen_server.erl', line: 374]}]}}}}}, {:rabbit_mgmt_agent_app, :start, [:normal, []]}}}}

anyone got this issue before ?

i'm using RabbitMQ on VM

debug file

2019-11-15 09:43:57.649 [info] <0.28747.1> Stopping message store for directory 'c:/Users/xxxxxxxxx/AppData/Roaming/RabbitMQ/db/rabbit@IE-DEV-CNTRCT02-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2019-11-15 09:43:57.707 [info] <0.28747.1> Message store for directory 'c:/Users/xxxxxxxxx/AppData/Roaming/RabbitMQ/db/rabbit@IE-DEV-CNTRCT02-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2019-11-15 09:43:57.707 [info] <0.28744.1> Stopping message store for directory 'c:/Users/xxxxxxxxx/AppData/Roaming/RabbitMQ/db/rabbit@IE-DEV-CNTRCT02-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2019-11-15 09:43:57.751 [info] <0.28744.1> Message store for directory 'c:/Users/xxxxxxxxx/AppData/Roaming/RabbitMQ/db/rabbit@IE-DEV-CNTRCT02-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2019-11-15 09:43:57.773 [info] <0.43.0> Application rabbit exited with reason: stopped
2019-11-15 09:43:57.787 [info] <0.43.0> Application ra exited with reason: stopped
2019-11-15 09:43:57.797 [info] <0.43.0> Application cowboy exited with reason: stopped
2019-11-15 09:43:57.797 [info] <0.43.0> Application cowlib exited with reason: stopped
2019-11-15 09:43:57.806 [info] <0.43.0> Application amqp_client exited with reason: stopped
2019-11-15 09:43:57.806 [info] <0.43.0> Application rabbit_common exited with reason: stopped
2019-11-15 09:43:57.817 [info] <0.43.0> Application os_mon exited with reason: stopped
2019-11-15 09:43:57.828 [info] <0.43.0> Application sysmon_handler exited with reason: stopped
2019-11-15 09:43:57.847 [info] <0.43.0> Application mnesia exited with reason: stopped
2019-11-15 09:43:57.848 [error] <0.28496.1> 
Error description:
    rpc:'-handle_call_call/6-fun-0-'/5 line 197
    rabbit:start_it/1 line 465
    rabbit:broker_start/1 line 341
    rabbit:start_loaded_apps/2 line 591
    app_utils:manage_applications/6 line 126
    lists:foldl/3 line 1263
    rabbit:'-handle_app_error/1-fun-0-'/3 line 714
throw:{could_not_start,rabbitmq_management_agent,
       {rabbitmq_management_agent,
        {{shutdown,
          {failed_to_start_child,rabbit_mgmt_agent_sup,
           {shutdown,
            {failed_to_start_child,rabbit_mgmt_external_stats,
             {badarg,
              [{erlang,port_command,
                [#Port<0.10551>,[]],
                [{file,"erlang.erl"},{line,3143}]},
               {os,cmd,2,[{file,"os.erl"},{line,278}]},
               {rabbit_mgmt_external_stats,get_used_fd,1,
                [{file,"src/rabbit_mgmt_external_stats.erl"},{line,137}]},
               {rabbit_mgmt_external_stats,get_used_fd,0,
                [{file,"src/rabbit_mgmt_external_stats.erl"},{line,65}]},
               {rabbit_mgmt_external_stats,'-infos/2-lc$^0/1-0-',2,
                [{file,"src/rabbit_mgmt_external_stats.erl"},{line,181}]},
               {rabbit_mgmt_external_stats,emit_update,1,
                [{file,"src/rabbit_mgmt_external_stats.erl"},{line,385}]},
               {rabbit_mgmt_external_stats,in...

@michaelklishin michaelklishin transferred this issue from rabbitmq/rabbitmq-management-agent Nov 15, 2019
@michaelklishin
Copy link
Member

michaelklishin commented Nov 15, 2019

There are no details or evidence as to why the node might have been stopped. Consider posting full server logs and inspecting Windows event log. We only have a snippet explained below that says one of the management plugin components failed to start.

rabbitmqctl stop_app does not terminate the VM, it "pauses" RabbitMQ for certain operations that require it to be in a stopped state with the assumption that the operator will perform more operations shortly before starting it back. Perhaps you want rabbitmqctl shutdown? Why do you have to stop the node if you say that it "crashed"? What specifically does "crashed" mean here, the exception with the word "crash" that you see in the logs? That is simply an Erlang term for an unhandled exception and does not mean that the node was stopped or wasn't available.

When you try to start the node again, something prevents the runtime from opening a subprocess which tries to compute how many file descriptors its OS proceses uses.

@michaelklishin michaelklishin added the question A question of RCA request label Nov 15, 2019
@michaelklishin michaelklishin changed the title RabbitMQ stopped working after 1-2 hours from installation Management UI starts responding with 500s after 1-2 hours of operation Nov 15, 2019
@michaelklishin
Copy link
Member

michaelklishin commented Nov 15, 2019

This reminds me of a couple of known Erlang bugs specific to Windows. One is ERL-533 (which we have no shared evidence of). It would terminate the VM in a way that would not manifest in the RabbitMQ logs.

Another is an Erlang bug where it consumes close to 100% of a single core when running in a single core VM on Windows, and this line

Erlang configuration: Erlang/OTP 22 [erts-10.5] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:64]

confirms that this node has only one core available to it.

Here is one rabbitmq-users thread.
The only workaround we are aware of is moving to a two or more core machine. This can result in some of the tools killing the OS process.

Those are just guesses. Maybe there are traces of something else in the logs.

@michaelklishin
Copy link
Member

Other observations from the information we have: file descriptors limit is not an issue (Total: 2, limit: 65439), there is next to no data on this node (34% of memory is used by code modules: code: 0.0297 gb (34.24 %)).

So please share logs and move to a two or more core VM.

@badrshs
Copy link
Author

badrshs commented Nov 15, 2019

Hi , Thanks

I typed the following command rabbitmqctl shutdown

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.8.1\sbin>rabbitmqctl shutdown
Shutting down RabbitMQ node rabbit@IE-DEV-CNTRCT02 running at PID 1956
Waiting for PID 1956 to terminate
RabbitMQ node rabbit@IE-DEV-CNTRCT02 running at PID 1956 successfully shut down

then I tried to start the server but it failed without an error

RASH REPORT :

2019-11-15 10:13:21 =CRASH REPORT====
  crasher:
    initial call: rabbit_mgmt_external_stats:init/1
    pid: <0.473.0>
    registered_name: []
    exception error: bad argument: [{erlang,port_command,[#Port<0.23>,[]],[{file,"erlang.erl"},{line,3143}]},{os,cmd,2,[{file,"os.erl"},{line,278}]},{rabbit_mgmt_external_stats,get_used_fd,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,137}]},{rabbit_mgmt_external_stats,get_used_fd,0,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,65}]},{rabbit_mgmt_external_stats,'-infos/2-lc$^0/1-0-',2,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,181}]},{rabbit_mgmt_external_stats,emit_update,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,385}]},{rabbit_mgmt_external_stats,init,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,363}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]}]
    ancestors: [rabbit_mgmt_agent_sup,rabbit_mgmt_agent_sup_sup,<0.469.0>]
    message_queue_len: 1
    messages: [{'DOWN',#Ref<0.2801192893.508821505.130117>,port,#Port<0.23>,noproc}]
    links: [<0.471.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 2986
  neighbours:
2019-11-15 10:13:21 =SUPERVISOR REPORT====
     Supervisor: {local,rabbit_mgmt_agent_sup}
     Context:    start_error
     Reason:     {badarg,[{erlang,port_command,[#Port<0.23>,[]],[{file,"erlang.erl"},{line,3143}]},{os,cmd,2,[{file,"os.erl"},{line,278}]},{rabbit_mgmt_external_stats,get_used_fd,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,137}]},{rabbit_mgmt_external_stats,get_used_fd,0,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,65}]},{rabbit_mgmt_external_stats,'-infos/2-lc$^0/1-0-',2,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,181}]},{rabbit_mgmt_external_stats,emit_update,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,385}]},{rabbit_mgmt_external_stats,init,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,363}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]}]}
     Offender:   [{pid,undefined},{id,rabbit_mgmt_external_stats},{mfargs,{rabbit_mgmt_external_stats,start_link,[]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]

2019-11-15 10:13:21 =SUPERVISOR REPORT====
     Supervisor: {local,rabbit_mgmt_agent_sup_sup}
     Context:    start_error
     Reason:     {shutdown,{failed_to_start_child,rabbit_mgmt_external_stats,{badarg,[{erlang,port_command,[#Port<0.23>,[]],[{file,"erlang.erl"},{line,3143}]},{os,cmd,2,[{file,"os.erl"},{line,278}]},{rabbit_mgmt_external_stats,get_used_fd,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,137}]},{rabbit_mgmt_external_stats,get_used_fd,0,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,65}]},{rabbit_mgmt_external_stats,'-infos/2-lc$^0/1-0-',2,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,181}]},{rabbit_mgmt_external_stats,emit_update,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,385}]},{rabbit_mgmt_external_stats,init,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,363}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]}]}}}
     Offender:   [{pid,undefined},{name,rabbit_mgmt_agent_sup},{mfargs,{rabbit_mgmt_agent_sup,start_link,[]}},{restart_type,temporary},{shutdown,infinity},{child_type,supervisor}]

2019-11-15 10:13:21 =CRASH REPORT====
  crasher:
    initial call: application_master:init/4
    pid: <0.468.0>
    registered_name: []
    exception exit: {{{shutdown,{failed_to_start_child,rabbit_mgmt_agent_sup,{shutdown,{failed_to_start_child,rabbit_mgmt_external_stats,{badarg,[{erlang,port_command,[#Port<0.23>,[]],[{file,"erlang.erl"},{line,3143}]},{os,cmd,2,[{file,"os.erl"},{line,278}]},{rabbit_mgmt_external_stats,get_used_fd,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,137}]},{rabbit_mgmt_external_stats,get_used_fd,0,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,65}]},{rabbit_mgmt_external_stats,'-infos/2-lc$^0/1-0-',2,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,181}]},{rabbit_mgmt_external_stats,emit_update,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,385}]},{rabbit_mgmt_external_stats,init,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,363}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]}]}}}}},{rabbit_mgmt_agent_app,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
    ancestors: [<0.467.0>]
    message_queue_len: 1
    messages: [{'EXIT',<0.469.0>,normal}]
    links: [<0.467.0>,<0.43.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 184
  neighbours:

Then I restarted the VM and everything was fine .
i'm sure it will fail again, when it does I will post the full logs here ( because i removed it by mistake 😅 )

@michaelklishin
Copy link
Member

So it's the same exception in a slightly different place. Something prevents the node from starting a subprocess. It could be a piece of anti-virus software or something of that nature.

We will try to reproduce to make sure the management agent does not terminate the plugin and HTTP API. Instead, only one metric should be unavailable (how many file descriptors the node uses). I will file a more focused issue for that.

I see no evidence that the node itself terminates.

@michaelklishin
Copy link
Member

Can you make sure that your machine has handle.exe in PATH and restart the node? Do you have a message that says "Could not find handle.exe" logged?

@michaelklishin
Copy link
Member

There is apparently a Pivotal tech article on the topic, and it references this rabbitmq-users thread from a couple of years ago.

@badrshs
Copy link
Author

badrshs commented Nov 15, 2019

as i told you it will fail again :)

like you said : i found this in my log file
Could not find handle.exe

2019-11-15 10:21:09.569 [error] <0.466.0> Supervisor rabbit_mgmt_agent_sup had child rabbit_mgmt_external_stats started with rabbit_mgmt_external_stats:start_link() at {restarting,<0.468.0>} exit with reason bad argument in call to erlang:port_command(#Port<0.43>, []) line 3143 in os:cmd/2 line 278 in context start_error
2019-11-15 10:21:09.581 [warning] <0.645.0> Could not find handle.exe, please install from sysinternals
2019-11-15 10:21:14.622 [error] <0.645.0> ** Generic server rabbit_mgmt_external_stats terminating 
** Last message in was emit_update
** When Server state == {state,65536,[{{io_sync,count},0},{{queue_index_read,count},0},{{msg_store_read,count},0},{{io_read,count},1},{{io_file_handle_open_attempt,count},10},{{queue_index_write,count},0},{{queue_index_journal_write,count},0},{{msg_store_write,count},0},{{io_write,count},2},{{io_seek,time},0},{{mnesia_ram_tx,count},11},{{mnesia_disk_tx,count},0},{{io_read,bytes},1},{{io_write,bytes},5},{{io_seek,count},0},{{io_sync,time},0},{{io_reopen,count},0},{{io_read,time},0},{{io_file_handle_open_attempt,time},16000},{{io_write,time},0}],{set,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},undefined,5000}
** Reason for termination ==
** {badarg,[{erlang,port_command,[#Port<0.46>,[]],[{file,"erlang.erl"},{line,3143}]},{os,cmd,2,[{file,"os.erl"},{line,278}]},{rabbit_mgmt_external_stats,get_used_fd,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,137}]},{rabbit_mgmt_external_stats,get_used_fd,0,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,65}]},{rabbit_mgmt_external_stats,'-infos/2-lc$^0/1-0-',2,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,181}]},{rabbit_mgmt_external_stats,emit_update,1,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,385}]},{rabbit_mgmt_external_stats,handle_info,2,[{file,"src/rabbit_mgmt_external_stats.erl"},{line,372}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]}]}

the log file now is around 50000 Lines ,I don't know what to share from that 😅😅

@michaelklishin
Copy link
Member

@badrshs well, we have made that function extra defensive and next patch releases of 3.8 and 3.7 should avoid plugin failure. Note that we could not reproduce the issue. Please install handle.exe and see if the exception goes away.

@michaelklishin
Copy link
Member

Our recommendation of not using single core Windows VM due to known Erlang peculiarities in such environments also still stands.

@lukebakken
Copy link

@badrshs - We could not reproduce this issue, but have committed a fix here:

rabbitmq/rabbitmq-management-agent#87

Would you be willing to test a custom build containing this fix?

@badrshs
Copy link
Author

badrshs commented Nov 19, 2019

@lukebakken ok , I will try that tomorrow , but not sure about the result because IT department already upgrade my machine to dual core.
anyway i will give it try.
thank you guys 😁

@badrshs
Copy link
Author

badrshs commented Nov 20, 2019

Hi. seems like everything is fine. I'm not facing the issue anymore
but again I'm not sure if it's because of your changes or because i upgrade my vm to dual process
thanks again

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question A question of RCA request
Projects
None yet
Development

No branches or pull requests

3 participants