Parallelise statistics DB collector #41

Closed
michaelklishin opened this Issue Apr 26, 2015 · 47 comments

Comments

Projects
None yet
@michaelklishin
Member

michaelklishin commented Apr 26, 2015

With a high number of stats-emitting things in a cluster (primarily connections, channels, and queues), statistics collector process gets overwhelmed and begins to use a higher-than-usual amount of RAM and (intentionally) drop stats. This has no correlation with message rates and only triggered by a large number of stats emitting entities.

There are two workarounds:

  • Disabling message rate collection (rates_mode = none)
  • Increasing stats emission interval to 20-30-60 seconds (30 is usually sufficient). This means that management UI will only be updated every 20-30-60 seconds but stats DB load will drop (default emission interval is 5 seconds).

We should look into parallelising it, or at least the parts that are not order-dependent.

A longer term plan is to make the stats DB distributed, store and aggregate results from multiple nodes. This is out of scope of this issue.

UPD: this has a couple of known issues leading to high[-er] RAM usage in 3.6.2. Even though they are fixed in later releases, the single stats node architecture can only go so far, so we've started working on #236.

@michaelklishin michaelklishin self-assigned this Apr 26, 2015

@michaelklishin michaelklishin added this to the 3.6.0 milestone Apr 26, 2015

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 26, 2015

Member

Some early thoughts on what we can do here.

We do have order-dependent bits: for example, if a connection is opened and immediately closed, we'd emit two stats events. They have to be processed in order. However, such events for separate connections can be processed in parallel. Our generic work pool won't be a good fit here: it has no hashing of any kind.

So we need work pool with hashing. We identify things by

  • Pids (connections, channels)
  • #resource{} records (queues, exchanges)
  • Binaries (vhosts, nodes)
  • Tuples (consumers)

Our list of worker processes will be fixed in size, so we can use either consistent hashing or modulo-based one.

ETS may still be our ultimate bottleneck. It has basic concurrency controls. We need to profile to see if all of the above would make much difference, or would need combining with a different key/value store backend.

Member

michaelklishin commented Apr 26, 2015

Some early thoughts on what we can do here.

We do have order-dependent bits: for example, if a connection is opened and immediately closed, we'd emit two stats events. They have to be processed in order. However, such events for separate connections can be processed in parallel. Our generic work pool won't be a good fit here: it has no hashing of any kind.

So we need work pool with hashing. We identify things by

  • Pids (connections, channels)
  • #resource{} records (queues, exchanges)
  • Binaries (vhosts, nodes)
  • Tuples (consumers)

Our list of worker processes will be fixed in size, so we can use either consistent hashing or modulo-based one.

ETS may still be our ultimate bottleneck. It has basic concurrency controls. We need to profile to see if all of the above would make much difference, or would need combining with a different key/value store backend.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 27, 2015

Member

Another idea: we could use one collector per vhost. Aggregation then becomes more involved. Cluster-wide supervision may get more complicated as well.

Member

michaelklishin commented Apr 27, 2015

Another idea: we could use one collector per vhost. Aggregation then becomes more involved. Cluster-wide supervision may get more complicated as well.

@gmr

This comment has been minimized.

Show comment
Hide comment
@gmr

gmr Aug 21, 2015

Contributor

Re one collector per vhost -- that doesn't buy you much if you only use one. I'm running into rate issues at the moment with only one.

Contributor

gmr commented Aug 21, 2015

Re one collector per vhost -- that doesn't buy you much if you only use one. I'm running into rate issues at the moment with only one.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Aug 21, 2015

Member

This is true. One collector per vhost could still be combined with other parallelisation strategies, though.

Member

michaelklishin commented Aug 21, 2015

This is true. One collector per vhost could still be combined with other parallelisation strategies, though.

@carlhoerberg

This comment has been minimized.

Show comment
Hide comment
@carlhoerberg

carlhoerberg Aug 21, 2015

Contributor

Agreed, we're running hundreds of vhosts on some clusters, so for us it
would probably help a lot.

Contributor

carlhoerberg commented Aug 21, 2015

Agreed, we're running hundreds of vhosts on some clusters, so for us it
would probably help a lot.

@michaelklishin michaelklishin removed this from the 3.6.0 milestone Aug 31, 2015

@rtraschke

This comment has been minimized.

Show comment
Hide comment
@rtraschke

rtraschke Oct 7, 2015

Some initial results from profiling the stats db:

  1. The first one is staring us in the face, but I'd not thought of it: rabbit_mgmt_db is a gen_server that does two things: store event data (via handle_cast) and query metric data (via handle_call). When running a profile and keeping a Mgmt UI open, the query operations completely swamp the profile data. From a cursory glance at the code, I assume this is because the query operations do quite a bit of heavy lifting to convert the event data samples into metrics.
  2. Taking the UI out of the picture now ends up splitting the remaining profile further into two: ~90% GC and ~10% event handling. In this case, GC is the code that prunes the aggregated_stats ETS table (see rabbit_mgmt_db:gc_batch/1). I am slightly confused by this high impact though, since the GC is supposedly only run every 5 seconds.

The first point essentially means that having a Mgmt UI up makes the problem worse, having multiple Mgmt UI up would make things even worse. So, one thing we can look at, is to split the event handling and the metric queries into two separate processes.

Next, do the same thing for the GC of the aggregated_stats table.

A final point is about the hibernation in the rabbit_mgmt_db module. The profile turns up a lot of suspensions of the process. Seeing that there was a lot of activity in my test run, I believe that the high number of suspends comes from the return of hibernate in the handle_call and handle_cast functions. I am unsure of the reasons behind placing the stats db server into hibernation, to my mind it doesn't make an awful lot of sense, but I'm probably missing something. Anyone have any insight into that? (@michaelklishin notes that he believes hibernation is not a good idea for the event collector)

Any further thoughts from anyone?

Next step for me is to get a better handle on the timings split between metric queries, GC and event handling. And how they can be made into independent processes (ultimately making ETS be the bottleneck, that's my plan anyway).

Additionally, the suggestion of splitting into per-vhost collection would be good to get in as well.

Some initial results from profiling the stats db:

  1. The first one is staring us in the face, but I'd not thought of it: rabbit_mgmt_db is a gen_server that does two things: store event data (via handle_cast) and query metric data (via handle_call). When running a profile and keeping a Mgmt UI open, the query operations completely swamp the profile data. From a cursory glance at the code, I assume this is because the query operations do quite a bit of heavy lifting to convert the event data samples into metrics.
  2. Taking the UI out of the picture now ends up splitting the remaining profile further into two: ~90% GC and ~10% event handling. In this case, GC is the code that prunes the aggregated_stats ETS table (see rabbit_mgmt_db:gc_batch/1). I am slightly confused by this high impact though, since the GC is supposedly only run every 5 seconds.

The first point essentially means that having a Mgmt UI up makes the problem worse, having multiple Mgmt UI up would make things even worse. So, one thing we can look at, is to split the event handling and the metric queries into two separate processes.

Next, do the same thing for the GC of the aggregated_stats table.

A final point is about the hibernation in the rabbit_mgmt_db module. The profile turns up a lot of suspensions of the process. Seeing that there was a lot of activity in my test run, I believe that the high number of suspends comes from the return of hibernate in the handle_call and handle_cast functions. I am unsure of the reasons behind placing the stats db server into hibernation, to my mind it doesn't make an awful lot of sense, but I'm probably missing something. Anyone have any insight into that? (@michaelklishin notes that he believes hibernation is not a good idea for the event collector)

Any further thoughts from anyone?

Next step for me is to get a better handle on the timings split between metric queries, GC and event handling. And how they can be made into independent processes (ultimately making ETS be the bottleneck, that's my plan anyway).

Additionally, the suggestion of splitting into per-vhost collection would be good to get in as well.

@dcorbacho

This comment has been minimized.

Show comment
Hide comment
@dcorbacho

dcorbacho Nov 27, 2015

Contributor

Some enhancements available in #89 which depends on rabbitmq/rabbitmq-management-agent#8

Contributor

dcorbacho commented Nov 27, 2015

Some enhancements available in #89 which depends on rabbitmq/rabbitmq-management-agent#8

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Nov 27, 2015

Member

Sorry, moved QA discussions to the pull request.

Member

michaelklishin commented Nov 27, 2015

Sorry, moved QA discussions to the pull request.

@dcorbacho

This comment has been minimized.

Show comment
Hide comment
@dcorbacho

dcorbacho Dec 15, 2015

Contributor

The problem in #89 is that it creates one ETS table per queue per stat (i.e. messages, messages_ready...), so it quickly reaches the system limit.

We are validating now the alternative of creating one ETS table per event (i.e. queue_stats, vhost_stats, etc). The current implementation is functionally tested but the performance is a bit behind, as the new ETS tables are very large. We are investigating how to aggregate the data within ETS tables.

Contributor

dcorbacho commented Dec 15, 2015

The problem in #89 is that it creates one ETS table per queue per stat (i.e. messages, messages_ready...), so it quickly reaches the system limit.

We are validating now the alternative of creating one ETS table per event (i.e. queue_stats, vhost_stats, etc). The current implementation is functionally tested but the performance is a bit behind, as the new ETS tables are very large. We are investigating how to aggregate the data within ETS tables.

@dcorbacho

This comment has been minimized.

Show comment
Hide comment
@dcorbacho

dcorbacho Jan 6, 2016

Contributor

Issues in #89 solved by aggregating the ETS tables as explained in #101. Depends also on rabbitmq/rabbitmq-management-agent#10
Parallelism and optimisation of some calls yields much better response times and reduces messages queues

Contributor

dcorbacho commented Jan 6, 2016

Issues in #89 solved by aggregating the ETS tables as explained in #101. Depends also on rabbitmq/rabbitmq-management-agent#10
Parallelism and optimisation of some calls yields much better response times and reduces messages queues

@michaelklishin michaelklishin referenced this issue in rabbitmq/rabbitmq-server Jan 15, 2016

Closed

Prevent unbounded gc interval backoff #100

@dcorbacho dcorbacho referenced this issue in rabbitmq/rabbitmq-common Feb 8, 2016

Merged

Optimisation #51

@dcorbacho

This comment has been minimized.

Show comment
Hide comment
@dcorbacho

dcorbacho Feb 8, 2016

Contributor

Hi @michaelklishin , we have a new version in:

#101
rabbitmq/rabbitmq-management-agent#10
rabbitmq/rabbitmq-common#51

Tested with 56484 queues on a MacBook
time curl -X GET -u guest:guest http://127.0.0.1:15672/api/queues\?page\=10\&page_size\=100\&name\=\&use_regex\=false

Branch #41

real    0m5.930s
user    0m0.003s
sys     0m0.005s

Branch stable

real    0m11.437s
user    0m0.003s
sys     0m0.004s

Additional testing by @Gsantomaggio https://gist.github.com/Gsantomaggio/0b32a0eb9a08e2316051

Contributor

dcorbacho commented Feb 8, 2016

Hi @michaelklishin , we have a new version in:

#101
rabbitmq/rabbitmq-management-agent#10
rabbitmq/rabbitmq-common#51

Tested with 56484 queues on a MacBook
time curl -X GET -u guest:guest http://127.0.0.1:15672/api/queues\?page\=10\&page_size\=100\&name\=\&use_regex\=false

Branch #41

real    0m5.930s
user    0m0.003s
sys     0m0.005s

Branch stable

real    0m11.437s
user    0m0.003s
sys     0m0.004s

Additional testing by @Gsantomaggio https://gist.github.com/Gsantomaggio/0b32a0eb9a08e2316051

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 9, 2016

Member

@dcorbacho things look better in terms of query efficiency and with some modifications to account for higher concurrency of the collectors, rabbithole tests pass most of the time.

However, sometimes there are failure, typically in channel listing (I may be wrong about this) and there's a gen server crash logged:

=ERROR REPORT==== 10-Feb-2016::01:05:32 ===
** Generic server aggr_queue_stats_queue_msg_counts_gc terminating
** Last message in was gc
** When Server state == {state,500,#Ref<0.0.5767169.19943>,
                            aggr_queue_stats_queue_msg_counts,
                            aggr_queue_stats_queue_msg_counts_key_index,
                            {resource,<<"/">>,queue,
                                <<"amq.gen-iWjpHEDEvQJzEu7btTQByg">>}}
** Reason for termination ==
** {badarg,[{ets,next,
                 [aggr_queue_stats_queue_msg_counts_key_index,
                  {resource,<<"/">>,queue,
                            <<"amq.gen-iWjpHEDEvQJzEu7btTQByg">>}],
                 []},
            {rabbit_mgmt_stats_gc,gc_batch,3,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,128}]},
            {rabbit_mgmt_stats_gc,handle_info,2,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,80}]},
            {gen_server2,handle_msg,2,
                         [{file,"src/gen_server2.erl"},{line,1049}]},
            {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}
Member

michaelklishin commented Feb 9, 2016

@dcorbacho things look better in terms of query efficiency and with some modifications to account for higher concurrency of the collectors, rabbithole tests pass most of the time.

However, sometimes there are failure, typically in channel listing (I may be wrong about this) and there's a gen server crash logged:

=ERROR REPORT==== 10-Feb-2016::01:05:32 ===
** Generic server aggr_queue_stats_queue_msg_counts_gc terminating
** Last message in was gc
** When Server state == {state,500,#Ref<0.0.5767169.19943>,
                            aggr_queue_stats_queue_msg_counts,
                            aggr_queue_stats_queue_msg_counts_key_index,
                            {resource,<<"/">>,queue,
                                <<"amq.gen-iWjpHEDEvQJzEu7btTQByg">>}}
** Reason for termination ==
** {badarg,[{ets,next,
                 [aggr_queue_stats_queue_msg_counts_key_index,
                  {resource,<<"/">>,queue,
                            <<"amq.gen-iWjpHEDEvQJzEu7btTQByg">>}],
                 []},
            {rabbit_mgmt_stats_gc,gc_batch,3,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,128}]},
            {rabbit_mgmt_stats_gc,handle_info,2,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,80}]},
            {gen_server2,handle_msg,2,
                         [{file,"src/gen_server2.erl"},{line,1049}]},
            {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}
@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 9, 2016

Member

Yes, so it seems to affect all ETS table GC processes:

=ERROR REPORT==== 10-Feb-2016::01:06:51 ===
** Generic server aggr_connection_stats_coarse_conn_stats_gc terminating
** Last message in was gc
** When Server state == {state,500,#Ref<0.0.5767169.85497>,
                               aggr_connection_stats_coarse_conn_stats,
                               aggr_connection_stats_coarse_conn_stats_key_index,
                               <0.13733.73>}
** Reason for termination ==
** {badarg,[{ets,next,
                 [aggr_connection_stats_coarse_conn_stats_key_index,
                  <0.13733.73>],
                 []},
            {rabbit_mgmt_stats_gc,gc_batch,3,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,128}]},
            {rabbit_mgmt_stats_gc,handle_info,2,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,80}]},
            {gen_server2,handle_msg,2,
                         [{file,"src/gen_server2.erl"},{line,1049}]},
            {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}
Member

michaelklishin commented Feb 9, 2016

Yes, so it seems to affect all ETS table GC processes:

=ERROR REPORT==== 10-Feb-2016::01:06:51 ===
** Generic server aggr_connection_stats_coarse_conn_stats_gc terminating
** Last message in was gc
** When Server state == {state,500,#Ref<0.0.5767169.85497>,
                               aggr_connection_stats_coarse_conn_stats,
                               aggr_connection_stats_coarse_conn_stats_key_index,
                               <0.13733.73>}
** Reason for termination ==
** {badarg,[{ets,next,
                 [aggr_connection_stats_coarse_conn_stats_key_index,
                  <0.13733.73>],
                 []},
            {rabbit_mgmt_stats_gc,gc_batch,3,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,128}]},
            {rabbit_mgmt_stats_gc,handle_info,2,
                                  [{file,"src/rabbit_mgmt_stats_gc.erl"},
                                   {line,80}]},
            {gen_server2,handle_msg,2,
                         [{file,"src/gen_server2.erl"},{line,1049}]},
            {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}
@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 9, 2016

Member
=CRASH REPORT==== 10-Feb-2016::01:09:53 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.9822.75>
    registered_name: aggr_queue_stats_queue_msg_counts_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_queue_stats_queue_msg_counts_key_index,
                              {resource,<<"/">>,queue,
                                  <<"amq.gen-R_4zL9Je4ufHoJNJS39yjg">>}],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 7387
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:09:53 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_queue_stats_queue_msg_counts_key_index,
                           {resource,<<"/">>,queue,
                               <<"amq.gen-R_4zL9Je4ufHoJNJS39yjg">>}],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.9822.75>},
                  {name,aggr_queue_stats_queue_msg_counts_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_queue_stats_queue_msg_counts]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 10-Feb-2016::01:09:56 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.19158.75>
    registered_name: aggr_connection_stats_coarse_conn_stats_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_connection_stats_coarse_conn_stats_key_index,
                              <0.12926.75>],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 560
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:09:56 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_connection_stats_coarse_conn_stats_key_index,
                           <0.12926.75>],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.19158.75>},
                  {name,aggr_connection_stats_coarse_conn_stats_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_connection_stats_coarse_conn_stats]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 10-Feb-2016::01:10:03 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.20347.75>
    registered_name: aggr_queue_stats_queue_msg_counts_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_queue_stats_queue_msg_counts_key_index,
                              {resource,<<"rabbit/hole">>,queue,<<"q2">>}],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 487
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:10:03 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_queue_stats_queue_msg_counts_key_index,
                           {resource,<<"rabbit/hole">>,queue,<<"q2">>}],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.20347.75>},
                  {name,aggr_queue_stats_queue_msg_counts_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_queue_stats_queue_msg_counts]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 10-Feb-2016::01:11:08 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.22946.75>
    registered_name: aggr_queue_stats_queue_msg_counts_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_queue_stats_queue_msg_counts_key_index,
                              {resource,<<"/">>,queue,
                                  <<"amq.gen-kBwTFG-MTuiEMWE94WZ0BA">>}],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 9773
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:11:08 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_queue_stats_queue_msg_counts_key_index,
                           {resource,<<"/">>,queue,
                               <<"amq.gen-kBwTFG-MTuiEMWE94WZ0BA">>}],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.22946.75>},
                  {name,aggr_queue_stats_queue_msg_counts_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_queue_stats_queue_msg_counts]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]

(for the record, not much new info there)

Member

michaelklishin commented Feb 9, 2016

=CRASH REPORT==== 10-Feb-2016::01:09:53 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.9822.75>
    registered_name: aggr_queue_stats_queue_msg_counts_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_queue_stats_queue_msg_counts_key_index,
                              {resource,<<"/">>,queue,
                                  <<"amq.gen-R_4zL9Je4ufHoJNJS39yjg">>}],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 7387
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:09:53 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_queue_stats_queue_msg_counts_key_index,
                           {resource,<<"/">>,queue,
                               <<"amq.gen-R_4zL9Je4ufHoJNJS39yjg">>}],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.9822.75>},
                  {name,aggr_queue_stats_queue_msg_counts_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_queue_stats_queue_msg_counts]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 10-Feb-2016::01:09:56 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.19158.75>
    registered_name: aggr_connection_stats_coarse_conn_stats_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_connection_stats_coarse_conn_stats_key_index,
                              <0.12926.75>],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 560
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:09:56 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_connection_stats_coarse_conn_stats_key_index,
                           <0.12926.75>],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.19158.75>},
                  {name,aggr_connection_stats_coarse_conn_stats_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_connection_stats_coarse_conn_stats]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 10-Feb-2016::01:10:03 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.20347.75>
    registered_name: aggr_queue_stats_queue_msg_counts_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_queue_stats_queue_msg_counts_key_index,
                              {resource,<<"rabbit/hole">>,queue,<<"q2">>}],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 487
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:10:03 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_queue_stats_queue_msg_counts_key_index,
                           {resource,<<"rabbit/hole">>,queue,<<"q2">>}],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.20347.75>},
                  {name,aggr_queue_stats_queue_msg_counts_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_queue_stats_queue_msg_counts]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 10-Feb-2016::01:11:08 ===
  crasher:
    initial call: gen:init_it/7
    pid: <0.22946.75>
    registered_name: aggr_queue_stats_queue_msg_counts_gc
    exception exit: {badarg,
                        [{ets,next,
                             [aggr_queue_stats_queue_msg_counts_key_index,
                              {resource,<<"/">>,queue,
                                  <<"amq.gen-kBwTFG-MTuiEMWE94WZ0BA">>}],
                             []},
                         {rabbit_mgmt_stats_gc,gc_batch,3,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,128}]},
                         {rabbit_mgmt_stats_gc,handle_info,2,
                             [{file,"src/rabbit_mgmt_stats_gc.erl"},
                              {line,80}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1049}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,250}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1160)
    ancestors: [<0.325.0>,rabbit_mgmt_sup,rabbit_mgmt_sup_sup,<0.304.0>]
    messages: []
    links: [<0.325.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 9773
  neighbours:

=SUPERVISOR REPORT==== 10-Feb-2016::01:11:08 ===
     Supervisor: {<0.325.0>,mirrored_supervisor_sups}
     Context:    child_terminated
     Reason:     {badarg,
                     [{ets,next,
                          [aggr_queue_stats_queue_msg_counts_key_index,
                           {resource,<<"/">>,queue,
                               <<"amq.gen-kBwTFG-MTuiEMWE94WZ0BA">>}],
                          []},
                      {rabbit_mgmt_stats_gc,gc_batch,3,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,128}]},
                      {rabbit_mgmt_stats_gc,handle_info,2,
                          [{file,"src/rabbit_mgmt_stats_gc.erl"},{line,80}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1049}]},
                      {proc_lib,wake_up,3,
                          [{file,"proc_lib.erl"},{line,250}]}]}
     Offender:   [{pid,<0.22946.75>},
                  {name,aggr_queue_stats_queue_msg_counts_gc},
                  {mfargs,
                      {rabbit_mgmt_stats_gc,start_link,
                          [aggr_queue_stats_queue_msg_counts]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]

(for the record, not much new info there)

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 9, 2016

Member

I can confirm that whenever there are no crashes during a rabbithole test suite run, all tests pass (I'm about to push my test suite changes).

Here's how I run the suite multiple times in a row:

./bin/ci/before_script.sh
make
for i in {1..30}; do go test -v; done
Member

michaelklishin commented Feb 9, 2016

I can confirm that whenever there are no crashes during a rabbithole test suite run, all tests pass (I'm about to push my test suite changes).

Here's how I run the suite multiple times in a row:

./bin/ci/before_script.sh
make
for i in {1..30}; do go test -v; done

michaelklishin added a commit to michaelklishin/rabbit-hole that referenced this issue Feb 9, 2016

Adjust test suite for higher event collector concurrency
See rabbitmq/rabbitmq-management#41. With the parallel
collector there, there's a natural race condition between
certain events recorded by the management DB and queries.

While sleeps aren't great, developing an awaiting version
of every read function in the client is an overkill.

While at it, don't assert on more volatile node metrics.
@dcorbacho

This comment has been minimized.

Show comment
Hide comment
@dcorbacho

dcorbacho Feb 11, 2016

Contributor

The previous crash is solved by using ordered_set for the key indexes, thus if an entry has been deleted while in the loop, the GC can still get the next element (ets:next/2 always succeed in ordered sets). Fixed in f9fd8b2

While testing options for this fix, I tried ets:select/3 and ets:select/1 which are guaranteed to always succeed. However, ets:select/1 would fail occasionally while running the rabbit_hole suite with a badarg. This happens in Erlang 17.5 but not 18.x. After checking with one member of the OTP team, such fix doesn't seem to be in the release notes. They'll check it and may create a regression test from our call sequence. @michaelklishin, we may have found another OTP bug ¯_(ツ)_/¯

Contributor

dcorbacho commented Feb 11, 2016

The previous crash is solved by using ordered_set for the key indexes, thus if an entry has been deleted while in the loop, the GC can still get the next element (ets:next/2 always succeed in ordered sets). Fixed in f9fd8b2

While testing options for this fix, I tried ets:select/3 and ets:select/1 which are guaranteed to always succeed. However, ets:select/1 would fail occasionally while running the rabbit_hole suite with a badarg. This happens in Erlang 17.5 but not 18.x. After checking with one member of the OTP team, such fix doesn't seem to be in the release notes. They'll check it and may create a regression test from our call sequence. @michaelklishin, we may have found another OTP bug ¯_(ツ)_/¯

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 11, 2016

Member

rabbithole test suite now succeeds 50 and 100 times in a row:

for i in {1..50}; do go test -v | grep Passed; done
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.52s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.00s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.41s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (15.16s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (14.52s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.64s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.14s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.75s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.08s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.92s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.51s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.66s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.79s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.03s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.05s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.61s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.60s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.64s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.99s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.20s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.66s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.61s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.58s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (15.54s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.12s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.77s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.06s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.45s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.65s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.14s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.08s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.83s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.88s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.25s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.00s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.73s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.53s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.15s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (15.02s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.94s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.10s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.31s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.52s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.78s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (14.09s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.42s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.11s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.83s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.99s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.82s)

Median and 99th percetile test run time is also noticeably lower now.

Member

michaelklishin commented Feb 11, 2016

rabbithole test suite now succeeds 50 and 100 times in a row:

for i in {1..50}; do go test -v | grep Passed; done
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.52s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.00s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.41s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (15.16s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (14.52s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.64s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.14s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.75s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.08s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.92s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.51s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.66s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.79s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.03s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.05s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.61s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.60s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.64s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.99s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.20s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.66s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.61s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.58s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (15.54s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.12s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.77s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.06s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.45s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.65s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.14s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.08s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.83s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.88s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.25s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.00s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.73s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.53s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.15s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (15.02s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.94s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.10s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.31s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.52s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.78s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (14.09s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.42s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (13.11s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.83s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.99s)
SUCCESS! -- 51 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestRabbitHole (12.82s)

Median and 99th percetile test run time is also noticeably lower now.

@michaelklishin michaelklishin added this to the 3.6.2 milestone Feb 19, 2016

@michaelklishin michaelklishin added the bug label Feb 19, 2016

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 22, 2016

Member

After a few rounds of QA, this seems to be ready and will be shipped to the community for testing in 3.6.2 Milestone 1 or so.

Member

michaelklishin commented Feb 22, 2016

After a few rounds of QA, this seems to be ready and will be shipped to the community for testing in 3.6.2 Milestone 1 or so.

@dcorbacho

This comment has been minimized.

Show comment
Hide comment
@dcorbacho

dcorbacho Feb 29, 2016

Contributor

There are 3 new commits in this PR, fixing 2 issues that @Gsantomaggio found accidentally while testing a different issue with PerfTest on a cluster patched with #41 and detailed stats.

  • dd97139 restores the ETS table for queue_msg_rates. These stats aren't always generated, so it went unnoticed.
  • 46e53a5 handles the float numbers passed in the statistics of io_read_avg_time, io_write_avg_time, io_sync_avg_time and io_seek_avg_time. These stats are not generated when the load is very low (~0), so it had gone unnoticed. The visible side-effect of the bug where the node stats (memory, sockets, etc) go down to 0 or negative values in the UI.
  • 7364a2a the timestamp performance improvement (<5%) has been reverted, after evaluating if that could have been the cause of the previous issue. It is not, but it might case 0 instant rates when is not the case.

^^ @michaelklishin

Contributor

dcorbacho commented Feb 29, 2016

There are 3 new commits in this PR, fixing 2 issues that @Gsantomaggio found accidentally while testing a different issue with PerfTest on a cluster patched with #41 and detailed stats.

  • dd97139 restores the ETS table for queue_msg_rates. These stats aren't always generated, so it went unnoticed.
  • 46e53a5 handles the float numbers passed in the statistics of io_read_avg_time, io_write_avg_time, io_sync_avg_time and io_seek_avg_time. These stats are not generated when the load is very low (~0), so it had gone unnoticed. The visible side-effect of the bug where the node stats (memory, sockets, etc) go down to 0 or negative values in the UI.
  • 7364a2a the timestamp performance improvement (<5%) has been reverted, after evaluating if that could have been the cause of the previous issue. It is not, but it might case 0 instant rates when is not the case.

^^ @michaelklishin

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 2, 2016

Member

Did another round of QA with multiple test suites, manual testing, stress testing with 100K queues, and so on. Things are looking good. Merging.

Member

michaelklishin commented Mar 2, 2016

Did another round of QA with multiple test suites, manual testing, stress testing with 100K queues, and so on. Things are looking good. Merging.

@michaelklishin michaelklishin changed the title from Statistics DB collector is a bottleneck to Parallelise statistics DB collector Mar 2, 2016

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 2, 2016

Member

@dcorbacho now we need to merge this into master. Interestingly there's only one conflict (in the dispatcher module), so hopefully it won't be a gargantuan amount of work.

Member

michaelklishin commented Mar 2, 2016

@dcorbacho now we need to merge this into master. Interestingly there's only one conflict (in the dispatcher module), so hopefully it won't be a gargantuan amount of work.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 2, 2016

Member

@dcorbacho thank you for your work on this. It took months and several different approaches but the results are promising.

Member

michaelklishin commented Mar 2, 2016

@dcorbacho thank you for your work on this. It took months and several different approaches but the results are promising.

michaelklishin added a commit to rabbitmq/rabbitmq-mqtt that referenced this issue Mar 8, 2016

Emit stats unconditionally
...of connection (flow control) state.

This makes it much easier to reason about flow control
state when looking at the management UI or monitoring tools
that poll HTTP API.
Now that rabbitmq/rabbitmq-management#41 is merged, there are
few arguments against always emitting stats.

Fixes #71.

michaelklishin added a commit to rabbitmq/rabbitmq-stomp that referenced this issue Mar 8, 2016

Emit stats unconditionally
of connection (flow control) state.

This makes it much easier to reason about flow control
state when looking at the management UI or monitoring tools
that poll HTTP API.
Now that rabbitmq/rabbitmq-management#41 is merged, there are
few arguments against always emitting stats.

Fixes #70.

michaelklishin added a commit to rabbitmq/rabbitmq-common that referenced this issue Mar 8, 2016

Emit stats unconditionally
...of connection (flow control) state.

This makes it much easier to reason about flow control
state when looking at the management UI or monitoring tools
that poll HTTP API.
Now that rabbitmq/rabbitmq-management#41 is merged, there are
few arguments against always emitting stats.

Fixes #679.

@michaelklishin michaelklishin referenced this issue in rabbitmq/rabbitmq-common Mar 8, 2016

Merged

Emit stats unconditionally #70

@jippi

This comment has been minimized.

Show comment
Hide comment
@jippi

jippi Apr 25, 2016

Any timeframe for a release with this fix?

jippi commented Apr 25, 2016

Any timeframe for a release with this fix?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
Member

michaelklishin commented Apr 25, 2016

@jippi no promises. There are several milestones released.

@jippi

This comment has been minimized.

Show comment
Hide comment
@jippi

jippi Apr 25, 2016

okay, is it an intended thing that the release 3.6.2 M5 got all the files named rabbitmq-server-3.6.1.905* and not rabbitmq-server-3.6.2.*

jippi commented Apr 25, 2016

okay, is it an intended thing that the release 3.6.2 M5 got all the files named rabbitmq-server-3.6.1.905* and not rabbitmq-server-3.6.2.*

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 25, 2016

Member

@jippi err, I meant to say that 3.6.2 is not yet out so versions are 3.6.1.X where X is high enough to hint that this is not a typical release.

Member

michaelklishin commented Apr 25, 2016

@jippi err, I meant to say that 3.6.2 is not yet out so versions are 3.6.1.X where X is high enough to hint that this is not a typical release.

@jippi

This comment has been minimized.

Show comment
Hide comment
@jippi

jippi Apr 25, 2016

@michaelklishin got it ! I've upgraded my cluster to M5 and will continue to monitor it for quirks ! :) thanks for the help and speedy replies - much appreciated

jippi commented Apr 25, 2016

@michaelklishin got it ! I've upgraded my cluster to M5 and will continue to monitor it for quirks ! :) thanks for the help and speedy replies - much appreciated

@noahhaon

This comment has been minimized.

Show comment
Hide comment
@noahhaon

noahhaon Apr 25, 2016

@jippi Please report back and let me know how the 3.6.2 milestones affects this issue for you! We ran into the same thing when upgrading to 3.6.1 and were forced to downgrade.

I suspect this issue was introduced somehow in 3.6.x - as we ran into it immediately with no other changes beside the upgrade, as you did - but @michaelklishin does not seem to share my opinion. There has been a lot of work on the rabbitmq-management plugin and related issues in 3.6.2, so I'm hoping the root cause has either been mitigated by the many management plugin improvements in 3.6.2 or has otherwise fixed by accident as a result of the extensive refactoring.

noahhaon commented Apr 25, 2016

@jippi Please report back and let me know how the 3.6.2 milestones affects this issue for you! We ran into the same thing when upgrading to 3.6.1 and were forced to downgrade.

I suspect this issue was introduced somehow in 3.6.x - as we ran into it immediately with no other changes beside the upgrade, as you did - but @michaelklishin does not seem to share my opinion. There has been a lot of work on the rabbitmq-management plugin and related issues in 3.6.2, so I'm hoping the root cause has either been mitigated by the many management plugin improvements in 3.6.2 or has otherwise fixed by accident as a result of the extensive refactoring.

@jippi

This comment has been minimized.

Show comment
Hide comment
@jippi

jippi Apr 26, 2016

@noahhaon ran for ~15h now without issues.. it's been through the most busy time of day (morning) without breaking a sweat - so i would say it totally fixed the issue for me :) running M5

I saw the issue across all 3.6.x releases, and no problems at all under 3.5.7 (or below)

cc @michaelklishin TL;DR the issue seem to have been resolved

jippi commented Apr 26, 2016

@noahhaon ran for ~15h now without issues.. it's been through the most busy time of day (morning) without breaking a sweat - so i would say it totally fixed the issue for me :) running M5

I saw the issue across all 3.6.x releases, and no problems at all under 3.5.7 (or below)

cc @michaelklishin TL;DR the issue seem to have been resolved

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 26, 2016

Member

Thank you for the update, Christian!

On Tue, Apr 26, 2016 at 1:23 AM, Christian Winther <notifications@github.com

wrote:

@noahhaon https://github.com/noahhaon ran for ~15h now without issues..
it's been through the most busy time of day (morning) without breaking a
sweat - so i would say it totally fixed the issue for me :) running M5

I saw the issue across all 3.6.x releases, and no problems at all under
3.5.7 (or below)

cc @michaelklishin https://github.com/michaelklishin TL;DR the issue
seem to have been resolved


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#41 (comment)

MK

Staff Software Engineer, Pivotal/RabbitMQ

Member

michaelklishin commented Apr 26, 2016

Thank you for the update, Christian!

On Tue, Apr 26, 2016 at 1:23 AM, Christian Winther <notifications@github.com

wrote:

@noahhaon https://github.com/noahhaon ran for ~15h now without issues..
it's been through the most busy time of day (morning) without breaking a
sweat - so i would say it totally fixed the issue for me :) running M5

I saw the issue across all 3.6.x releases, and no problems at all under
3.5.7 (or below)

cc @michaelklishin https://github.com/michaelklishin TL;DR the issue
seem to have been resolved


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#41 (comment)

MK

Staff Software Engineer, Pivotal/RabbitMQ

@jippi

This comment has been minimized.

Show comment
Hide comment
@jippi

jippi Apr 26, 2016

@michaelklishin one thing i just noticed - 100k messages got (successfully) consumed, but doesn't show up as any kind of rate in the chart below - is it a bug or not?

image

jippi commented Apr 26, 2016

@michaelklishin one thing i just noticed - 100k messages got (successfully) consumed, but doesn't show up as any kind of rate in the chart below - is it a bug or not?

image

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 26, 2016

Member

Please post questions to rabbitmq-users or Stack Overflow. RabbitMQ uses GitHub issues for specific actionable items engineers can work on, not questions. Thank you.

Messages could expire due to TTL, and so on. Unless this can be reproduced, I doubt it is a bug.

Member

michaelklishin commented Apr 26, 2016

Please post questions to rabbitmq-users or Stack Overflow. RabbitMQ uses GitHub issues for specific actionable items engineers can work on, not questions. Thank you.

Messages could expire due to TTL, and so on. Unless this can be reproduced, I doubt it is a bug.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 26, 2016

Member

3.6.2 RC1 is out. Let's move all support questions there, this thread is already long and discusses all kinds of things.

Member

michaelklishin commented Apr 26, 2016

3.6.2 RC1 is out. Let's move all support questions there, this thread is already long and discusses all kinds of things.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Jun 6, 2016

Member

Follow-up issues worth mentioning: #214, #216.

Member

michaelklishin commented Jun 6, 2016

Follow-up issues worth mentioning: #214, #216.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Jun 24, 2016

Member

This issue is worth an update.

We've fixed a couple of leaks in the 3.6.2 version and they make things work better for some users. We've also updated the docs to mention workarounds. For some environments a single node collector isn't enough.

So we will be looking into a completely new plugin for 3.7.0, if time permits. It will be distributed: stats will be kept on all cluster nodes and aggregated by HTTP API handlers. No promises on ETA.

Member

michaelklishin commented Jun 24, 2016

This issue is worth an update.

We've fixed a couple of leaks in the 3.6.2 version and they make things work better for some users. We've also updated the docs to mention workarounds. For some environments a single node collector isn't enough.

So we will be looking into a completely new plugin for 3.7.0, if time permits. It will be distributed: stats will be kept on all cluster nodes and aggregated by HTTP API handlers. No promises on ETA.

@beasurajitroy

This comment has been minimized.

Show comment
Hide comment
@beasurajitroy

beasurajitroy Aug 26, 2016

We have followed both the work arounds on version 3.5.7 but did not see a difference. We set the rates_mode=none and stats collection to 60,000 with no improvement. Currently we have disabled management plugin to work around the issue.

Is there any other way to work around the issue?

We have followed both the work arounds on version 3.5.7 but did not see a difference. We set the rates_mode=none and stats collection to 60,000 with no improvement. Currently we have disabled management plugin to work around the issue.

Is there any other way to work around the issue?

@hoalequang

This comment has been minimized.

Show comment
Hide comment
@hoalequang

hoalequang Oct 13, 2016

I have same issue on version 3.6.5 but have not found the right solution yet :(

hoalequang commented Oct 13, 2016

I have same issue on version 3.6.5 but have not found the right solution yet :(

@nickjones

This comment has been minimized.

Show comment
Hide comment
@nickjones

nickjones Oct 13, 2016

FWIW, I have a crontab terminating stats every 24hrs. The node node assigned to stats has been up for 21d now instead of the usual 5-7d.

FWIW, I have a crontab terminating stats every 24hrs. The node node assigned to stats has been up for 21d now instead of the usual 5-7d.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 13, 2016

Member

The command is documented. Let's not turn issues into a yet another support channel.

On 14 Oct 2016, at 06:38, beasurajitroy notifications@github.com wrote:

@nickjones what is the command you use to terminate the stats can you give me the steps? I can also try at my end. In my case I have 10,000 VMs on that env and it dies pretty quich ....lets say 12 hrs.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Member

michaelklishin commented Oct 13, 2016

The command is documented. Let's not turn issues into a yet another support channel.

On 14 Oct 2016, at 06:38, beasurajitroy notifications@github.com wrote:

@nickjones what is the command you use to terminate the stats can you give me the steps? I can also try at my end. In my case I have 10,000 VMs on that env and it dies pretty quich ....lets say 12 hrs.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@chiarishow

This comment has been minimized.

Show comment
Hide comment
@chiarishow

chiarishow Oct 14, 2016

same problem here (3.6.5, Erlang 18.3.4), stats slowly grow until memory is finished. And i try to lower the stats_event_max_backlog and the collect_statistics_interval. For now I will use @nickjones solution.

chiarishow commented Oct 14, 2016

same problem here (3.6.5, Erlang 18.3.4), stats slowly grow until memory is finished. And i try to lower the stats_event_max_backlog and the collect_statistics_interval. For now I will use @nickjones solution.

@beasurajitroy

This comment has been minimized.

Show comment
Hide comment
@beasurajitroy

beasurajitroy Mar 10, 2017

Is there a permanent fix for this anywhere apart from @nickjones work around. We have been facing this issue for a while.

Is there a permanent fix for this anywhere apart from @nickjones work around. We have been facing this issue for a while.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 10, 2017

Member

@beasurajitroy this is not a support venue. Please direct questions to rabbitmq-users.

#236 is a more fundamental solution and it will be available in 3.6.7. I don't know if it's "permanent enough", as there are always way to improve stats storage and aggregation but it avoids the problem with a single node taking all the stats-related load.

Member

michaelklishin commented Mar 10, 2017

@beasurajitroy this is not a support venue. Please direct questions to rabbitmq-users.

#236 is a more fundamental solution and it will be available in 3.6.7. I don't know if it's "permanent enough", as there are always way to improve stats storage and aggregation but it avoids the problem with a single node taking all the stats-related load.

@rplanteras

This comment has been minimized.

Show comment
Hide comment
@rplanteras

rplanteras Jun 28, 2018

@nickjones May i know what exactly is the process name of the process that you reset every 24hrs in crontab?

@nickjones May i know what exactly is the process name of the process that you reset every 24hrs in crontab?

@rplanteras

This comment has been minimized.

Show comment
Hide comment
@rplanteras

rplanteras Jun 28, 2018

@chiarishow May i ask what process do you terminate when you followed @nickjones workaround? Thank you.

@chiarishow May i ask what process do you terminate when you followed @nickjones workaround? Thank you.

@rplanteras

This comment has been minimized.

Show comment
Hide comment
@rplanteras

rplanteras Jun 28, 2018

@michaelklishin Is it not possible to share the command here?

The command is documented. Let's not turn issues into a yet another support channel.

On 14 Oct 2016, at 06:38, beasurajitroy notifications@github.com wrote:

@nickjones what is the command you use to terminate the stats can you give me the 

steps? I can also try at my end. In my case I have 10,000 VMs on that env and it dies
pretty quich .... lets say 12 hrs.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

rplanteras commented Jun 28, 2018

@michaelklishin Is it not possible to share the command here?

The command is documented. Let's not turn issues into a yet another support channel.

On 14 Oct 2016, at 06:38, beasurajitroy notifications@github.com wrote:

@nickjones what is the command you use to terminate the stats can you give me the 

steps? I can also try at my end. In my case I have 10,000 VMs on that env and it dies
pretty quich .... lets say 12 hrs.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@rabbitmq rabbitmq locked as resolved and limited conversation to collaborators Jun 28, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.