Slow increase in management db memory use when rates_mode = basic #214

spatula75 · 2016-06-02T01:24:58Z

With RMQ 3.6.2, on a system that is fairly close to idle most of the time, we nonetheless see slow, but consistent memory creep from the management DB, but only when rates_mode is basic. Setting it to 'none' seems to eliminate the problem. The slope of the curve is highly consistent (about 15MB/hr), even if we turn on scripts to try to create lots of channels, messages, etc:

Most of this space is being used by the management database table:

Management DB configuration for this host (we've adjusted to shorter retention periods):

 {rabbitmq_management,
     [{basic,[{305,5},{1860,60},{3900,300}]},
      {cors_allow_origins,[]},
      {cors_max_age,1800},
      {detailed,[{10,5}]},
      {global,[{305,5},{1860,60},{3900,300}]},
      {http_log_dir,none},
      {listener,[{port,15672}]},
      {load_definitions,none},
      {process_stats_gc_timeout,300000},
      {rates_mode,basic},
      {sample_retention_policies,
          [{global,[{605,5},{3660,60},{29400,600},{86400,1800}]},
           {basic,[{605,5},{3600,60}]},
           {detailed,[{10,5}]}]}]},

On a different node, which is identically configured except for setting rates_mode=none (and which is actually busier), we don't see the memory creep:

Manually forcing an Erlang GC ( rabbitmqctl eval '[garbage_collect(P) || P <- processes()].') has no effect on the memory consumption.

Our workaround is to set rates_mode=none for now.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2016-06-02T14:16:36Z

@spatula75 ETS tables are not cleaned up by running runtime GC.

Can you enable rabbitmq_top and see what processes consume most RAM?
How many connections, channels, and queues does this system have?

spatula75 · 2016-06-02T14:49:00Z

RAM usage per process isn't too dramatic. Sorry for the screen shots, but that seems the easiest way to keep the formatting intact:

michaelklishin · 2016-06-03T12:57:31Z

@spatula75 thank you. On the chart above we see memory growth over the course of 24 or so hours. Default retention policies collect some samples for up to 24 hours. Does the growth level out after that?

Can you try tweaking retention policies to be e.g. 4 hours to compare? Thank you.

spatula75 · 2016-06-03T13:00:52Z

Our longest retention period is set to just over an hour, and unfortunately the increase we're seeing never really changes, even after 24 hours.

michaelklishin · 2016-06-03T13:04:05Z

@spatula75 OK, good to know. I guess we'll add ETS table info to the UI or rabbitmq-top to make investigating such issues easier.

What kind of workload do you run against this node? Is connection, channel, queue, or binding churn relatively high? We are looking for a way to reproduce.

spatula75 · 2016-06-03T14:01:21Z

The workload is mostly RPC-like, so once a minute a job will kick off, create channels, create consumers, publish a bunch of messages, close the channels used for publishing, and the consumers will await responses.

On this cluster in particular, there are many messages which will time out, though we see similar growth on a cluster with fewer timeouts.

Probably the most notable thing is a somewhat high amount of channel churn. I tried to make the problem worse by writing a script that repeatedly created a huge number of channels and then closed them, but this did not seem to affect the growth rate.

michaelklishin · 2016-06-03T14:04:37Z

@spatula75 OK, thanks, that's also helpful. We'll start by introducing #215 and then using it to see if we can trigger unusual table size growth.

spatula75 · 2016-06-03T21:12:19Z

Cool. If/when you have a build containing #215, I'd be happy to run it on this cluster and point our normal workload at it to see if we can see which table(s) are causing trouble.

michaelklishin · 2016-06-06T19:53:18Z

OK, looks like #217 largely addresses it, at least per @gmr's comment his test tool no longer can reproduce the issue even after running for 10 times as long.

We will produce a milestone build for @spatula75 to try. Closing this but feel free to reopen if #217 wasn't sufficient for the workload you have (and please provide a way to reproduce!)

spatula75 · 2016-06-06T19:55:34Z

Sounds promising. We'll evaluate it just as soon as the milestone build is available.

gmr · 2016-06-06T20:21:12Z

@spatula75 I included a pre-built drop-in for the plugin on ticket #217 if you wanted to test with that.

spatula75 · 2016-06-06T20:28:38Z

@gmr thanks, giving it a try now.

spatula75 · 2016-06-06T20:53:11Z

@gmr: is it expected that this build of the plugin will report many node-specific statistics (eg, memory, IO, descriptors, space) as 0's when it's loaded in 3.6.2?

michaelklishin · 2016-06-06T20:54:41Z

@spatula75 most node-wide stats are reported by rabbitmq_management_agent, which is a separate plugin. This change cannot affect those stats.

spatula75 · 2016-06-06T20:56:52Z

Looks like it happens if I select a date range for the graphs that's longer than the available history. If I choose something like "last minute" then it's fine. Probably unrelated and not a big deal.

spatula75 · 2016-06-06T21:33:01Z

So far this looks encouraging. The management DB has been running for an hour in our environment with no bloat.

michaelklishin · 2016-06-06T23:37:27Z

3.6.3 Milestone 1 includes this.

michaelklishin assigned dcorbacho Jun 2, 2016

michaelklishin mentioned this issue Jun 2, 2016

Slow increase seen in message store index in 3.6.2M3 rabbitmq/rabbitmq-server#740

Closed

michaelklishin assigned hairyhum Jun 2, 2016

michaelklishin added the bug label Jun 2, 2016

michaelklishin added the need-followup label Jun 2, 2016

hairyhum mentioned this issue Jun 3, 2016

Report stats database memory for each ETS table it uses #215

Closed

gmr mentioned this issue Jun 6, 2016

Configurable toggle or retention of per-channel stats #216

Closed

michaelklishin closed this as completed Jun 6, 2016

michaelklishin removed the need-followup label Jun 6, 2016

michaelklishin added this to the 3.6.3 milestone Jun 6, 2016

michaelklishin mentioned this issue Jun 6, 2016

Parallelise statistics DB collector #41

Closed

michaelklishin added the effort-low label Jun 7, 2016

tom-pryor mentioned this issue Jul 26, 2016

Queues endpoint fails to serialize GC metrics on Erlang 19.0 #244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow increase in management db memory use when rates_mode = basic #214

Slow increase in management db memory use when rates_mode = basic #214

spatula75 commented Jun 2, 2016

michaelklishin commented Jun 2, 2016

spatula75 commented Jun 2, 2016

michaelklishin commented Jun 3, 2016

spatula75 commented Jun 3, 2016

michaelklishin commented Jun 3, 2016

spatula75 commented Jun 3, 2016

michaelklishin commented Jun 3, 2016

spatula75 commented Jun 3, 2016

michaelklishin commented Jun 6, 2016

spatula75 commented Jun 6, 2016

gmr commented Jun 6, 2016

spatula75 commented Jun 6, 2016

spatula75 commented Jun 6, 2016

michaelklishin commented Jun 6, 2016

spatula75 commented Jun 6, 2016

spatula75 commented Jun 6, 2016

michaelklishin commented Jun 6, 2016

Slow increase in management db memory use when rates_mode = basic #214

Slow increase in management db memory use when rates_mode = basic #214

Comments

spatula75 commented Jun 2, 2016

michaelklishin commented Jun 2, 2016

spatula75 commented Jun 2, 2016

michaelklishin commented Jun 3, 2016

spatula75 commented Jun 3, 2016

michaelklishin commented Jun 3, 2016

spatula75 commented Jun 3, 2016

michaelklishin commented Jun 3, 2016

spatula75 commented Jun 3, 2016

michaelklishin commented Jun 6, 2016

spatula75 commented Jun 6, 2016

gmr commented Jun 6, 2016

spatula75 commented Jun 6, 2016

spatula75 commented Jun 6, 2016

michaelklishin commented Jun 6, 2016

spatula75 commented Jun 6, 2016

spatula75 commented Jun 6, 2016

michaelklishin commented Jun 6, 2016