Skip to content
This repository has been archived by the owner on Nov 17, 2020. It is now read-only.

Slow increase in management db memory use when rates_mode = basic #214

Closed
spatula75 opened this issue Jun 2, 2016 · 17 comments
Closed

Slow increase in management db memory use when rates_mode = basic #214

spatula75 opened this issue Jun 2, 2016 · 17 comments
Assignees
Milestone

Comments

@spatula75
Copy link

With RMQ 3.6.2, on a system that is fairly close to idle most of the time, we nonetheless see slow, but consistent memory creep from the management DB, but only when rates_mode is basic. Setting it to 'none' seems to eliminate the problem. The slope of the curve is highly consistent (about 15MB/hr), even if we turn on scripts to try to create lots of channels, messages, etc:

image

Most of this space is being used by the management database table:

image

Management DB configuration for this host (we've adjusted to shorter retention periods):

 {rabbitmq_management,
     [{basic,[{305,5},{1860,60},{3900,300}]},
      {cors_allow_origins,[]},
      {cors_max_age,1800},
      {detailed,[{10,5}]},
      {global,[{305,5},{1860,60},{3900,300}]},
      {http_log_dir,none},
      {listener,[{port,15672}]},
      {load_definitions,none},
      {process_stats_gc_timeout,300000},
      {rates_mode,basic},
      {sample_retention_policies,
          [{global,[{605,5},{3660,60},{29400,600},{86400,1800}]},
           {basic,[{605,5},{3600,60}]},
           {detailed,[{10,5}]}]}]},

On a different node, which is identically configured except for setting rates_mode=none (and which is actually busier), we don't see the memory creep:

image

Manually forcing an Erlang GC ( rabbitmqctl eval '[garbage_collect(P) || P <- processes()].') has no effect on the memory consumption.

Our workaround is to set rates_mode=none for now.

@michaelklishin
Copy link
Member

@spatula75 ETS tables are not cleaned up by running runtime GC.

Can you enable rabbitmq_top and see what processes consume most RAM?
How many connections, channels, and queues does this system have?

@spatula75
Copy link
Author

RAM usage per process isn't too dramatic. Sorry for the screen shots, but that seems the easiest way to keep the formatting intact:

image

image

@michaelklishin
Copy link
Member

@spatula75 thank you. On the chart above we see memory growth over the course of 24 or so hours. Default retention policies collect some samples for up to 24 hours. Does the growth level out after that?

Can you try tweaking retention policies to be e.g. 4 hours to compare? Thank you.

@spatula75
Copy link
Author

Our longest retention period is set to just over an hour, and unfortunately the increase we're seeing never really changes, even after 24 hours.

@michaelklishin
Copy link
Member

@spatula75 OK, good to know. I guess we'll add ETS table info to the UI or rabbitmq-top to make investigating such issues easier.

What kind of workload do you run against this node? Is connection, channel, queue, or binding churn relatively high? We are looking for a way to reproduce.

@spatula75
Copy link
Author

The workload is mostly RPC-like, so once a minute a job will kick off, create channels, create consumers, publish a bunch of messages, close the channels used for publishing, and the consumers will await responses.

On this cluster in particular, there are many messages which will time out, though we see similar growth on a cluster with fewer timeouts.

Probably the most notable thing is a somewhat high amount of channel churn. I tried to make the problem worse by writing a script that repeatedly created a huge number of channels and then closed them, but this did not seem to affect the growth rate.

@michaelklishin
Copy link
Member

@spatula75 OK, thanks, that's also helpful. We'll start by introducing #215 and then using it to see if we can trigger unusual table size growth.

@spatula75
Copy link
Author

Cool. If/when you have a build containing #215, I'd be happy to run it on this cluster and point our normal workload at it to see if we can see which table(s) are causing trouble.

@michaelklishin
Copy link
Member

OK, looks like #217 largely addresses it, at least per @gmr's comment his test tool no longer can reproduce the issue even after running for 10 times as long.

We will produce a milestone build for @spatula75 to try. Closing this but feel free to reopen if #217 wasn't sufficient for the workload you have (and please provide a way to reproduce!)

@spatula75
Copy link
Author

Sounds promising. We'll evaluate it just as soon as the milestone build is available.

@gmr
Copy link
Contributor

gmr commented Jun 6, 2016

@spatula75 I included a pre-built drop-in for the plugin on ticket #217 if you wanted to test with that.

@spatula75
Copy link
Author

@gmr thanks, giving it a try now.

@spatula75
Copy link
Author

@gmr: is it expected that this build of the plugin will report many node-specific statistics (eg, memory, IO, descriptors, space) as 0's when it's loaded in 3.6.2?

@michaelklishin
Copy link
Member

@spatula75 most node-wide stats are reported by rabbitmq_management_agent, which is a separate plugin. This change cannot affect those stats.

@spatula75
Copy link
Author

Looks like it happens if I select a date range for the graphs that's longer than the available history. If I choose something like "last minute" then it's fine. Probably unrelated and not a big deal.

@spatula75
Copy link
Author

So far this looks encouraging. The management DB has been running for an hour in our environment with no bloat.

@michaelklishin
Copy link
Member

3.6.3 Milestone 1 includes this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants