Exclude prometheus plugin from default plugins #309

timbrd · 2020-09-03T10:52:09Z

I have noticed that the prometheus plugin consumes way too much cpu. On a completely idle rabbitmq instance (No connections, no queues etc.), the cpu load is between 600m to 1 core:

After disabling the plugin, the cpu load drops to nearly 0:

Please exclude the plugin from the default list so that you can add it only if it is really needed.

mkuratczyk · 2020-09-03T11:14:02Z

That sounds more like a bug in the prometheus plugin - it should not consume any resources unless you actually query the Prometheus endpoint. We'll look into that. Thanks for reporting!

michaelklishin · 2020-09-07T07:35:58Z

This is not a bug in the Prometheus plugin. The plugin, if enabled, has to collect metric samples emitted by RabbitMQ entities periodically.

For better or worse, Prometheus is the Kubernetes community monitoring standard. Most users would like to use it and therefore have it enabled. It would be great to have a way to opt-out but this plugin is exactly the kind that we do want to be enabled by default. Not having monitoring data when it's time to troubleshoot an issue is a significantly worse outcome than having to opt-out.

Equally importantly, the charts above really don't explain what exactly consumed the resources. There are CLI commands that provide relevant metrics and steps to reduce CPU switching and "busy-waiting" for mostly idle systems.

I cannot reproduce this "one full core" behavior on a node that has multiple plugins enabled, including the Prometheus one, and does not have any connections, has a few virtual hosts and queues. So my guess is that this environment has very constrained (as far as data services such as RabbitMQ go) resources and default Erlang scheduler settings cause enough CPU context switching to record as "one full core" of usage in Kubernetes metrics. See the links above to learn more and reduce busy-waiting.

@mkuratczyk if there is a way to opt-out already, I suggest that we close this as a wontfix. This plugin really should be enabled by default because having monitoring is essential.

michaelklishin · 2020-09-07T08:46:53Z

I cannot reproduce the "one full core" behavior (outside of Kubernetes, with 3-4 year old hardware). With a handful of virtual hosts and queues, default (5s) stats emission interval, and no connections the node OS process hovers around 9% of a single core, here's how the scheduler time is spent:

rabbitmq-diagnostics runtime_thread_stats --sample-interval 20

        Type      aux check_io emulator       gc    other     port    sleep

         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
          poll    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     scheduler    0.00%    0.00%    0.10%    0.00%    1.08%    0.00%   98.80%

1.08% is spent executing code, almost 99% in sleep.

I am not using reduced busy-wait settings, in fact, no scheduler flags of any kind:

rabbitmq-diagnostics os_env                                                                                                                                                                                    

Listing RabbitMQ-specific environment variables defined on node rabbit@warp10...
ADVANCED_CONFIG_FILE=/path/to/rabbitmq/generic/etc/rabbitmq/advanced.config
CONFIG_FILE=/path/to/rabbitmq/generic/etc/rabbitmq/rabbitmq
ENABLED_PLUGINS_FILE=/path/to/rabbitmq/generic/etc/rabbitmq/enabled_plugins
MNESIA_BASE=/path/to/rabbitmq/generic/var/lib/rabbitmq/mnesia
PLUGINS_DIR=/path/to/rabbitmq/generic/plugins
RABBITMQ_HOME=/path/to/rabbitmq/generic

So, no VM flags or a rabbitmq-env.conf file involved.

gerhard · 2020-09-07T10:31:08Z

Given 2 nodes rmq0 & rmq1
And rmq0 has the rabbitmq_prometheus plugin enabled
And rmq1 has the rabbitmq_prometheus plugin disabled
We observe the following CPU stats over 30 seconds:

rmq0 - rabbitmq_prometheus enabled

rmq1 - rabbitmq_prometheus disabled

root@rmq0:/# dstat -cpy 1 30
--total-cpu-usage-- ---procs--- ---system--
usr sys idl wai stl|run blk new| int   csw
  1   1  99   0   0|  0   0 1.7|1140  1975
  1   0  99   0   0|  0   0   0|1048  1759
  0   0  99   0   0|  0   0   0| 827  1395
  0   0 100   0   0|  0   0   0| 798  1337
  0   0 100   0   0|  0   0 2.0| 822  1305
  0   0  99   0   0|1.0   0   0|1084  1868
  0   0  99   0   0|  0   0  12|1203  1879
  0   0  99   0   0|1.0   0   0| 689  1132
  0   0 100   0   0|  0   0   0| 769  1301
  2   2  96   0   0|  0   0  46|2207  4174
  0   0  99   0   0|  0   0   0| 993  1668
  1   0  99   0   0|  0   0   0|1045  1667
  0   0 100   0   0|  0   0   0| 786  1270
  2   1  96   0   0|  0   0  46|2593  4636
  0   0 100   0   0|  0   0 2.0| 913  1431
  1   0  99   0   0|  0   0   0|1065  1791
  0   1  99   0   0|  0   0 4.0|1089  1654
  0   0  99   0   0|  0   0   0| 816  1370
  0   0 100   0   0|1.0   0   0| 755  1250
  0   0  99   0   0|4.0   0   0| 826  1379
  1   1  99   0   0|  0   0   0|1163  1971
  1   0  99   0   0|2.0   0   0|1034  1671
  0   0  99   0   0|  0   0   0| 777  1293
  0   0 100   0   0|  0   0   0| 718  1166
  0   0  99   0   0|  0   0 2.0|1003  1544
  0   0  99   0   0|  0   0   0|1129  1878
  1   1  99   0   0|1.0   0  12|1511  2263
  0   0  99   0   0|1.0   0   0| 963  1578
  0   0  99   0   0|  0   0   0| 831  1388
  0   0  99   0   0|1.0   0   0| 723  1228

 root@rmq1:/# dstat -cpy 1 30
--total-cpu-usage-- ---procs--- ---system--
usr sys idl wai stl|run blk new| int   csw
  1   1  99   0   0|1.0   0 1.7|1140  1975
  1   0  99   0   0|  0   0   0|1063  1766
  0   0 100   0   0|  0   0   0| 823  1387
  0   0  99   0   0|  0   0   0| 799  1344
  0   0 100   0   0|  0   0 2.0| 822  1305
  0   0  99   0   0|1.0   0   0|1101  1902
  0   0  99   0   0|  0   0  12|1190  1848
  0   0  99   0   0|1.0   0   0| 702  1163
  0   0 100   0   0|  0   0   0| 747  1260
  2   1  96   0   0|  0   0  46|2211  4181
  0   1  99   0   0|  0   0   0| 995  1668
  1   0  99   0   0|  0   0   0|1046  1667
  0   0 100   0   0|  0   0   0| 788  1276
  2   1  96   0   0|  0   0  46|2593  4636
  0   0 100   0   0|  0   0 2.0| 910  1426
  1   0  99   0   0|  0   0   0|1045  1746
  0   0  99   0   0|  0   0 4.0|1094  1676
  0   0  99   0   0|  0   0   0| 820  1374
  0   0 100   0   0|1.0   0   0| 782  1302
  0   0  99   0   0|4.0   0   0| 826  1379
  1   1  99   0   0|  0   0   0|1134  1915
  1   0  99   0   0|2.0   0   0|1056  1707
  0   0  99   0   0|  0   0   0| 772  1283
  0   0 100   0   0|  0   0   0| 709  1153
  0   0  99   0   0|  0   0 2.0|1009  1556
  0   0  99   0   0|  0   0   0|1127  1878
  1   1  99   0   0|1.0   0  12|1528  2294
  0   0  99   0   0|2.0   0   0| 939  1535
  0   0  99   0   0|  0   0   0| 832  1387
  0   0  99   0   0|1.0   0   0| 746  1272

Notice that CPU utilisation & breakdown, system processes, interrupts & context switches are almost identical.

Let's take 3 minutes to re-create this and confirm that rabbitmq_prometheus does not consume any extra CPU when enabled (it's a YouTube video, click to play):

Notice towards the end of the above recording how the number of process reductions inside the Erlang VM are almost identical for both nodes. Given 16 Erlang schedulers (16 CPUs) only 1% of 1 CPU is being utilised on both nodes (the equivalent of 10m CPU in K8S). In my example I am using Ubuntu 18.04.4 LTS with Intel(R) Xeon(R) CPU E5-1410 v2 @ 2.80GHz & Docker 19.03.12.

I would be curious to know what is different about your environment. Can you share same stats as I have, but for 2 separate RabbitMQ nodes, one with rabbitmq_prometheus enabled & one without?

FWIW, rabbitmq_prometheus does not consume any CPU cycles unless it gets queried for metrics. Unlike the rabbitmq_management plugin, rabbitmq_prometheus reads data structures from memory, converts them to Prometheus format and then serves via HTTP when an HTTP request is made. Just to re-emphasise, unless some component in your infrastructure queries TCP port 15692, rabbitmq_prometheus will not use any CPU. If it did, we would not have enabled it by default.

Let us know how your investigation goes @timbrd.

michaelklishin · 2020-09-07T10:31:32Z

@timbrd what is the tool that produced those charts and what kind of pod resource limits are used in this environment? We cannot reproduce the resource usage you are demonstrating, so our conclusion is that it is something environment-specific.

As for what kind of defaults this Operator uses, @mkuratczyk and so far agree on the following:

The Operator will have two sets of default plugins: essential (you won't be able to disable these) and "recommended"
The latter will be overridable using an "additional plugins" key that can be set to, say, an empty list

However, there is another moving part that must be considered: the community Docker image this Operator currently uses.
The image has recently enabled Prometheus by default, including disabling management UI-specific metric collection (which would help your case by conserving CPU resources). So there is very strong interest in having Prometheus scraping endpoint enabled by default.

This happens at image build time, so the Operator has a final say in what set of plugins is pre-enabled.

If rabbitmq_prometheus is moved from the required plugins list to the default additional plugins list, you would be able to disable it by setting additional plugins explicitly to an empty list. I will file a new issue that recommends that we do that.

gerhard · 2020-09-07T11:46:34Z

@timbrd not having rabbitmq_prometheus enabled is a step backwards for understanding how RabbitMQ & Erlang is behaving. When problems hit, it's impossible to know why they are happening, and enabling the plugin at that point in time may not help, since you don't know what changed.

rabbitmq_prometheus doesn't use any CPU when enabled if management_agent.disable_metrics_collector is set - this is the official RabbitMQ Docker image default. I suspect that you don't have management_agent.disable_metrics_collector = true set and many objects in your RabbitMQ deployment (connections, channels, queues etc.) which results in a rabbitmq_management_agent busy tracking metrics for all these objects. If you could confirm this, it would help us conclude that disabling rabbitmq_prometheus is an indirect way of solving your CPU usage problem.

Thanks!

michaelklishin closed this as completed Sep 7, 2020

michaelklishin mentioned this issue Sep 7, 2020

Consider moving rabbitmq_prometheus to the list of additional (optional) plugins #316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude prometheus plugin from default plugins #309

Exclude prometheus plugin from default plugins #309

timbrd commented Sep 3, 2020

mkuratczyk commented Sep 3, 2020

michaelklishin commented Sep 7, 2020 •

edited

michaelklishin commented Sep 7, 2020

gerhard commented Sep 7, 2020

michaelklishin commented Sep 7, 2020

gerhard commented Sep 7, 2020

Exclude prometheus plugin from default plugins #309

Exclude prometheus plugin from default plugins #309

Comments

timbrd commented Sep 3, 2020

mkuratczyk commented Sep 3, 2020

michaelklishin commented Sep 7, 2020 • edited

michaelklishin commented Sep 7, 2020

gerhard commented Sep 7, 2020

michaelklishin commented Sep 7, 2020

gerhard commented Sep 7, 2020

michaelklishin commented Sep 7, 2020 •

edited