-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude prometheus plugin from default plugins #309
Comments
That sounds more like a bug in the prometheus plugin - it should not consume any resources unless you actually query the Prometheus endpoint. We'll look into that. Thanks for reporting! |
This is not a bug in the Prometheus plugin. The plugin, if enabled, has to collect metric samples emitted by RabbitMQ entities periodically. For better or worse, Prometheus is the Kubernetes community monitoring standard. Most users would like to use it and therefore have it enabled. It would be great to have a way to opt-out but this plugin is exactly the kind that we do want to be enabled by default. Not having monitoring data when it's time to troubleshoot an issue is a significantly worse outcome than having to opt-out. Equally importantly, the charts above really don't explain what exactly consumed the resources. There are CLI commands that provide relevant metrics and steps to reduce CPU switching and "busy-waiting" for mostly idle systems. I cannot reproduce this "one full core" behavior on a node that has multiple plugins enabled, including the Prometheus one, and does not have any connections, has a few virtual hosts and queues. So my guess is that this environment has very constrained (as far as data services such as RabbitMQ go) resources and default Erlang scheduler settings cause enough CPU context switching to record as "one full core" of usage in Kubernetes metrics. See the links above to learn more and reduce busy-waiting. @mkuratczyk if there is a way to opt-out already, I suggest that we close this as a |
I cannot reproduce the "one full core" behavior (outside of Kubernetes, with 3-4 year old hardware). With a handful of virtual hosts and queues, default (5s) stats emission interval, and no connections the node OS process hovers around 9% of a single core, here's how the scheduler time is spent:
1.08% is spent executing code, almost 99% in sleep. I am not using reduced busy-wait settings, in fact, no scheduler flags of any kind:
So, no VM flags or a |
Given 2 nodes rmq0 & rmq1
Notice that CPU utilisation & breakdown, system processes, interrupts & context switches are almost identical. Let's take 3 minutes to re-create this and confirm that rabbitmq_prometheus does not consume any extra CPU when enabled (it's a YouTube video, click to play): Notice towards the end of the above recording how the number of process reductions inside the Erlang VM are almost identical for both nodes. Given 16 Erlang schedulers (16 CPUs) only 1% of 1 CPU is being utilised on both nodes (the equivalent of 10m CPU in K8S). In my example I am using Ubuntu 18.04.4 LTS with Intel(R) Xeon(R) CPU E5-1410 v2 @ 2.80GHz & Docker 19.03.12. I would be curious to know what is different about your environment. Can you share same stats as I have, but for 2 separate RabbitMQ nodes, one with rabbitmq_prometheus enabled & one without? FWIW, rabbitmq_prometheus does not consume any CPU cycles unless it gets queried for metrics. Unlike the rabbitmq_management plugin, rabbitmq_prometheus reads data structures from memory, converts them to Prometheus format and then serves via HTTP when an HTTP request is made. Just to re-emphasise, unless some component in your infrastructure queries TCP port 15692, rabbitmq_prometheus will not use any CPU. If it did, we would not have enabled it by default. Let us know how your investigation goes @timbrd. |
@timbrd what is the tool that produced those charts and what kind of pod resource limits are used in this environment? We cannot reproduce the resource usage you are demonstrating, so our conclusion is that it is something environment-specific. As for what kind of defaults this Operator uses, @mkuratczyk and so far agree on the following:
However, there is another moving part that must be considered: the community Docker image this Operator currently uses. This happens at image build time, so the Operator has a final say in what set of plugins is pre-enabled. If |
@timbrd not having rabbitmq_prometheus enabled is a step backwards for understanding how RabbitMQ & Erlang is behaving. When problems hit, it's impossible to know why they are happening, and enabling the plugin at that point in time may not help, since you don't know what changed. rabbitmq_prometheus doesn't use any CPU when enabled if Thanks! |
I have noticed that the prometheus plugin consumes way too much cpu. On a completely idle rabbitmq instance (No connections, no queues etc.), the cpu load is between 600m to 1 core:
After disabling the plugin, the cpu load drops to nearly 0:
Please exclude the plugin from the default list so that you can add it only if it is really needed.
The text was updated successfully, but these errors were encountered: