Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent node disk space and memory with metric aggregation #2513

Closed
logicbomb421 opened this issue Jun 23, 2020 · 4 comments
Closed

Inconsistent node disk space and memory with metric aggregation #2513

logicbomb421 opened this issue Jun 23, 2020 · 4 comments

Comments

@logicbomb421
Copy link

I am seeing some oddness when I look at disk space (rabbitmq_disk_space_available_bytes) and memory (rabbitmq_process_resident_memory_bytes) metrics in a multi-node cluster. It seems the value returned hops between all nodes, which makes charting the value rather odd. After reading the documentation on metric aggregation and enabling prometheus.return_per_object_metrics, I expected to see these metrics expand into one per node, however that does not appear to be the case.

I would be very grateful if someone could explain how to view these two metrics per node.

Node Configuration

RabbitMQ_Management

Example of rabbitmq_disk_space_available_bytes

CODE-975_–_Monitoring_–_Development_–_Google_Cloud_Platform

Example of rabbitmq_process_resident_memory_bytes

CODE-975_–_Monitoring_–_Development_–_Google_Cloud_Platform

Thanks!

@michaelklishin
Copy link
Member

Node metrics should be excluded from aggregation as 100K nodes is not a feasible scenario. @gerhard @dcorbacho can we be aggregating these as well?

References #2512.

@gerhard
Copy link
Contributor

gerhard commented Jun 24, 2020

rabbitmq_disk_space_available_bytes & rabbitmq_process_resident_memory_bytes are per node, there is nothing to aggregate.

The problem that I suspect your are hitting is the lack of labels on those metrics, so that they can be distinguished between nodes. This is the correct approach in Prometheus, where we use a on(instance) group_left rabbitmq_identity_info operation on all metrics that need to be grouped by node & cluster. To be more exact, this is how the rabbitmq_disk_space_available_bytes metric gets queried:

rabbitmq_disk_space_available_bytes * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster"}

The {{rabbitmq_node}} in the Legend splits all metrics per node when visualised in Grafana. This is what that looks like:

image

I am not familiar with the service that you are using for visualising metrics, but I would recommend using a PromQL equivalent to get the metrics in the right format. We are implementing Prometheus exporter best practices, even if it was hard work at the time: prometheus/docs#1414

If this answer addresses your question, please close the issue @logicbomb421. Thanks!

@michaelklishin
Copy link
Member

We don't think these are aggregated. They may be missing labels. @logicbomb421 has the above suggestion from @gerhard helped? Are the charts from GCP (Google Cloud Platform)? If that's the case, it sounds like GCP is unaware of the fact that these metrics are node-specific.

@gerhard
Copy link
Contributor

gerhard commented Feb 2, 2021

I am not sure how you are querying those metrics, but if you are using a loadbalancer or service to query rather than the nodes directly, then the values will change based on which node services the request.

prometheus. return_per_object_metrics applies to multiple metrics within a single node, like channels, queues etc. There will be a single rabbitmq_process_resident_memory_bytes metric per node, regardless of what prometheus. return_per_object_metrics is set to.

I'm assuming that this issue is solved for you @logicbomb421 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants