Aggregated `queue_messages_published_total` metric violates Prometheus expectations about counters #2783

michaelklishin · 2021-02-02T16:53:29Z

See #2781 for the background.

Some metrics, e.g. queue_messages_published_total are computed as a basic sum aggregation of samples from a local ETS table, e.g. channel_queue_exchange_metrics. When a channel is closed, its samples are removed from the table,
decreasing the sum. This violates an expectation for counter metrics in Prometheus: they can only increment or stay flat or reset to 0 but not decrease.

The issue is not present when per-object metrics are used: when a channel is closed, all of its metrics go away, which is what the user expects to happen.

This is a side-effect of our quick-and-dirty switch to aggregated metrics. We need to retain this historical total or delegate
to the Prometheus client library which will do most aggregation work and handle resets.

This is node local state, so we can address this even with a significant rework of the Prometheus plugin and still ship it in a 3.8.x release.

Per discussion with @dcorbacho @gerhard @kjnilsson.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2021-02-02T17:00:28Z

References #2512, #2513.

gerhard · 2021-02-02T18:26:11Z

Will be picking this one up first thing with @mkuratczyk, and working with @dcorbacho on the review so that she can continue focusing on her in-flight.

dcorbacho · 2021-02-03T10:25:56Z

Luckily we're not tracking totals on the channel but doing increments, see https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_channel.erl#L205

So when queue_exchange_stats are increased we can use that value to increase a new publish counter to track totals per node (or even exchange). Here: https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_channel.erl#L2123

This should be a very simple addition. Tracking this metric per node means that no garbage collection is needed, the new table only needs to be included on the tables to be reset (API command) and ignored everywhere else. The prometheus collector could stop scrapping channel_queue_exchange_metrics when per_object is selected and use the new metric.

@mkuratczyk

This is meant to address #2783, which began as #2781 This is just an experiment, don't get too excited. The first impression is that the OpenTelemetry API is really hard to work with, and the concepts involved don't make a lot of sense. We (+@mkuratczyk) will continue as soon as we have dealt with #2785 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

@mkuratczyk

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, rabbitmq_global_, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho (this is multiple commits squashed into one) Next steps: - Create new PR and ask @mkuratcyzk and @ansd for review - fresh 👀 👀 - This new PR closes #3045 - Back-port to 3.9.x as is (to my knowledge, this is the only feature missing before we can code freeze and cut an RC) - Back-port parts of this to 3.8.x so that we can finally address #2783 - Fix & publish new version of the RabbitMQ-Overview Grafana dashboard - Fix & finally publish the new RabbitMQ-Streams Grafana dashboard Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

@mkuratczyk

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, rabbitmq_global_, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho (this is multiple commits squashed into one) Next steps: - Create new PR and ask @mkuratcyzk and @ansd for review - fresh 👀 👀 - This new PR closes #3045 - Back-port to 3.9.x as is (to my knowledge, this is the only feature missing before we can code freeze and cut an RC) - Back-port parts of this to 3.8.x so that we can finally address #2783 - Fix & publish new version of the RabbitMQ-Overview Grafana dashboard - Fix & finally publish the new RabbitMQ-Streams Grafana dashboard Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

karanlik · 2021-07-07T09:10:59Z

Hi All,

Please sorry if i'm posting at a wrong place.

Currently we're seeing some behaviour similar to this issue; our "rabbitmq_channel_messages_published_total" metrics (with aggregation enabled) decreases which causes false alarms & mis-guiding graphs. I guess this is some sort of bug or at least something not good for Prometheus. What is the proposed workaround-solution for such kind of situation? Going to per-object metrics ?
Once again sorry if I'm at wrong location.
Thanks...

gerhard · 2021-07-07T10:07:09Z

The fix is already available in what will ship as RabbitMQ 3.9.0 via #3127

You will notice that the PR has 5 more outstanding tasks, the next one being a back-port to v3.8.x. Once this is done, the issue will be fixed for the 3.8 release series via an update to the dashboards (also one of those 5 tasks).

Between RabbitMQ Summit - will you be joining us for today's pre-conference meetup @karanlik? 😉 - and a few other things in flight (3.9.0 being the most significant one by far), this back-port is going slower than I would like. The important take-away is that it's coming and it definitely fixes the issue (we have been testing it for weeks in our 3.9 long-running environment).

Hope that helps 👍🏻

karanlik · 2021-07-07T10:52:59Z

Great! Thanks for your quick feedback.. I'll give a try to per-object based metrics until this is available.. best regards..(PS: I've subscribed t o RabbitMQ Summit mailing list :-) )

luos · 2021-11-03T08:00:10Z

Hi,

Do you think this will get backported to 3.8.x?

Thanks

MirahImage · 2022-05-10T10:18:22Z

Closing. As this is available in 3.9 and 3.10, and 3.8 is in extended support.

michaelklishin added bug usability rabbitmq-prometheus labels Feb 2, 2021

michaelklishin assigned gerhard and dcorbacho Feb 2, 2021

gerhard assigned mkuratczyk and michaelklishin and unassigned dcorbacho, mkuratczyk and michaelklishin Feb 2, 2021

jiangxinlingdu mentioned this issue Mar 15, 2021

add some metrics about rate kbudde/rabbitmq_exporter#180

Merged

gerhard mentioned this issue Jun 21, 2021

Global counters per protocol + protocol AND queue_type #3127

Merged

14 tasks

MirahImage closed this as completed May 10, 2022

druidai-devops mentioned this issue Jan 18, 2023

Aggregated Metrics in RMQ 3.10.14 #6929

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregated `queue_messages_published_total` metric violates Prometheus expectations about counters #2783

Aggregated `queue_messages_published_total` metric violates Prometheus expectations about counters #2783

michaelklishin commented Feb 2, 2021

michaelklishin commented Feb 2, 2021 •

edited

gerhard commented Feb 2, 2021 •

edited

dcorbacho commented Feb 3, 2021 •

edited

karanlik commented Jul 7, 2021

gerhard commented Jul 7, 2021 •

edited

karanlik commented Jul 7, 2021

luos commented Nov 3, 2021

MirahImage commented May 10, 2022

Aggregated queue_messages_published_total metric violates Prometheus expectations about counters #2783

Aggregated queue_messages_published_total metric violates Prometheus expectations about counters #2783

Comments

michaelklishin commented Feb 2, 2021

michaelklishin commented Feb 2, 2021 • edited

gerhard commented Feb 2, 2021 • edited

dcorbacho commented Feb 3, 2021 • edited

karanlik commented Jul 7, 2021

gerhard commented Jul 7, 2021 • edited

karanlik commented Jul 7, 2021

luos commented Nov 3, 2021

MirahImage commented May 10, 2022

Aggregated `queue_messages_published_total` metric violates Prometheus expectations about counters #2783

Aggregated `queue_messages_published_total` metric violates Prometheus expectations about counters #2783

michaelklishin commented Feb 2, 2021 •

edited

gerhard commented Feb 2, 2021 •

edited

dcorbacho commented Feb 3, 2021 •

edited

gerhard commented Jul 7, 2021 •

edited