Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregated queue_messages_published_total metric violates Prometheus expectations about counters #2783

Closed
michaelklishin opened this issue Feb 2, 2021 · 8 comments

Comments

@michaelklishin
Copy link
Member

See #2781 for the background.

Some metrics, e.g. queue_messages_published_total are computed as a basic sum aggregation of samples from a local ETS table, e.g. channel_queue_exchange_metrics. When a channel is closed, its samples are removed from the table,
decreasing the sum. This violates an expectation for counter metrics in Prometheus: they can only increment or stay flat or reset to 0 but not decrease.

The issue is not present when per-object metrics are used: when a channel is closed, all of its metrics go away, which is what the user expects to happen.

This is a side-effect of our quick-and-dirty switch to aggregated metrics. We need to retain this historical total or delegate
to the Prometheus client library which will do most aggregation work and handle resets.

This is node local state, so we can address this even with a significant rework of the Prometheus plugin and still ship it in a 3.8.x release.

Per discussion with @dcorbacho @gerhard @kjnilsson.

@michaelklishin
Copy link
Member Author

michaelklishin commented Feb 2, 2021

References #2512, #2513.

@gerhard
Copy link
Contributor

gerhard commented Feb 2, 2021

Will be picking this one up first thing with @mkuratczyk, and working with @dcorbacho on the review so that she can continue focusing on her in-flight.

@dcorbacho
Copy link
Contributor

dcorbacho commented Feb 3, 2021

Luckily we're not tracking totals on the channel but doing increments, see https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_channel.erl#L205

So when queue_exchange_stats are increased we can use that value to increase a new publish counter to track totals per node (or even exchange). Here: https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_channel.erl#L2123

This should be a very simple addition. Tracking this metric per node means that no garbage collection is needed, the new table only needs to be included on the tables to be reset (API command) and ignored everywhere else. The prometheus collector could stop scrapping channel_queue_exchange_metrics when per_object is selected and use the new metric.

gerhard added a commit that referenced this issue Feb 12, 2021
This is meant to address
#2783, which began as
#2781

This is just an experiment, don't get too excited. The first impression
is that the OpenTelemetry API is really hard to work with, and the
concepts involved don't make a lot of sense. We (+@mkuratczyk) will
continue as soon as we have dealt with
#2785

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
gerhard added a commit that referenced this issue Jun 16, 2021
This way we can show how many messages were received via a
certain protocol (stream is the second real protocol besides the default
amqp091 one), as well as by queue type, which is something that many
asked for a really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
#3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, rabbitmq_global_, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho
(this is multiple commits squashed into one)

Next steps:
- Create new PR and ask @mkuratcyzk and @ansd for review - fresh 👀 👀
- This new PR closes #3045
- Back-port to 3.9.x as is (to my knowledge, this is the only feature
  missing before we can code freeze and cut an RC)
- Back-port parts of this to 3.8.x so that we can finally address
  #2783
- Fix & publish new version of the RabbitMQ-Overview Grafana dashboard
- Fix & finally publish the new RabbitMQ-Streams Grafana dashboard

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
gerhard added a commit that referenced this issue Jun 21, 2021
This way we can show how many messages were received via a
certain protocol (stream is the second real protocol besides the default
amqp091 one), as well as by queue type, which is something that many
asked for a really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
#3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, rabbitmq_global_, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho
(this is multiple commits squashed into one)

Next steps:
- Create new PR and ask @mkuratcyzk and @ansd for review - fresh 👀 👀
- This new PR closes #3045
- Back-port to 3.9.x as is (to my knowledge, this is the only feature
  missing before we can code freeze and cut an RC)
- Back-port parts of this to 3.8.x so that we can finally address
  #2783
- Fix & publish new version of the RabbitMQ-Overview Grafana dashboard
- Fix & finally publish the new RabbitMQ-Streams Grafana dashboard

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@karanlik
Copy link

karanlik commented Jul 7, 2021

Hi All,

Please sorry if i'm posting at a wrong place.

Currently we're seeing some behaviour similar to this issue; our "rabbitmq_channel_messages_published_total" metrics (with aggregation enabled) decreases which causes false alarms & mis-guiding graphs. I guess this is some sort of bug or at least something not good for Prometheus. What is the proposed workaround-solution for such kind of situation? Going to per-object metrics ?
Once again sorry if I'm at wrong location.
Thanks...

@gerhard
Copy link
Contributor

gerhard commented Jul 7, 2021

The fix is already available in what will ship as RabbitMQ 3.9.0 via #3127

You will notice that the PR has 5 more outstanding tasks, the next one being a back-port to v3.8.x. Once this is done, the issue will be fixed for the 3.8 release series via an update to the dashboards (also one of those 5 tasks).

Between RabbitMQ Summit - will you be joining us for today's pre-conference meetup @karanlik? 😉 - and a few other things in flight (3.9.0 being the most significant one by far), this back-port is going slower than I would like. The important take-away is that it's coming and it definitely fixes the issue (we have been testing it for weeks in our 3.9 long-running environment).

Hope that helps 👍🏻

@karanlik
Copy link

karanlik commented Jul 7, 2021

Great! Thanks for your quick feedback.. I'll give a try to per-object based metrics until this is available.. best regards..(PS: I've subscribed t o RabbitMQ Summit mailing list :-) )

@luos
Copy link
Contributor

luos commented Nov 3, 2021

Hi,

Do you think this will get backported to 3.8.x?

Thanks

@MirahImage
Copy link
Member

Closing. As this is available in 3.9 and 3.10, and 3.8 is in extended support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants