Option to aggregate channel, queue and connection metrics #28

dcorbacho · 2020-01-10T15:37:09Z

prometheus.return_per_object_metrics = false

Closes #26, see #24 and #25 for the background.

`prometheus.enable_metric_aggregation = true` rabbitmq-prometheus#26

gerhard · 2020-01-15T13:31:39Z

Picking this one up now.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

gerhard · 2020-01-15T19:03:21Z

Given 1k queues on rmq2, scrape duration is 2-3s. Running rabbitmqctl eval 'application:set_env(rabbitmq_prometheus, enable_metric_aggregation, true).' on rmq2 brings it down to 250ms, similar to what the other nodes are doing:

RabbitMQ-Overview dashboard is not affected by these changes.

I will test RabbitMQ-Quorum-Queues-Raft tomorrow - I expect a few changes needed there.

I will increase the number of queues all the way to 80k and see if this still holds. The last phase is to increase the number of connections & channels to 80k each and see if this optimisations holds.

#24 (comment) Before we can test the effectiveness of the fix in #26 against an environment replica that this was initially reported in, we are missing the load app deployment that would generate all the connections and queues. It would be helpful to know whether https://github.com/coreos/kube-prometheus was used for the Prometheus & Grafana deployment. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

gerhard · 2020-02-03T15:19:51Z

I am picking this one up again, deploying 50k queues, 50k connections & 50k channels.

gerhard · 2020-02-04T10:03:31Z

Tested on:

gcloud compute instances create-with-container tgir-s01e01-gerhard-rmq1-server \
  --public-dns --boot-disk-type=pd-ssd --labels=namespace=rabbitmq-prometheus-28-gerhard --container-stdin --container-tty \
  --machine-type=n1-standard-32 \
  --create-disk=name=rabbitmq-prometheus-28-gerhard-rmq1-server-persistent,size=200GB,type=pd-ssd,auto-delete=yes \
  --container-mount-disk=name=rabbitmq-prometheus-28-gerhard-rmq1-server-persistent,mount-path=/var/lib/rabbitmq \
  --container-env RABBITMQ_ERLANG_COOKIE=rabbitmq-prometheus-28-gerhard \
  --container-image=pivotalrabbitmq/rabbitmq-prometheus:3.9.0-alpha.203-2020.02.03

with curl -s -o /dev/null -w '%{http_code} time_total:%{time_total} size_bytes:%{size_download}\n' http://34.89.10.130:15692/metrics

50k queues with & without metric aggregation enabled:

200 time_total:60.395880 size_bytes:62322787
200 time_total:1.023527 size_bytes:347759

When I had 50k connections on top of the 50k queues the metrics would timeout after 60s:

000 time_total:60.105977 size_bytes:0

With metric aggregation enabled & then with -H "Accept-Encoding: gzip"

200 time_total:1.705673 size_bytes:348528
200 time_total:1.607329 size_bytes:21323

Option to aggregate channel, queue and connection metrics (cherry picked from commit 82fafae)

@dcorbacho

It behaves as a gauge when metrics are not aggregated, which is not what we want. It's either a histogram, or it's a gauge, but it doesn't change type based on whether we are aggregating metrics or not. To be honest, I rushed #28 acceptance and didn't check this metric type propertly. Hoping to pair-up with @dcorbacho on this and getting the histogram back. For now, let's keep it a gauge and go forward with the metric aggregation back-port into v3.8.x. Because the change from millis to micros was a breaking change in rabbitmq/ra#160, it was reverted, so we had to fix (missed the undefined in one of the merges - whoops) and revert to millis. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

gerhard · 2020-02-05T19:38:16Z

@dcorbacho can we pair-up on this tomorrow? 5caa419

We want to keep the same metric type regardless whether we aggregate or don't. If we had used a histogram type, considering the ~12 buckets that we added, it would have meant 12 extra metrics per queue which would have resulted in an explosion of metrics. Keeping the gauge type and aggregating latencies across all members. re #28 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

We want to keep the same metric type regardless whether we aggregate or don't. If we had used a histogram type, considering the ~12 buckets that we added, it would have meant 12 extra metrics per queue which would have resulted in an explosion of metrics. Keeping the gauge type and aggregating latencies across all members. re #28 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk> (cherry picked from commit 3a24c4a) Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

gerhard · 2020-02-10T17:53:20Z

FWIW, 378da2f enables metrics aggregation by default. The reasoning is captured in the README. This is the follow-up commit 8b0c7c4.

@dcorbacho

It behaves as a gauge when metrics are not aggregated, which is not what we want. It's either a histogram, or it's a gauge, but it doesn't change type based on whether we are aggregating metrics or not. To be honest, I rushed rabbitmq/rabbitmq-prometheus#28 acceptance and didn't check this metric type propertly. Hoping to pair-up with @dcorbacho on this and getting the histogram back. For now, let's keep it a gauge and go forward with the metric aggregation back-port into v3.8.x. Because the change from millis to micros was a breaking change in rabbitmq/ra#160, it was reverted, so we had to fix (missed the undefined in one of the merges - whoops) and revert to millis. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

We want to keep the same metric type regardless whether we aggregate or don't. If we had used a histogram type, considering the ~12 buckets that we added, it would have meant 12 extra metrics per queue which would have resulted in an explosion of metrics. Keeping the gauge type and aggregating latencies across all members. re rabbitmq/rabbitmq-prometheus#28 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk> (cherry picked from commit 3a24c4a7b44e3cb4c1b60918c85052bf667a053e) Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

Option to aggregate channel, queue and connection metrics

57b6256

`prometheus.enable_metric_aggregation = true` rabbitmq-prometheus#26

dcorbacho requested review from michaelklishin and gerhard January 10, 2020 15:37

gerhard added 2 commits January 15, 2020 17:58

Make rabbitmq_prometheus compile

02141af

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

Bump RabbitMQ to latest dev & OTP to latest stable

a853eb2

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

gerhard merged commit 82fafae into master Feb 4, 2020

gerhard deleted the metrics-aggregation branch February 4, 2020 12:43

gerhard added a commit that referenced this pull request Feb 5, 2020

Merge pull request #28 from rabbitmq/metrics-aggregation

5e794d4

Option to aggregate channel, queue and connection metrics (cherry picked from commit 82fafae)

gerhard mentioned this pull request Feb 15, 2020

Scraping endpoing: 169K queues causes /metrics not to load #24

Closed

gerhard mentioned this pull request Feb 24, 2020

Expose metric for number of publishers when metrics are aggregated #30

Closed

gerhard mentioned this pull request Mar 14, 2020

Monitor queues using wildcard #31

Closed

hrobertson mentioned this pull request Apr 15, 2020

Fix references in queue_metrics ETS table #19

Closed

michaelklishin mentioned this pull request Jun 16, 2020

Prometheus: unable to disable metrics aggregation through configuration rabbitmq/discussions#115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to aggregate channel, queue and connection metrics #28

Option to aggregate channel, queue and connection metrics #28

dcorbacho commented Jan 10, 2020 •

edited by michaelklishin

gerhard commented Jan 15, 2020

gerhard commented Jan 15, 2020 •

edited

gerhard commented Feb 3, 2020 •

edited

gerhard commented Feb 4, 2020 •

edited

gerhard commented Feb 5, 2020

gerhard commented Feb 10, 2020 •

edited

Option to aggregate channel, queue and connection metrics #28

Option to aggregate channel, queue and connection metrics #28

Conversation

dcorbacho commented Jan 10, 2020 • edited by michaelklishin

gerhard commented Jan 15, 2020

gerhard commented Jan 15, 2020 • edited

gerhard commented Feb 3, 2020 • edited

gerhard commented Feb 4, 2020 • edited

gerhard commented Feb 5, 2020

gerhard commented Feb 10, 2020 • edited

dcorbacho commented Jan 10, 2020 •

edited by michaelklishin

gerhard commented Jan 15, 2020 •

edited

gerhard commented Feb 3, 2020 •

edited

gerhard commented Feb 4, 2020 •

edited

gerhard commented Feb 10, 2020 •

edited