Skip to content
This repository has been archived by the owner on Nov 14, 2020. It is now read-only.

Option to aggregate channel, queue and connection metrics #28

Merged
merged 4 commits into from Feb 4, 2020

Conversation

dcorbacho
Copy link
Contributor

@dcorbacho dcorbacho commented Jan 10, 2020

prometheus.return_per_object_metrics = false

Closes #26, see #24 and #25 for the background.

`prometheus.enable_metric_aggregation = true`

rabbitmq-prometheus#26
@gerhard
Copy link
Contributor

gerhard commented Jan 15, 2020

Picking this one up now.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Contributor

gerhard commented Jan 15, 2020

Given 1k queues on rmq2, scrape duration is 2-3s. Running rabbitmqctl eval 'application:set_env(rabbitmq_prometheus, enable_metric_aggregation, true).' on rmq2 brings it down to 250ms, similar to what the other nodes are doing:

Screenshot 2020-01-15 at 18 57 08

RabbitMQ-Overview dashboard is not affected by these changes.

I will test RabbitMQ-Quorum-Queues-Raft tomorrow - I expect a few changes needed there.

I will increase the number of queues all the way to 80k and see if this still holds. The last phase is to increase the number of connections & channels to 80k each and see if this optimisations holds.

#24 (comment)

Before we can test the effectiveness of the fix in
#26 against an
environment replica that this was initially reported in, we are missing
the load app deployment that would generate all the connections and queues.

It would be helpful to know whether
https://github.com/coreos/kube-prometheus was used for the
Prometheus & Grafana deployment.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Contributor

gerhard commented Feb 3, 2020

I am picking this one up again, deploying 50k queues, 50k connections & 50k channels.

@gerhard
Copy link
Contributor

gerhard commented Feb 4, 2020

Tested on:

gcloud compute instances create-with-container tgir-s01e01-gerhard-rmq1-server \
  --public-dns --boot-disk-type=pd-ssd --labels=namespace=rabbitmq-prometheus-28-gerhard --container-stdin --container-tty \
  --machine-type=n1-standard-32 \
  --create-disk=name=rabbitmq-prometheus-28-gerhard-rmq1-server-persistent,size=200GB,type=pd-ssd,auto-delete=yes \
  --container-mount-disk=name=rabbitmq-prometheus-28-gerhard-rmq1-server-persistent,mount-path=/var/lib/rabbitmq \
  --container-env RABBITMQ_ERLANG_COOKIE=rabbitmq-prometheus-28-gerhard \
  --container-image=pivotalrabbitmq/rabbitmq-prometheus:3.9.0-alpha.203-2020.02.03

with curl -s -o /dev/null -w '%{http_code} time_total:%{time_total} size_bytes:%{size_download}\n' http://34.89.10.130:15692/metrics

50k queues with & without metric aggregation enabled:

200 time_total:60.395880 size_bytes:62322787
200 time_total:1.023527 size_bytes:347759

When I had 50k connections on top of the 50k queues the metrics would timeout after 60s:

000 time_total:60.105977 size_bytes:0

With metric aggregation enabled & then with -H "Accept-Encoding: gzip"

200 time_total:1.705673 size_bytes:348528
200 time_total:1.607329 size_bytes:21323

:shipit:

@gerhard gerhard merged commit 82fafae into master Feb 4, 2020
@gerhard gerhard deleted the metrics-aggregation branch February 4, 2020 12:43
gerhard added a commit that referenced this pull request Feb 5, 2020
Option to aggregate channel, queue and connection metrics

(cherry picked from commit 82fafae)
gerhard added a commit that referenced this pull request Feb 5, 2020
It behaves as a gauge when metrics are not aggregated, which is not what
we want. It's either a histogram, or it's a gauge, but it doesn't change
type based on whether we are aggregating metrics or not. To be honest, I
rushed #28
acceptance and didn't check this metric type propertly. Hoping to
pair-up with @dcorbacho on this and getting the histogram back. For now,
let's keep it a gauge and go forward with the metric aggregation
back-port into v3.8.x.

Because the change from millis to micros was a breaking change in
rabbitmq/ra#160, it was reverted, so we had to
fix (missed the undefined in one of the merges - whoops) and revert to
millis.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Contributor

gerhard commented Feb 5, 2020

@dcorbacho can we pair-up on this tomorrow? 5caa419

gerhard added a commit that referenced this pull request Feb 6, 2020
We want to keep the same metric type regardless whether we aggregate or
don't. If we had used a histogram type, considering the ~12 buckets that
we added, it would have meant 12 extra metrics per queue which would
have resulted in an explosion of metrics. Keeping the gauge type and
aggregating latencies across all members.

re #28

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
gerhard added a commit that referenced this pull request Feb 6, 2020
We want to keep the same metric type regardless whether we aggregate or
don't. If we had used a histogram type, considering the ~12 buckets that
we added, it would have meant 12 extra metrics per queue which would
have resulted in an explosion of metrics. Keeping the gauge type and
aggregating latencies across all members.

re #28

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
(cherry picked from commit 3a24c4a)
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Contributor

gerhard commented Feb 10, 2020

FWIW, 378da2f enables metrics aggregation by default. The reasoning is captured in the README. This is the follow-up commit 8b0c7c4.

pjk25 pushed a commit to rabbitmq/rabbitmq-monorepo that referenced this pull request Sep 9, 2020
It behaves as a gauge when metrics are not aggregated, which is not what
we want. It's either a histogram, or it's a gauge, but it doesn't change
type based on whether we are aggregating metrics or not. To be honest, I
rushed rabbitmq/rabbitmq-prometheus#28
acceptance and didn't check this metric type propertly. Hoping to
pair-up with @dcorbacho on this and getting the histogram back. For now,
let's keep it a gauge and go forward with the metric aggregation
back-port into v3.8.x.

Because the change from millis to micros was a breaking change in
rabbitmq/ra#160, it was reverted, so we had to
fix (missed the undefined in one of the merges - whoops) and revert to
millis.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
pjk25 pushed a commit to rabbitmq/rabbitmq-monorepo that referenced this pull request Sep 9, 2020
We want to keep the same metric type regardless whether we aggregate or
don't. If we had used a histogram type, considering the ~12 buckets that
we added, it would have meant 12 extra metrics per queue which would
have resulted in an explosion of metrics. Keeping the gauge type and
aggregating latencies across all members.

re rabbitmq/rabbitmq-prometheus#28

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
(cherry picked from commit 3a24c4a7b44e3cb4c1b60918c85052bf667a053e)
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A way to reduce the number of metric lines emitted
2 participants