Prepare a Grafana Dashboard for `internal_metrics` gathered via Prometheus #4838

MOZGIII · 2020-11-02T11:05:16Z

Motivation

In the Kubernetes, Prometheus is usually used to gather metrics, and Grafana is used in conjunction to view the gathered metrics.

We want to provide an option to expose our internal metrics (internal_metrics source) via prometheus metrics sink out of the box when deploying Vector into Kubernetes environment (#3799), and a natural extension to that would be shipping a Grafana Dashboard out of the box as well.

The end goal is to make it so that when the user deploys Vector with internal metrics enabled, a dashboard with all Vector metrics immediately appears at Grafana. Zero-configuration (on a preconfigured cluster, where Grafa Dashboard gathering is enabled) besides opting-in to exposing internal_metrics and picking the way to hook into the Prometheus scraping (prometheus-native annotations or prometheus-operator-powered PodMonitor/ServiceMonitor).

Design

There are a few unknowns so far:

What metrics to include in the dashboard?
- It would definitely make sense to expose processed_event_total. But what else?
- We should probably be able to build a very advanced and sophisticated dashboard for all of the component types with specialized UIs per component type - but it might to be hard to maintain, and we'll have our own UI soon - so does exposing everything worth it?
- Even if we ship our own UI - we should provide at least a bare-bones dashboard for Grafana.
How to organize it (in the context of Helm charts)?
- Add a specialized dashboard for each of the vector-agent, vector-aggregator, etc charts?
- Add a single vector-grafana-dashboard chart with a common dashboard, and make other charts depend on it?
Basically, we can use either way, and we can pick which one to use after we figure out how do we want to architect the dashboard itself. Some design decision constraints may be dictated by the Helm charts layout too, so the additional investigation is necessary here.

We should reuse the dashboard from https://github.com/timberio/vector-grafana and make them work great for this use case. This should reduce the estimate, but we need to discuss it.

The text was updated successfully, but these errors were encountered:

spencergilbert · 2020-11-02T16:30:08Z

I think in a Grafana only world - ignoring the Vector UI - you could have a "Vector System" dashboard with metrics from everything, and have builtin drillins/links to agent/aggregator dashboards with details specific to them.

I don't know how different the metrics would be between those though, it might make more sense for it to be a selector on a single dashboard. Default show all, drop down to change that to agent or aggregator, additional drop down to select a single instance.

I'd be happy to work on/collaborate on this since I expect we'll primarily use Grafana for viz.

nivekuil · 2020-12-03T15:52:18Z

Hi, I noticed that 0.11 added a host metrics source, which I think is capable of replacing node_exporter. I would love to do so but having this popular dashboard https://grafana.com/grafana/dashboards/1860 available out of the box is really nice. I think it would help adoption of that source in particular if there were a similarly provided dashboard for vector.

jamtur01 · 2020-12-03T15:53:38Z

@nivekuil Thanks! That's on our agenda to look at soon!

MOZGIII · 2020-12-22T00:25:31Z

We've set up the k8s dev environment that unblocks this.

The only caveat for working on the dashboard that remains is a potential data loss until we implement vectordotdev/vector-k8s-dev-env#7 - so please do manual backups of your Grafana dashboards (that would survive the whole EKS cluster removal - i.e. a local copy of the dashboard json) until we implement cluster-wide backups.

zadunn · 2021-01-20T19:12:21Z

I think in a Grafana only world - ignoring the Vector UI - you could have a "Vector System" dashboard with metrics from everything, and have builtin drillins/links to agent/aggregator dashboards with details specific to them.

I don't know how different the metrics would be between those though, it might make more sense for it to be a selector on a single dashboard. Default show all, drop down to change that to agent or aggregator, additional drop down to select a single instance.

I'd be happy to work on/collaborate on this since I expect we'll primarily use Grafana for viz.

I really like this idea. It would close a feature gap in our recent comparison with Vector and Fluent Bit. As Fluent Bit has a grafana dashboard.

bygui86 · 2021-01-23T18:25:20Z

Taking inspiration by node_exporter dashboard is a good idea!
But as the vector dashboard has to be created from scratch, I think it would be a good idea to base on the latest Grafana version (7.x) in order to profit by new features!

zadunn · 2021-01-25T14:12:01Z

I think that a good part of the work here would be identifying what metrics need to be displayed to understand the health and performance of Vector in it's different roles. For our use case, we are principally looking to have vector act as a log shipper on kubernetes nodes to S3. We are going to start going down the route of figuring out what metrics make sense in this case, but if any one has any suggestions I'd love to get their insights. Once we've figured something out, I'll update this issue either way.

…

On Sat, Jan 23, 2021 at 1:25 PM Matteo Baiguini ***@***.***> wrote: Taking inspiration by node_exporter dashboard <https://grafana.com/grafana/dashboards/1860> is a good idea! But as the vector dashboard has to be created from scratch, I think it would be a good idea to base on the latest Grafana version (7.x) in order to profit by new features! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4838 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG7U262PCYRLOD75IZNKALS3MIBZANCNFSM4THKN6HQ> .

--

-- This message and any attachments are solely for the intended recipient. If you are not the intended recipient, disclosure, copying, use, or distribution of the information included in this message is prohibited -- please immediately and permanently delete this message.

spencergilbert · 2021-01-25T15:40:04Z

I think having a single dashboard to cover all usecases might be tricky. Including basic things about memory usage/cpu usage/disk usage for buffer(?) would be easy enough.

Most/all stages should have a similar bytes/events processed (or at least there are outstanding issues to standardize some of the generic metrics)

zadunn · 2021-01-25T15:43:37Z

Right, I was sort of envisioning something like three dashboards to cover the three general use cases of shipping, conversation, and indexing. IDK exactly what any of that would look like yet :D

…

On Mon, Jan 25, 2021 at 10:40 AM Spencer Gilbert ***@***.***> wrote: I think having a single dashboard to cover all usecases might be tricky. Including basic things about memory usage/cpu usage/disk usage for buffer(?) would be easy enough. Most/all stages should have a similar bytes/events processed (or at least there are outstanding issues to standardize some of the generic metrics) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4838 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG7U2ZDZGJFPASFC7LDF23S3WGGNANCNFSM4THKN6HQ> .

--

-- This message and any attachments are solely for the intended recipient. If you are not the intended recipient, disclosure, copying, use, or distribution of the information included in this message is prohibited -- please immediately and permanently delete this message.

spencergilbert · 2021-01-25T18:12:00Z

Right, I was sort of envisioning something like three dashboards to cover the three general use cases of shipping, conversation, and indexing. IDK exactly what any of that would look like yet :D

I think there still might be some awkwardness if there's a need for metrics that may be unique to particular sinks/sources

bygui86 · 2021-01-26T07:58:05Z

In the next couple of weeks I will try to create a prototype :)

Relating also to feature request #5363, can someone make me an example of host_metrics?
I'm not sure if host_metrics are also exported by vector or requires for example the prometheus-operator infrastructure (e.g. node_exporter).

fpytloun · 2021-06-22T12:06:00Z

Some searches that I am using:

input events by source

sum(rate(vector_events_in_total{component_kind="source"}[5m])) by (component_name)

output events by sink

sum(rate(vector_events_out_total{component_kind="sink"}[5m])) by (component_name)

Kafka consumed messages by instance

sum(rate(vector_kafka_consumed_messages_total[5m])) by (pod)

Kafka consumed bytes by instance

sum(rate(vector_kafka_consumed_messages_bytes_total[5m])) by (pod)

There are some more use-cases that I have but there are no related metrics or I don't understand existing ones:

monitor adaptive concurrency (there are some metrics but I don't know how to read them yet)
monitor buffer size (I didn't find metrics for this 😞)
monitor Kafka topics lag by consumed vs. latest offset (kafka sink doesn't expose these metrics 😞)

spencergilbert · 2021-06-22T12:39:23Z

monitor adaptive concurrency (there are some metrics but I don't know how to read them yet)

Related: #7971

monitor buffer size (I didn't find metrics for this 😞)

Related: #7185 #5654

monitor Kafka topics lag by consumed vs. latest offset (kafka sink doesn't expose these metrics 😞)

I suspect today you'd need a third party exporter (kafka-exporter kminion lag-exporter). Though given how frequently kafka seems to be used in Vector setups, it may be nice to have a kafka_metrics source to expose those without an additional application.

~~I don't see an issue yet but I'll open one for that~~ @fpytloun I actually found this one #870

fpytloun · 2021-06-24T09:19:27Z

@spencergilbert thank you for references.

Ad. monitoring offset lag, I think it makes more sense to have this metric on client-side (eg. Kafka Connect also exposes this metric as a Kafka client). Similar case might also apply for other sources (eg. AWS SQS). To be able to simply determine status of queue independently on source - it could be single metric for all sources.

bengoldenberg · 2022-03-27T16:10:18Z

do we have any update regarding a full dashboard?

zamazan4ik · 2022-09-08T12:49:04Z

@jszwedko Does anyone from Vector team work on it? Do you need a help here?

spencergilbert · 2022-09-08T12:58:09Z

@jszwedko Does anyone from Vector team work on it? Do you need a help here?

Hey @zamazan4ik - this issue isn't currently on our roadmap, but we'd be happy to review a community built dashboard.

mjperrone · 2022-09-09T17:43:45Z

I'd love to see this for datadog

spencergilbert · 2022-09-09T17:46:17Z

I'd love to see this for datadog

Hey @mjperrone we're actually working on that right now, hopefully it won't be long until there's Vector Integration (dashboard, monitors, metrics)!

mjperrone · 2022-09-09T17:59:32Z

I'd love to see this for datadog

Hey @mjperrone we're actually working on that right now, hopefully it won't be long until there's Vector Integration (dashboard, monitors, metrics)!

That's exciting @spencergilbert. I can't wait!

zamazan4ik · 2022-09-09T18:01:55Z

I'd love to see this for datadog

Hey @mjperrone we're actually working on that right now, hopefully it won't be long until there's Vector Integration (dashboard, monitors, metrics)!

Do you work on some example dashboard for Vector? Just curious, could a dashboard on which you work right now, be useful for internal_metrics or not.

spencergilbert · 2022-09-09T19:52:05Z

Do you work on some example dashboard for Vector? Just curious, could a dashboard on which you work right now, be useful for internal_metrics or not.

@zamazan4ik I'm not 100% sure what you're asking, but the dashboard is currently populated by metrics available from the internal_metrics source.

zamazan4ik · 2022-09-11T16:22:53Z

I have written an initial (early work in progress) Grafana dashboard here: #14369

Please check it and leave their your thoughts about it.

gaby · 2023-04-16T08:34:19Z

Has there been any traction with this issue?

bruceg · 2023-04-19T20:32:16Z

We would prefer to have this kind of work live in a separate community repository. As such, this issue will only be resolved there.

Ref: #14369 (comment)

gaby · 2023-04-20T02:10:07Z

@bruceg Creating vectordotdev/vector-community has been mentioned since Oct 2022, still no action. Is there a timeline for this?

bruceg · 2023-04-20T23:54:42Z

No there is not.

MOZGIII added type: task Generic non-code related tasks domain: administration Anything related to administration/operation domain: metrics Anything related to Vector's metrics events platform: kubernetes Anything `kubernetes` platform related labels Nov 2, 2020

MOZGIII mentioned this issue Nov 2, 2020

Add an option to expose internal metrics via Prometheus to the Helm Chart #3799

Closed

MOZGIII self-assigned this Nov 20, 2020

MOZGIII added this to the 2020-11-23: Pseudo-chitin armor milestone Nov 20, 2020

MOZGIII mentioned this issue Nov 27, 2020

fix(observability): Aggregate metrics #5174

Closed

lucperkins mentioned this issue Nov 28, 2020

enhancement(config): Generate Grafana dashboards using CUE #5263

Closed

MOZGIII mentioned this issue Dec 3, 2020

Prepare a Grafana Dashboard for host_metrics gathered via Prometheus #5363

Open

MOZGIII removed this from the 2020-11-23: Pseudo-chitin armor milestone Dec 4, 2020

binarylogic unassigned MOZGIII Feb 24, 2021

zamazan4ik linked a pull request Sep 11, 2022 that will close this issue

feat(grafana): add initial grafana dashboard #14369

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare a Grafana Dashboard for `internal_metrics` gathered via Prometheus #4838

Prepare a Grafana Dashboard for `internal_metrics` gathered via Prometheus #4838

MOZGIII commented Nov 2, 2020 •

edited

spencergilbert commented Nov 2, 2020

nivekuil commented Dec 3, 2020

jamtur01 commented Dec 3, 2020

MOZGIII commented Dec 22, 2020 •

edited

zadunn commented Jan 20, 2021

bygui86 commented Jan 23, 2021

zadunn commented Jan 25, 2021 via email

spencergilbert commented Jan 25, 2021

zadunn commented Jan 25, 2021 via email

spencergilbert commented Jan 25, 2021

bygui86 commented Jan 26, 2021

fpytloun commented Jun 22, 2021 •

edited

spencergilbert commented Jun 22, 2021 •

edited

fpytloun commented Jun 24, 2021 •

edited

bengoldenberg commented Mar 27, 2022

zamazan4ik commented Sep 8, 2022

spencergilbert commented Sep 8, 2022

mjperrone commented Sep 9, 2022

spencergilbert commented Sep 9, 2022

mjperrone commented Sep 9, 2022

zamazan4ik commented Sep 9, 2022

spencergilbert commented Sep 9, 2022

zamazan4ik commented Sep 11, 2022

gaby commented Apr 16, 2023

bruceg commented Apr 19, 2023

gaby commented Apr 20, 2023

bruceg commented Apr 20, 2023

Prepare a Grafana Dashboard for internal_metrics gathered via Prometheus #4838

Prepare a Grafana Dashboard for internal_metrics gathered via Prometheus #4838

Comments

MOZGIII commented Nov 2, 2020 • edited

Motivation

Design

spencergilbert commented Nov 2, 2020

nivekuil commented Dec 3, 2020

jamtur01 commented Dec 3, 2020

MOZGIII commented Dec 22, 2020 • edited

zadunn commented Jan 20, 2021

bygui86 commented Jan 23, 2021

zadunn commented Jan 25, 2021 via email

spencergilbert commented Jan 25, 2021

zadunn commented Jan 25, 2021 via email

spencergilbert commented Jan 25, 2021

bygui86 commented Jan 26, 2021

fpytloun commented Jun 22, 2021 • edited

spencergilbert commented Jun 22, 2021 • edited

fpytloun commented Jun 24, 2021 • edited

bengoldenberg commented Mar 27, 2022

zamazan4ik commented Sep 8, 2022

spencergilbert commented Sep 8, 2022

mjperrone commented Sep 9, 2022

spencergilbert commented Sep 9, 2022

mjperrone commented Sep 9, 2022

zamazan4ik commented Sep 9, 2022

spencergilbert commented Sep 9, 2022

zamazan4ik commented Sep 11, 2022

gaby commented Apr 16, 2023

bruceg commented Apr 19, 2023

gaby commented Apr 20, 2023

bruceg commented Apr 20, 2023

Prepare a Grafana Dashboard for `internal_metrics` gathered via Prometheus #4838

Prepare a Grafana Dashboard for `internal_metrics` gathered via Prometheus #4838

MOZGIII commented Nov 2, 2020 •

edited

MOZGIII commented Dec 22, 2020 •

edited

fpytloun commented Jun 22, 2021 •

edited

spencergilbert commented Jun 22, 2021 •

edited

fpytloun commented Jun 24, 2021 •

edited