Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare a Grafana Dashboard for internal_metrics gathered via Prometheus #4838

Open
MOZGIII opened this issue Nov 2, 2020 · 27 comments · May be fixed by #14369
Open

Prepare a Grafana Dashboard for internal_metrics gathered via Prometheus #4838

MOZGIII opened this issue Nov 2, 2020 · 27 comments · May be fixed by #14369
Labels
domain: administration Anything related to administration/operation domain: metrics Anything related to Vector's metrics events platform: kubernetes Anything `kubernetes` platform related type: task Generic non-code related tasks

Comments

@MOZGIII
Copy link
Contributor

MOZGIII commented Nov 2, 2020

Motivation

In the Kubernetes, Prometheus is usually used to gather metrics, and Grafana is used in conjunction to view the gathered metrics.

We want to provide an option to expose our internal metrics (internal_metrics source) via prometheus metrics sink out of the box when deploying Vector into Kubernetes environment (#3799), and a natural extension to that would be shipping a Grafana Dashboard out of the box as well.

The end goal is to make it so that when the user deploys Vector with internal metrics enabled, a dashboard with all Vector metrics immediately appears at Grafana. Zero-configuration (on a preconfigured cluster, where Grafa Dashboard gathering is enabled) besides opting-in to exposing internal_metrics and picking the way to hook into the Prometheus scraping (prometheus-native annotations or prometheus-operator-powered PodMonitor/ServiceMonitor).

Design

There are a few unknowns so far:

  1. What metrics to include in the dashboard?

    • It would definitely make sense to expose processed_event_total. But what else?
    • We should probably be able to build a very advanced and sophisticated dashboard for all of the component types with specialized UIs per component type - but it might to be hard to maintain, and we'll have our own UI soon - so does exposing everything worth it?
    • Even if we ship our own UI - we should provide at least a bare-bones dashboard for Grafana.
  2. How to organize it (in the context of Helm charts)?

    • Add a specialized dashboard for each of the vector-agent, vector-aggregator, etc charts?
    • Add a single vector-grafana-dashboard chart with a common dashboard, and make other charts depend on it?

    Basically, we can use either way, and we can pick which one to use after we figure out how do we want to architect the dashboard itself. Some design decision constraints may be dictated by the Helm charts layout too, so the additional investigation is necessary here.


We should reuse the dashboard from https://github.com/timberio/vector-grafana and make them work great for this use case. This should reduce the estimate, but we need to discuss it.

@MOZGIII MOZGIII added type: task Generic non-code related tasks domain: administration Anything related to administration/operation domain: metrics Anything related to Vector's metrics events platform: kubernetes Anything `kubernetes` platform related labels Nov 2, 2020
@spencergilbert
Copy link
Contributor

I think in a Grafana only world - ignoring the Vector UI - you could have a "Vector System" dashboard with metrics from everything, and have builtin drillins/links to agent/aggregator dashboards with details specific to them.

I don't know how different the metrics would be between those though, it might make more sense for it to be a selector on a single dashboard. Default show all, drop down to change that to agent or aggregator, additional drop down to select a single instance.

I'd be happy to work on/collaborate on this since I expect we'll primarily use Grafana for viz.

@nivekuil
Copy link

nivekuil commented Dec 3, 2020

Hi, I noticed that 0.11 added a host metrics source, which I think is capable of replacing node_exporter. I would love to do so but having this popular dashboard https://grafana.com/grafana/dashboards/1860 available out of the box is really nice. I think it would help adoption of that source in particular if there were a similarly provided dashboard for vector.

@jamtur01
Copy link
Contributor

jamtur01 commented Dec 3, 2020

@nivekuil Thanks! That's on our agenda to look at soon!

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Dec 22, 2020

We've set up the k8s dev environment that unblocks this.

The only caveat for working on the dashboard that remains is a potential data loss until we implement vectordotdev/vector-k8s-dev-env#7 - so please do manual backups of your Grafana dashboards (that would survive the whole EKS cluster removal - i.e. a local copy of the dashboard json) until we implement cluster-wide backups.

@zadunn
Copy link

zadunn commented Jan 20, 2021

I think in a Grafana only world - ignoring the Vector UI - you could have a "Vector System" dashboard with metrics from everything, and have builtin drillins/links to agent/aggregator dashboards with details specific to them.

I don't know how different the metrics would be between those though, it might make more sense for it to be a selector on a single dashboard. Default show all, drop down to change that to agent or aggregator, additional drop down to select a single instance.

I'd be happy to work on/collaborate on this since I expect we'll primarily use Grafana for viz.

I really like this idea. It would close a feature gap in our recent comparison with Vector and Fluent Bit. As Fluent Bit has a grafana dashboard.

@bygui86
Copy link

bygui86 commented Jan 23, 2021

Taking inspiration by node_exporter dashboard is a good idea!
But as the vector dashboard has to be created from scratch, I think it would be a good idea to base on the latest Grafana version (7.x) in order to profit by new features!

@zadunn
Copy link

zadunn commented Jan 25, 2021 via email

@spencergilbert
Copy link
Contributor

I think having a single dashboard to cover all usecases might be tricky. Including basic things about memory usage/cpu usage/disk usage for buffer(?) would be easy enough.

Most/all stages should have a similar bytes/events processed (or at least there are outstanding issues to standardize some of the generic metrics)

@zadunn
Copy link

zadunn commented Jan 25, 2021 via email

@spencergilbert
Copy link
Contributor

Right, I was sort of envisioning something like three dashboards to cover the three general use cases of shipping, conversation, and indexing. IDK exactly what any of that would look like yet :D

I think there still might be some awkwardness if there's a need for metrics that may be unique to particular sinks/sources

@bygui86
Copy link

bygui86 commented Jan 26, 2021

In the next couple of weeks I will try to create a prototype :)

Relating also to feature request #5363, can someone make me an example of host_metrics?
I'm not sure if host_metrics are also exported by vector or requires for example the prometheus-operator infrastructure (e.g. node_exporter).

@fpytloun
Copy link
Contributor

fpytloun commented Jun 22, 2021

Some searches that I am using:

  • input events by source
sum(rate(vector_events_in_total{component_kind="source"}[5m])) by (component_name)
  • output events by sink
sum(rate(vector_events_out_total{component_kind="sink"}[5m])) by (component_name)
  • Kafka consumed messages by instance
sum(rate(vector_kafka_consumed_messages_total[5m])) by (pod)
  • Kafka consumed bytes by instance
sum(rate(vector_kafka_consumed_messages_bytes_total[5m])) by (pod)

There are some more use-cases that I have but there are no related metrics or I don't understand existing ones:

  • monitor adaptive concurrency (there are some metrics but I don't know how to read them yet)
  • monitor buffer size (I didn't find metrics for this 😞)
  • monitor Kafka topics lag by consumed vs. latest offset (kafka sink doesn't expose these metrics 😞)

@spencergilbert
Copy link
Contributor

spencergilbert commented Jun 22, 2021

  • monitor adaptive concurrency (there are some metrics but I don't know how to read them yet)

Related: #7971

  • monitor buffer size (I didn't find metrics for this 😞)

Related: #7185 #5654

  • monitor Kafka topics lag by consumed vs. latest offset (kafka sink doesn't expose these metrics 😞)

I suspect today you'd need a third party exporter (kafka-exporter kminion lag-exporter). Though given how frequently kafka seems to be used in Vector setups, it may be nice to have a kafka_metrics source to expose those without an additional application.

I don't see an issue yet but I'll open one for that @fpytloun I actually found this one #870

@fpytloun
Copy link
Contributor

fpytloun commented Jun 24, 2021

@spencergilbert thank you for references.

Ad. monitoring offset lag, I think it makes more sense to have this metric on client-side (eg. Kafka Connect also exposes this metric as a Kafka client). Similar case might also apply for other sources (eg. AWS SQS). To be able to simply determine status of queue independently on source - it could be single metric for all sources.

@bengoldenberg
Copy link

do we have any update regarding a full dashboard?

@zamazan4ik
Copy link
Contributor

@jszwedko Does anyone from Vector team work on it? Do you need a help here?

@spencergilbert
Copy link
Contributor

@jszwedko Does anyone from Vector team work on it? Do you need a help here?

Hey @zamazan4ik - this issue isn't currently on our roadmap, but we'd be happy to review a community built dashboard.

@mjperrone
Copy link
Contributor

I'd love to see this for datadog

@spencergilbert
Copy link
Contributor

I'd love to see this for datadog

Hey @mjperrone we're actually working on that right now, hopefully it won't be long until there's Vector Integration (dashboard, monitors, metrics)!

@mjperrone
Copy link
Contributor

I'd love to see this for datadog

Hey @mjperrone we're actually working on that right now, hopefully it won't be long until there's Vector Integration (dashboard, monitors, metrics)!

That's exciting @spencergilbert. I can't wait!

@zamazan4ik
Copy link
Contributor

I'd love to see this for datadog

Hey @mjperrone we're actually working on that right now, hopefully it won't be long until there's Vector Integration (dashboard, monitors, metrics)!

Do you work on some example dashboard for Vector? Just curious, could a dashboard on which you work right now, be useful for internal_metrics or not.

@spencergilbert
Copy link
Contributor

Do you work on some example dashboard for Vector? Just curious, could a dashboard on which you work right now, be useful for internal_metrics or not.

@zamazan4ik I'm not 100% sure what you're asking, but the dashboard is currently populated by metrics available from the internal_metrics source.

@zamazan4ik zamazan4ik linked a pull request Sep 11, 2022 that will close this issue
@zamazan4ik
Copy link
Contributor

I have written an initial (early work in progress) Grafana dashboard here: #14369

Please check it and leave their your thoughts about it.

@gaby
Copy link

gaby commented Apr 16, 2023

Has there been any traction with this issue?

@bruceg
Copy link
Member

bruceg commented Apr 19, 2023

We would prefer to have this kind of work live in a separate community repository. As such, this issue will only be resolved there.

Ref: #14369 (comment)

@gaby
Copy link

gaby commented Apr 20, 2023

@bruceg Creating vectordotdev/vector-community has been mentioned since Oct 2022, still no action. Is there a timeline for this?

@bruceg
Copy link
Member

bruceg commented Apr 20, 2023

No there is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: administration Anything related to administration/operation domain: metrics Anything related to Vector's metrics events platform: kubernetes Anything `kubernetes` platform related type: task Generic non-code related tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.