Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(grafana): add initial grafana dashboard #14369

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zamazan4ik
Copy link
Contributor

@zamazan4ik zamazan4ik commented Sep 11, 2022

Hi!

In this PR I want to provide initial Grafana dashboard for Vector. It resolves #4838 .

Below I will explain several choices which I have done during the working on the dashboard. I want to gather a feedback from the maintainers and other people before futher steps with the dashboard. Current state of this dashboard - early draft.

Implementation details

Inspiration

As an example, I have used Node Exporter dashboard: https://grafana.com/grafana/dashboards/1860-node-exporter-full/ (as was mentioned here: #4838 (comment)).

Which metrics should be tracked on the dashboard?

I think as a base dashboard we need to provide full dashboard for as much metrics as we can. Users can choose, in which panels they are interested in their particular cases and just remove/disable uneeded panels on their own. Also, if we can identify common use cases from Vector users, we also should put them into the dashboard. E.g. see this comment.

Which Grafana/Prometheus versions should be supported?

I am not a Grafana/Prometheus guru, so I do not know much about differences between versions. I suggest at least at first version target the latest available Grafana/Prometheus features. Later, if we see a desire from Vector users in backporting to the older versions, we could think about it.

Rows on the dashboard

I have grouped related panels into groups (as an example right now there are Kafka, Kubernetes and SQS groups. General group is just for metric without an explicit group). It should users with navigating over the dozen of metrics in some more guided way.

Filtering capabilities

That is an interesting topic to discuss. You can see the following variables, added to the dashboard:

  • DS_PROMETHEUS - that is a hack, which I have found in Node Exporter dashboard. It allows to select different data sources for the dashboard. Seems to be kinda useful, I guess
  • component_*(id, type, kind) - since almost all metrics have this labels, I think it would be useful to be able to filter dashboards via these labels
  • host, instance - the same reasoning as for component_* group - some metrics have these lables it will be convenient to filter over them
  • job - this is an additional labels, which is added with Prometheus (actually, VictoriaMetrics, since I use it locally) during the configuration. It simply adds the label to all metrics from the data source. Maybe it will be a good idea to eliminate DS_PROMETHEUS variable and just left job?

But there is at least one more point for additional configuration - metric prefix. Vector's internal_metrics source adds a namespace to all metrics. In real life, user possibly wants to gather metrics with different prefixes in one dashboard and have an ability to filter based on this prefix. That is why for fetching metrics I use { __name__ =~ ".*_metric_name" } notation - looks ugly, but it allows to fetch all prefixes. However, now there is no way to choose, in which exactly prefix user is interested - I do not know, how to implement it properly. Any ideas? :)

UPD: I have talked with colleagues and they suggested try to use Prometheus relabeling capabilities. With relabeling we will be able to add specific labels based on the metric prefix. And then based on a label add a corresponding variable to the dashboard. Sounds good but in this case will be a requirement to the users setup a relabeling somehow on their own.

Alerting

No alerting is implemented yet. I think it is out of scope of this PR. However, some sample alerts could be added later in other PRs.

Additional details

After the merging the dashboard, I suggest to publish the resulting dashboard to the Grafana site. Also, since the dashboard definitely will be evolving in the future, we need to establish the process with regular publishing the dashboard to Grafana site (some kind of CI). Since the ownership over the dashboard will be on Vector team, I guess you need somehow implement it :)

Also, when Vector will introduce new metrics or change existing one, I think we should put checking the Grafana dashboard to a some sort of checklist. Since Grafana dashboard should be up to date.

And don't forget somehow mention this dashboard in Vector documentation :)

If you have any sort of ideas/feedback/whatever else - please share it here!

Thanks in advance!

- add initial draft for Grafana full dashboard

Tested:
- Local tests
@netlify
Copy link

netlify bot commented Sep 11, 2022

Deploy Preview for vector-project ready!

Name Link
🔨 Latest commit d84e56b
🔍 Latest deploy log https://app.netlify.com/sites/vector-project/deploys/631e0b21056feb0008962f09
😎 Deploy Preview https://deploy-preview-14369--vector-project.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions
Copy link

Soak Test Results

Baseline: 0083d15
Comparison: d84e56b
Total Vector CPUs: 4

Explanation

A soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core.

The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed.

No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%:

Fine details of change detection per experiment.
experiment Δ mean Δ mean % confidence baseline mean baseline stdev baseline stderr baseline outlier % baseline CoV comparison mean comparison stdev comparison stderr comparison outlier % comparison CoV erratic declared erratic
socket_to_socket_blackhole 881.27KiB 3.79 100.00% 22.69MiB 189.24KiB 3.86KiB 0 0.00814191 23.55MiB 214.12KiB 4.37KiB 0 0.00887557 False False
http_text_to_http_json 622.05KiB 1.58 100.00% 38.33MiB 894.02KiB 18.25KiB 0 0.0227727 38.94MiB 874.94KiB 17.86KiB 0 0.021939 False False
splunk_hec_route_s3 178.83KiB 0.96 99.06% 18.22MiB 2.41MiB 50.19KiB 0 0.132311 18.39MiB 2.25MiB 47.16KiB 0 0.122566 False False
syslog_log2metric_splunk_hec_metrics 164.18KiB 0.9 100.00% 17.73MiB 656.83KiB 13.38KiB 0 0.0361671 17.89MiB 680.73KiB 13.87KiB 0 0.0371472 False False
syslog_regex_logs2metric_ddmetrics 102.3KiB 0.8 100.00% 12.45MiB 626.4KiB 12.75KiB 0 0.0491355 12.55MiB 584.37KiB 11.91KiB 0 0.0454734 False False
syslog_humio_logs 127.28KiB 0.75 100.00% 16.61MiB 127.39KiB 2.6KiB 0 0.00748682 16.74MiB 151.64KiB 3.11KiB 0 0.0088464 False False
syslog_splunk_hec_logs 74.29KiB 0.44 99.99% 16.59MiB 734.25KiB 14.94KiB 0 0.0432022 16.67MiB 593.93KiB 12.11KiB 0 0.0347937 False False
datadog_agent_remap_blackhole_acks 197.93KiB 0.31 93.89% 61.91MiB 4.3MiB 89.63KiB 0 0.0695194 62.1MiB 2.67MiB 55.94KiB 0 0.0430389 False False
datadog_agent_remap_blackhole 165.39KiB 0.27 89.11% 60.28MiB 3.85MiB 80.23KiB 0 0.0638503 60.44MiB 3.11MiB 64.81KiB 0 0.0514008 False False
splunk_hec_to_splunk_hec_logs_noack 14.65KiB 0.06 78.73% 23.82MiB 463.03KiB 9.45KiB 0 0.0189759 23.84MiB 342.49KiB 6.99KiB 0 0.0140274 False False
splunk_hec_to_splunk_hec_logs_acks 9.97KiB 0.04 31.78% 23.75MiB 866.55KiB 17.63KiB 0 0.0356241 23.76MiB 825.23KiB 16.79KiB 0 0.0339115 False False
enterprise_http_to_http 2.39KiB 0.01 25.55% 23.84MiB 252.76KiB 5.16KiB 0 0.0103502 23.85MiB 253.69KiB 5.2KiB 0 0.0103869 False False
splunk_hec_indexer_ack_blackhole -3.29KiB -0.01 9.58% 23.74MiB 935.47KiB 19.02KiB 0 0.0384692 23.74MiB 963.69KiB 19.6KiB 0 0.0396352 False False
http_pipelines_blackhole -818.49B -0.05 26.81% 1.67MiB 14.85KiB 310.62B 0 0.00868182 1.67MiB 113.52KiB 2.31KiB 0 0.0663908 False False
file_to_blackhole -55.74KiB -0.06 52.43% 95.36MiB 2.53MiB 52.44KiB 0 0.0265218 95.3MiB 2.79MiB 57.94KiB 0 0.0292415 False False
http_pipelines_blackhole_acks -2.01KiB -0.16 54.30% 1.22MiB 109.7KiB 2.23KiB 0 0.0879584 1.22MiB 74.36KiB 1.52KiB 0 0.0597166 False False
http_to_http_json -45.58KiB -0.19 99.88% 23.85MiB 332.85KiB 6.79KiB 0 0.0136282 23.8MiB 603.88KiB 12.32KiB 0 0.0247715 False False
http_to_http_noack -111.36KiB -0.46 99.99% 23.82MiB 607.12KiB 12.41KiB 0 0.0248861 23.71MiB 1.26MiB 26.24KiB 0 0.0531183 False False
fluent_elasticsearch -385.64KiB -0.47 100.00% 79.47MiB 53.18KiB 1.07KiB 0 0.000653295 79.1MiB 4.12MiB 84.64KiB 0 0.0520736 False False
syslog_loki -94.34KiB -0.63 100.00% 14.67MiB 385.14KiB 7.88KiB 0 0.0256386 14.57MiB 727.9KiB 14.8KiB 0 0.0487617 False False
http_pipelines_no_grok_blackhole -85.05KiB -0.77 100.00% 10.84MiB 113.27KiB 2.31KiB 0 0.0102043 10.75MiB 1008.57KiB 20.52KiB 0 0.0915604 False False
datadog_agent_remap_datadog_logs -495.6KiB -0.81 100.00% 59.93MiB 1.59MiB 33.36KiB 0 0.0265282 59.44MiB 3.67MiB 76.59KiB 0 0.0617949 False False
datadog_agent_remap_datadog_logs_acks -536.96KiB -0.88 100.00% 59.7MiB 3.61MiB 75.53KiB 0 0.0604882 59.18MiB 4.72MiB 98.19KiB 0 0.0796942 False False
syslog_log2metric_humio_metrics -139.81KiB -1.07 100.00% 12.81MiB 256.58KiB 5.24KiB 0 0.0195524 12.68MiB 548.13KiB 11.16KiB 0 0.0422189 False False
http_to_http_acks -452.93KiB -2.49 96.34% 17.73MiB 7.56MiB 157.97KiB 0 0.426175 17.29MiB 7.1MiB 148.22KiB 0 0.410856 True True

@zamazan4ik
Copy link
Contributor Author

@spencergilbert could you please take a look on it?

@spencergilbert
Copy link
Contributor

I'll definitely try and find time to review this week, I imagine it'll be somewhat of a pain reviewing without having access to the ui - so we probably want to consider how to support that locally 🤔

cc @binarylogic who's recently done some dashboard work and probably has some opinions.

@spencergilbert
Copy link
Contributor

Thinking about it, I'm also not sure if we want to be the owners of the dashboard as (to my knowledge) none of us are using Grafana. It may be better to have a vector-community project or an individual's which we link to.

@jtweaver
Copy link

In it's current state, this dashboard is useless. There are no Vector metrics. All I see are SQS, Kafka, and Kubernetes metrics. And even those don't work.

@zamazan4ik
Copy link
Contributor Author

zamazan4ik commented Sep 13, 2022

There are no Vector metrics.

They are located in General panel. And General tab is tested locally with actual metrics from Vector. However, not all Vector added right now, since at first I wanna get a feedback.

All I see are SQS, Kafka, and Kubernetes metrics. And even those don't work.

Do you have a corresponding metrics in your Prometheus setup? I have no, I have just inserted them into the dashboard as an example of grouping different metrics groups on the dashboard. If this way is good for Vector devs, I will continue work on it.

@zamazan4ik
Copy link
Contributor Author

@spencergilbert since you have no desire to maintain it in the main repo, could you please internally in your team, how should we deal with the community-supported tooling around vector? From my perspective, additional vector-community repo under vectordotdev organization should be fine. I do not like the idea with linking to external projects in this case, since then you need track somehow, which fork is actual and up-to-date and other boring and error-prone stuff. vectordotdev is a good source of truth.

When you discuss it, please write a note about Vector dev team decision here and we will move on with the dashboard.

Thank you.

@zamazan4ik
Copy link
Contributor Author

Before your decision, I have continued work here: https://github.com/zamazan4ik/vector-community

Current state - I have added initial panels versions for all metrics from internal_metrics source to the dashboard.

@zamazan4ik
Copy link
Contributor Author

@spencergilbert Did you discuss this PR somewhere internally? Are you still interested in this dashboard in the upstream?

@spencergilbert
Copy link
Contributor

@spencergilbert Did you discuss this PR somewhere internally? Are you still interested in this dashboard in the upstream?

Discussing with @jszwedko tomorrow, we'd prefer not to start the repo/org without having some plans in place for how it would be managed/admin/etc. - hopefully can get a rough plan in place soon.

@zamazan4ik
Copy link
Contributor Author

Update regarding the current state of the dashboard. I have just uploaded a big bunch of changes. Now dashboard has all Vector metrics inside it, organized in some meaningful groups. All counter-based sample dashboards were transformed into rate-based. Added for filtering support based on dashboard variables.

The biggest issue that I cannot test all metrics locally because right now Vector has no ability to generate fake metrics so I just tested a subset of them and they work. If anyone is interested in the dashboard - you are welcome to test it. Just download the dashboard, import into your Grafana, select data source and it should work.

@neuronull
Copy link
Contributor

I discussed this briefly with @jszwedko and @spencergilbert .

The preference is to eventually have it live in something like vectordotdev/vector-community , but we need to do a bit of prep before we get to that point.

In the short term, keeping it in your repo and re-directing users there is good 👍

@neuronull
Copy link
Contributor

(FYI) We created a task internally to track the work to prep for a vectordotdev/vector-community so that it doesn't get left behind 😄

@jszwedko jszwedko self-assigned this Dec 2, 2022
@gaby
Copy link

gaby commented Apr 16, 2023

@zamazan4ik @neuronull any progress on this?

@zamazan4ik
Copy link
Contributor Author

No progress from my side since this PR is not going to be merged by Vector devs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prepare a Grafana Dashboard for internal_metrics gathered via Prometheus
6 participants