Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: Add Thanos Ruler alerts #1963

Merged
merged 4 commits into from Jan 15, 2020
Merged

Conversation

kakkoyun
Copy link
Member

@kakkoyun kakkoyun commented Jan 8, 2020

This PR adds missing Ruler alerts to thanos-mixin.

Alerts are inspired by https://thanos.io/components/rule.md/#must-have-essential-ruler-alerts

Signed-off-by: Kemal Akkoyun kakkoyun@gmail.com

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Adds Ruler alerts
  • Adds alert queue metric panels to Thanos Rule dashboards

Verification

For rules:

  • make example-rules-lint
  • Manual tests using Query UI against running cluster.

For dashboards:

  • Imported generated dashboard JSON to a Grafana with live data.

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
@kakkoyun
Copy link
Member Author

kakkoyun commented Jan 8, 2020

cc @metalmatze @bwplotka

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Copy link
Member

@metalmatze metalmatze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
When creating/changing dashboards it would be nice to post some screenshots of that. 😊

mixin/thanos/alerts/ruler.libsonnet Outdated Show resolved Hide resolved
Comment on lines +78 to +81
sum by (job, pod, rule_group) (prometheus_rule_group_last_duration_seconds{%(selector)s})
>
sum by (job, pod, rule_group) (prometheus_rule_group_interval_seconds{%(selector)s})
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure about this rule, as it's always evaluation for us, even though everything seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still indicates a problem though. Its severity set to warning. We might also give a coefficient for the interval but I think we should have this.

With this alert, you can detect you have too many rules for a group for example. Or your queries take too much time.

It's not very critical but I still see value in this.

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
@kakkoyun
Copy link
Member Author

kakkoyun commented Jan 9, 2020

@metalmatze ping

@kakkoyun
Copy link
Member Author

@metalmatze Could you merge? Or should we wait for another opinion?

@squat squat merged commit 46a97fd into thanos-io:master Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants