-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitoring: add "for" to CephOSDDownHigh alert #12731
Conversation
As part of a normal restart, an OSD can go down just as prometheus tries to scrape it, causing this alert to fire. Alerts shouldn't fire as part of normal operations. This change requires the OSD to be down for 5 consecutive minutes before the alert will fire. Signed-off-by: Chris Jones <chris@cjones.org>
@@ -89,6 +89,7 @@ groups: | |||
description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}" | |||
summary: "More than 10% of OSDs are down" | |||
expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10" | |||
for: "5m" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason for choosing 5m
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. It's a constant that appears other places in this file. It should be longer than the expected time for the old OSD pod to terminate plus the time for the new OSD pod to become ready, in a healthy Kubernetes cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see most of the other alerts have the same 5m time, so this seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, could you also open a PR in the ceph repo for this? We sometimes refresh the latest ceph alerts by picking up the file from the ceph repo here.
@@ -89,6 +89,7 @@ groups: | |||
description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}" | |||
summary: "More than 10% of OSDs are down" | |||
expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10" | |||
for: "5m" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see most of the other alerts have the same 5m time, so this seems reasonable.
monitoring: add "for" to CephOSDDownHigh alert (backport #12731)
As part of a normal restart, an OSD can go down just as prometheus tries to scrape it, causing this alert to fire. Alerts shouldn't fire as part of normal operations. This change requires the OSD to be down for 5 consecutive minutes before the alert will fire.
Checklist:
skip-ci
on the PR.