monitoring: add "for" to CephOSDDownHigh alert #12731

cjyar · 2023-08-16T01:51:15Z

As part of a normal restart, an OSD can go down just as prometheus tries to scrape it, causing this alert to fire. Alerts shouldn't fire as part of normal operations. This change requires the OSD to be down for 5 consecutive minutes before the alert will fire.

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

As part of a normal restart, an OSD can go down just as prometheus tries to scrape it, causing this alert to fire. Alerts shouldn't fire as part of normal operations. This change requires the OSD to be down for 5 consecutive minutes before the alert will fire. Signed-off-by: Chris Jones <chris@cjones.org>

parth-gr · 2023-08-16T07:18:00Z

deploy/charts/rook-ceph-cluster/prometheus/localrules.yaml

@@ -89,6 +89,7 @@ groups:
          description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
          summary: "More than 10% of OSDs are down"
        expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10"
+        for: "5m"


any specific reason for choosing 5m?

Not really. It's a constant that appears other places in this file. It should be longer than the expected time for the old OSD pod to terminate plus the time for the new OSD pod to become ready, in a healthy Kubernetes cluster.

I see most of the other alerts have the same 5m time, so this seems reasonable.

travisn

Looks good, could you also open a PR in the ceph repo for this? We sometimes refresh the latest ceph alerts by picking up the file from the ceph repo here.

travisn · 2023-08-16T20:11:34Z

deploy/charts/rook-ceph-cluster/prometheus/localrules.yaml

@@ -89,6 +89,7 @@ groups:
          description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
          summary: "More than 10% of OSDs are down"
        expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10"
+        for: "5m"


I see most of the other alerts have the same 5m time, so this seems reasonable.

monitoring: add "for" to CephOSDDownHigh alert (backport #12731)

subhamkrai requested a review from travisn August 16, 2023 06:04

parth-gr reviewed Aug 16, 2023

View reviewed changes

travisn approved these changes Aug 16, 2023

View reviewed changes

travisn added the backport-release-1.12 label Aug 16, 2023

travisn merged commit f3764cc into rook:master Aug 16, 2023
44 of 49 checks passed

mergify bot mentioned this pull request Aug 16, 2023

monitoring: add "for" to CephOSDDownHigh alert (backport #12731) #12742

Merged

cjyar deleted the alert-for branch August 16, 2023 22:12

travisn added a commit that referenced this pull request Aug 16, 2023

Merge pull request #12742 from rook/mergify/bp/release-1.12/pr-12731

7081f1a

monitoring: add "for" to CephOSDDownHigh alert (backport #12731)

cjyar mentioned this pull request Aug 24, 2023

monitoring: add "for" to CephOSDDownHigh alert ceph/ceph#53148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitoring: add "for" to CephOSDDownHigh alert #12731

monitoring: add "for" to CephOSDDownHigh alert #12731

cjyar commented Aug 16, 2023

parth-gr Aug 16, 2023

cjyar Aug 16, 2023

travisn Aug 16, 2023

travisn left a comment

travisn Aug 16, 2023

monitoring: add "for" to CephOSDDownHigh alert #12731

monitoring: add "for" to CephOSDDownHigh alert #12731

Conversation

cjyar commented Aug 16, 2023

parth-gr Aug 16, 2023

Choose a reason for hiding this comment

cjyar Aug 16, 2023

Choose a reason for hiding this comment

travisn Aug 16, 2023

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

travisn Aug 16, 2023

Choose a reason for hiding this comment