Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: add "for" to CephOSDDownHigh alert #12731

Merged
merged 1 commit into from
Aug 16, 2023
Merged

Conversation

cjyar
Copy link
Contributor

@cjyar cjyar commented Aug 16, 2023

As part of a normal restart, an OSD can go down just as prometheus tries to scrape it, causing this alert to fire. Alerts shouldn't fire as part of normal operations. This change requires the OSD to be down for 5 consecutive minutes before the alert will fire.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

As part of a normal restart, an OSD can go down just as prometheus tries
to scrape it, causing this alert to fire. Alerts shouldn't fire as part
of normal operations. This change requires the OSD to be down for 5
consecutive minutes before the alert will fire.

Signed-off-by: Chris Jones <chris@cjones.org>
@@ -89,6 +89,7 @@ groups:
description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
summary: "More than 10% of OSDs are down"
expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10"
for: "5m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason for choosing 5m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. It's a constant that appears other places in this file. It should be longer than the expected time for the old OSD pod to terminate plus the time for the new OSD pod to become ready, in a healthy Kubernetes cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see most of the other alerts have the same 5m time, so this seems reasonable.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, could you also open a PR in the ceph repo for this? We sometimes refresh the latest ceph alerts by picking up the file from the ceph repo here.

@@ -89,6 +89,7 @@ groups:
description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
summary: "More than 10% of OSDs are down"
expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10"
for: "5m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see most of the other alerts have the same 5m time, so this seems reasonable.

@travisn travisn merged commit f3764cc into rook:master Aug 16, 2023
44 of 49 checks passed
@cjyar cjyar deleted the alert-for branch August 16, 2023 22:12
travisn added a commit that referenced this pull request Aug 16, 2023
monitoring: add "for" to CephOSDDownHigh alert (backport #12731)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants