Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression]: prometheus/prom_rules/prometheus.rules.yml: severities are inverted #2029

Closed
vladzcloudius opened this issue Jul 22, 2023 · 6 comments · Fixed by #2030 or #2032
Closed
Assignees
Labels
bug Something isn't working right

Comments

@vladzcloudius
Copy link
Contributor

Installation details
Scylla-Monitoring Version: 4.4.2

Description
Patch 2e3d0c7

commit 2e3d0c7280599f2d09b9eea102f522defdd05db6
Author: Amnon Heiman <amnon@scylladb.com>
Date:   Wed Mar 15 10:22:20 2023 +0200

    prometheus.rules.yml: Use severity as string instead of numbers

set wrong severities values.

For example he set "info" severity to a "DiskFull with less than 15% free" alert and "error" for a "DiskFull 35% disk left"

  - alert: DiskFull
    expr: node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"}
      * 100 < 35
    for: 30s
    labels:
      severity: "error"
    annotations:
      description: '{{ $labels.instance }} has less than 35% free disk space.'
      summary: Instance {{ $labels.instance }} low disk space
  - alert: DiskFull
    expr: node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"}
      * 100 < 25
    for: 30s
    labels:
      severity: "warn"
    annotations:
      description: '{{ $labels.instance }} has less than 25% free disk space.'
      summary: Instance {{ $labels.instance }} low disk space
  - alert: DiskFull
    expr: node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"}
      * 100 < 15
    for: 30s
    labels:
      severity: "info"
    annotations:
      description: '{{ $labels.instance }} has less than 15% free disk space.'
      summary: Instance {{ $labels.instance }} low disk space
@vladzcloudius vladzcloudius added the bug Something isn't working right label Jul 22, 2023
@vladzcloudius
Copy link
Contributor Author

vladzcloudius added a commit to vladzcloudius/scylla-ansible-roles that referenced this issue Jul 22, 2023
Since "prometheus.rules.yml: Use severity as string instead of numbers"
Alert Manager is going to use string severities values, like "warn", "error", etc.

However the same patch introduced a regression by inverting the severities:
scylladb/scylla-monitoring#2029

While we are waiting for a fix let's "hack" it so that things continue working.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
@vladzcloudius
Copy link
Contributor Author

@amnonh as trivial the change is - it's very critical. Could you, please, try to fix this ASAP?

vladzcloudius added a commit to scylladb/scylla-ansible-roles that referenced this issue Jul 22, 2023
Since "prometheus.rules.yml: Use severity as string instead of numbers"
Alert Manager is going to use string severities values, like "warn", "error", etc.

However the same patch introduced a regression by inverting the severities:
scylladb/scylla-monitoring#2029

While we are waiting for a fix let's "hack" it so that things continue working.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
@amnonh amnonh added this to the Monitoring 4.5 milestone Jul 23, 2023
@vladzcloudius
Copy link
Contributor Author

@amnonh #2030 doesn't fix this issue unfortunately.
The offending patch messed up many other alerts - not just DiskFull.

Please, go other the whole https://github.com/scylladb/scylla-monitoring/commit/2e3d0c7280599f2d09b9eea102f522defdd05db6 patch.

The translation was supposed to be

1 - info
2 - warn
3 - error
4 - critical

Just as a sample - look at this screenshot of the part of the patch in question.
All 4 alerts are translated wrong there (and there are a lot more like these in this patch):

image

You may consider asking for a review for following patches in this context.

@vladzcloudius vladzcloudius reopened this Jul 24, 2023
@amnonh
Copy link
Collaborator

amnonh commented Jul 25, 2023

@vladzcloudius thanks for the update, I miss-understand the original issue and wanted to provide a patch release asap, I've removed 4.4.3 and will take the longer route

@vladzcloudius
Copy link
Contributor Author

@amnonh Could you, please, clarify which releases have the fix for this?
Is it only 4.5.0 or some 4.4.x too?

@amnonh
Copy link
Collaborator

amnonh commented Oct 24, 2023

It was backported to 4.4.3 and after

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment