Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add leniency to disk thresholds of riemann-health #282

Merged
merged 3 commits into from
Jan 26, 2024

Conversation

smortex
Copy link
Collaborator

@smortex smortex commented Jan 21, 2024

Disk thresholds as a fraction of their usage does not scale well with
modern disks: on one hand a 90% full partition that store logs is
generaly an issue and should be reported, but in the other hand when a huge
volume is available for storing backups (e.g. 10TB) the 90% usage limit
does not really make sense as we do not want to waste 1TB of disk space.

Introduce two new parameters to tune disk usage thresholds:

  • --disk-warning-leninency (default: 500G)
  • --disk-critical-leninency (default: 250G)

When the fraction of disk space used reach a warning / critical
threshold, check the available space against these "leninency" values,
and only report the warning / critical status if the available space is
lower than this limit.

The defaults values have been chosen to be high enough to have an effect
only for disks lager than 5TB.

According to IEEE Std 1003.1-2017, a POSIX compliant df(1) must
support the -k flag to return sizes in kB instead of the default that
used to be 512-bytes (still in effect by default on FreeBSD but not on
Linux). We use this flag on all systems to make sure the output is in
1024-bytes unit regardless of the operating system. Existing unit tests
are updated accordingly.

@smortex smortex added the enhancement New feature or request label Jan 21, 2024
@smortex smortex changed the title Add leniency to disk thresholds Add leniency to disk thresholds of riemann-health Jan 21, 2024
The default warning / critical limits for disk occupation do not scale
well for large volume: the default configuration for a 10 TB disk should
not raise a warning when 90% of it is used and 1 TB is still available.

Add a unit test that show the expected behavior.
Disk thresholds as a fraction of their usage does not scale well with
modern disks: on one hand a 90% full partition that store logs is
generaly an issue and should be reported, but in the other hand when a huge
volume is available for storing backups (e.g. 10TB) the 90% usage limit
does not really make sense as we do not want to waste 1TB of disk space.

Introduce two new parameters to tune disk usage thresholds:
  - `--disk-warning-leninency` (default: 500G)
  - `--disk-critical-leninency` (default: 250G)

When the fraction of disk space used reach a warning / critical
threshold, check the available space against these "leninency" values,
and only report the warning / critical status if the available space is
lower than this limit.

The defaults values have been chosen to be high enough to have an effect
only for disks lager than 5TB.

According to IEEE Std 1003.1-2017, a POSIX compliant `df(1)` must
support the `-k` flag to return sizes in kB instead of the default that
used to be 512-bytes (still in effect by default on FreeBSD but not on
Linux).  We use this flag on all systems to make sure the output is in
1024-bytes unit regardless of the operating system.  Existing unit tests
are updated accordingly.
Now that we take free space into account, adding it to the message make
sense.
@smortex smortex marked this pull request as ready for review January 22, 2024 23:39
@smortex
Copy link
Collaborator Author

smortex commented Jan 22, 2024

I think this is ready for review. As a non-native English speaker, it was quite hard for me to express this notion of "leniency" (tolerance). If you think of a better naming, I will be happy to update the PR accordingly.

Copy link
Member

@jamtur01 jamtur01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to me.

@jamtur01 jamtur01 merged commit 0f583e1 into main Jan 26, 2024
9 checks passed
@jamtur01 jamtur01 deleted the disk-threshold-leniency branch January 26, 2024 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants