Add leniency to disk thresholds of `riemann-health` #282

smortex · 2024-01-21T05:07:49Z

Disk thresholds as a fraction of their usage does not scale well with
modern disks: on one hand a 90% full partition that store logs is
generaly an issue and should be reported, but in the other hand when a huge
volume is available for storing backups (e.g. 10TB) the 90% usage limit
does not really make sense as we do not want to waste 1TB of disk space.

Introduce two new parameters to tune disk usage thresholds:

--disk-warning-leninency (default: 500G)
--disk-critical-leninency (default: 250G)

When the fraction of disk space used reach a warning / critical
threshold, check the available space against these "leninency" values,
and only report the warning / critical status if the available space is
lower than this limit.

The defaults values have been chosen to be high enough to have an effect
only for disks lager than 5TB.

According to IEEE Std 1003.1-2017, a POSIX compliant df(1) must
support the -k flag to return sizes in kB instead of the default that
used to be 512-bytes (still in effect by default on FreeBSD but not on
Linux). We use this flag on all systems to make sure the output is in
1024-bytes unit regardless of the operating system. Existing unit tests
are updated accordingly.

The default warning / critical limits for disk occupation do not scale well for large volume: the default configuration for a 10 TB disk should not raise a warning when 90% of it is used and 1 TB is still available. Add a unit test that show the expected behavior.

Disk thresholds as a fraction of their usage does not scale well with modern disks: on one hand a 90% full partition that store logs is generaly an issue and should be reported, but in the other hand when a huge volume is available for storing backups (e.g. 10TB) the 90% usage limit does not really make sense as we do not want to waste 1TB of disk space. Introduce two new parameters to tune disk usage thresholds: - `--disk-warning-leninency` (default: 500G) - `--disk-critical-leninency` (default: 250G) When the fraction of disk space used reach a warning / critical threshold, check the available space against these "leninency" values, and only report the warning / critical status if the available space is lower than this limit. The defaults values have been chosen to be high enough to have an effect only for disks lager than 5TB. According to IEEE Std 1003.1-2017, a POSIX compliant `df(1)` must support the `-k` flag to return sizes in kB instead of the default that used to be 512-bytes (still in effect by default on FreeBSD but not on Linux). We use this flag on all systems to make sure the output is in 1024-bytes unit regardless of the operating system. Existing unit tests are updated accordingly.

Now that we take free space into account, adding it to the message make sense.

smortex · 2024-01-22T23:39:31Z

I think this is ready for review. As a non-native English speaker, it was quite hard for me to express this notion of "leniency" (tolerance). If you think of a better naming, I will be happy to update the PR accordingly.

jamtur01

I think this makes sense to me.

smortex added the enhancement New feature or request label Jan 21, 2024

smortex changed the title ~~Add leniency to disk thresholds~~ Add leniency to disk thresholds of riemann-health Jan 21, 2024

smortex force-pushed the disk-threshold-leniency branch from e37f48b to 9176cd7 Compare January 21, 2024 19:42

smortex mentioned this pull request Jan 21, 2024

Enable the rubocop-rake and rubocop-rspec gems #285

Merged

smortex force-pushed the disk-threshold-leniency branch from 9176cd7 to 87b87bd Compare January 22, 2024 00:08

smortex added 2 commits January 22, 2024 13:30

smortex force-pushed the disk-threshold-leniency branch from 9d119e3 to 6fce7e6 Compare January 22, 2024 23:30

Also report free disk space in metric description

b002378

Now that we take free space into account, adding it to the message make sense.

smortex force-pushed the disk-threshold-leniency branch from effdefe to b002378 Compare January 22, 2024 23:37

smortex marked this pull request as ready for review January 22, 2024 23:39

jamtur01 approved these changes Jan 26, 2024

View reviewed changes

jamtur01 merged commit 0f583e1 into main Jan 26, 2024
9 checks passed

jamtur01 deleted the disk-threshold-leniency branch January 26, 2024 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add leniency to disk thresholds of `riemann-health` #282

Add leniency to disk thresholds of `riemann-health` #282

smortex commented Jan 21, 2024

smortex commented Jan 22, 2024

jamtur01 left a comment

Add leniency to disk thresholds of riemann-health #282

Add leniency to disk thresholds of riemann-health #282

Conversation

smortex commented Jan 21, 2024

smortex commented Jan 22, 2024

jamtur01 left a comment

Choose a reason for hiding this comment

Add leniency to disk thresholds of `riemann-health` #282

Add leniency to disk thresholds of `riemann-health` #282