Add uptime monitoring to riemann-health #218

smortex · 2022-07-07T22:54:12Z

Make a new metric with the node uptime available. The metric is the raw uptime in seconds and the description is a human readable duration. Critical and Warning thresholds are reversed when compared to other metrics: a value lower that these thresholds indicates a problem. This is intended to get notifications when nodes restart unexpectedly.

The uptime check is not enabled by default and must be explicitly enabled using the --checks flag.

lib/riemann/tools/health.rb

jamtur01 · 2022-07-08T00:16:34Z

You thinking about replacing some of the more complex regex's etc with racc grammars?

smortex · 2022-07-08T00:23:41Z

This is admittedly oven-engineered but the uptime(1) commands try hard to format the uptime for humans and a parser is probably the most maintainable (but not straightforward) way to understand things intended to be read by humans…

That might be a direction we are not ready to take, but I would benefit from this metric and would like to avoid to spit it apart, hence the PR for a discussion 😄

jamtur01 · 2022-07-08T00:26:26Z

I'm good with it. I am curious if there are other things we could do easier with it.

smortex · 2022-07-08T00:58:31Z

For now, it's the only time I see something that is likely to break easily is we rely on regexps. The previous code that was loading the 1 minute load average did it in a way that worked almost everywhere: "first thing that looks-like a float after the last :", here are examples from the test suite I collected in the real world today:

11:10 up 3:40, 1 user, load averages: 0,25 0,67 0,68
11:30AM up 4 hrs, 1 user, load averages: 0.45, 0.53, 0.54
11:46 up 38 days, 22:21, 2 users, load averages: 1,76 1,24 0,94
10:40:21 up 1 day, 18:51, 1 user, load average: 0,46, 1,45, 2,00
11:50:17 up 1 day, 20:01, 1 user, load average: 1.66, 1.69, 1.38

only when the locale implied a , as decimal separator we lost the fractional part, not really an issue because the code would run as a service with the default locale I guess, so this was okayish.

Extracting the uptime from this mess with regexp would really be another story.

All other regexp I see seem to extract data from well defined strings, the output of top on MacOS might be considered fragile but the regexps are straightforward so unless they update the output at some point I think this is fine as it is now.

Multiple metrics for MacOS are provided by a single utility and we used a cache to avoid running it once for each metric. In order to extend this logic to other utilities and avoid code duplication, introduce a "glabal" cache where any function can store data and which is discarded before each tick. No functional change.

smortex · 2022-07-08T01:08:01Z

I could not test that the first commit does not change the behavior on MacOS, so if you can confirm that noting breaks it would be awesome. I spotted a minor issue in the first commit and fixed it, I consider this ready and am ready to adjust the defaults (enabled, thresholds, etc).

Parse the output of `uptime(1)` to gather load averages on BSD. This makes available all information provided by `uptime(1)` excepted the current time which has no value in our monitoring context.

Make a new metric with the node uptime available. The metric is the raw uptime in seconds and the description is a human readable duration. Critical and Warning thresholds are reversed when compared to other metrics: a value lower that these thresholds indicates a problem. This is intended to get notifications when nodes restart unexpectedly. The uptime check is not enabled by default and must be explicitly enabled using the --checks flag.

jamtur01

LGTM

smortex force-pushed the uptime branch from 6ef3661 to 0cde249 Compare July 8, 2022 00:06

smortex added the enhancement New feature or request label Jul 8, 2022

jamtur01 reviewed Jul 8, 2022

View reviewed changes

lib/riemann/tools/health.rb Show resolved Hide resolved

smortex marked this pull request as ready for review July 8, 2022 00:58

smortex force-pushed the uptime branch from a735356 to 2e376f7 Compare July 8, 2022 01:02

smortex added 2 commits July 11, 2022 19:52

Implement an uptime parser

a070c08

Parse the output of `uptime(1)` to gather load averages on BSD. This makes available all information provided by `uptime(1)` excepted the current time which has no value in our monitoring context.

smortex force-pushed the uptime branch from d8bed23 to 99f266b Compare July 12, 2022 05:52

jamtur01 approved these changes Jul 27, 2022

View reviewed changes

jamtur01 merged commit cb0e36a into main Jul 27, 2022

smortex deleted the uptime branch July 27, 2022 02:42

smortex mentioned this pull request Aug 16, 2022

Disabled checks are hard to discover #227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add uptime monitoring to riemann-health #218

Add uptime monitoring to riemann-health #218

smortex commented Jul 7, 2022

jamtur01 commented Jul 8, 2022

smortex commented Jul 8, 2022

jamtur01 commented Jul 8, 2022

smortex commented Jul 8, 2022

smortex commented Jul 8, 2022 •

edited

Loading

jamtur01 left a comment

Add uptime monitoring to riemann-health #218

Add uptime monitoring to riemann-health #218

Conversation

smortex commented Jul 7, 2022

jamtur01 commented Jul 8, 2022

smortex commented Jul 8, 2022

jamtur01 commented Jul 8, 2022

smortex commented Jul 8, 2022

smortex commented Jul 8, 2022 • edited Loading

jamtur01 left a comment

Choose a reason for hiding this comment

smortex commented Jul 8, 2022 •

edited

Loading