Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add uptime monitoring to riemann-health #218

Merged
merged 3 commits into from
Jul 27, 2022
Merged

Add uptime monitoring to riemann-health #218

merged 3 commits into from
Jul 27, 2022

Conversation

smortex
Copy link
Member

@smortex smortex commented Jul 7, 2022

Make a new metric with the node uptime available. The metric is the raw uptime in seconds and the description is a human readable duration. Critical and Warning thresholds are reversed when compared to other metrics: a value lower that these thresholds indicates a problem. This is intended to get notifications when nodes restart unexpectedly.

The uptime check is not enabled by default and must be explicitly enabled using the --checks flag.

@smortex smortex added the enhancement New feature or request label Jul 8, 2022
@jamtur01
Copy link
Member

jamtur01 commented Jul 8, 2022

You thinking about replacing some of the more complex regex's etc with racc grammars?

@smortex
Copy link
Member Author

smortex commented Jul 8, 2022

This is admittedly oven-engineered but the uptime(1) commands try hard to format the uptime for humans and a parser is probably the most maintainable (but not straightforward) way to understand things intended to be read by humans…

That might be a direction we are not ready to take, but I would benefit from this metric and would like to avoid to spit it apart, hence the PR for a discussion 😄

@jamtur01
Copy link
Member

jamtur01 commented Jul 8, 2022

I'm good with it. I am curious if there are other things we could do easier with it.

@smortex
Copy link
Member Author

smortex commented Jul 8, 2022

For now, it's the only time I see something that is likely to break easily is we rely on regexps. The previous code that was loading the 1 minute load average did it in a way that worked almost everywhere: "first thing that looks-like a float after the last :", here are examples from the test suite I collected in the real world today:

  • 11:10 up 3:40, 1 user, load averages: 0,25 0,67 0,68
  • 11:30AM up 4 hrs, 1 user, load averages: 0.45, 0.53, 0.54
  • 11:46 up 38 days, 22:21, 2 users, load averages: 1,76 1,24 0,94
  • 10:40:21 up 1 day, 18:51, 1 user, load average: 0,46, 1,45, 2,00
  • 11:50:17 up 1 day, 20:01, 1 user, load average: 1.66, 1.69, 1.38

only when the locale implied a , as decimal separator we lost the fractional part, not really an issue because the code would run as a service with the default locale I guess, so this was okayish.

Extracting the uptime from this mess with regexp would really be another story.

All other regexp I see seem to extract data from well defined strings, the output of top on MacOS might be considered fragile but the regexps are straightforward so unless they update the output at some point I think this is fine as it is now.

@smortex smortex marked this pull request as ready for review July 8, 2022 00:58
Multiple metrics for MacOS are provided by a single utility and we used
a cache to avoid running it once for each metric.  In order to extend
this logic to other utilities and avoid code duplication, introduce a
"glabal" cache where any function can store data and which is discarded
before each tick.

No functional change.
@smortex
Copy link
Member Author

smortex commented Jul 8, 2022

I could not test that the first commit does not change the behavior on MacOS, so if you can confirm that noting breaks it would be awesome. I spotted a minor issue in the first commit and fixed it, I consider this ready and am ready to adjust the defaults (enabled, thresholds, etc).

Parse the output of `uptime(1)` to gather load averages on BSD.  This
makes available all information provided by `uptime(1)` excepted the
current time which has no value in our monitoring context.
Make a new metric with the node uptime available. The metric is the raw
uptime in seconds and the description is a human readable duration.
Critical and Warning thresholds are reversed when compared to other
metrics: a value lower that these thresholds indicates a problem. This
is intended to get notifications when nodes restart unexpectedly.

The uptime check is not enabled by default and must be explicitly
enabled using the --checks flag.
Copy link
Member

@jamtur01 jamtur01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jamtur01 jamtur01 merged commit cb0e36a into main Jul 27, 2022
@smortex smortex deleted the uptime branch July 27, 2022 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants