-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add uptime monitoring to riemann-health #218
Conversation
You thinking about replacing some of the more complex regex's etc with racc grammars? |
This is admittedly oven-engineered but the That might be a direction we are not ready to take, but I would benefit from this metric and would like to avoid to spit it apart, hence the PR for a discussion 😄 |
I'm good with it. I am curious if there are other things we could do easier with it. |
For now, it's the only time I see something that is likely to break easily is we rely on regexps. The previous code that was loading the 1 minute load average did it in a way that worked almost everywhere: "first thing that looks-like a float after the last
only when the locale implied a Extracting the uptime from this mess with regexp would really be another story. All other regexp I see seem to extract data from well defined strings, the output of top on MacOS might be considered fragile but the regexps are straightforward so unless they update the output at some point I think this is fine as it is now. |
Multiple metrics for MacOS are provided by a single utility and we used a cache to avoid running it once for each metric. In order to extend this logic to other utilities and avoid code duplication, introduce a "glabal" cache where any function can store data and which is discarded before each tick. No functional change.
I could not test that the first commit does not change the behavior on MacOS, so if you can confirm that noting breaks it would be awesome. I spotted a minor issue in the first commit and fixed it, I consider this ready and am ready to adjust the defaults (enabled, thresholds, etc). |
Parse the output of `uptime(1)` to gather load averages on BSD. This makes available all information provided by `uptime(1)` excepted the current time which has no value in our monitoring context.
Make a new metric with the node uptime available. The metric is the raw uptime in seconds and the description is a human readable duration. Critical and Warning thresholds are reversed when compared to other metrics: a value lower that these thresholds indicates a problem. This is intended to get notifications when nodes restart unexpectedly. The uptime check is not enabled by default and must be explicitly enabled using the --checks flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Make a new metric with the node uptime available. The metric is the raw uptime in seconds and the description is a human readable duration. Critical and Warning thresholds are reversed when compared to other metrics: a value lower that these thresholds indicates a problem. This is intended to get notifications when nodes restart unexpectedly.
The uptime check is not enabled by default and must be explicitly enabled using the
--checks
flag.