DISCUSS: feature: expired event resolution #1228
Comments
Just to add my thoughts to this (as the co-author of the above handler, and the person within the company who needed this): I wrote this because I was trying to raise alerts on dynamic entities, ie. entities that might be configured, managed and reconfigured automatically and outside of the monitoring system. In this particular case, it was regions (horizontal-segments) of our HBase tables, and some detection of "hot" outlier regions. With the existing design and trying to do this monitoring, we suffer from 2 problems:
In addition, though, this approach turns out not only to be useful for anything that can be dynamically configured outside of the monitoring system, where you trust your configuration system, and the protections around it, but want to look for other failures, but also for edge-triggered events whose timeliness is a factor in the value of the alert data. These include things like backup or other scripts failing, and as such could even extend to alerts being raised within the monitoring core itself for something like a handler failing. Things I'm currently using this for (all in hadoop/hdfs):
In most of these cases, I should also note that if the underlying event is still a problem, the event is refreshed (with the appropriate warning) and the ttl counter is reset, so the current aggregation behaviour of sensu ends up helping. The other significant case where this behaviour is useful however, is for "quick hack" style monitoring. This may sound bad, but in my experience, it's important. Suppose you have an incident caused by some kind of software bug, and you can't (for whatever reason) get it fixed straight away; you put in place a monitor to temporarily alert you about the issue until you can get the fix out. When the new software version goes out, you just want the monitor to go away, without loads of action on your part. With real-expiry of events configured in the way Rob describes above, it's easier to actually get better monitoring coverage in place, because people don't have to be as concerned about the long-term management of their monitors - sensu would automatically deal with the management for them. If you're concerned that this doesn't work in practice, this is basically the strategy that the FB internal monitoring system employed (and presumably still employs), for pretty much exactly the reasons given above. This feature has my vote ;-) |
(to clarify: the "quick hack" above is the monitor, rather than the fix - allowing people to get monitoring in place more trivially and thus providing better visibility and coverage) |
We would have to avoid the use of "ttl" in the definition attribute, as it could be confused with check TTLs and their behaviour. We could use something along the lines of "expiration" or "expires". |
@roobert this is an interesting idea. Thank you for your submission. I'm tagging this for consideration as part of the 0.25 release. We'll follow-up with some proposed solutions prior to implementing any changes. |
Please let me know if check |
I think that would be fine. 👍 |
@portertech We need to make sure that we remove both the event and the result, but other than that - yes, I think it would be fine. If you're going through the same code paths as the APIs, you want to make sure you don't generate a resolved message too, as this has the tendency to confuse things (results without events or vice-versa). Otherwise - while there's obviously a bit of confusion about the difference between "ttl" and "expires" that would need careful documentation - I certainly like this idea. Thanks for taking a look! |
(oh, and being British, I'm not a huge fan of "expiration" - but that's just me ;-) ) |
The 0.25 release is going to hit today with fewer features than originally anticipated & primarily only some internal improvements. Moving this issue to 0.26. |
This issue and feature request is a bit odd. The check TTL feature is intended to identify checks that haven't executed in an expected amount of time in order to warn users, the use case explained in this issue if for cleanup. There are many moving parts to check result and event storage, it's unreasonable to try to leverage Redis expire with many keys that continue to be updated, and this would only cause data to disappear. @roobert A Sensu filter can be used to improve your current method, e.g. Going to close this issue, as we are not convinced that its the correct approach to solving this problem. |
Hi,
There was recently a thread on sensu-users where I bought up a solution we came up with to handle resolving events from sources that don't store state: https://groups.google.com/forum/#!topic/sensu-users/g1EBXggDQDE
Our solution is a hack: we created a handler that reads the
:output
from a check event and if the message is classed as an expired message due to the TTL for the check being hit then the check is removed via the API. Here's the handler: https://gist.github.com/roobert/2cd85ce2bbbeaad1748c7149ba1fd2a1I'd like to propose that we add a feature to core that allows us to do this in not such a kludgey way. The simplest implementation that immediately springs to mind is either adding a
ttl_handlers
parameter, or adjustingttl
to accept a hash that includes ahandlers
key.Alternatively a more fully-featured implementation could include removal of the check from redis.
If this feature is considered acceptable then I'd be happy to submit a patch once we've decided on the implementation.
The text was updated successfully, but these errors were encountered: