DISCUSS: feature: expired event resolution #1228

roobert · 2016-04-12T11:31:10Z

Hi,

There was recently a thread on sensu-users where I bought up a solution we came up with to handle resolving events from sources that don't store state: https://groups.google.com/forum/#!topic/sensu-users/g1EBXggDQDE

Our solution is a hack: we created a handler that reads the :output from a check event and if the message is classed as an expired message due to the TTL for the check being hit then the check is removed via the API. Here's the handler: https://gist.github.com/roobert/2cd85ce2bbbeaad1748c7149ba1fd2a1

I'd like to propose that we add a feature to core that allows us to do this in not such a kludgey way. The simplest implementation that immediately springs to mind is either adding a ttl_handlers parameter, or adjusting ttl to accept a hash that includes a handlers key.

Alternatively a more fully-featured implementation could include removal of the check from redis.

If this feature is considered acceptable then I'd be happy to submit a patch once we've decided on the implementation.

The text was updated successfully, but these errors were encountered:

mb-m · 2016-04-12T12:07:00Z

Just to add my thoughts to this (as the co-author of the above handler, and the person within the company who needed this):

I wrote this because I was trying to raise alerts on dynamic entities, ie. entities that might be configured, managed and reconfigured automatically and outside of the monitoring system. In this particular case, it was regions (horizontal-segments) of our HBase tables, and some detection of "hot" outlier regions.

With the existing design and trying to do this monitoring, we suffer from 2 problems:

we're checking ~17k regions, only a few of which will be "hot" at once, and sending clears for all the rest becomes impractical
after enough data stored in a region, the region may "split" - ie. the horizontal segment may itself be divided into two child segments, with the region names of the resultant children changing. This means that even if we sent the clears, we could find ourselves in a position of having to clear up the now stale monitoring

In addition, though, this approach turns out not only to be useful for anything that can be dynamically configured outside of the monitoring system, where you trust your configuration system, and the protections around it, but want to look for other failures, but also for edge-triggered events whose timeliness is a factor in the value of the alert data. These include things like backup or other scripts failing, and as such could even extend to alerts being raised within the monitoring core itself for something like a handler failing.

Things I'm currently using this for (all in hadoop/hdfs):

"hot" (outlier of read requests or write requests) regions in hbase
outlier uncompacted regions in hbase
regions in transition (not being served by any current regionserver, but scheduled) (one alert per region)
hbase replication error state of some sort from a specific peer
high hbase replication lag from a specific peer
hadoop/hbase services running that are not expected on a node (one alert for each service)
hdfs snapshot is older than 72 hours (one alert for each snapshot)
hdfs quorum-journal node (5 of these normally) not in sync with namenode (central server) - raised from namenode

In most of these cases, I should also note that if the underlying event is still a problem, the event is refreshed (with the appropriate warning) and the ttl counter is reset, so the current aggregation behaviour of sensu ends up helping.

The other significant case where this behaviour is useful however, is for "quick hack" style monitoring. This may sound bad, but in my experience, it's important. Suppose you have an incident caused by some kind of software bug, and you can't (for whatever reason) get it fixed straight away; you put in place a monitor to temporarily alert you about the issue until you can get the fix out. When the new software version goes out, you just want the monitor to go away, without loads of action on your part. With real-expiry of events configured in the way Rob describes above, it's easier to actually get better monitoring coverage in place, because people don't have to be as concerned about the long-term management of their monitors - sensu would automatically deal with the management for them. If you're concerned that this doesn't work in practice, this is basically the strategy that the FB internal monitoring system employed (and presumably still employs), for pretty much exactly the reasons given above.

This feature has my vote ;-)

mb-m · 2016-04-12T12:09:22Z

(to clarify: the "quick hack" above is the monitor, rather than the fix - allowing people to get monitoring in place more trivially and thus providing better visibility and coverage)

portertech · 2016-04-26T14:59:00Z

We would have to avoid the use of "ttl" in the definition attribute, as it could be confused with check TTLs and their behaviour. We could use something along the lines of "expiration" or "expires".

calebhailey · 2016-04-26T15:11:38Z

@roobert this is an interesting idea. Thank you for your submission. I'm tagging this for consideration as part of the 0.25 release. We'll follow-up with some proposed solutions prior to implementing any changes.

portertech · 2016-04-26T15:12:42Z

Please let me know if check "expires": $seconds, causing any resulting events to be removed from Redis after $seconds, is not sufficient.

roobert · 2016-04-26T15:21:52Z

I think that would be fine. 👍

mb-m · 2016-04-26T15:30:23Z

@portertech We need to make sure that we remove both the event and the result, but other than that - yes, I think it would be fine. If you're going through the same code paths as the APIs, you want to make sure you don't generate a resolved message too, as this has the tendency to confuse things (results without events or vice-versa).

Otherwise - while there's obviously a bit of confusion about the difference between "ttl" and "expires" that would need careful documentation - I certainly like this idea. Thanks for taking a look!

mb-m · 2016-04-26T15:33:12Z

(oh, and being British, I'm not a huge fan of "expiration" - but that's just me ;-) )

calebhailey · 2016-06-13T16:14:46Z

The 0.25 release is going to hit today with fewer features than originally anticipated & primarily only some internal improvements. Moving this issue to 0.26.

portertech · 2017-03-01T18:40:11Z

This issue and feature request is a bit odd. The check TTL feature is intended to identify checks that haven't executed in an expected amount of time in order to warn users, the use case explained in this issue if for cleanup. There are many moving parts to check result and event storage, it's unreasonable to try to leverage Redis expire with many keys that continue to be updated, and this would only cause data to disappear.

@roobert A Sensu filter can be used to improve your current method, e.g. "output": "eval: value.include?('Last check was execute')", applied to the handler which can then be added to your default handler set.

Going to close this issue, as we are not convinced that its the correct approach to solving this problem.

calebhailey added the Feature label Apr 26, 2016

calebhailey added this to the 0.25 milestone Apr 26, 2016

calebhailey modified the milestones: 0.26, 0.25 Jun 13, 2016

calebhailey modified the milestones: 0.26, 0.27 Aug 16, 2016

portertech modified the milestones: 0.28, 0.27 Oct 12, 2016

portertech modified the milestones: 0.28, 0.29 Feb 23, 2017

portertech closed this as completed Mar 1, 2017

portertech removed this from the 0.29 milestone Mar 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISCUSS: feature: expired event resolution #1228

DISCUSS: feature: expired event resolution #1228

roobert commented Apr 12, 2016 •

edited by portertech

mb-m commented Apr 12, 2016

mb-m commented Apr 12, 2016

portertech commented Apr 26, 2016

calebhailey commented Apr 26, 2016

portertech commented Apr 26, 2016

roobert commented Apr 26, 2016

mb-m commented Apr 26, 2016 •

edited

mb-m commented Apr 26, 2016

calebhailey commented Jun 13, 2016

portertech commented Mar 1, 2017

DISCUSS: feature: expired event resolution #1228

DISCUSS: feature: expired event resolution #1228

Comments

roobert commented Apr 12, 2016 • edited by portertech

mb-m commented Apr 12, 2016

mb-m commented Apr 12, 2016

portertech commented Apr 26, 2016

calebhailey commented Apr 26, 2016

portertech commented Apr 26, 2016

roobert commented Apr 26, 2016

mb-m commented Apr 26, 2016 • edited

mb-m commented Apr 26, 2016

calebhailey commented Jun 13, 2016

portertech commented Mar 1, 2017

roobert commented Apr 12, 2016 •

edited by portertech

mb-m commented Apr 26, 2016 •

edited