Skip to content
This repository has been archived by the owner on Jan 1, 2020. It is now read-only.

DISCUSS: feature: expired event resolution #1228

Closed
roobert opened this issue Apr 12, 2016 · 10 comments
Closed

DISCUSS: feature: expired event resolution #1228

roobert opened this issue Apr 12, 2016 · 10 comments
Labels

Comments

@roobert
Copy link
Contributor

roobert commented Apr 12, 2016

Hi,

There was recently a thread on sensu-users where I bought up a solution we came up with to handle resolving events from sources that don't store state: https://groups.google.com/forum/#!topic/sensu-users/g1EBXggDQDE

Our solution is a hack: we created a handler that reads the :output from a check event and if the message is classed as an expired message due to the TTL for the check being hit then the check is removed via the API. Here's the handler: https://gist.github.com/roobert/2cd85ce2bbbeaad1748c7149ba1fd2a1

I'd like to propose that we add a feature to core that allows us to do this in not such a kludgey way. The simplest implementation that immediately springs to mind is either adding a ttl_handlers parameter, or adjusting ttl to accept a hash that includes a handlers key.

Alternatively a more fully-featured implementation could include removal of the check from redis.

If this feature is considered acceptable then I'd be happy to submit a patch once we've decided on the implementation.

@mb-m
Copy link

mb-m commented Apr 12, 2016

Just to add my thoughts to this (as the co-author of the above handler, and the person within the company who needed this):

I wrote this because I was trying to raise alerts on dynamic entities, ie. entities that might be configured, managed and reconfigured automatically and outside of the monitoring system. In this particular case, it was regions (horizontal-segments) of our HBase tables, and some detection of "hot" outlier regions.

With the existing design and trying to do this monitoring, we suffer from 2 problems:

  1. we're checking ~17k regions, only a few of which will be "hot" at once, and sending clears for all the rest becomes impractical
  2. after enough data stored in a region, the region may "split" - ie. the horizontal segment may itself be divided into two child segments, with the region names of the resultant children changing. This means that even if we sent the clears, we could find ourselves in a position of having to clear up the now stale monitoring

In addition, though, this approach turns out not only to be useful for anything that can be dynamically configured outside of the monitoring system, where you trust your configuration system, and the protections around it, but want to look for other failures, but also for edge-triggered events whose timeliness is a factor in the value of the alert data. These include things like backup or other scripts failing, and as such could even extend to alerts being raised within the monitoring core itself for something like a handler failing.

Things I'm currently using this for (all in hadoop/hdfs):

  • "hot" (outlier of read requests or write requests) regions in hbase
  • outlier uncompacted regions in hbase
  • regions in transition (not being served by any current regionserver, but scheduled) (one alert per region)
  • hbase replication error state of some sort from a specific peer
  • high hbase replication lag from a specific peer
  • hadoop/hbase services running that are not expected on a node (one alert for each service)
  • hdfs snapshot is older than 72 hours (one alert for each snapshot)
  • hdfs quorum-journal node (5 of these normally) not in sync with namenode (central server) - raised from namenode

In most of these cases, I should also note that if the underlying event is still a problem, the event is refreshed (with the appropriate warning) and the ttl counter is reset, so the current aggregation behaviour of sensu ends up helping.

The other significant case where this behaviour is useful however, is for "quick hack" style monitoring. This may sound bad, but in my experience, it's important. Suppose you have an incident caused by some kind of software bug, and you can't (for whatever reason) get it fixed straight away; you put in place a monitor to temporarily alert you about the issue until you can get the fix out. When the new software version goes out, you just want the monitor to go away, without loads of action on your part. With real-expiry of events configured in the way Rob describes above, it's easier to actually get better monitoring coverage in place, because people don't have to be as concerned about the long-term management of their monitors - sensu would automatically deal with the management for them. If you're concerned that this doesn't work in practice, this is basically the strategy that the FB internal monitoring system employed (and presumably still employs), for pretty much exactly the reasons given above.

This feature has my vote ;-)

@mb-m
Copy link

mb-m commented Apr 12, 2016

(to clarify: the "quick hack" above is the monitor, rather than the fix - allowing people to get monitoring in place more trivially and thus providing better visibility and coverage)

@portertech
Copy link
Contributor

We would have to avoid the use of "ttl" in the definition attribute, as it could be confused with check TTLs and their behaviour. We could use something along the lines of "expiration" or "expires".

@calebhailey calebhailey added this to the 0.25 milestone Apr 26, 2016
@calebhailey
Copy link

@roobert this is an interesting idea. Thank you for your submission. I'm tagging this for consideration as part of the 0.25 release. We'll follow-up with some proposed solutions prior to implementing any changes.

@portertech
Copy link
Contributor

Please let me know if check "expires": $seconds, causing any resulting events to be removed from Redis after $seconds, is not sufficient.

@roobert
Copy link
Contributor Author

roobert commented Apr 26, 2016

I think that would be fine. 👍

@mb-m
Copy link

mb-m commented Apr 26, 2016

@portertech We need to make sure that we remove both the event and the result, but other than that - yes, I think it would be fine. If you're going through the same code paths as the APIs, you want to make sure you don't generate a resolved message too, as this has the tendency to confuse things (results without events or vice-versa).

Otherwise - while there's obviously a bit of confusion about the difference between "ttl" and "expires" that would need careful documentation - I certainly like this idea. Thanks for taking a look!

@mb-m
Copy link

mb-m commented Apr 26, 2016

(oh, and being British, I'm not a huge fan of "expiration" - but that's just me ;-) )

@calebhailey
Copy link

The 0.25 release is going to hit today with fewer features than originally anticipated & primarily only some internal improvements. Moving this issue to 0.26.

@calebhailey calebhailey modified the milestones: 0.26, 0.25 Jun 13, 2016
@calebhailey calebhailey modified the milestones: 0.26, 0.27 Aug 16, 2016
@portertech portertech modified the milestones: 0.28, 0.27 Oct 12, 2016
@portertech portertech modified the milestones: 0.28, 0.29 Feb 23, 2017
@portertech
Copy link
Contributor

This issue and feature request is a bit odd. The check TTL feature is intended to identify checks that haven't executed in an expected amount of time in order to warn users, the use case explained in this issue if for cleanup. There are many moving parts to check result and event storage, it's unreasonable to try to leverage Redis expire with many keys that continue to be updated, and this would only cause data to disappear.

@roobert A Sensu filter can be used to improve your current method, e.g. "output": "eval: value.include?('Last check was execute')", applied to the handler which can then be added to your default handler set.

Going to close this issue, as we are not convinced that its the correct approach to solving this problem.

@portertech portertech removed this from the 0.29 milestone Mar 1, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants