Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
DISCUSS: Aggregates 2.0 (including support for standalone check aggregates)! #1218
This is a meta issue, er, aggregating all outstanding issues & PRs related to Sensu aggregates into a single location for discussion and eventual implementation. If you are aware of a related issue or PR – open or closed – please comment on this issue with a link, so we can take every scenario into consideration before designing Aggregates 2.0.
We'll update this ticket as we begin planning aggregate improvements, so please stay tuned for updates – your feedback is important! #monitoringlove
changed the title
Aggregates 2.0 (including support for standalone check aggregates)
Apr 9, 2016
changed the title
Discuss: Aggregates 2.0 (including support for standalone check aggregates)!
Apr 9, 2016
This was referenced
Apr 9, 2016
Just chiming in with a "me, too" - at least for the basic use case where:
We're in relatively early days, so needs might get more complicated as we get into the weeds.
Aggregates 2.0: Introducing Named Aggregates
In an attempt to make Aggregates more broadly useful, we are proposing a major
Introducing Named Aggregates
Sensu 0.24 will introduce support for a new Sensu primitive called "named
The proposed change to "named aggregates" will replace the legacy check-based
Current data only
In Sensu 0.24, named aggregates will drop support for aggregate result history,
New API endpoints
The proposed API endpoints for named aggregates are as follows:
Why did this take so long?
Named aggregates are not a new idea, but our previous resistance to
NOTE: in theory, because the Sensu client uses the same check scheduling
By decoupling the aggregation of check results from the storage of aggregate
@keymone hmm, that is an interesting point. What level of detail would be necessary for "stale result detection"? I would think you would want multiple counters for that too; e.g.:
What should stale counters represent? Should they be a subset of clients included in the aggregate
@calebhailey i didn't think about many-checks case, usually one aggregates over many clients for single check and staleness is determined for a result, not client. imo
in that case, the most logical outcome for me when we have clients A, B and checks X, Y to aggregate over, would be that
hope it makes sense?
Can you please clarify how these aggregates will be passed to handlers and when handlers will be called on these aggregates - on schedule (say every minute)? when state of aggregate changes? what determines when state of the aggregate has changed?
Will there be any notion of "aggregate event is triggered" and "aggregate event is resolved"? These 2 states are of course a function of ok, warning, critical, unknown and stale counters but there are many ways how they could be implemented. Will sensu supply some functions out of the box but allow user to implement their own too? For example, we found that some checks generate better signal when they are critical based on percentage of critical to total, while others could be based on absolute number of critical + stale.
@somic great questions. At this time our proposal is to make aggregates a lightweight primitive for aggregating check results. By keeping them as un-opinionated as possible it is our hope that they can be useful in more applications than the old aggregates which were too opinionated and thus unusable in certain circumstances.
With named aggregates you can get the desired check/handler behavior you want by configuring a simple check against the API (querying the corresponding named aggregate) and handling the results accordingly (including notifications and/or sending them to a time series database for graphing, etc). In this case aggregates can be monitored and acted upon like any other event, with an unlimited amount of "conditional behaviors".
In a way, the answer to your question is "yes" in that all of Sensu's "out of the box" features can be combined together to obtain your desired behavior; i.e.:
check => aggregate <= check => filter => mutator => handler
I hope this makes sense.
I would like to give +1s too all the comments in this thread and thanks to the Sensu "org" to addressing this issue and having a design discussion in the open. I'm also happy to see other Yelpers chime in before I got around to reading this.
I second @keymone's concern that the "stale" checks is a concern. I don't care too much about how it is exposed, but some sort of data about how many stale checks that didn't make the cutoff I think is necessary to give us confidence that the aggregate is working over a sane subset?
I'm also happy to hear that this is going to be considered "a primitive". As long as it exposes enough detail about the internal state of things then it should be really useful to build upon. (Then I guess we can replace our thing)
I guess another nice thing about our thing is that it understands silenced boxes. Silences are not really a first-class citizen so I don't know how we would really work with that? Peaking directly into redis does give us that advantage. (the advantage of baking in all of our opinions about what should be considered healthy)
@solarkennedy first off, thank you for your comments and support! We really appreciate it. \o/
We're definitely going to incorporate some data to acknowledge "stale" results. Identifying this as a requirement as a result of having this discussion in an open forum is a big win (thanks @keymone!). We'll follow-up with some ideas about how we might do that before we start implementing anything.
Lastly, please note that we have another design discussion happening around "Subdue 2.0" (#1219). Subdue 2.0 will incorporate all of the current "silencing" functionality (i.e. the ever-confusing use of stashes for silencing) and