Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSS: Aggregates 2.0 (including support for standalone check aggregates)! #1218

Closed
calebhailey opened this issue Apr 9, 2016 · 14 comments

Comments

Projects
None yet
6 participants
@calebhailey
Copy link
Member

commented Apr 9, 2016

This is a meta issue, er, aggregating all outstanding issues & PRs related to Sensu aggregates into a single location for discussion and eventual implementation. If you are aware of a related issue or PR – open or closed – please comment on this issue with a link, so we can take every scenario into consideration before designing Aggregates 2.0.

Related issues:

We'll update this ticket as we begin planning aggregate improvements, so please stay tuned for updates – your feedback is important! #monitoringlove

@calebhailey calebhailey changed the title Aggregates 2.0 (including support for standalone check aggregates) Discuss: Aggregates 2.0 (including support for standalone check aggregates)! Apr 9, 2016

@calebhailey calebhailey changed the title Discuss: Aggregates 2.0 (including support for standalone check aggregates)! DISCUSS: Aggregates 2.0 (including support for standalone check aggregates)! Apr 9, 2016

@calebhailey calebhailey added the Feature label Apr 13, 2016

@calebhailey calebhailey added this to the 0.25 milestone Apr 13, 2016

@cjchand

This comment has been minimized.

Copy link

commented Apr 13, 2016

Just chiming in with a "me, too" - at least for the basic use case where:

  1. We want to have all (or at least most) checks to be standalone
  2. For some of those checks that are duplicated across hosts, we would like to have them be part of an Aggregate (e.g.: warn if one node in a 3-node cluster is down, crit if two are down)

We're in relatively early days, so needs might get more complicated as we get into the weeds.

@calebhailey

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2016

Aggregates 2.0: Introducing Named Aggregates

Proposal

In an attempt to make Aggregates more broadly useful, we are proposing a major
change to Sensu check aggregate functionality, as follows:

Introducing Named Aggregates

Sensu 0.24 will introduce support for a new Sensu primitive called "named
aggregates". With "named aggregates", multiple check results from disparate
sources (including standalone checks) can be collected into an "aggregate" and
made accessible via the API.

The proposed change to "named aggregates" will replace the legacy check-based
aggregation which limited the scope of aggregates to simple collection of
results from multiple/disparate executions of a single check. With the new named
aggregates, users will be free to aggregate check results from an arbitrary
collection of checks (though we suspect the primary and recommended use case
will continue to be aggregation of multiple check results from a single check).

Current data only

In Sensu 0.24, named aggregates will drop support for aggregate result history,
and only provide "current" aggregate result data. This will help make check
aggregates more broadly adaptable as a primitive, and make them as useful with
standalone checks as they are with subscription checks. For users who care about
check aggregate history, simple checks against the new Aggregates API will
facilitate collection and storage of that data (e.g. write a simple check
against a named aggregate, on a 10-second interval, and store the results in
a time series database), and/or "aggregate service health monitoring"
(i.e. don't alert me about service X unless it's failing on 20% of the instances).

New API endpoints

The proposed API endpoints for named aggregates are as follows:

  • /aggregates (GET): returns a JSON array of named aggregates; e.g.:

    [
      {"name": "check_http"},
      {"name": "check_web_app"},
      {"name": "check_lb"}
    ]
  • /aggregates/:aggregate (GET): returns a JSON hash of counters, with a
    new format:

    {
      "clients": 15,
      "checks": 2,
      "status": {
        "ok": 18,
        "warning": 0,
        "critical": 1,
        "unknown": 0,
        "total": 19
      }
    }

    NOTE: all of the top-level attributes in the aggregate result will have
    corresponding API endpoints for obtaining additional information about the
    aggregate result (i.e. /aggregates/:aggregate/clients,
    /aggregates/:aggregate/checks, and /aggregates/:aggregate/status).

  • /aggregates/:aggregate/clients (GET): returns a JSON array of
    client/check "sources" (but no counters); e.g.:

    [
      {"name": "1-424242", "checks": ["elasticsearch","mysql"]},
      {...},
      {...}
    ]
  • /aggregates/:aggregate/checks (GET): return a JSON array of check/client
    "sources" (but no counters):

    [
      {"name": "elasticsearch", "clients": ["1-424242","i-424243"]},
      {"name": "mysql", "clients": ["1-424242","i-424243"]},
      {...},
      {...}
    ]
  • /aggregates/:aggregate/status/:status (GET): returns a JSON array of
    JSON hashes containing summary reports - one hash/report per check; e.g.:

    $ curl -s http://localhost:4567/aggregates/status/critical | jq .
    [
      {
        "check": "elasticsearch", 
        "summary": [
          {
            "output": "Everything is Broken", 
            "total": 1, 
            "clients": ["i-424242"]
          }
        ]
      },
      {...},
      {...}
    ]
  • The new API will also support a single query parameter for ?max_age (in
    seconds), to support exclusion of "old" check results from aggregate
    counters; e.g.:

    $ curl http://localhost:4567/aggregates/check_http?max_age=3600

    NOTE: this query parameter will be available for the /aggregates/:aggregate
    and /aggregates/status/:status endpoints only.

Why did this take so long?

Named aggregates are not a new idea, but our previous resistance to
implementing them was based on the fact that switching to named aggregates alone
didn't address the larger limitation surrounding aggregates: that they relied
upon timestamps to aggregate results. A legacy check aggregate is a collection
of check results from the same check (by name) and with the same issued
timestamp. This model works well for subscription checks (i.e. checks
scheduled and requested by the Sensu server), because a single check
request – with a single issued timestamp – may produce several
check results (depending on how many client are members of the defined
subscription(s)); however, this dependency on issued timestamps made
aggregates too fragile to depend on for standalone checks.

NOTE: in theory, because the Sensu client uses the same check scheduling
algorithm as the Sensu server
, standalone checks could produce
check results with identical issued timestamps, but minor variances in system
clocks make this unreliable enough that legacy check result aggregation never
really worked for standalone checks.

By decoupling the aggregation of check results from the storage of aggregate
check result history, we feel that named aggregates provide a simple yet elegant
primitive that will enable new monitoring solutions for Sensu users.

@calebhailey calebhailey modified the milestones: 0.24, 0.25 Apr 26, 2016

@keymone

This comment has been minimized.

Copy link

commented Apr 26, 2016

rather than exclude old check results according to max_age param, it would be useful to include it as a separate attribute, e.g. "stale": 5. otherwise sounds good. 👍

@portertech

This comment has been minimized.

Copy link
Member

commented Apr 26, 2016

@keymone interesting, the stale counter could be useful for validating exclusion, but not provide their individual statuses (but I don't think we should go that far).

@keymone

This comment has been minimized.

Copy link

commented Apr 26, 2016

@portertech sure, i'm not interested in stale results, but number of stale clients is useful for alerting logic. if 2 out of 2 hosts are OK - there's nothing to alert on, if 2 out of 200 are OK and 198 stale - you might be in trouble.

@calebhailey

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2016

@keymone hmm, that is an interesting point. What level of detail would be necessary for "stale result detection"? I would think you would want multiple counters for that too; e.g.:

$ curl -s http://hostname:4567/aggregates/my_aggregate?max_age=xxx | jq .
{
  "clients": 15,
  "checks": 2,
  "status": {
    "ok": 18,
    "warning": 0,
    "critical": 1,
    "unknown": 0,
    "total": 19
  },
  "stale": {
    "clients": 2,
    "checks": 4
  }
}

What should stale counters represent? Should they be a subset of clients included in the aggregate clients counter, or a separate count (e.g. should the above example mean 2/15 clients are stale, or 2/17 clients are stale)?

@keymone

This comment has been minimized.

Copy link

commented Apr 26, 2016

@calebhailey i didn't think about many-checks case, usually one aggregates over many clients for single check and staleness is determined for a result, not client. imo total = ok + warn + crit + unk + stale.

in that case, the most logical outcome for me when we have clients A, B and checks X, Y to aggregate over, would be that total = 4: AX, AY, BX, BY, and staleness can count twice towards each client. in other words staleness is always part of "status" attribute.

hope it makes sense?

@somic

This comment has been minimized.

Copy link

commented Apr 26, 2016

Can you please clarify how these aggregates will be passed to handlers and when handlers will be called on these aggregates - on schedule (say every minute)? when state of aggregate changes? what determines when state of the aggregate has changed?

Will there be any notion of "aggregate event is triggered" and "aggregate event is resolved"? These 2 states are of course a function of ok, warning, critical, unknown and stale counters but there are many ways how they could be implemented. Will sensu supply some functions out of the box but allow user to implement their own too? For example, we found that some checks generate better signal when they are critical based on percentage of critical to total, while others could be based on absolute number of critical + stale.

@calebhailey

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2016

@somic great questions. At this time our proposal is to make aggregates a lightweight primitive for aggregating check results. By keeping them as un-opinionated as possible it is our hope that they can be useful in more applications than the old aggregates which were too opinionated and thus unusable in certain circumstances.

With named aggregates you can get the desired check/handler behavior you want by configuring a simple check against the API (querying the corresponding named aggregate) and handling the results accordingly (including notifications and/or sending them to a time series database for graphing, etc). In this case aggregates can be monitored and acted upon like any other event, with an unlimited amount of "conditional behaviors".

In a way, the answer to your question is "yes" in that all of Sensu's "out of the box" features can be combined together to obtain your desired behavior; i.e.:

check => aggregate <= check => filter => mutator => handler

I hope this makes sense.

@somic

This comment has been minimized.

Copy link

commented Apr 26, 2016

Yes, makes sense. Thanks!

@solarkennedy

This comment has been minimized.

Copy link
Contributor

commented Apr 27, 2016

I would like to give +1s too all the comments in this thread and thanks to the Sensu "org" to addressing this issue and having a design discussion in the open. I'm also happy to see other Yelpers chime in before I got around to reading this.

I second @keymone's concern that the "stale" checks is a concern. I don't care too much about how it is exposed, but some sort of data about how many stale checks that didn't make the cutoff I think is necessary to give us confidence that the aggregate is working over a sane subset?

I'm also happy to hear that this is going to be considered "a primitive". As long as it exposes enough detail about the internal state of things then it should be really useful to build upon. (Then I guess we can replace our thing)

I guess another nice thing about our thing is that it understands silenced boxes. Silences are not really a first-class citizen so I don't know how we would really work with that? Peaking directly into redis does give us that advantage. (the advantage of baking in all of our opinions about what should be considered healthy)

@calebhailey

This comment has been minimized.

Copy link
Member Author

commented Apr 27, 2016

@solarkennedy first off, thank you for your comments and support! We really appreciate it. \o/

We're definitely going to incorporate some data to acknowledge "stale" results. Identifying this as a requirement as a result of having this discussion in an open forum is a big win (thanks @keymone!). We'll follow-up with some ideas about how we might do that before we start implementing anything.

Lastly, please note that we have another design discussion happening around "Subdue 2.0" (#1219). Subdue 2.0 will incorporate all of the current "silencing" functionality (i.e. the ever-confusing use of stashes for silencing) and subdue functionality into a new first-class primitive (via a new subdue specification and API). We are close to publishing a proposal for Subdue 2.0 (hopefully by the end of this week), and we are targeting the 0.25 release for it. We'll definitely want your input with regards to how Aggregates 2.0 and Subdue 2.0 are complementary. Ideally, these changes should allow you to move away from another workaround "thing". :)

#monitoringlove

portertech added a commit that referenced this issue May 19, 2016

@portertech

This comment has been minimized.

Copy link
Member

commented May 20, 2016

I have completed a implementation of this spec with a few minor changes: #1273

@portertech

This comment has been minimized.

Copy link
Member

commented May 25, 2016

Aggregates 2.0 has been merged into the release/0.24 branch #1273 Sensu 0.24 will be available sometime next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.