You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 20, 2019. It is now read-only.
What if we filter these out of pagerduty and jira handlers? After all when we get this, we can't be certain it's the check that failed - in fact almost always it's not the actual check but a bad or hung ec2 instance etc.
A positive outcome of this is we would cut down on (frequently) unactionable tickets and pages. Also, if we ever decide to do more auto-remediation on these, we would be more confident autoremediation is attempted when it's needed - since running autoremediation in response to "execution timed out" (essentially an unknown exit code from a check) may not always be desirable.
A negative outcome would be we would lose implicit info about some problems that oncall often derives from seeing "Execution timed out".
I'm pro this. Almost always when we get "Execution timed out" that is non-transient, we'll get keepalive alerts at the same time (or shortly afterwards).
I would like to open this up for discussion.
If a check is taking longer to run than expected, it often would exit 2 (critical) with output of "Execution timed out".
This comes from sensu-spawn gem - https://github.com/sensu/sensu-spawn/blob/master/lib/sensu/spawn.rb#L163
What if we filter these out of pagerduty and jira handlers? After all when we get this, we can't be certain it's the check that failed - in fact almost always it's not the actual check but a bad or hung ec2 instance etc.
A positive outcome of this is we would cut down on (frequently) unactionable tickets and pages. Also, if we ever decide to do more auto-remediation on these, we would be more confident autoremediation is attempted when it's needed - since running autoremediation in response to "execution timed out" (essentially an unknown exit code from a check) may not always be desirable.
A negative outcome would be we would lose implicit info about some problems that oncall often derives from seeing "Execution timed out".
Discuss.
@solarkennedy @bobtfish
The text was updated successfully, but these errors were encountered: