-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
codeintel: alert when all executor jobs are failing #38767
Conversation
Codenotify: Notifying subscribers in CODENOTIFY files for diff 3accc47...192edb5.
|
monitoring/monitoring/monitoring.go
Outdated
// when using <aggregation>_over_time queries to augment the alert query, lookbackWindow | ||
// determines the lookback range for the subquery. Location in AlertQuery must be specified | ||
// with %%[1]s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit brittle, but I see why it's done (you want to supply the lookback window at a different level of abstraction than the alert query). Is it possible to make it type-impossible to construct an alert that expects a lookback query without supplying a value (or at least using a default value here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this sounds super fragile, but I think it would be best to not do this for now - i.e. if you supply a raw query, that is simply used with no formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make it type-impossible to construct an alert that expects a lookback query without supplying a value
This would be an alternative as well, e.g. with a new observableLookbackAlertDefinition
that you can only get on WithLookback
- but that still has the issue of requiring a perfectly crafted raw query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobheadxi Hmm I have a sense of where youre going with this, but having a hard time figuring out how to place things that make sense within this perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Strum355 Might be good to just hard-code the lookback into the error rate panel alert definition for now. I think the other comment thread is a higher impact thing to solve up-front (and we can always re-parameterize the lookback later). I think this will be a bit specific to error rate alerts, though.
@bobheadxi Noah and I prototyped this in a hack session earlier this week and can confirm that the approach of using |
monitoring/monitoring/monitoring.go
Outdated
// AlertQuery is Prometheus query (without aggregate and threshold) that should be used as the | ||
// alert query instead of the query used for the panel visualization. Useful if you need to perform | ||
// additional | ||
AlertQuery string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this to ObservableAlertDefinition
with rawQuery
, then the API would be:
monitoring.RawAlert(query).For(time.Hour).GreaterOrEqual(100),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im having a bit of difficulty seeing how this is the right place for this case (too many abstraction layers, my head 🤯 ). Id like to have this defined at the same place as the panel query, as this is where we have the panel query being built
https://github.com/sourcegraph/sourcegraph/pull/38767/files#diff-38dd11ae9cda5810bba1baabf511ec68b1ce7f90622093c797dc3e2fc676c889R138
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobheadxi In this particular example the alert query will be the same for anything using this panel - the only thing that changes is the threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the idea of having the alert done at the point above as rob illustrated, but not in the sense of supplying the query at that high a level, as the panel query is lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with @Strum355's take here, but not 100% clear how to define it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many abstraction layers, my head 🤯
Agreed on that 😆
From the other thread:
Might be good to just hard-code the lookback into the error rate panel alert definition for now
+1, this is the gist of the other thread as well - and in the interest of that, is why I suggest it be placed on the alert type itself, rather than the panel type - I would favour explicitness, i.e. "for this alert I am doing something very different", plus at the end of the day that is what we are overriding here
(on that note, maybe overrideQuery
is the better name internally, with monitoring.AlertQuery
being the constructor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR to (hopefully) address this. Spent too much time trying various approaches so Im just gonna go with this :laff:
48bfc1f
to
4358481
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor details about naming and the generated output, but otherwise LGTM :)
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Creates alert for executors error rate that alerts when the rate of errors is 100%, indicating some global misconfiguration (as happened before with src-cli related issues).
The alert is a bit special in that it uses a different query to the panel, one based on
last_over_time
aggregate. We do this as we dont want the alert to mark itself as resolved if there happens to be a period over the defined window where there are no auto indexing jobs (aka when the error rate is "technically" < 100%).The screenshot below illustrates how the alert query maintains the last value over a predefined window, so that if no executor jobs are processing but over the error rate was 100% before, we will continue alerting as the absence of running jobs does not imply the issue is resolved.
Closes #30494
Test plan
Only modifies dashboards/alerts, n/a