Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codeintel: alert when all executor jobs are failing #38767

Merged
merged 5 commits into from
Jul 19, 2022

Conversation

Strum355
Copy link
Member

Creates alert for executors error rate that alerts when the rate of errors is 100%, indicating some global misconfiguration (as happened before with src-cli related issues).

The alert is a bit special in that it uses a different query to the panel, one based on last_over_time aggregate. We do this as we dont want the alert to mark itself as resolved if there happens to be a period over the defined window where there are no auto indexing jobs (aka when the error rate is "technically" < 100%).

The screenshot below illustrates how the alert query maintains the last value over a predefined window, so that if no executor jobs are processing but over the error rate was 100% before, we will continue alerting as the absence of running jobs does not imply the issue is resolved.

image

Closes #30494

Test plan

Only modifies dashboards/alerts, n/a

@Strum355 Strum355 added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) monitoring team/language-platform-and-navigation labels Jul 13, 2022
@Strum355 Strum355 self-assigned this Jul 13, 2022
@cla-bot cla-bot bot added the cla-signed label Jul 13, 2022
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jul 13, 2022

Codenotify: Notifying subscribers in CODENOTIFY files for diff 3accc47...192edb5.

Notify File(s)
@bobheadxi monitoring/definitions/shared/codeintel.go
monitoring/definitions/shared/shared.go
monitoring/definitions/shared/standard.go
monitoring/definitions/shared/workerutil.go
monitoring/monitoring/documentation.go
monitoring/monitoring/monitoring.go
@slimsag monitoring/definitions/shared/codeintel.go
monitoring/definitions/shared/shared.go
monitoring/definitions/shared/standard.go
monitoring/definitions/shared/workerutil.go
monitoring/monitoring/documentation.go
monitoring/monitoring/monitoring.go
@sourcegraph/delivery doc/admin/observability/alerts.md
doc/admin/observability/dashboards.md
monitoring/definitions/shared/codeintel.go
monitoring/definitions/shared/shared.go
monitoring/definitions/shared/standard.go
monitoring/definitions/shared/workerutil.go
monitoring/monitoring/documentation.go
monitoring/monitoring/monitoring.go

Comment on lines 916 to 918
// when using <aggregation>_over_time queries to augment the alert query, lookbackWindow
// determines the lookback range for the subquery. Location in AlertQuery must be specified
// with %%[1]s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit brittle, but I see why it's done (you want to supply the lookback window at a different level of abstraction than the alert query). Is it possible to make it type-impossible to construct an alert that expects a lookback query without supplying a value (or at least using a default value here)?

Copy link
Member

@bobheadxi bobheadxi Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this sounds super fragile, but I think it would be best to not do this for now - i.e. if you supply a raw query, that is simply used with no formatting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make it type-impossible to construct an alert that expects a lookback query without supplying a value

This would be an alternative as well, e.g. with a new observableLookbackAlertDefinition that you can only get on WithLookback - but that still has the issue of requiring a perfectly crafted raw query

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobheadxi Hmm I have a sense of where youre going with this, but having a hard time figuring out how to place things that make sense within this perspective

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Strum355 Might be good to just hard-code the lookback into the error rate panel alert definition for now. I think the other comment thread is a higher impact thing to solve up-front (and we can always re-parameterize the lookback later). I think this will be a bit specific to error rate alerts, though.

@efritz
Copy link
Contributor

efritz commented Jul 13, 2022

@bobheadxi Noah and I prototyped this in a hack session earlier this week and can confirm that the approach of using last_over_time seems ideal for this use case. I'd like to leave the merge approval to you as it touches some of the monitoring generator stuff and I'm sure you'll have ideas on how to clean this up (the biggest change is the addition of an AlertQuery so that we can separate the alert query and the display query the for error rate panels).

Comment on lines 709 to 712
// AlertQuery is Prometheus query (without aggregate and threshold) that should be used as the
// alert query instead of the query used for the panel visualization. Useful if you need to perform
// additional
AlertQuery string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this to ObservableAlertDefinition with rawQuery, then the API would be:

monitoring.RawAlert(query).For(time.Hour).GreaterOrEqual(100),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im having a bit of difficulty seeing how this is the right place for this case (too many abstraction layers, my head 🤯 ). Id like to have this defined at the same place as the panel query, as this is where we have the panel query being built
https://github.com/sourcegraph/sourcegraph/pull/38767/files#diff-38dd11ae9cda5810bba1baabf511ec68b1ce7f90622093c797dc3e2fc676c889R138

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobheadxi In this particular example the alert query will be the same for anything using this panel - the only thing that changes is the threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the idea of having the alert done at the point above as rob illustrated, but not in the sense of supplying the query at that high a level, as the panel query is lower

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with @Strum355's take here, but not 100% clear how to define it this way.

Copy link
Member

@bobheadxi bobheadxi Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many abstraction layers, my head 🤯

Agreed on that 😆

From the other thread:

Might be good to just hard-code the lookback into the error rate panel alert definition for now

+1, this is the gist of the other thread as well - and in the interest of that, is why I suggest it be placed on the alert type itself, rather than the panel type - I would favour explicitness, i.e. "for this alert I am doing something very different", plus at the end of the day that is what we are overriding here

(on that note, maybe overrideQuery is the better name internally, with monitoring.AlertQuery being the constructor)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to (hopefully) address this. Spent too much time trying various approaches so Im just gonna go with this :laff:

@Strum355 Strum355 force-pushed the nsc-ef/executor-errors-over-time branch from 48bfc1f to 4358481 Compare July 18, 2022 16:53
@Strum355
Copy link
Member Author

Added alert queries to the documentation too
image

Copy link
Member

@bobheadxi bobheadxi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor details about naming and the generated output, but otherwise LGTM :)

monitoring/monitoring/monitoring.go Outdated Show resolved Hide resolved
monitoring/monitoring/documentation.go Outdated Show resolved Hide resolved
Strum355 and others added 2 commits July 19, 2022 11:50
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
@Strum355 Strum355 enabled auto-merge (squash) July 19, 2022 10:50
@Strum355 Strum355 merged commit 08cc75d into main Jul 19, 2022
@Strum355 Strum355 deleted the nsc-ef/executor-errors-over-time branch July 19, 2022 15:10
vovakulikov pushed a commit that referenced this pull request Jul 20, 2022
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed monitoring team/graph Graph Team (previously Code Intel/Language Tools/Language Platform)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create alert that fires when auto-indexing on Cloud has stopped working
4 participants