codeintel: alert when all executor jobs are failing #38767

Strum355 · 2022-07-13T19:33:18Z

Creates alert for executors error rate that alerts when the rate of errors is 100%, indicating some global misconfiguration (as happened before with src-cli related issues).

The alert is a bit special in that it uses a different query to the panel, one based on last_over_time aggregate. We do this as we dont want the alert to mark itself as resolved if there happens to be a period over the defined window where there are no auto indexing jobs (aka when the error rate is "technically" < 100%).

The screenshot below illustrates how the alert query maintains the last value over a predefined window, so that if no executor jobs are processing but over the error rate was 100% before, we will continue alerting as the absence of running jobs does not imply the issue is resolved.

Closes #30494

Test plan

Only modifies dashboards/alerts, n/a

sourcegraph-bot · 2022-07-13T19:34:46Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 3accc47...192edb5.

Notify	File(s)
@bobheadxi	monitoring/definitions/shared/codeintel.go monitoring/definitions/shared/shared.go monitoring/definitions/shared/standard.go monitoring/definitions/shared/workerutil.go monitoring/monitoring/documentation.go monitoring/monitoring/monitoring.go
@slimsag	monitoring/definitions/shared/codeintel.go monitoring/definitions/shared/shared.go monitoring/definitions/shared/standard.go monitoring/definitions/shared/workerutil.go monitoring/monitoring/documentation.go monitoring/monitoring/monitoring.go
@sourcegraph/delivery	doc/admin/observability/alerts.md doc/admin/observability/dashboards.md monitoring/definitions/shared/codeintel.go monitoring/definitions/shared/shared.go monitoring/definitions/shared/standard.go monitoring/definitions/shared/workerutil.go monitoring/monitoring/documentation.go monitoring/monitoring/monitoring.go

efritz · 2022-07-13T19:37:28Z

monitoring/monitoring/monitoring.go

+	// when using <aggregation>_over_time queries to augment the alert query, lookbackWindow
+	// determines the lookback range for the subquery. Location in AlertQuery must be specified
+	// with %%[1]s.


This seems a bit brittle, but I see why it's done (you want to supply the lookback window at a different level of abstraction than the alert query). Is it possible to make it type-impossible to construct an alert that expects a lookback query without supplying a value (or at least using a default value here)?

Agreed that this sounds super fragile, but I think it would be best to not do this for now - i.e. if you supply a raw query, that is simply used with no formatting

Is it possible to make it type-impossible to construct an alert that expects a lookback query without supplying a value

This would be an alternative as well, e.g. with a new observableLookbackAlertDefinition that you can only get on WithLookback - but that still has the issue of requiring a perfectly crafted raw query

@bobheadxi Hmm I have a sense of where youre going with this, but having a hard time figuring out how to place things that make sense within this perspective

@Strum355 Might be good to just hard-code the lookback into the error rate panel alert definition for now. I think the other comment thread is a higher impact thing to solve up-front (and we can always re-parameterize the lookback later). I think this will be a bit specific to error rate alerts, though.

efritz · 2022-07-13T19:39:29Z

@bobheadxi Noah and I prototyped this in a hack session earlier this week and can confirm that the approach of using last_over_time seems ideal for this use case. I'd like to leave the merge approval to you as it touches some of the monitoring generator stuff and I'm sure you'll have ideas on how to clean this up (the biggest change is the addition of an AlertQuery so that we can separate the alert query and the display query the for error rate panels).

bobheadxi · 2022-07-13T19:39:35Z

monitoring/monitoring/monitoring.go

+	// AlertQuery is Prometheus query (without aggregate and threshold) that should be used as the
+	// alert query instead of the query used for the panel visualization. Useful if you need to perform
+	// additional
+	AlertQuery string


We should move this to ObservableAlertDefinition with rawQuery, then the API would be:

monitoring.RawAlert(query).For(time.Hour).GreaterOrEqual(100),

Im having a bit of difficulty seeing how this is the right place for this case (too many abstraction layers, my head 🤯 ). Id like to have this defined at the same place as the panel query, as this is where we have the panel query being built
https://github.com/sourcegraph/sourcegraph/pull/38767/files#diff-38dd11ae9cda5810bba1baabf511ec68b1ce7f90622093c797dc3e2fc676c889R138

@bobheadxi In this particular example the alert query will be the same for anything using this panel - the only thing that changes is the threshold.

I do like the idea of having the alert done at the point above as rob illustrated, but not in the sense of supplying the query at that high a level, as the panel query is lower

Yeah I agree with @Strum355's take here, but not 100% clear how to define it this way.

too many abstraction layers, my head 🤯

Agreed on that 😆

From the other thread:

Might be good to just hard-code the lookback into the error rate panel alert definition for now

+1, this is the gist of the other thread as well - and in the interest of that, is why I suggest it be placed on the alert type itself, rather than the panel type - I would favour explicitness, i.e. "for this alert I am doing something very different", plus at the end of the day that is what we are overriding here

(on that note, maybe overrideQuery is the better name internally, with monitoring.AlertQuery being the constructor)

Updated the PR to (hopefully) address this. Spent too much time trying various approaches so Im just gonna go with this :laff:

Strum355 · 2022-07-18T16:59:18Z

Added alert queries to the documentation too

bobheadxi

Some minor details about naming and the generated output, but otherwise LGTM :)

monitoring/monitoring/monitoring.go

monitoring/monitoring/documentation.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

Strum355 added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) monitoring team/language-platform-and-navigation labels Jul 13, 2022

Strum355 requested review from efritz and bobheadxi July 13, 2022 19:33

Strum355 self-assigned this Jul 13, 2022

cla-bot bot added the cla-signed label Jul 13, 2022

efritz reviewed Jul 13, 2022

View reviewed changes

bobheadxi requested changes Jul 13, 2022

View reviewed changes

codeintel: alert when all executor jobs are failing

4358481

Strum355 force-pushed the nsc-ef/executor-errors-over-time branch from 48bfc1f to 4358481 Compare July 18, 2022 16:53

Strum355 requested review from bobheadxi and efritz July 18, 2022 16:53

bobheadxi approved these changes Jul 19, 2022

View reviewed changes

monitoring/monitoring/monitoring.go Outdated Show resolved Hide resolved

monitoring/monitoring/documentation.go Outdated Show resolved Hide resolved

Strum355 and others added 2 commits July 19, 2022 11:50

Update monitoring/monitoring/monitoring.go

da2e778

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

Update monitoring/monitoring/documentation.go

e73634f

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

Strum355 enabled auto-merge (squash) July 19, 2022 10:50

Strum355 added 2 commits July 19, 2022 15:22

update method name

931669c

go generate

192edb5

Strum355 merged commit 08cc75d into main Jul 19, 2022

Strum355 deleted the nsc-ef/executor-errors-over-time branch July 19, 2022 15:10

vovakulikov pushed a commit that referenced this pull request Jul 20, 2022

codeintel: alert when all executor jobs are failing (#38767)

df2862a

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

Strum355 mentioned this pull request Aug 18, 2022

executors: alert when no jobs processed but queue > 0 #40570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codeintel: alert when all executor jobs are failing #38767

codeintel: alert when all executor jobs are failing #38767

Strum355 commented Jul 13, 2022

sourcegraph-bot commented Jul 13, 2022 •

edited

efritz Jul 13, 2022

bobheadxi Jul 13, 2022 •

edited

bobheadxi Jul 13, 2022

Strum355 Jul 13, 2022

efritz Jul 13, 2022

efritz commented Jul 13, 2022

bobheadxi Jul 13, 2022

Strum355 Jul 13, 2022

efritz Jul 13, 2022

Strum355 Jul 13, 2022

efritz Jul 13, 2022

bobheadxi Jul 13, 2022 •

edited

Strum355 Jul 18, 2022

Strum355 commented Jul 18, 2022

bobheadxi left a comment

codeintel: alert when all executor jobs are failing #38767

codeintel: alert when all executor jobs are failing #38767

Conversation

Strum355 commented Jul 13, 2022

Test plan

sourcegraph-bot commented Jul 13, 2022 • edited

Choose a reason for hiding this comment

bobheadxi Jul 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efritz commented Jul 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobheadxi Jul 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Strum355 commented Jul 18, 2022

bobheadxi left a comment

Choose a reason for hiding this comment

sourcegraph-bot commented Jul 13, 2022 •

edited

bobheadxi Jul 13, 2022 •

edited

bobheadxi Jul 13, 2022 •

edited