Navigation Menu

Skip to content

Commit

Permalink
monitoring: alert owners (#12358)
Browse files Browse the repository at this point in the history
Adds a new required Observable.Owner field, with predefined owners based on 2021 engineering org, as proposed in #12010 for RFC-189. This field does not currently do anything.

Many of these teams don't exist yet, but it seems the ones that don't are clearly covered by an existing team in the team page (for example, backend infrastructure => current Cloud team)

Owners were assigned based on guesses for this first pass.

Co-authored-by: ᴜɴᴋɴᴡᴏɴ <joe@sourcegraph.com>
  • Loading branch information
bobheadxi and unknwon committed Jul 24, 2020
1 parent 6ee7fef commit dd2b7b1
Show file tree
Hide file tree
Showing 12 changed files with 116 additions and 0 deletions.
31 changes: 31 additions & 0 deletions monitoring/frontend.go
Expand Up @@ -18,6 +18,7 @@ func Frontend() *Container {
DataMayBeNaN: true, // See https://github.com/sourcegraph/sourcegraph/issues/9834
Warning: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("duration").Unit(Seconds),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Get details on the exact queries that are slow** by configuring '"observability.logSlowSearches": 20,' in the site configuration and looking for 'frontend' warning logs prefixed with 'slow search request' for additional details.
- **Check that most repositories are indexed** by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
Expand All @@ -33,6 +34,7 @@ func Frontend() *Container {
DataMayBeNaN: true, // See https://github.com/sourcegraph/sourcegraph/issues/9834
Warning: Alert{GreaterOrEqual: 15},
PanelOptions: PanelOptions().LegendFormat("duration").Unit(Seconds),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Get details on the exact queries that are slow** by configuring '"observability.logSlowSearches": 15,' in the site configuration and looking for 'frontend' warning logs prefixed with 'slow search request' for additional details.
- **Check that most repositories are indexed** by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
Expand All @@ -50,6 +52,7 @@ func Frontend() *Container {
Warning: Alert{GreaterOrEqual: 5},
Critical: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("hard timeout"),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -60,6 +63,7 @@ func Frontend() *Container {
Warning: Alert{GreaterOrEqual: 5},
Critical: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("hard error"),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -69,6 +73,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("partial timeout"),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -78,6 +83,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 50},
PanelOptions: PanelOptions().LegendFormat("{{alert_type}}"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- This indicates your user's are making syntax errors or similar user errors.
`,
Expand All @@ -93,6 +99,7 @@ func Frontend() *Container {
{
Name: "99th_percentile_search_codeintel_request_duration",
Description: "99th percentile code-intel successful search request duration over 5m",
Owner: ObservableOwnerCodeIntel,
Query: `histogram_quantile(0.99, sum by (le)(rate(src_graphql_field_seconds_bucket{type="Search",field="results",error="false",source="browser",request_name="CodeIntelSearch"}[5m])))`,
DataMayNotExist: true,
DataMayBeNaN: true, // See https://github.com/sourcegraph/sourcegraph/issues/9834
Expand All @@ -113,6 +120,7 @@ func Frontend() *Container {
DataMayBeNaN: true, // See https://github.com/sourcegraph/sourcegraph/issues/9834
Warning: Alert{GreaterOrEqual: 15},
PanelOptions: PanelOptions().LegendFormat("duration").Unit(Seconds),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: `
- **Get details on the exact queries that are slow** by configuring '"observability.logSlowSearches": 15,' in the site configuration and looking for 'frontend' warning logs prefixed with 'slow search request' for additional details.
- **Check that most repositories are indexed** by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
Expand All @@ -130,6 +138,7 @@ func Frontend() *Container {
Warning: Alert{GreaterOrEqual: 5},
Critical: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("hard timeout"),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
{
Expand All @@ -140,6 +149,7 @@ func Frontend() *Container {
Warning: Alert{GreaterOrEqual: 5},
Critical: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("hard error"),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
{
Expand All @@ -149,6 +159,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("partial timeout"),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
{
Expand All @@ -158,6 +169,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 50},
PanelOptions: PanelOptions().LegendFormat("{{alert_type}}"),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: `
- This indicates a bug in Sourcegraph, please [open an issue](https://github.com/sourcegraph/sourcegraph/issues/new/choose).
`,
Expand All @@ -178,6 +190,7 @@ func Frontend() *Container {
DataMayBeNaN: true, // See https://github.com/sourcegraph/sourcegraph/issues/9834
Warning: Alert{GreaterOrEqual: 50},
PanelOptions: PanelOptions().LegendFormat("duration").Unit(Seconds),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Get details on the exact queries that are slow** by configuring '"observability.logSlowSearches": 20,' in the site configuration and looking for 'frontend' warning logs prefixed with 'slow search request' for additional details.
- **If your users are requesting many results** with a large 'count:' parameter, consider using our [search pagination API](../../api/graphql/search.md).
Expand All @@ -194,6 +207,7 @@ func Frontend() *Container {
DataMayBeNaN: true, // See https://github.com/sourcegraph/sourcegraph/issues/9834
Warning: Alert{GreaterOrEqual: 40},
PanelOptions: PanelOptions().LegendFormat("duration").Unit(Seconds),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Get details on the exact queries that are slow** by configuring '"observability.logSlowSearches": 15,' in the site configuration and looking for 'frontend' warning logs prefixed with 'slow search request' for additional details.
- **If your users are requesting many results** with a large 'count:' parameter, consider using our [search pagination API](../../api/graphql/search.md).
Expand All @@ -212,6 +226,7 @@ func Frontend() *Container {
Warning: Alert{GreaterOrEqual: 5},
Critical: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("hard timeout"),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -222,6 +237,7 @@ func Frontend() *Container {
Warning: Alert{GreaterOrEqual: 5},
Critical: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("hard error"),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -231,6 +247,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("partial timeout"),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -240,6 +257,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 50},
PanelOptions: PanelOptions().LegendFormat("{{alert_type}}"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- This indicates your user's search API requests have syntax errors or a similar user error. Check the responses the API sends back for an explanation.
`,
Expand All @@ -261,6 +279,7 @@ func Frontend() *Container {
DataMayBeNaN: true,
Warning: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("api operation").Unit(Seconds),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
{
Expand All @@ -270,6 +289,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("api operation"),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
},
Expand All @@ -283,6 +303,7 @@ func Frontend() *Container {
DataMayBeNaN: true,
Warning: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("store operation").Unit(Seconds),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
{
Expand All @@ -292,6 +313,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("store operation"),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
},
Expand All @@ -309,6 +331,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("{{code}}"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- Check the Zoekt Web Server dashboard for indications it might be unhealthy.
`,
Expand All @@ -320,6 +343,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("{{code}}"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- Check the Searcher dashboard for indications it might be unhealthy.
`,
Expand All @@ -331,6 +355,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 25},
PanelOptions: PanelOptions().LegendFormat("{{category}}"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- May not be a substantial issue, check the 'frontend' logs for potential causes.
`,
Expand All @@ -345,6 +370,7 @@ func Frontend() *Container {
DataMayBeNaN: true,
Warning: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("{{category}}").Unit(Seconds),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
{
Expand All @@ -355,6 +381,7 @@ func Frontend() *Container {
DataMayBeNaN: true,
Warning: Alert{GreaterOrEqual: 300},
PanelOptions: PanelOptions().LegendFormat("{{category}}").Unit(Seconds),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
{
Expand All @@ -364,6 +391,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("{{category}}"),
Owner: ObservableOwnerCodeIntel,
PossibleSolutions: "none",
},
},
Expand All @@ -376,6 +404,7 @@ func Frontend() *Container {
DataMayBeNaN: true,
Warning: Alert{GreaterOrEqual: 20},
PanelOptions: PanelOptions().LegendFormat("{{category}}").Unit(Seconds),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -385,6 +414,7 @@ func Frontend() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("{{category}}"),
Owner: ObservableOwnerSearch,
PossibleSolutions: "none",
},
{
Expand All @@ -395,6 +425,7 @@ func Frontend() *Container {
DataMayBeNaN: true,
Warning: Alert{GreaterOrEqual: 0.10},
PanelOptions: PanelOptions().LegendFormat("{{method}}").Max(0.10).Unit(Seconds),
Owner: ObservableOwnerDistribution,
PossibleSolutions: "none",
},
},
Expand Down
26 changes: 26 additions & 0 deletions monitoring/generator.go
Expand Up @@ -105,6 +105,25 @@ func (r Row) validate() error {
return nil
}

// ObservableOwner denotes a team that owns an Observable. The current teams are described in
// the handbook: https://about.sourcegraph.com/handbook/engineering/2021_org
type ObservableOwner string

const (
// Core products teams
ObservableOwnerSearch ObservableOwner = "search"
ObservableOwnerCampaigns ObservableOwner = "campaigns"
ObservableOwnerCodeIntel ObservableOwner = "code-intel"
ObservableOwnerExtensibility ObservableOwner = "extensibility"
ObservableOwnerCodeHostIntegrations ObservableOwner = "code-host-integrations"

// Core services teams
ObservableOwnerBackendInfrastructure ObservableOwner = "backend-infrastructure"
ObservableOwnerDistribution ObservableOwner = "distribution"
ObservableOwnerSecurity ObservableOwner = "security"
ObservableOwnerWebInfrastructure ObservableOwner = "web-infrastructure"
)

// Observable describes a metric about a container that can be observed. For example, memory usage.
type Observable struct {
// Name is a short and human-readable lower_snake_case name describing what is being observed.
Expand Down Expand Up @@ -139,6 +158,9 @@ type Observable struct {
//
Description string

// Owner indicates the team that owns any alerts associated with this Observable.
Owner ObservableOwner

// Query is the actual Prometheus query that should be observed.
Query string

Expand Down Expand Up @@ -221,6 +243,9 @@ func (o Observable) validate() error {
return fmt.Errorf("PossibleSolutions: %v", err)
}
}
if o.Owner == "" {
return errors.New("Observable.Owner must be defined")
}
return nil
}

Expand Down Expand Up @@ -655,6 +680,7 @@ func (c *Container) promAlertsFile() *promRulesFile {
"level": level,
"service_name": c.Name,
"description": description,
"owner": string(o.Owner),
}
}

Expand Down
5 changes: 5 additions & 0 deletions monitoring/git_server.go
Expand Up @@ -18,6 +18,7 @@ func GitServer() *Container {
Warning: Alert{LessOrEqual: 25},
Critical: Alert{LessOrEqual: 15},
PanelOptions: PanelOptions().LegendFormat("{{instance}}").Unit(Percentage),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
`,
Expand All @@ -30,6 +31,7 @@ func GitServer() *Container {
Warning: Alert{GreaterOrEqual: 50},
Critical: Alert{GreaterOrEqual: 100},
PanelOptions: PanelOptions().LegendFormat("running commands"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Check if the problem may be an intermittent and temporary peak** using the "Container monitoring" section at the bottom of the Git Server dashboard.
- **Single container deployments:** Consider upgrading to a [Docker Compose deployment](../install/docker-compose/migrate.md) which offers better scalability and resource isolation.
Expand All @@ -44,6 +46,7 @@ func GitServer() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 25},
PanelOptions: PanelOptions().LegendFormat("queue size"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **If you just added several repositories**, the warning may be expected.
- **Check which repositories need cloning**, by visiting e.g. https://sourcegraph.example.com/site-admin/repositories?filter=not-cloned
Expand All @@ -56,6 +59,7 @@ func GitServer() *Container {
DataMayNotExist: true,
Warning: Alert{GreaterOrEqual: 25},
PanelOptions: PanelOptions().LegendFormat("queue size"),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Check the code host status indicator for errors:** on the Sourcegraph app homepage, when signed in as an admin click the cloud icon in the top right corner of the page.
- **Check if the issue continues to happen after 30 minutes**, it may be temporary.
Expand All @@ -71,6 +75,7 @@ func GitServer() *Container {
Warning: Alert{GreaterOrEqual: 1.0},
Critical: Alert{GreaterOrEqual: 2.0},
PanelOptions: PanelOptions().LegendFormat("running commands").Unit(Seconds),
Owner: ObservableOwnerSearch,
PossibleSolutions: `
- **Check if the problem may be an intermittent and temporary peak** using the "Container monitoring" section at the bottom of the Git Server dashboard.
- **Single container deployments:** Consider upgrading to a [Docker Compose deployment](../install/docker-compose/migrate.md) which offers better scalability and resource isolation.
Expand Down

0 comments on commit dd2b7b1

Please sign in to comment.