Skip to content

Commit

Permalink
Merge pull request #5 from sapcc/playbook
Browse files Browse the repository at this point in the history
Playbook
  • Loading branch information
talal committed Aug 18, 2020
2 parents 782a795 + ca7abac commit f6343da
Show file tree
Hide file tree
Showing 18 changed files with 354 additions and 229 deletions.
8 changes: 5 additions & 3 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ on:
branches:
- master
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"
pull_request:
branches:
- '*'
- "*"
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"

jobs:
test:
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/golangci-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ on:
branches:
- master
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"
pull_request:
branches:
- '*'
- "*"
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"

jobs:
lint:
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/license.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ on:
branches:
- master
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"
pull_request:
branches:
- '*'
- "*"
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"

jobs:
check:
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Operator can be disabled for a specific alert rule.
- `playbook` label to absent metric alerts.
- `keep-labels` flag for specifying which labels to carry over from alert
rules.

## [0.1.0] - 2020-08-13

Expand Down
114 changes: 41 additions & 73 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,24 @@

> Project status: **alpha**. The API and user facing objects may change.
In this document:

- [Overview](#overview)
- [Motivation](#motivation)
- [Installation](#installation)
- [Pre\-compiled binaries and Docker images](#pre-compiled-binaries-and-docker-images)
- [Building from source](#building-from-source)
- [Usage](#usage)
- [Disable for specific alerts](#disable-for-specific-alerts)
- [Caveat](#caveat)
- [Metrics](#metrics)
- [Absent metric alert definition](#absent-metric-alert-definition)
- [Template](#template)
- [Labels](#labels)
- [Default tier and service](#default-tier-and-service)
- [Defaults](#defaults)
- [Carry over from original alert rule](#carry-over-from-original-alert-rule)
- [Tier and service](#tier-and-service)

In other documents:

- [Operator's Playbook](./doc/playbook.md)

## Overview

The absent metrics operator is a companion operator for the [Prometheus
Operator](https://github.com/prometheus-operator/prometheus-operator).
Expand Down Expand Up @@ -76,77 +82,24 @@ labels:
severity: info
annotations:
summary: missing foo_bar
description: The metric 'foo_bar' is missing. Alerts using it may not fire as intended.
```

## Installation

### Pre-compiled binaries and Docker images

See the latest [release](https://github.com/sapcc/absent-metrics-operator/releases/latest).

### Building from source

The only required build dependency is [Go](https://golang.org/).

```
$ git clone https://github.com/sapcc/absent-metrics-operator.git
$ cd absent-metrics-operator
$ make install
```

This will put the binary in `/usr/bin/`.

Alternatively, you can also build directly with the `go get` command:

```
$ go get -u github.com/sapcc/absent-metrics-operator
description: The metric 'foo_bar' is missing. 'ImportantAlert' alert using it may not fire as intended.
```

This will put the binary in `$GOPATH/bin/`.

## Usage

```
$ absent-metrics-operator --kubeconfig="$KUBECONFIG"
```
We provide pre-compiled binaries and container images. See the latest
[release](https://github.com/sapcc/absent-metrics-operator/releases/latest).

`kubeconfig` flag is only required if running outside a cluster.
Alternatively, you can build with `make`, install with `make install`, `go get`, or
`docker build`.

For detailed usage instructions:
For usage instructions:

```
$ absent-metrics-operator --help
```

### Disable for specific alerts

You can disable the operator for a specific `PrometheusRule` resource by adding
the following label to it:

```yaml
absent-metrics-operator/disable: true
```

If you want to disable the operator for only a specific alert rule instead of
all the alerts in a `PrometheusRule`, you can use the same label at the
rule-level:

```yaml
alert: ImportantAlert
expr: foo_bar > 0
for: 5m
labels:
absent-metrics-operator/disable: true
...
```

#### Caveat

If you disable the operator for a specific alert or a specific
`PrometheusRule`, however there are other alerts or `PrometheusRules` which
have alert definitions that use the same metric(s) then the absent metric
alerts for those metric(s) will be created regardless.
You can disable the the operator for a specific `PrometheusRule` or a specific alert definition, refer to the [operator's playbook](./doc/playbook.md) for more info.

### Metrics

Expand Down Expand Up @@ -186,7 +139,7 @@ labels:
severity: info
annotations:
summary: missing $metric
description: The metric '$metric' is missing. Alerts using it may not fire as intended.
description: The metric '$metric' is missing. '$alert-name' alert using it may not fire as intended.
```

Consider the metric `limes_successful_scrapes:rate5m` with tier `os` and
Expand All @@ -196,16 +149,31 @@ Then the alert name would be `AbsentOsLimesSuccessfulScrapesRate5m`.

### Labels

- `tier` and `service` labels are carried over from the original alert rule
unless those labels use templating (i.e. use `$labels`), in which case the
default `tier` and `service` values for that Prometheus server in that
namespace will be used.
- `severity` is always `info`.
#### Defaults

The following labels are always present on every absent metric alert rule:

#### Default tier and service
- `severity` is alway `info`.
- `playbook` provides a [link](./doc/playbook.md) to documentation that can be
referenced on how to deal with an absent metric alert.

#### Carry over from original alert rule

You can specify which labels to carry over from the original alert rule by
specifying a comma-separated list of labels to the `--keep-labels` flag. The
default value for this flag is `service,tier`.

#### Tier and service

`tier` and `service` labels are a special case they are carried over from the
original alert rule unless those labels use templating (i.e. use `$labels`), in
which case the default `tier` and `service` values will be used.

The operator determines a default `tier` and `service` for a specific
Prometheus server in a namespace by traversing through all the alert rule
definitions for that Prometheus server in that namespace. It chooses the most
common `tier` and `service` label combination that is used across those alerts
as the default values.

The value of these labels are also for used (if enabled with `keep-labels`) in
the name for the absent metric alert. See [template](#Template).
35 changes: 35 additions & 0 deletions doc/playbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Operator's Playbook

This document assumes that you have already read and understood the [general
README](../README.md). If not, start reading there.

### Disable for specific alerts

You can disable the operator for a specific `PrometheusRule` resource by adding
the following label to it:

```yaml
absent-metrics-operator/disable: "true"
```

If you want to disable the operator for only a specific alert rule instead of
all the alerts in a `PrometheusRule`, you can add the `no_alert_on_absence`
label to the alert rule. For example:

```yaml
alert: ImportantAlert
expr: foo_bar > 0
for: 5m
labels:
no_alert_on_absence: "true"
...
```

**Note**: make sure that you use `"true"` and not `true`.

#### Caveat

If you disable the operator for a specific alert or a specific
`PrometheusRule`, however there are other alerts or `PrometheusRules` which
have alert definitions that use the same metric(s) then the absent metric
alerts for those metric(s) will be created regardless.
43 changes: 27 additions & 16 deletions internal/controller/alert_rule.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,10 @@ func (mex *metricNameExtractor) Visit(node parser.Node, path []parser.Node) (par
//
// The rule group names for the absent metric alerts have the format:
// promRuleName/originalGroupName.
func parseRuleGroups(promRuleName, defaultTier, defaultService string, in []monitoringv1.RuleGroup) ([]monitoringv1.RuleGroup, error) {
func (c *Controller) parseRuleGroups(
promRuleName, defaultTier, defaultService string,
in []monitoringv1.RuleGroup) ([]monitoringv1.RuleGroup, error) {

out := make([]monitoringv1.RuleGroup, 0, len(in))
for _, g := range in {
var absentRules []monitoringv1.Rule
Expand All @@ -75,12 +78,12 @@ func parseRuleGroups(promRuleName, defaultTier, defaultService string, in []moni
if r.Record != "" {
continue
}
// Do not parse alert rule if it has disable label.
if r.Labels != nil && mustParseBool(r.Labels[labelDisable]) {
// Do not parse alert rule if it has the no alert on absence label.
if r.Labels != nil && mustParseBool(r.Labels[labelNoAlertOnAbsence]) {
continue
}

rules, err := ParseAlertRule(defaultTier, defaultService, r)
rules, err := c.ParseAlertRule(defaultTier, defaultService, r)
if err != nil {
return nil, err
}
Expand All @@ -103,7 +106,7 @@ func parseRuleGroups(promRuleName, defaultTier, defaultService string, in []moni
// Since an original alert expression can reference multiple time series therefore
// a slice of []monitoringv1.Rule is returned as the result would be multiple
// absent metric alert rules (one for each time series).
func ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.Rule, error) {
func (c *Controller) ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.Rule, error) {
exprStr := in.Expr.String()
mex := &metricNameExtractor{expr: exprStr, found: map[string]struct{}{}}
exprNode, err := parser.ParseExpr(exprStr)
Expand All @@ -119,20 +122,28 @@ func ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.
return nil, nil
}

// Default labels
lab := map[string]string{
"severity": "info",
"playbook": "https://github.com/sapcc/absent-metrics-operator/blob/master/doc/playbook.md",
}

// Carry over labels from the original alert
if origLab := in.Labels; origLab != nil {
if v, ok := origLab["tier"]; ok && !strings.Contains(v, "$labels") {
tier = v
}
if v, ok := origLab["service"]; ok && !strings.Contains(v, "$labels") {
service = v
for k := range c.keepLabel {
v := origLab[k]
emptyOrTmplVal := v == "" || strings.Contains(v, "$labels")
if k == labelTier && emptyOrTmplVal {
v = tier
}
if k == labelService && emptyOrTmplVal {
v = service
}
if v != "" {
lab[k] = v
}
}
}
lab := map[string]string{
"tier": tier,
"service": service,
"severity": "info",
}

// Sort metric names alphabetically for consistent test results.
metrics := make([]string, 0, len(mex.found))
Expand All @@ -145,7 +156,7 @@ func ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.
for _, m := range metrics {
// Generate an alert name from metric name:
// network:tis_a_metric:rate5m -> AbsentTierServiceNetworkTisAMetricRate5m
words := []string{"absent", tier, service}
words := []string{"absent", lab[labelTier], lab[labelService]}
sL1 := strings.Split(m, "_")
for _, v := range sL1 {
sL2 := strings.Split(v, ":")
Expand Down
Loading

0 comments on commit f6343da

Please sign in to comment.