Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playbook #5

Merged
merged 4 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ on:
branches:
- master
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"
pull_request:
branches:
- '*'
- "*"
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"

jobs:
test:
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/golangci-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ on:
branches:
- master
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"
pull_request:
branches:
- '*'
- "*"
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"

jobs:
lint:
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/license.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ on:
branches:
- master
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"
pull_request:
branches:
- '*'
- "*"
paths-ignore:
- '**.md'
- "**.md"
- "doc/**"

jobs:
check:
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Operator can be disabled for a specific alert rule.
- `playbook` label to absent metric alerts.
- `keep-labels` flag for specifying which labels to carry over from alert
rules.

## [0.1.0] - 2020-08-13

Expand Down
114 changes: 41 additions & 73 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,24 @@

> Project status: **alpha**. The API and user facing objects may change.

In this document:

- [Overview](#overview)
- [Motivation](#motivation)
- [Installation](#installation)
- [Pre\-compiled binaries and Docker images](#pre-compiled-binaries-and-docker-images)
- [Building from source](#building-from-source)
- [Usage](#usage)
- [Disable for specific alerts](#disable-for-specific-alerts)
- [Caveat](#caveat)
- [Metrics](#metrics)
- [Absent metric alert definition](#absent-metric-alert-definition)
- [Template](#template)
- [Labels](#labels)
- [Default tier and service](#default-tier-and-service)
- [Defaults](#defaults)
- [Carry over from original alert rule](#carry-over-from-original-alert-rule)
- [Tier and service](#tier-and-service)

In other documents:

- [Operator's Playbook](./doc/playbook.md)

## Overview

The absent metrics operator is a companion operator for the [Prometheus
Operator](https://github.com/prometheus-operator/prometheus-operator).
Expand Down Expand Up @@ -76,77 +82,24 @@ labels:
severity: info
annotations:
summary: missing foo_bar
description: The metric 'foo_bar' is missing. Alerts using it may not fire as intended.
```

## Installation

### Pre-compiled binaries and Docker images

See the latest [release](https://github.com/sapcc/absent-metrics-operator/releases/latest).

### Building from source

The only required build dependency is [Go](https://golang.org/).

```
$ git clone https://github.com/sapcc/absent-metrics-operator.git
$ cd absent-metrics-operator
$ make install
```

This will put the binary in `/usr/bin/`.

Alternatively, you can also build directly with the `go get` command:

```
$ go get -u github.com/sapcc/absent-metrics-operator
description: The metric 'foo_bar' is missing. 'ImportantAlert' alert using it may not fire as intended.
```

This will put the binary in `$GOPATH/bin/`.

## Usage

```
$ absent-metrics-operator --kubeconfig="$KUBECONFIG"
```
We provide pre-compiled binaries and container images. See the latest
[release](https://github.com/sapcc/absent-metrics-operator/releases/latest).

`kubeconfig` flag is only required if running outside a cluster.
Alternatively, you can build with `make`, install with `make install`, `go get`, or
`docker build`.

For detailed usage instructions:
For usage instructions:

```
$ absent-metrics-operator --help
```

### Disable for specific alerts

You can disable the operator for a specific `PrometheusRule` resource by adding
the following label to it:

```yaml
absent-metrics-operator/disable: true
```

If you want to disable the operator for only a specific alert rule instead of
all the alerts in a `PrometheusRule`, you can use the same label at the
rule-level:

```yaml
alert: ImportantAlert
expr: foo_bar > 0
for: 5m
labels:
absent-metrics-operator/disable: true
...
```

#### Caveat

If you disable the operator for a specific alert or a specific
`PrometheusRule`, however there are other alerts or `PrometheusRules` which
have alert definitions that use the same metric(s) then the absent metric
alerts for those metric(s) will be created regardless.
You can disable the the operator for a specific `PrometheusRule` or a specific alert definition, refer to the [operator's Playbook](./doc/playbook.md) for more info.

### Metrics

Expand Down Expand Up @@ -186,7 +139,7 @@ labels:
severity: info
annotations:
summary: missing $metric
description: The metric '$metric' is missing. Alerts using it may not fire as intended.
description: The metric '$metric' is missing. '$alert-name' alert using it may not fire as intended.
```

Consider the metric `limes_successful_scrapes:rate5m` with tier `os` and
Expand All @@ -196,16 +149,31 @@ Then the alert name would be `AbsentOsLimesSuccessfulScrapesRate5m`.

### Labels

- `tier` and `service` labels are carried over from the original alert rule
unless those labels use templating (i.e. use `$labels`), in which case the
default `tier` and `service` values for that Prometheus server in that
namespace will be used.
- `severity` is always `info`.
#### Defaults

The following labels are always present on every absent metric alert rule:

#### Default tier and service
- `severity` is alway `info`.
- `playbook` provides a [link](./doc/playbook.md) to documentation that can be
referenced on how to deal with an absent metric alert.

#### Carry over from original alert rule

You can specify which labels to carry over from the original alert rule by
specifying a comma-separated list of labels to the `--keep-labels` flag. The
default value for this flag is `service,tier`.

#### Tier and service

`tier` and `service` labels are a special case they are carried over from the
original alert rule unless those labels use templating (i.e. use `$labels`), in
which case the default `tier` and `service` values will be used.

The operator determines a default `tier` and `service` for a specific
Prometheus server in a namespace by traversing through all the alert rule
definitions for that Prometheus server in that namespace. It chooses the most
common `tier` and `service` label combination that is used across those alerts
as the default values.

The value of these labels are also for used (if enabled with `keep-labels`) in
the name for the absent metric alert. See [template](#Template).
35 changes: 35 additions & 0 deletions doc/playbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Operator's Playbook

This document assumes that you have already read and understood the [general
README](../README.md). If not, start reading there.

### Disable for specific alerts

You can disable the operator for a specific `PrometheusRule` resource by adding
the following label to it:

```yaml
absent-metrics-operator/disable: "true"
```

If you want to disable the operator for only a specific alert rule instead of
all the alerts in a `PrometheusRule`, you can add the `no_alert_on_absence`
label to the alert rule. For example:

```yaml
alert: ImportantAlert
expr: foo_bar > 0
for: 5m
labels:
no_alert_on_absence: "true"
...
```

**Note**: make sure that you use `"true"` and not `true`.

#### Caveat

If you disable the operator for a specific alert or a specific
`PrometheusRule`, however there are other alerts or `PrometheusRules` which
have alert definitions that use the same metric(s) then the absent metric
alerts for those metric(s) will be created regardless.
43 changes: 27 additions & 16 deletions internal/controller/alert_rule.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,10 @@ func (mex *metricNameExtractor) Visit(node parser.Node, path []parser.Node) (par
//
// The rule group names for the absent metric alerts have the format:
// promRuleName/originalGroupName.
func parseRuleGroups(promRuleName, defaultTier, defaultService string, in []monitoringv1.RuleGroup) ([]monitoringv1.RuleGroup, error) {
func (c *Controller) parseRuleGroups(
promRuleName, defaultTier, defaultService string,
in []monitoringv1.RuleGroup) ([]monitoringv1.RuleGroup, error) {

out := make([]monitoringv1.RuleGroup, 0, len(in))
for _, g := range in {
var absentRules []monitoringv1.Rule
Expand All @@ -75,12 +78,12 @@ func parseRuleGroups(promRuleName, defaultTier, defaultService string, in []moni
if r.Record != "" {
continue
}
// Do not parse alert rule if it has disable label.
if r.Labels != nil && mustParseBool(r.Labels[labelDisable]) {
// Do not parse alert rule if it has the no alert on absence label.
if r.Labels != nil && mustParseBool(r.Labels[labelNoAlertOnAbsence]) {
continue
}

rules, err := ParseAlertRule(defaultTier, defaultService, r)
rules, err := c.ParseAlertRule(defaultTier, defaultService, r)
if err != nil {
return nil, err
}
Expand All @@ -103,7 +106,7 @@ func parseRuleGroups(promRuleName, defaultTier, defaultService string, in []moni
// Since an original alert expression can reference multiple time series therefore
// a slice of []monitoringv1.Rule is returned as the result would be multiple
// absent metric alert rules (one for each time series).
func ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.Rule, error) {
func (c *Controller) ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.Rule, error) {
exprStr := in.Expr.String()
mex := &metricNameExtractor{expr: exprStr, found: map[string]struct{}{}}
exprNode, err := parser.ParseExpr(exprStr)
Expand All @@ -119,20 +122,28 @@ func ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.
return nil, nil
}

// Default labels
lab := map[string]string{
"severity": "info",
"playbook": "https://git.io/absent-metrics-operator-playbook",
talal marked this conversation as resolved.
Show resolved Hide resolved
}

// Carry over labels from the original alert
if origLab := in.Labels; origLab != nil {
if v, ok := origLab["tier"]; ok && !strings.Contains(v, "$labels") {
tier = v
}
if v, ok := origLab["service"]; ok && !strings.Contains(v, "$labels") {
service = v
for k := range c.keepLabel {
v := origLab[k]
emptyOrTmplVal := v == "" || strings.Contains(v, "$labels")
if k == labelTier && emptyOrTmplVal {
v = tier
}
if k == labelService && emptyOrTmplVal {
v = service
}
if v != "" {
lab[k] = v
}
}
}
lab := map[string]string{
"tier": tier,
"service": service,
"severity": "info",
}

// Sort metric names alphabetically for consistent test results.
metrics := make([]string, 0, len(mex.found))
Expand All @@ -145,7 +156,7 @@ func ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.
for _, m := range metrics {
// Generate an alert name from metric name:
// network:tis_a_metric:rate5m -> AbsentTierServiceNetworkTisAMetricRate5m
words := []string{"absent", tier, service}
words := []string{"absent", lab[labelTier], lab[labelService]}
sL1 := strings.Split(m, "_")
for _, v := range sL1 {
sL2 := strings.Split(v, ":")
Expand Down
Loading