Skip to content

Commit

Permalink
controller: use resource tier and service
Browse files Browse the repository at this point in the history
Use `PrometheusRule` level `tier` and `service` labels as defaults, if
specified.
  • Loading branch information
talal committed Aug 20, 2020
1 parent 18daa89 commit 8b9d306
Show file tree
Hide file tree
Showing 9 changed files with 121 additions and 77 deletions.
35 changes: 14 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ In this document:
- [Labels](#labels)
- [Defaults](#defaults)
- [Carry over from original alert rule](#carry-over-from-original-alert-rule)
- [Tier and service](#tier-and-service)

In other documents:

Expand Down Expand Up @@ -77,9 +76,10 @@ alert: AbsentFooBar
expr: absent(foo_bar)
for: 10m
labels:
severity: info
playbook: https://github.com/sapcc/absent-metrics-operator/blob/master/doc/playbook.md
tier: network
service: foo
severity: info
annotations:
summary: missing foo_bar
description: The metric 'foo_bar' is missing. 'ImportantAlert' alert using it may not fire as intended.
Expand All @@ -99,7 +99,9 @@ For usage instructions:
$ absent-metrics-operator --help
```

You can disable the the operator for a specific `PrometheusRule` or a specific alert definition, refer to the [operator's playbook](./doc/playbook.md) for more info.
The operator can be disabled for a specific `PrometheusRule` or a specific
alert definition. Refer to the [operator's playbook](./doc/playbook.md) for
more info.

### Metrics

Expand Down Expand Up @@ -134,18 +136,21 @@ alert: $name
expr: absent($metric)
for: 10m
labels:
severity: info
playbook: https://github.com/sapcc/absent-metrics-operator/blob/master/doc/playbook.md
tier: $tier
service: $service
severity: info
annotations:
summary: missing $metric
description: The metric '$metric' is missing. '$alert-name' alert using it may not fire as intended.
```

Consider the metric `limes_successful_scrapes:rate5m` with tier `os` and
service `limes`.
service `limes`. The corresponding absent metric alert name would be
`AbsentOsLimesSuccessfulScrapesRate5m`.

Then the alert name would be `AbsentOsLimesSuccessfulScrapesRate5m`.
The values of `tier` and `service` labels are only included in the name if the
labels are specified in the `keep-labels` flag. See below.

### Labels

Expand All @@ -155,25 +160,13 @@ The following labels are always present on every absent metric alert rule:

- `severity` is alway `info`.
- `playbook` provides a [link](./doc/playbook.md) to documentation that can be
referenced on how to deal with an absent metric alert.
referenced on how to deal with absent metric alerts.

#### Carry over from original alert rule

You can specify which labels to carry over from the original alert rule by
specifying a comma-separated list of labels to the `--keep-labels` flag. The
default value for this flag is `service,tier`.

#### Tier and service

`tier` and `service` labels are a special case they are carried over from the
original alert rule unless those labels use templating (i.e. use `$labels`), in
which case the default `tier` and `service` values will be used.

The operator determines a default `tier` and `service` for a specific
Prometheus server in a namespace by traversing through all the alert rule
definitions for that Prometheus server in that namespace. It chooses the most
common `tier` and `service` label combination that is used across those alerts
as the default values.

The value of these labels are also for used (if enabled with `keep-labels`) in
the name for the absent metric alert. See [template](#Template).
The `tier` and `service` are a special case, they are co-dependent. See the
[playbook](./doc/playbook.md) for details.
38 changes: 31 additions & 7 deletions doc/playbook.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,15 @@
# Operator's Playbook

In this document:

- [Disable for specific alerts](#disable-for-specific-alerts)
- [Caveat](#caveat)
- [Tier and service labels](#tier-and-service-labels)

This document assumes that you have already read and understood the [general
README](../README.md). If not, start reading there.

### Disable for specific alerts
## Disable for specific alerts

You can disable the operator for a specific `PrometheusRule` resource by adding
the following label to it:
Expand All @@ -13,8 +19,8 @@ absent-metrics-operator/disable: "true"
```

If you want to disable the operator for only a specific alert rule instead of
all the alerts in a `PrometheusRule`, you can add the `no_alert_on_absence`
label to the alert rule. For example:
all the alerts in a `PrometheusRule` then you can add the `no_alert_on_absence`
label to a specific alert rule. For example:

```yaml
alert: ImportantAlert
Expand All @@ -27,9 +33,27 @@ labels:

**Note**: make sure that you use `"true"` and not `true`.

#### Caveat
### Caveat

If you disable the operator for a specific alert or a specific
`PrometheusRule`, however there are other alerts or `PrometheusRules` which
have alert definitions that use the same metric(s) then the absent metric
alerts for those metric(s) will be created regardless.
`PrometheusRule` but there are other alerts or `PrometheusRules` which
have alert definitions that use the same metrics then the absent metric
alerts for those metrics will be created regardless.

## Tier and service labels

`tier` and `service` labels are a special case. We (SAP CCloudEE) use them for
posting alert notifications to different Slack channels.

These labels are defined in the following order (highest to lowest priority):

1. If the alert rule has the `tier` **OR** `service` label and the label
doesn't use templating (e.g. `$labels.some_label`) then carry over that
label as is.
2. If the `tier` **OR** `service` labels are defined at the resource (i.e.
`PrometheusRule`) level then use their values.
3. Try to find a default value for the `tier` and `service` labels by
traversing through all the alert rule definitions for a specific Prometheus
server in a specific namespace. The `tier` **AND** `service` label
combination that is the most common amongst all those alerts will be used as
the default.
12 changes: 6 additions & 6 deletions internal/controller/alert_rule.go
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ func (c *Controller) parseRuleGroups(
// Since an original alert expression can reference multiple time series therefore
// a slice of []monitoringv1.Rule is returned as the result would be multiple
// absent metric alert rules (one for each time series).
func (c *Controller) ParseAlertRule(tier, service string, in monitoringv1.Rule) ([]monitoringv1.Rule, error) {
func (c *Controller) ParseAlertRule(defaultTier, defaultService string, in monitoringv1.Rule) ([]monitoringv1.Rule, error) {
exprStr := in.Expr.String()
mex := &metricNameExtractor{expr: exprStr, found: map[string]struct{}{}}
exprNode, err := parser.ParseExpr(exprStr)
Expand All @@ -133,11 +133,11 @@ func (c *Controller) ParseAlertRule(tier, service string, in monitoringv1.Rule)
for k := range c.keepLabel {
v := origLab[k]
emptyOrTmplVal := v == "" || strings.Contains(v, "$labels")
if k == labelTier && emptyOrTmplVal {
v = tier
if k == LabelTier && emptyOrTmplVal {
v = defaultTier
}
if k == labelService && emptyOrTmplVal {
v = service
if k == LabelService && emptyOrTmplVal {
v = defaultService
}
if v != "" {
lab[k] = v
Expand All @@ -156,7 +156,7 @@ func (c *Controller) ParseAlertRule(tier, service string, in monitoringv1.Rule)
for _, m := range metrics {
// Generate an alert name from metric name:
// network:tis_a_metric:rate5m -> AbsentTierServiceNetworkTisAMetricRate5m
words := []string{"absent", lab[labelTier], lab[labelService]}
words := []string{"absent", lab[LabelTier], lab[LabelService]}
sL1 := strings.Split(m, "_")
for _, v := range sL1 {
sL2 := strings.Split(v, ":")
Expand Down
38 changes: 32 additions & 6 deletions internal/controller/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,12 @@ const (
labelOperatorDisable = "absent-metrics-operator/disable"

labelNoAlertOnAbsence = "no_alert_on_absence"
)

labelTier = "tier"
labelService = "service"
// Common constants for reusability.
const (
LabelTier = "tier"
LabelService = "service"
)

const (
Expand All @@ -69,9 +72,13 @@ const (
// Controller is the controller implementation for acting on PrometheusRule
// resources.
type Controller struct {
logger *log.Logger
metrics *Metrics
logger *log.Logger
metrics *Metrics

keepLabel map[string]bool
// keepTierServiceLabels is a shorthand for:
// c.keepLabel[LabelTier] && c.keepLabel[LabelService]
keepTierServiceLabels bool

kubeClientset kubernetes.Interface
promClientset monitoringclient.Interface
Expand Down Expand Up @@ -107,6 +114,7 @@ func New(
promClientset: pClient,
workqueue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "prometheusrules"),
}
c.keepTierServiceLabels = c.keepLabel[LabelTier] && c.keepLabel[LabelService]
ruleInf := informers.NewSharedInformerFactory(pClient, resyncPeriod).Monitoring().V1().PrometheusRules()
c.promRuleLister = ruleInf.Lister()
c.promRuleInformer = ruleInf.Informer()
Expand Down Expand Up @@ -297,7 +305,7 @@ func (c *Controller) syncHandler(key string) error {
prometheusServer, ok := promRule.Labels["prometheus"]
if !ok {
// This shouldn't happen but just in case it does.
c.logger.ErrorWithBackoff("msg", "no 'prometheus' label found on the PrometheusRule", "key", key)
c.logger.ErrorWithBackoff("msg", "no 'prometheus' label found", "key", key)
return nil
}

Expand All @@ -319,8 +327,26 @@ func (c *Controller) syncHandler(key string) error {
return errors.Wrap(err, "could not get AbsentPrometheusRule")
}

defaultTier := absentPromRule.Tier
defaultService := absentPromRule.Service
if c.keepTierServiceLabels {
// If the PrometheusRule has tier and service labels then use those as
// the defaults.
if t := promRule.Labels[LabelTier]; t != "" {
defaultTier = t
}
if s := promRule.Labels[LabelService]; s != "" {
defaultService = s
}
if defaultTier == "" {
c.logger.ErrorWithBackoff("msg", "could not find a value for 'tier' label", "key", key)
}
if defaultService == "" {
c.logger.ErrorWithBackoff("msg", "could not find a value for 'service' label", "key", key)
}
}
// Parse alert rules into absent metric alert rules.
rg, err := c.parseRuleGroups(name, absentPromRule.Tier, absentPromRule.Service, promRule.Spec.Groups)
rg, err := c.parseRuleGroups(name, defaultTier, defaultService, promRule.Spec.Groups)
if err != nil {
// We choose to absorb the error here as the worker would requeue the
// resource otherwise and we'll be stuck parsing broken alert rules.
Expand Down
47 changes: 18 additions & 29 deletions internal/controller/prometheusrule.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,25 +58,22 @@ func (c *Controller) getAbsentPrometheusRule(namespace, prometheusServer string)

// Find default tier and service values for this Prometheus server in this
// namespace.
if c.keepLabel[labelTier] || c.keepLabel[labelService] {
if c.keepTierServiceLabels {
// Fast path: get values from resource labels
t, s := aPR.Labels[labelTier], aPR.Labels[labelService]
if t == "" || s == "" {
aPR.Tier = aPR.Labels[LabelTier]
aPR.Service = aPR.Labels[LabelService]
if aPR.Tier == "" || aPR.Service == "" {
// If we can't get the values from resource then we fall back to
// the slower method of getting them by checking alert rules.
t, s = getTierAndService(aPR.Spec.Groups)
}
if t == "" || s == "" {
c.logger.Info("msg", fmt.Sprintf("could not find default tier and service for Prometheus server '%s' in namespace '%s'",
prometheusServer, namespace))
}
if c.keepLabel[labelTier] {
aPR.Tier = t
aPR.Labels[labelTier] = t
}
if c.keepLabel[labelService] {
aPR.Service = s
aPR.Labels[labelService] = s
t, s := getTierAndService(aPR.Spec.Groups)
if t != "" {
aPR.Tier = t
aPR.Labels[LabelTier] = t
}
if s != "" {
aPR.Service = s
aPR.Labels[LabelService] = s
}
}
}

Expand All @@ -103,7 +100,7 @@ func (c *Controller) newAbsentPrometheusRule(namespace, prometheusServer string)

// Find default tier and service values for this Prometheus server in this
// namespace.
if c.keepLabel[labelTier] || c.keepLabel[labelService] {
if c.keepTierServiceLabels {
prList, err := c.promRuleLister.List(labels.Everything())
if err != nil {
return nil, errors.Wrap(err, "could not list PrometheusRules")
Expand All @@ -116,21 +113,13 @@ func (c *Controller) newAbsentPrometheusRule(namespace, prometheusServer string)
}
}
t, s := getTierAndService(rg)
if t == "" || s == "" {
// Ideally, we shouldn't arrive at this point because this would
// mean that there was not a single alert rule for the prometheus
// server in this namespace that did not use templating for its
// tier and service labels.
c.logger.Info("msg", fmt.Sprintf("could not find default tier and service for Prometheus server '%s' in namespace '%s'",
prometheusServer, namespace))
}
if c.keepLabel[labelTier] {
if t != "" {
aPR.Tier = t
aPR.Labels[labelTier] = t
aPR.Labels[LabelTier] = t
}
if c.keepLabel[labelService] {
if s != "" {
aPR.Service = s
aPR.Labels[labelService] = s
aPR.Labels[LabelService] = s
}
}

Expand Down
4 changes: 2 additions & 2 deletions internal/controller/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@ func getTierAndService(rg []monitoringv1.RuleGroup) (tier, service string) {
if r.Record != "" {
continue
}
t, ok := r.Labels[labelTier]
t, ok := r.Labels[LabelTier]
if !ok || strings.Contains(t, "$labels") {
continue
}
s, ok := r.Labels[labelService]
s, ok := r.Labels[LabelService]
if !ok || strings.Contains(s, "$labels") {
continue
}
Expand Down
15 changes: 13 additions & 2 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ var (
log.FormatLogfmt,
log.FormatJSON,
}
defaultKeepLabels = []string{
controller.LabelService,
controller.LabelTier,
}
)

func main() {
Expand All @@ -62,7 +66,8 @@ func main() {
flagset.StringVar(&logFormat, "log-format", log.FormatLogfmt,
fmt.Sprintf("Log format to use. Possible values: %s", strings.Join(availableLogFormats, ", ")))
flagset.StringVar(&kubeconfig, "kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster")
flagset.StringVar(&keepLabels, "keep-labels", "service,tier", "A comma separated list of labels to keep from the original alert rule")
flagset.StringVar(&keepLabels, "keep-labels", strings.Join(defaultKeepLabels, ","),
"A comma separated list of labels to keep from the original alert rule")
if err := flagset.Parse(os.Args[1:]); err != nil {
logFatalf("could not parse flagset: %s", err.Error())
}
Expand All @@ -77,12 +82,18 @@ func main() {

r := prometheus.NewRegistry()

// Create controller
keepLabelMap := make(map[string]bool)
kL := strings.Split(keepLabels, ",")
for _, v := range kL {
keepLabelMap[strings.TrimSpace(v)] = true
}
if keepLabelMap[controller.LabelTier] || keepLabelMap[controller.LabelService] {
if !keepLabelMap[controller.LabelTier] && !keepLabelMap[controller.LabelService] {
logger.Fatal("msg", "labels 'tier' and 'service' are co-dependent, i.e. use both or neither")
}
}

// Create controller
cfg, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
if err != nil {
logger.Fatal("msg", "instantiating cluster config failed", "err", err)
Expand Down
Loading

0 comments on commit 8b9d306

Please sign in to comment.