Skip to content

Alertmanager HA cluster rollout leads to huge number of refiring alerts #4405

Open
@verejoel

Description

@verejoel

We are struggling since a long time with refiring alerts into our backend systems when our Alertmanager HA cluster reboots (for example, due to a normal rollout, or due to Kubernetes node upgrades and the required evictions.

I attached a diagram showing the active, suppressed, and unprocessed alerts during a typical reboot, together with the notification rate during the rollout.

What is surprising to us is that the cluster seems to take a long time (up to 15+ minutes in some cases) to achieve stability again following the reboot of one pod. We can see from the notification rate that our backend - in this case a custom ticketing system - is getting absolutely hammered by refiring events. There seems to be a mix of notifications that should be suppressed (and thus never end up as ticket events) as well as events that have already been notified, but get erroneously notified again.

We are at a loss for how to help resolve these issues, and would really appreciate some guidance for how/what we should tune in our cluster to help reduce the effect on our backends.

Technical details below.

Version: v0.26 (with a custom patch that includes #3419)
Deployment mode: Prometheus Operator, config below
Alertmanager config: varied. Typical values for alerts that are refiring

group_wait: 1m
group_interval: 1m15s
repeat_interval: 30d

Alertmanager: uses mainly default settings, but in HA mode:

  alertmanagerConfigNamespaceSelector: {}
  alertmanagerConfigSelector:
    matchLabels:
      alertmanagerConfig: obs
  alertmanagerConfiguration:
    name: obs-alerts-base
  automountServiceAccountToken: true
  clusterPushpullInterval: 5s
  containers:
  - name: alertmanager
    readinessProbe:
      initialDelaySeconds: 60
  externalUrl: https://alertmgr.osdp.open.ch
  image: alertmanager-linux-amd64:0.26.0-pr3419
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  podMetadata:
    annotations:
      traffic.sidecar.istio.io/excludeInboundPorts: "9094"
      traffic.sidecar.istio.io/excludeOutboundPorts: "9094"
  portName: http-web
  replicas: 3
  resources:
    limits:
      memory: 8Gi
    requests:
      cpu: 2000m
      memory: 4Gi
  retention: 2160h
  routePrefix: /
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  serviceAccountName: obs-monitoring-alertmanager
  storage:
    volumeClaimTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        storageClassName: default
  version: 0.26.0-pr3419

Image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions