Alertmanager HA cluster rollout leads to huge number of refiring alerts

We are struggling since a long time with refiring alerts into our backend systems when our Alertmanager HA cluster reboots (for example, due to a normal rollout, or due to Kubernetes node upgrades and the required evictions.

I attached a diagram showing the active, suppressed, and unprocessed alerts during a typical reboot, together with the notification rate during the rollout.

What is surprising to us is that the cluster seems to take a _long time_ (up to 15+ minutes in some cases) to achieve stability again following the reboot of one pod. We can see from the notification rate that our backend - in this case a custom ticketing system - is getting absolutely hammered by refiring events. There seems to be a mix of notifications that should be suppressed (and thus never end up as ticket events) as well as events that have already been notified, but get erroneously notified again.

We are at a loss for how to help resolve these issues, and would really appreciate some guidance for how/what we should tune in our cluster to help reduce the effect on our backends.

Technical details below.

Version: v0.26 (with a custom patch that includes https://github.com/prometheus/alertmanager/pull/3419)
Deployment mode: Prometheus Operator, config below
Alertmanager config: varied. Typical values for alerts that are refiring 
```yaml
group_wait: 1m
group_interval: 1m15s
repeat_interval: 30d
```
Alertmanager: uses mainly default settings, but in HA mode:
```yaml
  alertmanagerConfigNamespaceSelector: {}
  alertmanagerConfigSelector:
    matchLabels:
      alertmanagerConfig: obs
  alertmanagerConfiguration:
    name: obs-alerts-base
  automountServiceAccountToken: true
  clusterPushpullInterval: 5s
  containers:
  - name: alertmanager
    readinessProbe:
      initialDelaySeconds: 60
  externalUrl: https://alertmgr.osdp.open.ch
  image: alertmanager-linux-amd64:0.26.0-pr3419
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  podMetadata:
    annotations:
      traffic.sidecar.istio.io/excludeInboundPorts: "9094"
      traffic.sidecar.istio.io/excludeOutboundPorts: "9094"
  portName: http-web
  replicas: 3
  resources:
    limits:
      memory: 8Gi
    requests:
      cpu: 2000m
      memory: 4Gi
  retention: 2160h
  routePrefix: /
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  serviceAccountName: obs-monitoring-alertmanager
  storage:
    volumeClaimTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        storageClassName: default
  version: 0.26.0-pr3419
```

![Image](https://github.com/user-attachments/assets/016f546c-3342-4a50-94a7-7d091f75c943)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alertmanager HA cluster rollout leads to huge number of refiring alerts #4405

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alertmanager HA cluster rollout leads to huge number of refiring alerts #4405

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions