Description
We are struggling since a long time with refiring alerts into our backend systems when our Alertmanager HA cluster reboots (for example, due to a normal rollout, or due to Kubernetes node upgrades and the required evictions.
I attached a diagram showing the active, suppressed, and unprocessed alerts during a typical reboot, together with the notification rate during the rollout.
What is surprising to us is that the cluster seems to take a long time (up to 15+ minutes in some cases) to achieve stability again following the reboot of one pod. We can see from the notification rate that our backend - in this case a custom ticketing system - is getting absolutely hammered by refiring events. There seems to be a mix of notifications that should be suppressed (and thus never end up as ticket events) as well as events that have already been notified, but get erroneously notified again.
We are at a loss for how to help resolve these issues, and would really appreciate some guidance for how/what we should tune in our cluster to help reduce the effect on our backends.
Technical details below.
Version: v0.26 (with a custom patch that includes #3419)
Deployment mode: Prometheus Operator, config below
Alertmanager config: varied. Typical values for alerts that are refiring
group_wait: 1m
group_interval: 1m15s
repeat_interval: 30d
Alertmanager: uses mainly default settings, but in HA mode:
alertmanagerConfigNamespaceSelector: {}
alertmanagerConfigSelector:
matchLabels:
alertmanagerConfig: obs
alertmanagerConfiguration:
name: obs-alerts-base
automountServiceAccountToken: true
clusterPushpullInterval: 5s
containers:
- name: alertmanager
readinessProbe:
initialDelaySeconds: 60
externalUrl: https://alertmgr.osdp.open.ch
image: alertmanager-linux-amd64:0.26.0-pr3419
listenLocal: false
logFormat: logfmt
logLevel: info
paused: false
podMetadata:
annotations:
traffic.sidecar.istio.io/excludeInboundPorts: "9094"
traffic.sidecar.istio.io/excludeOutboundPorts: "9094"
portName: http-web
replicas: 3
resources:
limits:
memory: 8Gi
requests:
cpu: 2000m
memory: 4Gi
retention: 2160h
routePrefix: /
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
serviceAccountName: obs-monitoring-alertmanager
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: default
version: 0.26.0-pr3419