Alerts: Improve HAProxyReloadFail alert

openshift/router#209 reworks the HAProxy reload fails metric so that the HAProxyReloadFail alert can be improved. The new template_router_reload_failure metric in router openshift#209 that replaces the template_router_reload_fails metric is a simple boolean gauge metric, which allows the HAProxyReloadFail alert to fire for the duration of the HAProxy reload outage. Previously, the HAProxyReloadFail alert would fire for ~5 minutes regardless of whether or not reloads were still continuing to fail on the router. Also drops the HAProxyReloadFail alert to warning severity.
sgreene570 · Nov 11, 2020 · e1bcb1b · e1bcb1b
1 parent 49d3e86
commit e1bcb1b
Showing 1 changed file with 27 additions and 0 deletions.
diff --git a/manifests/0000_90_ingress-operator_03_prometheusrules.yaml b/manifests/0000_90_ingress-operator_03_prometheusrules.yaml
@@ -0,0 +1,27 @@
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: ingress-operator
+  namespace: openshift-ingress-operator
+  labels:
+    role: alert-rules
+  annotations:
+    include.release.openshift.io/self-managed-high-availability: "true"
+spec:
+  groups:
+    - name: openshift-ingress.rules
+      rules:
+      - alert: HAProxyReloadFail
+        expr: template_router_reload_failure == 1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          message: "HAProxy reloads are failing on {{ $labels.pod }}. Router is not respecting recently created or modified routes"
+      - alert: HAProxyDown
+        expr: haproxy_up == 0
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          message: "HAProxy metrics are reporting that the router is down"