Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules #2585

Ebaneck · 2020-06-02T08:28:02Z

Component:

'lifecycle', 'addons', 'Prometheus rules'

What happened:

Upgrade and downgrade is broken on 2.4 and 2.5 branches with the following error:

E       AssertionError: Expected default Prometheus rules to be equal to deployed rules.
E       assert [{'message': ...arning'}, ...] == [{'message': '...arning'}, ...]
E         At index 24 diff: {'message': 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.', 'name': 'KubePodCrashLooping', 'query': 'rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[15m]) * 60 * 5 > 0', 'severity': 'critical'} != {'name': 'AlertmanagerDown', 'severity': 'critical', 'message': 'Alertmanager has disappeared from Prometheus target discovery.', 'query': 'absent(up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"} == 1)'}
E         Righ...

What was expected:

Post-merge should run normally

Steps to reproduce

Run post-merge in the CI on branch 2.5 and watch what happens
In 2.5, we recently upgraded the Prometheus-operator charts from 8.1.2 to 8.13.0. There seem to be significant changes in the Prometheus alert rules between the two versions.
To get a clue of the alert rule changes, Bootstrap a 2.4 cluster and then use the rule_extractor python script to dump the alert rules to file. See below a sample output of the alert rule diffs.

diff --git a/tools/rule_extractor/alerting_rules.json b/tools/rule_extractor/alerting_rules.json
index 18af323a..09f750e1 100644
--- a/tools/rule_extractor/alerting_rules.json
+++ b/tools/rule_extractor/alerting_rules.json
@@ -96,147 +96,165 @@
         "severity": "warning"
     },
     {
-        "message": "{{ printf \"%.4g\" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.",
+        "message": "{{ printf \"%.4g\" $value }}% of the {{ $labels.job }} targets in {{ $labels.namespace }} namespace are down.",
         "name": "TargetDown",
         "query": "100 * (count by(job, namespace, service) (up == 0) / count by(job, namespace, service) (up)) > 10",
         "severity": "warning"
     },
     {
-        "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.",
+        "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n",
         "name": "Watchdog",
         "query": "vector(1)",
         "severity": "none"
     },
     {
-        "message": "The API server is burning too much error budget",
-        "name": "KubeAPIErrorBudgetBurn",
-        "query": "sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01)",
+        "message": "Alertmanager has disappeared from Prometheus target discovery.",
+        "name": "AlertmanagerDown",
+        "query": "absent(up{job=\"prometheus-operator-alertmanager\",namespace=\"metalk8s-monitoring\"} == 1)",
         "severity": "critical"
     },

Resolution proposal (optional):

Some PrometheusRule CR's are left untouched after an upgrade
~~No strict requirements exist to run Prometheus alert rule comparison tests during upgrade/downgrades mainly because we can still catch alert rule changes during Bootstrap/install tests.~~
~~So I suggest we add pytest filters such that we can skip this particular scenario during upgrade/downgrade:~~

    @alertrules
    Scenario: Ensure deployed Prometheus rules match the default
        Given the Kubernetes API is available
        And we have 1 running pod labeled 'prometheus=prometheus-operator-prometheus' in namespace 'metalk8s-monitoring'
        Then the deployed Prometheus alert rules are the same as the default alert rules

The text was updated successfully, but these errors were encountered:

…Rule CRs During upgrade/downgrade, remnant PrometheusRule CR's of the previous installation might remain. This commit adds a salt-state to perform cleanup post upgrade/downgrade. This also ensures that our CI tests for deployed Prometheus rules always pass. Closes: #2585

Ebaneck added kind:bug Something isn't working priority:urgent Any issue we should jump in as soon as possible release:blocker An issue that blocks a release until resolved labels Jun 2, 2020

Ebaneck changed the title ~~Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules test~~ Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules Jun 2, 2020

Ebaneck added this to To do in Week 23/24/2020 via automation Jun 2, 2020

Ebaneck self-assigned this Jun 3, 2020

Ebaneck moved this from To do to In progress in Week 23/24/2020 Jun 3, 2020

Ebaneck mentioned this issue Jun 4, 2020

Add state to cleanup old PrometheusRule CRs post upgrade/downgrade #2594

Merged

Ebaneck moved this from In progress to Review in progress in Week 23/24/2020 Jun 4, 2020

Ebaneck moved this from Review in progress to Reviewer approved in Week 23/24/2020 Jun 5, 2020

bert-e closed this as completed in d5271b9 Jun 5, 2020

Week 23/24/2020 automation moved this from Reviewer approved to Done Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules #2585

Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules #2585

Ebaneck commented Jun 2, 2020 •

edited

Loading

Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules #2585

Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules #2585

Comments

Ebaneck commented Jun 2, 2020 • edited Loading

Ebaneck commented Jun 2, 2020 •

edited

Loading