Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules #2585

Closed
Ebaneck opened this issue Jun 2, 2020 · 0 comments
Closed
Assignees
Labels
kind:bug Something isn't working priority:urgent Any issue we should jump in as soon as possible release:blocker An issue that blocks a release until resolved

Comments

@Ebaneck
Copy link
Contributor

Ebaneck commented Jun 2, 2020

Component:

'lifecycle', 'addons', 'Prometheus rules'

What happened:

Upgrade and downgrade is broken on 2.4 and 2.5 branches with the following error:

E       AssertionError: Expected default Prometheus rules to be equal to deployed rules.
E       assert [{'message': ...arning'}, ...] == [{'message': '...arning'}, ...]
E         At index 24 diff: {'message': 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.', 'name': 'KubePodCrashLooping', 'query': 'rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[15m]) * 60 * 5 > 0', 'severity': 'critical'} != {'name': 'AlertmanagerDown', 'severity': 'critical', 'message': 'Alertmanager has disappeared from Prometheus target discovery.', 'query': 'absent(up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"} == 1)'}
E         Righ...

What was expected:

  • Post-merge should run normally

Steps to reproduce

  • Run post-merge in the CI on branch 2.5 and watch what happens
  • In 2.5, we recently upgraded the Prometheus-operator charts from 8.1.2 to 8.13.0. There seem to be significant changes in the Prometheus alert rules between the two versions.
  • To get a clue of the alert rule changes, Bootstrap a 2.4 cluster and then use the rule_extractor python script to dump the alert rules to file. See below a sample output of the alert rule diffs.
diff --git a/tools/rule_extractor/alerting_rules.json b/tools/rule_extractor/alerting_rules.json
index 18af323a..09f750e1 100644
--- a/tools/rule_extractor/alerting_rules.json
+++ b/tools/rule_extractor/alerting_rules.json
@@ -96,147 +96,165 @@
         "severity": "warning"
     },
     {
-        "message": "{{ printf \"%.4g\" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.",
+        "message": "{{ printf \"%.4g\" $value }}% of the {{ $labels.job }} targets in {{ $labels.namespace }} namespace are down.",
         "name": "TargetDown",
         "query": "100 * (count by(job, namespace, service) (up == 0) / count by(job, namespace, service) (up)) > 10",
         "severity": "warning"
     },
     {
-        "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.",
+        "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n",
         "name": "Watchdog",
         "query": "vector(1)",
         "severity": "none"
     },
     {
-        "message": "The API server is burning too much error budget",
-        "name": "KubeAPIErrorBudgetBurn",
-        "query": "sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01)",
+        "message": "Alertmanager has disappeared from Prometheus target discovery.",
+        "name": "AlertmanagerDown",
+        "query": "absent(up{job=\"prometheus-operator-alertmanager\",namespace=\"metalk8s-monitoring\"} == 1)",
         "severity": "critical"
     },

Resolution proposal (optional):

  • Some PrometheusRule CR's are left untouched after an upgrade
    No strict requirements exist to run Prometheus alert rule comparison tests during upgrade/downgrades mainly because we can still catch alert rule changes during Bootstrap/install tests.
    So I suggest we add pytest filters such that we can skip this particular scenario during upgrade/downgrade:
    @alertrules
    Scenario: Ensure deployed Prometheus rules match the default
        Given the Kubernetes API is available
        And we have 1 running pod labeled 'prometheus=prometheus-operator-prometheus' in namespace 'metalk8s-monitoring'
        Then the deployed Prometheus alert rules are the same as the default alert rules
@Ebaneck Ebaneck added kind:bug Something isn't working priority:urgent Any issue we should jump in as soon as possible release:blocker An issue that blocks a release until resolved labels Jun 2, 2020
@Ebaneck Ebaneck changed the title Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules test Upgrade/downgrade in 2.5 is broken due to diff in Prometheus alert rules Jun 2, 2020
@Ebaneck Ebaneck added this to To do in Week 23/24/2020 via automation Jun 2, 2020
@Ebaneck Ebaneck self-assigned this Jun 3, 2020
@Ebaneck Ebaneck moved this from To do to In progress in Week 23/24/2020 Jun 3, 2020
Ebaneck added a commit that referenced this issue Jun 4, 2020
…Rule CRs

During upgrade/downgrade, remnant PrometheusRule CR's of the previous
installation might remain.

This commit adds a salt-state to perform cleanup post upgrade/downgrade.

This also ensures that our CI tests for deployed Prometheus rules always
pass.

Closes: #2585
Ebaneck added a commit that referenced this issue Jun 4, 2020
…Rule CRs

During upgrade/downgrade, remnant PrometheusRule CR's of the previous
installation might remain.

This commit adds a salt-state to perform cleanup post upgrade/downgrade.

This also ensures that our CI tests for deployed Prometheus rules always
pass.

Closes: #2585
@Ebaneck Ebaneck moved this from In progress to Review in progress in Week 23/24/2020 Jun 4, 2020
Ebaneck added a commit that referenced this issue Jun 4, 2020
…Rule CRs

During upgrade/downgrade, remnant PrometheusRule CR's of the previous
installation might remain.

This commit adds a salt-state to perform cleanup post upgrade/downgrade.

This also ensures that our CI tests for deployed Prometheus rules always
pass.

Closes: #2585
Ebaneck added a commit that referenced this issue Jun 4, 2020
…Rule CRs

During upgrade/downgrade, remnant PrometheusRule CR's of the previous
installation might remain.

This commit adds a salt-state to perform cleanup post upgrade/downgrade.

This also ensures that our CI tests for deployed Prometheus rules always
pass.

Closes: #2585
Ebaneck added a commit that referenced this issue Jun 4, 2020
…Rule CRs

During upgrade/downgrade, remnant PrometheusRule CR's of the previous
installation might remain.

This commit adds a salt-state to perform cleanup post upgrade/downgrade.

This also ensures that our CI tests for deployed Prometheus rules always
pass.

Closes: #2585
@Ebaneck Ebaneck moved this from Review in progress to Reviewer approved in Week 23/24/2020 Jun 5, 2020
@bert-e bert-e closed this as completed in d5271b9 Jun 5, 2020
Week 23/24/2020 automation moved this from Reviewer approved to Done Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Something isn't working priority:urgent Any issue we should jump in as soon as possible release:blocker An issue that blocks a release until resolved
Projects
No open projects
Development

No branches or pull requests

1 participant