charts,salt,build: Bump kube-prometheus-stack version to 16.9.1 #3422

TeddyAndrieux · 2021-06-18T14:12:08Z

Bump kube-prometheus-stack charts version to 16.9.1
The following images have also been bumped accordingly:

grafana to 8.0.1
k8s-sidecar to 1.12.2
kube-state-metrics to v2.0.0
node-exporter to v1.1.2
prometheus to v2.27.1
prometheus-config-reloader to v0.48.1
prometheus-operator to v0.48.1

bert-e · 2021-06-18T14:12:10Z

Hello teddyandrieux,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

bert-e · 2021-06-18T14:12:14Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
one peer

Peer approvals must include at least 1 approval from the following list:

alexandre-allard

LGTM, you just need to update the customizable Prometheus rules as follows:

diff --git a/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml b/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
index 7dcd6dca5..f356c4ede 100644
--- a/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
+++ b/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
@@ -39,10 +39,10 @@ spec:
           available: 3
       node_network_receive_errors:
         warning:
-          errors: 10  # Number of receive errors for the last 2m
+          errors: 0.01  # Rate of receive errors for the last 2m
       node_network_transmit_errors:
         warning:
-          errors: 10  # Number of transmit errors for the last 2m
+          errors: 0.01  # Rate of transmit errors for the last 2m
       node_high_number_conntrack_entries_used:
         warning:
           threshold: 0.75
diff --git a/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls b/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls
index 6549e8336..a5576dd77 100644
--- a/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls
+++ b/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls
@@ -168,7 +168,8 @@ spec:
           {{ printf "%.0f" $value }} receive errors in the last two minutes.'
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworkreceiveerrs
         summary: Network interface is reporting many receive errors.
-      expr: increase(node_network_receive_errs_total[2m]) > {% endraw %}{{ rules.node_exporter.node_network_receive_errors.warning.errors }}{% raw %}
+      expr: increase(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m])
+        > {% endraw %}{{ rules.node_exporter.node_network_receive_errors.warning.errors }}{% raw %}
       for: 1h
       labels:
         severity: warning
@@ -178,7 +179,8 @@ spec:
           {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworktransmiterrs
         summary: Network interface is reporting many transmit errors.
-      expr: increase(node_network_transmit_errs_total[2m]) > {% endraw %}{{ rules.node_exporter.node_network_transmit_errors.warning.errors }}{% raw %}
+      expr: increase(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m])
+        > {% endraw %}{{ rules.node_exporter.node_network_transmit_errors.warning.errors }}{% raw %}
       for: 1h
       labels:
         severity: warning
@@ -217,7 +219,10 @@ spec:
           is configured on this host.
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclocknotsynchronising
         summary: Clock not synchronising.
-      expr: min_over_time(node_timex_sync_status[5m]) == {% endraw %}{{ rules.node_exporter.node_clock_not_synchronising.warning.threshold }}{% raw %}
+      expr: |-
+        min_over_time(node_timex_sync_status[5m]) == {% endraw %}{{ rules.node_exporter.node_clock_not_synchronising.warning.threshold }}{% raw %}
+        and
+        node_timex_maxerror_seconds >= 16
       for: 10m
       labels:
         severity: warning
@@ -247,7 +252,7 @@ spec:
           Array '{{ $labels.device }}' needs attention and possibly a disk swap.
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-noderaiddiskfailure
         summary: Failed device in RAID array
-      expr: node_md_disks{state="fail"} >= {% endraw %}{{ rules.node_exporter.node_raid_disk_failure.warning.threshold }}{% raw %}
+      expr: node_md_disks{state="failed"} >= {% endraw %}{{ rules.node_exporter.node_raid_disk_failure.warning.threshold }}{% raw %}
       labels:
         severity: warning
 {%- endraw %}

I'm not 100% sure we want to change NodeNetworkTransmitErrs and NodeNetworkReceiveErrs as it breaks the compatibility with what we had before.
If a user has customized node_network_transmit_errors.warning.errors or node_network_receive_errors.warning.errors, it will not work as expected.
The thing is we can't really magically convert the old value, so either we keep it like that, either we do a "breaking" change (anyway I'm not sure anyone is using this).
Otherwise we could also rename errors to something else like error_rate, so at least the old customized value is not taken into account and we fallback on the default behavior.

Update kube-prometheus-stack chart to 16.9.1 ``` rm -rf charts/kube-prometheus-stack helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm fetch -d charts --untar prometheus-community/kube-prometheus-stack ``` Re-render chart to salt state using ``` ./charts/render.py prometheus-operator \ charts/kube-prometheus-stack.yaml \ charts/kube-prometheus-stack/ \ --namespace metalk8s-monitoring \ --service-config grafana \ metalk8s-grafana-config \ metalk8s/addons/prometheus-operator/config/grafana.yaml \ metalk8s-monitoring \ --service-config prometheus \ metalk8s-prometheus-config \ metalk8s/addons/prometheus-operator/config/prometheus.yaml \ metalk8s-monitoring \ --service-config alertmanager \ metalk8s-alertmanager-config \ metalk8s/addons/prometheus-operator/config/alertmanager.yaml \ metalk8s-monitoring \ --service-config dex \ metalk8s-dex-config \ metalk8s/addons/dex/config/dex.yaml.j2 metalk8s-auth \ --drop-prometheus-rules charts/drop-prometheus-rules.yaml \ > salt/metalk8s/addons/prometheus-operator/deployed/chart.sls ``` Update vendored rules Update alert rules extract ``` ./tools/rule_extractor/rule_extractor.py \ -i <control-plane-ip> -p 8443 -t rules ```

TeddyAndrieux · 2021-06-28T08:01:46Z

/approve

bert-e · 2021-06-28T08:01:56Z

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

✔️ development/2.10

The following branches will NOT be impacted:

development/2.0
development/2.1
development/2.2
development/2.3
development/2.4
development/2.5
development/2.6
development/2.7
development/2.8
development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

Any commit you add on the source branch will trigger a new cycle after the
current queue is merged.
Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

bert-e · 2021-06-28T09:22:49Z

I have successfully merged the changeset of this pull request
into targetted development branches:

✔️ development/2.10

The following branches have NOT changed:

development/2.0
development/2.1
development/2.2
development/2.3
development/2.4
development/2.5
development/2.6
development/2.7
development/2.8
development/2.9

Please check the status of the associated issue None.

Goodbye teddyandrieux.

TeddyAndrieux requested a review from a team June 18, 2021 14:12

TeddyAndrieux force-pushed the improvement/bump-kube-prom-stack branch from 8fc73b9 to 4c8a063 Compare June 18, 2021 14:12

alexandre-allard self-assigned this Jun 24, 2021

alexandre-allard reviewed Jun 24, 2021

View reviewed changes

TeddyAndrieux force-pushed the improvement/bump-kube-prom-stack branch from 4c8a063 to d8dad8b Compare June 25, 2021 15:25

TeddyAndrieux force-pushed the improvement/bump-kube-prom-stack branch from d8dad8b to 83809dd Compare June 25, 2021 16:35

alexandre-allard approved these changes Jun 28, 2021

View reviewed changes

bert-e merged commit 210feb8 into development/2.10 Jun 28, 2021

bert-e deleted the improvement/bump-kube-prom-stack branch June 28, 2021 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

charts,salt,build: Bump kube-prometheus-stack version to 16.9.1 #3422

charts,salt,build: Bump kube-prometheus-stack version to 16.9.1 #3422

TeddyAndrieux commented Jun 18, 2021

bert-e commented Jun 18, 2021

bert-e commented Jun 18, 2021

alexandre-allard left a comment •

edited

Loading

TeddyAndrieux commented Jun 28, 2021

bert-e commented Jun 28, 2021

bert-e commented Jun 28, 2021

charts,salt,build: Bump kube-prometheus-stack version to 16.9.1 #3422

charts,salt,build: Bump kube-prometheus-stack version to 16.9.1 #3422

Conversation

TeddyAndrieux commented Jun 18, 2021

bert-e commented Jun 18, 2021

Hello teddyandrieux,

bert-e commented Jun 18, 2021

Waiting for approval

alexandre-allard left a comment • edited Loading

Choose a reason for hiding this comment

TeddyAndrieux commented Jun 28, 2021

bert-e commented Jun 28, 2021

In the queue

bert-e commented Jun 28, 2021

alexandre-allard left a comment •

edited

Loading