Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charts,salt,build: Bump kube-prometheus-stack version to 16.9.1 #3422

Merged
merged 1 commit into from
Jun 28, 2021

Conversation

TeddyAndrieux
Copy link
Collaborator

Bump kube-prometheus-stack charts version to 16.9.1
The following images have also been bumped accordingly:

  • grafana to 8.0.1
  • k8s-sidecar to 1.12.2
  • kube-state-metrics to v2.0.0
  • node-exporter to v1.1.2
  • prometheus to v2.27.1
  • prometheus-config-reloader to v0.48.1
  • prometheus-operator to v0.48.1

@TeddyAndrieux TeddyAndrieux requested a review from a team June 18, 2021 14:12
@bert-e
Copy link
Contributor

bert-e commented Jun 18, 2021

Hello teddyandrieux,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Jun 18, 2021

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

@TeddyAndrieux TeddyAndrieux force-pushed the improvement/bump-kube-prom-stack branch from 8fc73b9 to 4c8a063 Compare June 18, 2021 14:12
@alexandre-allard alexandre-allard self-assigned this Jun 24, 2021
Copy link
Contributor

@alexandre-allard alexandre-allard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, you just need to update the customizable Prometheus rules as follows:

diff --git a/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml b/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
index 7dcd6dca5..f356c4ede 100644
--- a/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
+++ b/salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
@@ -39,10 +39,10 @@ spec:
           available: 3
       node_network_receive_errors:
         warning:
-          errors: 10  # Number of receive errors for the last 2m
+          errors: 0.01  # Rate of receive errors for the last 2m
       node_network_transmit_errors:
         warning:
-          errors: 10  # Number of transmit errors for the last 2m
+          errors: 0.01  # Rate of transmit errors for the last 2m
       node_high_number_conntrack_entries_used:
         warning:
           threshold: 0.75
diff --git a/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls b/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls
index 6549e8336..a5576dd77 100644
--- a/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls
+++ b/salt/metalk8s/addons/prometheus-operator/deployed/prometheus-rules.sls
@@ -168,7 +168,8 @@ spec:
           {{ printf "%.0f" $value }} receive errors in the last two minutes.'
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworkreceiveerrs
         summary: Network interface is reporting many receive errors.
-      expr: increase(node_network_receive_errs_total[2m]) > {% endraw %}{{ rules.node_exporter.node_network_receive_errors.warning.errors }}{% raw %}
+      expr: increase(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m])
+        > {% endraw %}{{ rules.node_exporter.node_network_receive_errors.warning.errors }}{% raw %}
       for: 1h
       labels:
         severity: warning
@@ -178,7 +179,8 @@ spec:
           {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworktransmiterrs
         summary: Network interface is reporting many transmit errors.
-      expr: increase(node_network_transmit_errs_total[2m]) > {% endraw %}{{ rules.node_exporter.node_network_transmit_errors.warning.errors }}{% raw %}
+      expr: increase(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m])
+        > {% endraw %}{{ rules.node_exporter.node_network_transmit_errors.warning.errors }}{% raw %}
       for: 1h
       labels:
         severity: warning
@@ -217,7 +219,10 @@ spec:
           is configured on this host.
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclocknotsynchronising
         summary: Clock not synchronising.
-      expr: min_over_time(node_timex_sync_status[5m]) == {% endraw %}{{ rules.node_exporter.node_clock_not_synchronising.warning.threshold }}{% raw %}
+      expr: |-
+        min_over_time(node_timex_sync_status[5m]) == {% endraw %}{{ rules.node_exporter.node_clock_not_synchronising.warning.threshold }}{% raw %}
+        and
+        node_timex_maxerror_seconds >= 16
       for: 10m
       labels:
         severity: warning
@@ -247,7 +252,7 @@ spec:
           Array '{{ $labels.device }}' needs attention and possibly a disk swap.
         runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-noderaiddiskfailure
         summary: Failed device in RAID array
-      expr: node_md_disks{state="fail"} >= {% endraw %}{{ rules.node_exporter.node_raid_disk_failure.warning.threshold }}{% raw %}
+      expr: node_md_disks{state="failed"} >= {% endraw %}{{ rules.node_exporter.node_raid_disk_failure.warning.threshold }}{% raw %}
       labels:
         severity: warning
 {%- endraw %}

I'm not 100% sure we want to change NodeNetworkTransmitErrs and NodeNetworkReceiveErrs as it breaks the compatibility with what we had before.
If a user has customized node_network_transmit_errors.warning.errors or node_network_receive_errors.warning.errors, it will not work as expected.
The thing is we can't really magically convert the old value, so either we keep it like that, either we do a "breaking" change (anyway I'm not sure anyone is using this).
Otherwise we could also rename errors to something else like error_rate, so at least the old customized value is not taken into account and we fallback on the default behavior.

@TeddyAndrieux TeddyAndrieux force-pushed the improvement/bump-kube-prom-stack branch from 4c8a063 to d8dad8b Compare June 25, 2021 15:25
Update kube-prometheus-stack chart to 16.9.1

```
rm -rf charts/kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm fetch -d charts --untar prometheus-community/kube-prometheus-stack
```

Re-render chart to salt state using

```
./charts/render.py prometheus-operator \
  charts/kube-prometheus-stack.yaml \
  charts/kube-prometheus-stack/ \
  --namespace metalk8s-monitoring \
  --service-config grafana \
  metalk8s-grafana-config \
  metalk8s/addons/prometheus-operator/config/grafana.yaml \
  metalk8s-monitoring \
  --service-config prometheus \
  metalk8s-prometheus-config \
  metalk8s/addons/prometheus-operator/config/prometheus.yaml \
  metalk8s-monitoring \
  --service-config alertmanager \
  metalk8s-alertmanager-config \
  metalk8s/addons/prometheus-operator/config/alertmanager.yaml \
  metalk8s-monitoring \
  --service-config dex \
  metalk8s-dex-config \
  metalk8s/addons/dex/config/dex.yaml.j2 metalk8s-auth \
  --drop-prometheus-rules charts/drop-prometheus-rules.yaml \
  > salt/metalk8s/addons/prometheus-operator/deployed/chart.sls
```

Update vendored rules

Update alert rules extract

```
./tools/rule_extractor/rule_extractor.py \
  -i <control-plane-ip> -p 8443 -t rules
```
@TeddyAndrieux TeddyAndrieux force-pushed the improvement/bump-kube-prom-stack branch from d8dad8b to 83809dd Compare June 25, 2021 16:35
@TeddyAndrieux
Copy link
Collaborator Author

/approve

@bert-e
Copy link
Contributor

bert-e commented Jun 28, 2021

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.10

The following branches will NOT be impacted:

  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3
  • development/2.4
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@bert-e
Copy link
Contributor

bert-e commented Jun 28, 2021

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.10

The following branches have NOT changed:

  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3
  • development/2.4
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

Please check the status of the associated issue None.

Goodbye teddyandrieux.

@bert-e bert-e merged commit 210feb8 into development/2.10 Jun 28, 2021
@bert-e bert-e deleted the improvement/bump-kube-prom-stack branch June 28, 2021 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants