MetalK8s is automatically deploying Prometheus, Alertmanager and a set of predefined alert rules. In order to leverage Prometheus and Alertmanager functionalities, we need to explain, in the documentation, how to use it. In a later stage, those functionalities will be exposed through various administration and alerting UIs, but for now, we want to provide our administrator with enough information in order to use very basic alerting functionalities.
As a MetalK8s administrator, I want to list or know the list of alert rules that are deployed on MetalK8s Prometheus cluster, In order to identify on what specific rule I want to be alerted.
As a MetalK8s administrator, I want to set notification routing and receiver for a specific alert, In order to get notified when such alert is fired The important routing to support are email, slack and pagerduty.
As a MetalK8s administrator, I want to update thresholds for a specific alert rule, In order to adapt the alert rule to the specificities and performances of my platform.
As a MetalK8s administrator, I want to add a new alert rule, In order to monitor a specific KPI which is not monitored out of the box by MetalK8s.
As a MetalK8s administrator, I want to inhibit an alert rule, In order to skip alerts in which I am not interested.
As a MetalK8s administrator, I want to silence an alert rule for a certain amount of time, In order to skip alert notifications during a planned maintenance operation.
Warning
In all cases, when MetalK8s administrator is upgrading the cluster, all listed customizations should remain.
Note
Alertmanager configuration documentation is available here
To be able to edit existing rules, add new ones, etc., and in order to keep these changes across restorations, upgrades and downgrades, we need to put in place some mechanisms to configure Prometheus and Alertmanager and persist these configurations.
For the persistence part, we will rely on what has been done for CSC<configurations>
(Cluster and Services Configurations), and use the already defined resources for Alertmanager
and Prometheus
.
We will use the already existing metalk8s-prometheus-config
ConfigMap
to store the Prometheus
configuration customizations.
Adding extra alert and record rules will be done editing this ConfigMap
under the spec.extraRules
key in config.yaml
as follows:
apiVersion: v1
kind: ConfigMap
metadata:
name: metalk8s-prometheus-config
namespace: metalk8s-monitoring
data:
config.yaml: |-
apiVersion: addons.metalk8s.scality.com
kind: PrometheusConfig
spec:
deployment:
replicas: 1
extraRules:
groups:
- name: <rulesGroupName>
rules:
- alert: <AlertName>
annotations:
description: description of what this alert is
expr: vector(1)
for: 10m
labels:
severity: critical
- alert: <AnotherAlertName>
[...]
- record: <recordName>
[...]
- name: <anotherRulesGroupName>
[...]
PromQL is to be used to define expr
field.
This spec.extraRules
entry will be used to generate through Salt a PrometheusRule
object named metalk8s-prometheus-extra-rules
in the metalk8s-monitoring
namespace, which will be automatically consumed by the Prometheus
Operator to generate the new rules.
A CLI and UI tooling will be provided to show and edit this configuration.
To edit existing Prometheus
rules, we can't only define new PrometheusRules
resources since Prometheus
Operator will not overwrite those already existing, but will rather append them to the list of rules, ending up with 2 rules with the same name but different parameters.
We also can't edit the PrometheusRules
deployed by MetalK8s, otherwise we would lose these changes in case of cluster restoration, upgrade or downgrade.
So, in order to allow the user to customize the alert rules, we will pick up some of them (the most relevant ones) and expose only few parts of their configurations (e.g. threshold) to be customized.
It also makes the customization of these alert rules easier for the user as, for example, he will not need to understand PromQL to adapt the threshold of an alert rule.
Since in Prometheus
rules, there are duplicated group name + alert rule name, we also need to take the severity into account to understand which specific alert we're editing.
These customization will be stored in the metalk8s-prometheus-config
ConfigMap
with something like:
apiVersion: v1
kind: ConfigMap
metadata:
name: metalk8s-prometheus-config
namespace: metalk8s-monitoring
data:
config.yaml: |-
apiVersion: addons.metalk8s.scality.com
kind: PrometheusConfig
spec:
deployment:
replicas: 1
rules:
<alertGroupName>:
<alertName>:
warning:
threshold: 30
critical:
threshold: 10
<anotherAlertGroupName>:
<anotherAlertName>:
critical:
threshold: 20
anotherThreshold: 10
The PrometheusRules
object manifests salt/metalk8s/addons/prometheus-operator/deployed/chart.sls
need to be templatized to consume these customizations through CSC module.
Default values for customizable alert rules to fallback on, if not defined in the ConfigMap
, will be set in salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
.
We will use the already existing metalk8s-alertmanager-config
ConfigMap
to store the term:Alertmanager configuration customizations.
A Salt module will be developed to manipulate this object, so the logic can be kept in only one place.
This module must provide necessary methods to show or edit the configuration in 2 different ways:
- simple
- advanced
The simple
mode will only display and allow to change some specific configuration, such as the receivers or the inhibit rules, and in an as simple as possible manner for the user.
The advanced
mode will allow to change all the configuration points, exposing the whole configuration as a plain YAML.
This module will then be exposed through a CLI and a UI.
To retrieve the list of alert rules, we must use the Prometheus API. This can be achieved using the following route:
http://<prometheus-ip>:9090/api/v1/rules
This API call should be done in a Salt module metalk8s_monitoring
which could then be wrapped in a CLI and UI.
To silence an alert, we need to send a query to the Alertmanager API. This can be done using the following route:
http://<alertmanager-ip>:9093/api/v1/silences
With a POST query content formatted as below:
{
"matchers": [
{
"name": "alert-name",
"value": "<alert-name>"
}
],
"startsAt": "2020-04-10T12:12:12",
"endsAt": "2020-04-10T13:12:12",
"createdBy": "<author>",
"comment": "Maintenance is planned",
"status": {
"state": "active"
}
}
We must also be able to retrieve silenced alerts and to remove a silence. This will be done using the API, with the same route using GET and DELETE word respectively:
# GET - to list all silences
http://<alertmanager-ip>:9093/api/v1/silences
# DELETE - to delete a specific silence
http://<alertmanager-ip>:9093/api/v1/silence/<silence-id>
We will need to provide these functionnalities through a Salt module metalk8s_monitoring
which could then be wrapped in a CLI and UI.
We need to build a tool to extract all alert rules from the Prometheus
Operator rendered chart salt/metalk8s/addons/prometheus-operator/deployed/chart.sls
.
Its purpose will be to generate a file (each time this chart is updated) which will then be used to check that what's deployed matches what was expected.
And so, we will be able to see what has been changed when updating Prometheus
Operator chart and see if there is any change on customizable alert rules.
Using amtool vs Alertmanager API
Managing alert silences can be done using amtool:
# Add
amtool --alertmanager.url=http://localhost:9093 silence add \
alertname="<alert-name>" --comment 'Maintenance is planned'
# List
amtool --alertmanager.url=http://localhost:9093 silence query
# Delete
amtool --alertmanager.url=http://localhost:9093 silence expire <silence-id>
This option has been rejected because, to do so, we need to install an extra dependency (amtool binary) or run the commands inside the Alertmanager
container, rather than simply send HTTP queries on the API.
- Add an internal tool to list all Prometheus alert rules from rendered chart
- Implement Salt formulas to handle configuration customization (
advanced
mode only) - Provide CLI and UI to wrap the Salt calls
- Customization of node-exporter alert group thresholds
Document how to:
- Retrieve the list of alert rules
- Add a new alert rule
- Edit an existing alert rule
- Configure notifications (email, slack and pagerduty)
- Silence an alert
- Deactivate an alert
- Implement the
simple
mode in Salt formulas - Add the
simple
mode to both CLI and UI - Update the documentation with the
simple
mode
In the Operational Guide:
- Document how to manage silence on alerts (list, create & delete)
- Document how to manage alert rules (list, create, edit)
- Document how to configure alertmanager notifications
- Document how to deactivate an alert
- Add a list of alert rules configured in
Prometheus
, with a brief explanation for each and what can be customized
Add a new test scenario using pytest-bdd framework to ensure the correct behavior of this feature. These tests must be put in the post-merge step in the CI and must include:
- Configuration of a receiver in
Alertmanager
- Configuration of inhibit rules in
Alertmanager
- Add a new alert rule in
Prometheus
- Customize an existing alert rule in
Prometheus
- Alert silences management (add, list and delete)
- Deployed Prometheus alert rules must match what's expected from a given list (generated by a tool Extract Rules Tooling)