You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Metalk8s predefined Alert rules and Alert Grouping
Context
As part of Metalk8s, we would like to provide the Administrator with built-in
rules expressions that can be used to fire alerts and send notifications when
one of the High Level entities of the system is degraded or impacted by the
degradation of a Low Level component.
As an example, we would like to notify the administrator when the MetalK8s log
service is degraded because of some specific observed symptoms:
not all log service replicas are scheduled
one of the persistent volumes claimed by one log service replica is getting
full.
Log DB Ingestion rate is near zero
In this specific example, the goal is to invite the administrator to perform
manual tasks to avoid having a Log Service interruption in the near future.
Vocabulary
Atomic Alert: An Alert which is based on existing metrics in Prometheus and
which is linked to a specific symptom.
High Level Alert: An Alert which is based on other atomic alerts or High
Level alerts.
Requirements
When receiving such High Level alerts, we would like the system to guide the
administrator to find and understand the root cause of the alert as well as
the path to resolve it. Accessing the list of observed low level symptoms will
help the administrator's investigation.
Having the High Level Alerts also helps the administrator to have a
better understanding of what part/layer/component of the System is currently
impacted (without having to build a mental model to guess the impact of any
existing atomic alert in the System)
A bunch of atomic alerts are already deployed but we don’t yet have the
High Level Alerts that we could use to build the above the MetalK8s dashboard.
Being able to define the impact of one atomic alert is a way to build those
High Level Alerts:
It is impossible to modelize all possible causes through this kind of impacting
tree. However, when an alert is received, the system shall suggest other alerts
that may be linked to it, (maybe using matching labels).
Also, when accessing the Node or the Volume page / alert tab, the administrator
should be able to visualise all the fired alerts that are described under Nodes
or Volumes entities.
In the end, the way to visualise the impact of an atomic alert in the alert
page is described with the screenshot below:
The High Level alerts should be easily identifiable in order to filter it out
in the UI views. Indeed, the a first iteration we might want to display the
atomic alerts only until all High Level alerts are implemented and deployed.
Severity Classification
Critical Alert = Red = Service Offline or At Risk, requires immediate
intervention
The name of the bootstrap node depends on how MetalK8s is deployed. We would
need to automatically configure this alert during deployment. We may want
to use more deterministic filter to find out the repository and salt-master
pods.
The name of the interface used by Workload Plane and/or Control Plane is
not known in advance. As such, we should find a way to automatically
configure the Network alerts based on Network configuration.
Note
Currently we don't have any alerts for the Virtual Plane which is provided
by kube-proxy, calico-kube-controllers, calico-node. It is not even part of
the MetalK8s Dashboard page. We may want to introduce it.