Metalk8s predefined Alert rules and Alert Grouping

Context

As part of Metalk8s, we would like to provide the Administrator with built-in rules expressions that can be used to fire alerts and send notifications when one of the High Level entities of the system is degraded or impacted by the degradation of a Low Level component.

As an example, we would like to notify the administrator when the MetalK8s log service is degraded because of some specific observed symptoms:

not all log service replicas are scheduled
one of the persistent volumes claimed by one log service replica is getting full.
Log DB Ingestion rate is near zero

In this specific example, the goal is to invite the administrator to perform manual tasks to avoid having a Log Service interruption in the near future.

Vocabulary

Atomic Alert: An Alert which is based on existing metrics in Prometheus and which is linked to a specific symptom.

High Level Alert: An Alert which is based on other atomic alerts or High Level alerts.

Requirements

When receiving such High Level alerts, we would like the system to guide the administrator to find and understand the root cause of the alert as well as the path to resolve it. Accessing the list of observed low level symptoms will help the administrator's investigation.

Having the High Level Alerts also helps the administrator to have a better understanding of what part/layer/component of the System is currently impacted (without having to build a mental model to guess the impact of any existing atomic alert in the System)

A bunch of atomic alerts are already deployed but we don’t yet have the High Level Alerts that we could use to build the above the MetalK8s dashboard. Being able to define the impact of one atomic alert is a way to build those High Level Alerts:

It is impossible to modelize all possible causes through this kind of impacting tree. However, when an alert is received, the system shall suggest other alerts that may be linked to it, (maybe using matching labels).

Also, when accessing the Node or the Volume page / alert tab, the administrator should be able to visualise all the fired alerts that are described under Nodes or Volumes entities.

In the end, the way to visualise the impact of an atomic alert in the alert page is described with the screenshot below:

The High Level alerts should be easily identifiable in order to filter it out in the UI views. Indeed, the a first iteration we might want to display the atomic alerts only until all High Level alerts are implemented and deployed.

Severity Classification

Critical Alert = Red = Service Offline or At Risk, requires immediate intervention
Warning Alert = Yellow = Service Degraded, requires planned (within 1 week) intervention
No Active Alert = Green = Service Healthy

Notifications are either a mail, slack message or whatever routing supported by AlertManager or a decorated icon in the UI.

Data Model

We consider that Nodes and Volumes don't impact the Platform directly. As such they are not belonging to Platform.

Volumes

Nodes

System Partitions

Platform

Platform Services
Network

Platform

PlatformAtRisk

Severity	Critical
Summary	The Platform is at risk
Parent	none

Sub Alert	Severity	Filter
PlatformServicesAtRisk	Critical

PlatformDegraded

Severity	Warning
Summary	The Platform is degraded
Parent	none

Sub Alert	Severity	Filter
PlatformServicesDegraded	Warning
ControlPlaneNetworkDegraded	Warning
WorkloadPlaneNetworkDegraded	Warning

Nodes

NodeAtRisk

Severity	Critical
Summary	Node <nodename> is at risk
Parent	none

Sub Alert	Severity	Filter
KubeletClientCertificateExpiration	Critical
NodeRAIDDegraded	Critical
SystemPartitionAtRisk	Critical

NodeDegraded

Severity	Warning
Summary	Node <nodename> is degraded
Parent	none

Sub Alert	Severity	Filter
KubeNodeNotReady	Warning
KubeNodeReadinessFlapping	Warning
KubeNodeUnreachable	Warning
KubeletClientCertificateExpiration	Warning
KubeletClientCertificateRenewalErrors	Warning
KubeletPlegDurationHigh	Warning
KubeletPodStartUpLatencyHigh	Warning
KubeletServerCertificateExpiration	Warning
KubeletServerCertificateExpiration	Warning
KubeletServerCertificateRenewalErrors	Warning
KubeletTooManyPods	Warning
NodeClockNotSynchronising	Warning
NodeClockSkewDetected	Warning
NodeRAIDDiskFailure	Warning
NodeTextFileCollectorScrapeError	Warning
SystemPartitionDegraded	Warning

Currently no atomic Alert is defined yet for the following

System Unit (kubelet, containerd, salt-minion, ntp) would need to enrich node exporter
RAM
CPU

System Partitions

SystemPartitionAtRisk

Severity	Warning
Summary	The partition <mountpoint> on node <nodename> is at risk
Parent	NodeAtRisk

Sub Alert	Severity	Filter
NodeFilesystemAlmostOutOfSpace	Critical
NodeFilesystemAlmostOutOfFiles	Critical
NodeFilesystemFilesFillingUp	Critical
NodeFilesystemSpaceFillingUp	Critical

SystemPartitionDegraded

Severity	Warning
Summary	The partition <mountpoint> on node <nodename> is degraded
Parent	NodeDegraded

Sub Alert	Severity	Filter
NodeFilesystemAlmostOutOfSpace	Warning
NodeFilesystemAlmostOutOfFiles	Warning
NodeFilesystemFilesFillingUp	Warning
NodeFilesystemSpaceFillingUp	Warning

Volumes

VolumeAtRisk

Severity	Critical
Summary	The volume <volumename> on node <nodename> is at risk
Parent	multiple parents

Sub Alert	Severity	Filter
KubePersistentVolumeErrors	Warning
KubePersistentVolumeFillingUp	Critical

VolumeDegraded

Severity	Warning
Summary	The volume <volumename> on node <nodename> is degraded
Parent	multiple parents

Sub Alert	Severity	Filter
KubePersistentVolumeFillingUp	Warning

Platform Services

PlatformServicesAtRisk

Severity	Critical
Summary	The Platform services are at risk
Parent	PlatformAtRisk

Sub Alert	Severity	Filter
CoreServicesAtRisk	Critical
ObservabilityServicesAtRisk	Critical

PlatformServicesDegraded

Severity	Warning
Summary	The Platform services are degraded
Parent	PlatformDegraded

Sub Alert	Severity	Filter
CoreServicesDegraded	Warning
ObservabilityServicesDegraded	Warning
AccessServicesDegraded	Warning

Core

CoreServicesAtRisk

Severity	Critical
Summary	The Core services are at risk
Parent	PlatformServicesAtRisk

Sub Alert	Severity	Filter
K8sMasterServicesAtRisk	Critical

CoreServicesDegraded

Severity	Warning
Summary	The Core services are degraded
Parent	PlatformServicesDegraded

Sub Alert	Severity	Filter
K8sMasterServicesDegraded	Critical
BootstrapServicesDegraded	Critical

K8sMasterServicesAtRisk

Severity	Warning
Summary	The kubernetes master services are at risk
Parent	CoreServicesAtRisk

Sub Alert	Severity	Filter
KubeAPIErrorBudgetBurn	Critical
etcdHighNumberOfFailedGRPCRequests	Critical
etcdGRPCRequestsSlow	Critical
etcdHighNumberOfFailedHTTPRequests	Critical
etcdInsufficientMembers	Critical
etcdMembersDown	Critical
etcdNoLeader	Critical
KubeStateMetricsListErrors	Critical
KubeStateMetricsWatchErrors	Critical
KubeAPIDown	Critical
KubeClientCertificateExpiration	Critical
KubeClientCertificateExpiration	Critical
KubeControllerManagerDown	Critical
KubeletDown	Critical
KubeSchedulerDown	Critical

K8sMasterServicesDegraded

Severity	Warning
Summary	The kubernetes master services are degraded
Parent	CoreServicesDegraded

Sub Alert	Severity	Filter
KubeAPIErrorBudgetBurn	Warning
etcdHighNumberOfFailedGRPCRequests	Warning
etcdHTTPRequestsSlow	Warning
etcdHighCommitDurations	Warning
etcdHighFsyncDurations	Warning
etcdHighNumberOfFailedHTTPRequests	Warning
etcdHighNumberOfFailedProposals	Warning
etcdHighNumberOfLeaderChanges	Warning
etcdMemberCommunicationSlow	Warning
KubeCPUOvercommit	Warning
KubeCPUQuotaOvercommit	Warning
KubeMemoryOvercommit	Warning
KubeMemoryQuotaOvercommit	Warning
KubeClientCertificateExpiration	Warning
KubeClientErrors	Warning
KubeVersionMismatch	Warning
KubeDeploymentReplicasMismatch	Warning	kube-system/coredns
KubeDeploymentReplicasMismatch	Warning	metalk8s-monitoring/prometheus-adapter
KubeDeploymentReplicasMismatch	Warning	metalk8s-monitoring/prometheus-operator-kube-state-metrics

BootstrapServicesDegraded

Severity	Warning
Summary	The bootstrap services are degraded
Parent	CoreServicesDegraded

Sub Alert	Severity	Filter
KubePodNotReady	Warning	kube-system/repositories-<bootstrapname>
KubePodNotReady	Warning	kube-system/salt-master-<bootstrapname>
KubeDeploymentReplicasMismatch	Warning	kube-system/storage-operator
KubeDeploymentReplicasMismatch	Warning	metalk8s-ui/metalk8s-ui

Note

The name of the bootstrap node depends on how MetalK8s is deployed. We would need to automatically configure this alert during deployment. We may want to use more deterministic filter to find out the repository and salt-master pods.

Observability

ObservabilityServicesAtRisk

Severity	Critical
Summary	The observability services are at risk
Parent	PlatformServicesAtRisk

Sub Alert	Severity	Filter
MonitoringServiceAtRisk	Critical
AlertingServiceAtRisk	Critical
LoggingServiceAtRisk	Critical

ObservabilityServicesDegraded

Severity	Warning
Summary	The observability services are degraded
Parent	PlatformServicesDegraded

Sub Alert	Severity	Filter
MonitoringServiceDegraded	Warning
AlertingServiceDegraded	Warning
DashboardingServiceDegraded	Warning
LoggingServiceDegraded	Warning

MonitoringServiceAtRisk

Severity	Warning
Summary	The monitoring service is at risk
Parent	ObservabilityServicesAtRisk

Sub Alert	Severity	Filter
PrometheusRuleFailures	Critical
PrometheusRemoteWriteBehind	Critical
PrometheusRemoteStorageFailures	Critical
PrometheusErrorSendingAlertsToAnyAlertmanager	Critical
PrometheusBadConfig	Critical

MonitoringServiceDegraded

Severity	Warning
Summary	The monitoring service is degraded
Parent	ObservabilityServicesDegraded

Sub Alert	Severity	Filter
VolumeDegraded	Warning	app.kubernetes.io/name=prometheus-operator-prometheus
VolumeAtRisk	Critical	app.kubernetes.io/name=prometheus-operator-prometheus
TargetDown	Warning	To be defined
PrometheusTargetLimitHit	Warning
PrometheusTSDBReloadsFailing	Warning
PrometheusTSDBCompactionsFailing	Warning
PrometheusRemoteWriteDesiredShards	Warning
PrometheusOutOfOrderTimestamps	Warning
PrometheusNotificationQueueRunningFull	Warning
PrometheusNotIngestingSamples	Warning
PrometheusNotConnectedToAlertmanagers	Warning
PrometheusMissingRuleEvaluations	Warning
PrometheusErrorSendingAlertsToSomeAlertmanagers	Warning
PrometheusDuplicateTimestamps	Warning
PrometheusOperatorWatchErrors	Warning
PrometheusOperatorSyncFailed	Warning
PrometheusOperatorRejectedResources	Warning
PrometheusOperatorReconcileErrors	Warning
PrometheusOperatorNotReady	Warning
PrometheusOperatorNodeLookupErrors	Warning
PrometheusOperatorListErrors	Warning
KubeStatefulSetReplicasMismatch	Warning	metalk8s-monitoring/prometheus-prometheus-operator-prometheus
KubeDeploymentReplicasMismatch	Warning	metalk8s-monitoring/prometheus-operator-operator
KubeDaemonSetNotScheduled	Warning	metalk8s-monitoring/prometheus-operator-prometheus-node-exporter

LoggingServiceAtRisk

Severity	Critcal
Summary	The logging service is at risk
Parent	ObservabilityServicesAtRisk

Sub Alert	Severity	Filter
AlertmanagerConfigInconsistent	Critical
AlertmanagerMembersInconsistent	Critical
AlertmanagerFailedReload	Critical

LoggingServiceDegraded

Severity	Warning
Summary	The logging service is degraded
Parent	ObservabilityServicesDegraded

Sub Alert	Severity	Filter
VolumeDegraded	Warning	app.kubernetes.io/name=loki
VolumeAtRisk	Critical	app.kubernetes.io/name=loki
TargetDown	Warning	To be defined
KubeStatefulSetReplicasMismatch	Warning	metalk8s-logging/loki
KubeDaemonSetNotScheduled	Warning	metalk8s-logging/fluentbit

AlertingServiceAtRisk

Severity	Critcal
Summary	The alerting service is at risk
Parent	ObservabilityServicesAtRisk

Sub Alert	Severity	Filter
AlertmanagerConfigInconsistent	Critical
AlertmanagerMembersInconsistent	Critical
AlertmanagerFailedReload	Critical

AlertingServiceDegraded

Severity	Warning
Summary	The alerting service is degraded
Parent	ObservabilityServicesDegraded

Sub Alert	Severity	Filter
VolumeDegraded	Warning	app.kubernetes.io/name=prometheus-operator-alertmanager
VolumeAtRisk	Critical	app.kubernetes.io/name=prometheus-operator-alertmanager
TargetDown	Warning	To be defined
KubeStatefulSetReplicasMismatch	Warning	metalk8s-monitoring/alertmanager-prometheus-operator-alertmanager
AlertmanagerFailedReload	Warning

DashboardingServiceDegraded

Severity	Warning
Summary	The dashboarding service is degraded
Parent	ObservabilityServicesDegraded

Sub Alert	Severity	Filter
KubeStatefulSetReplicasMismatch	Warning	metalk8s-monitoring/prometheus-operator-grafana
TargetDown	Warning	To be defined

Network

ControlPlaneNetworkDegraded

Severity	Warning
Summary	The Control Plane Network is degraded
Parent	PlatformDegraded

Sub Alert	Severity	Filter
NodeNetworkReceiveErrs	Warning	Need to filter on the proper cp interface
NodeHighNumberConntrackEntriesUsed	Warning	Need to filter on the proper cp interface
NodeNetworkTransmitErrs	Warning	Need to filter on the proper cp interface
NodeNetworkInterfaceFlapping	Warning	Need to filter on the proper cp interface

WorkloadPlaneNetworkDegraded

Severity	Warning
Summary	The Workload Plane Network is degraded
Parent	PlatformDegraded

Sub Alert	Severity	Filter
NodeNetworkReceiveErrs	Warning	Need to filter on the proper wp interface
NodeHighNumberConntrackEntriesUsed	Warning	Need to filter on the proper wp interface
NodeNetworkTransmitErrs	Warning	Need to filter on the proper wp interface
NodeNetworkInterfaceFlapping	Warning	Need to filter on the proper wp interface

Note

The name of the interface used by Workload Plane and/or Control Plane is not known in advance. As such, we should find a way to automatically configure the Network alerts based on Network configuration.

Note

Currently we don't have any alerts for the Virtual Plane which is provided by kube-proxy, calico-kube-controllers, calico-node. It is not even part of the MetalK8s Dashboard page. We may want to introduce it.

Access

AccessServicesDegraded

Severity	Warning
Summary	The Access services are degraded
Parent	PlatformServicesDegraded

Sub Alert	Severity	Filter
IngressControllerDegraded	Warning
AuthenticationDegraded	Warning

IngressControllerDegraded

Severity	Warning
Summary	The Ingress Controllers for CP and WP are degraded
Parent	AccessServicesDegraded

Sub Alert	Severity	Filter
KubeDeploymentReplicasMismatch	Warning	metalk8s-ingress/ingress-nginx-defaultbackend
KubeDaemonSetNotScheduled	Warning	metalk8s-system/ingress-nginx-controller
KubeDaemonSetNotScheduled	Warning	metalk8s-system/ingress-nginx-control-plane-controller

AuthenticationDegraded

Severity	Warning
Summary	The Authentication service for K8S API is degraded
Parent	AccessServicesDegraded

Sub Alert	Severity	Filter
KubeDeploymentReplicasMismatch	Warning	metalk8s-auth/dex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alert-grouping.rst

alert-grouping.rst

Metalk8s predefined Alert rules and Alert Grouping

Context

Vocabulary

Requirements

Severity Classification

Data Model

Platform

Nodes

System Partitions

Volumes

Platform Services

Core

Observability

Network

Access

Files

alert-grouping.rst

Latest commit

History

alert-grouping.rst

File metadata and controls

Metalk8s predefined Alert rules and Alert Grouping

Context

Vocabulary

Requirements

Severity Classification

Data Model

Platform

Nodes

System Partitions

Volumes

Platform Services

Core

Observability

Network

Access