Skip to content

Latest commit

 

History

History
761 lines (551 loc) · 20.4 KB

alert-grouping.rst

File metadata and controls

761 lines (551 loc) · 20.4 KB

Metalk8s predefined Alert rules and Alert Grouping

Context

As part of Metalk8s, we would like to provide the Administrator with built-in rules expressions that can be used to fire alerts and send notifications when one of the High Level entities of the system is degraded or impacted by the degradation of a Low Level component.

As an example, we would like to notify the administrator when the MetalK8s log service is degraded because of some specific observed symptoms:

  • not all log service replicas are scheduled
  • one of the persistent volumes claimed by one log service replica is getting full.
  • Log DB Ingestion rate is near zero

In this specific example, the goal is to invite the administrator to perform manual tasks to avoid having a Log Service interruption in the near future.

Vocabulary

Atomic Alert: An Alert which is based on existing metrics in Prometheus and which is linked to a specific symptom.

High Level Alert: An Alert which is based on other atomic alerts or High Level alerts.

Requirements

When receiving such High Level alerts, we would like the system to guide the administrator to find and understand the root cause of the alert as well as the path to resolve it. Accessing the list of observed low level symptoms will help the administrator's investigation.

Having the High Level Alerts also helps the administrator to have a better understanding of what part/layer/component of the System is currently impacted (without having to build a mental model to guess the impact of any existing atomic alert in the System)

img/metalk8s-overview.jpg

A bunch of atomic alerts are already deployed but we don’t yet have the High Level Alerts that we could use to build the above the MetalK8s dashboard. Being able to define the impact of one atomic alert is a way to build those High Level Alerts:

It is impossible to modelize all possible causes through this kind of impacting tree. However, when an alert is received, the system shall suggest other alerts that may be linked to it, (maybe using matching labels).

Also, when accessing the Node or the Volume page / alert tab, the administrator should be able to visualise all the fired alerts that are described under Nodes or Volumes entities.

In the end, the way to visualise the impact of an atomic alert in the alert page is described with the screenshot below:

img/alertes.jpg

The High Level alerts should be easily identifiable in order to filter it out in the UI views. Indeed, the a first iteration we might want to display the atomic alerts only until all High Level alerts are implemented and deployed.

Severity Classification

  • Critical Alert = Red = Service Offline or At Risk, requires immediate intervention
  • Warning Alert = Yellow = Service Degraded, requires planned (within 1 week) intervention
  • No Active Alert = Green = Service Healthy

Notifications are either a mail, slack message or whatever routing supported by AlertManager or a decorated icon in the UI.

Data Model

We consider that Nodes and Volumes don't impact the Platform directly. As such they are not belonging to Platform.

Volumes

Nodes

Platform

Platform

PlatformAtRisk
Severity Critical
Summary The Platform is at risk
Parent none
Sub Alert Severity Filter
PlatformServicesAtRisk Critical  
PlatformDegraded
Severity Warning
Summary The Platform is degraded
Parent none
Sub Alert Severity Filter
PlatformServicesDegraded Warning  
ControlPlaneNetworkDegraded Warning  
WorkloadPlaneNetworkDegraded Warning  

Nodes

NodeAtRisk
Severity Critical
Summary Node <nodename> is at risk
Parent none
Sub Alert Severity Filter
KubeletClientCertificateExpiration Critical  
NodeRAIDDegraded Critical  
SystemPartitionAtRisk Critical  
NodeDegraded
Severity Warning
Summary Node <nodename> is degraded
Parent none
Sub Alert Severity Filter
KubeNodeNotReady Warning  
KubeNodeReadinessFlapping Warning  
KubeNodeUnreachable Warning  
KubeletClientCertificateExpiration Warning  
KubeletClientCertificateRenewalErrors Warning  
KubeletPlegDurationHigh Warning  
KubeletPodStartUpLatencyHigh Warning  
KubeletServerCertificateExpiration Warning  
KubeletServerCertificateExpiration Warning  
KubeletServerCertificateRenewalErrors Warning  
KubeletTooManyPods Warning  
NodeClockNotSynchronising Warning  
NodeClockSkewDetected Warning  
NodeRAIDDiskFailure Warning  
NodeTextFileCollectorScrapeError Warning  
SystemPartitionDegraded Warning  

Currently no atomic Alert is defined yet for the following

  • System Unit (kubelet, containerd, salt-minion, ntp) would need to enrich node exporter
  • RAM
  • CPU

System Partitions

SystemPartitionAtRisk
Severity Warning
Summary The partition <mountpoint> on node <nodename> is at risk
Parent NodeAtRisk
Sub Alert Severity Filter
NodeFilesystemAlmostOutOfSpace Critical  
NodeFilesystemAlmostOutOfFiles Critical  
NodeFilesystemFilesFillingUp Critical  
NodeFilesystemSpaceFillingUp Critical  
SystemPartitionDegraded
Severity Warning
Summary The partition <mountpoint> on node <nodename> is degraded
Parent NodeDegraded
Sub Alert Severity Filter
NodeFilesystemAlmostOutOfSpace Warning  
NodeFilesystemAlmostOutOfFiles Warning  
NodeFilesystemFilesFillingUp Warning  
NodeFilesystemSpaceFillingUp Warning  

Volumes

VolumeAtRisk
Severity Critical
Summary The volume <volumename> on node <nodename> is at risk
Parent multiple parents
Sub Alert Severity Filter
KubePersistentVolumeErrors Warning  
KubePersistentVolumeFillingUp Critical  
VolumeDegraded
Severity Warning
Summary The volume <volumename> on node <nodename> is degraded
Parent multiple parents
Sub Alert Severity Filter
KubePersistentVolumeFillingUp Warning  

Platform Services

PlatformServicesAtRisk
Severity Critical
Summary The Platform services are at risk
Parent PlatformAtRisk
Sub Alert Severity Filter
CoreServicesAtRisk Critical  
ObservabilityServicesAtRisk Critical  
PlatformServicesDegraded
Severity Warning
Summary The Platform services are degraded
Parent PlatformDegraded
Sub Alert Severity Filter
CoreServicesDegraded Warning  
ObservabilityServicesDegraded Warning  
AccessServicesDegraded Warning  

Core

CoreServicesAtRisk
Severity Critical
Summary The Core services are at risk
Parent PlatformServicesAtRisk
Sub Alert Severity Filter
K8sMasterServicesAtRisk Critical  
CoreServicesDegraded
Severity Warning
Summary The Core services are degraded
Parent PlatformServicesDegraded
Sub Alert Severity Filter
K8sMasterServicesDegraded Critical  
BootstrapServicesDegraded Critical  
K8sMasterServicesAtRisk
Severity Warning
Summary The kubernetes master services are at risk
Parent CoreServicesAtRisk
Sub Alert Severity Filter
KubeAPIErrorBudgetBurn Critical  
etcdHighNumberOfFailedGRPCRequests Critical  
etcdGRPCRequestsSlow Critical  
etcdHighNumberOfFailedHTTPRequests Critical  
etcdInsufficientMembers Critical  
etcdMembersDown Critical  
etcdNoLeader Critical  
KubeStateMetricsListErrors Critical  
KubeStateMetricsWatchErrors Critical  
KubeAPIDown Critical  
KubeClientCertificateExpiration Critical  
KubeClientCertificateExpiration Critical  
KubeControllerManagerDown Critical  
KubeletDown Critical  
KubeSchedulerDown Critical  
K8sMasterServicesDegraded
Severity Warning
Summary The kubernetes master services are degraded
Parent CoreServicesDegraded
Sub Alert Severity Filter
KubeAPIErrorBudgetBurn Warning  
etcdHighNumberOfFailedGRPCRequests Warning  
etcdHTTPRequestsSlow Warning  
etcdHighCommitDurations Warning  
etcdHighFsyncDurations Warning  
etcdHighNumberOfFailedHTTPRequests Warning  
etcdHighNumberOfFailedProposals Warning  
etcdHighNumberOfLeaderChanges Warning  
etcdMemberCommunicationSlow Warning  
KubeCPUOvercommit Warning  
KubeCPUQuotaOvercommit Warning  
KubeMemoryOvercommit Warning  
KubeMemoryQuotaOvercommit Warning  
KubeClientCertificateExpiration Warning  
KubeClientErrors Warning  
KubeVersionMismatch Warning  
KubeDeploymentReplicasMismatch Warning kube-system/coredns
KubeDeploymentReplicasMismatch Warning metalk8s-monitoring/prometheus-adapter
KubeDeploymentReplicasMismatch Warning metalk8s-monitoring/prometheus-operator-kube-state-metrics
BootstrapServicesDegraded
Severity Warning
Summary The bootstrap services are degraded
Parent CoreServicesDegraded
Sub Alert Severity Filter
KubePodNotReady Warning kube-system/repositories-<bootstrapname>
KubePodNotReady Warning kube-system/salt-master-<bootstrapname>
KubeDeploymentReplicasMismatch Warning kube-system/storage-operator
KubeDeploymentReplicasMismatch Warning metalk8s-ui/metalk8s-ui

Note

The name of the bootstrap node depends on how MetalK8s is deployed. We would need to automatically configure this alert during deployment. We may want to use more deterministic filter to find out the repository and salt-master pods.

Observability

ObservabilityServicesAtRisk
Severity Critical
Summary The observability services are at risk
Parent PlatformServicesAtRisk
Sub Alert Severity Filter
MonitoringServiceAtRisk Critical  
AlertingServiceAtRisk Critical  
LoggingServiceAtRisk Critical  
ObservabilityServicesDegraded
Severity Warning
Summary The observability services are degraded
Parent PlatformServicesDegraded
Sub Alert Severity Filter
MonitoringServiceDegraded Warning  
AlertingServiceDegraded Warning  
DashboardingServiceDegraded Warning  
LoggingServiceDegraded Warning  
MonitoringServiceAtRisk
Severity Warning
Summary The monitoring service is at risk
Parent ObservabilityServicesAtRisk
Sub Alert Severity Filter
PrometheusRuleFailures Critical  
PrometheusRemoteWriteBehind Critical  
PrometheusRemoteStorageFailures Critical  
PrometheusErrorSendingAlertsToAnyAlertmanager Critical  
PrometheusBadConfig Critical  
MonitoringServiceDegraded
Severity Warning
Summary The monitoring service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
VolumeDegraded Warning app.kubernetes.io/name=prometheus-operator-prometheus
VolumeAtRisk Critical app.kubernetes.io/name=prometheus-operator-prometheus
TargetDown Warning To be defined
PrometheusTargetLimitHit Warning  
PrometheusTSDBReloadsFailing Warning  
PrometheusTSDBCompactionsFailing Warning  
PrometheusRemoteWriteDesiredShards Warning  
PrometheusOutOfOrderTimestamps Warning  
PrometheusNotificationQueueRunningFull Warning  
PrometheusNotIngestingSamples Warning  
PrometheusNotConnectedToAlertmanagers Warning  
PrometheusMissingRuleEvaluations Warning  
PrometheusErrorSendingAlertsToSomeAlertmanagers Warning  
PrometheusDuplicateTimestamps Warning  
PrometheusOperatorWatchErrors Warning  
PrometheusOperatorSyncFailed Warning  
PrometheusOperatorRejectedResources Warning  
PrometheusOperatorReconcileErrors Warning  
PrometheusOperatorNotReady Warning  
PrometheusOperatorNodeLookupErrors Warning  
PrometheusOperatorListErrors Warning  
KubeStatefulSetReplicasMismatch Warning metalk8s-monitoring/prometheus-prometheus-operator-prometheus
KubeDeploymentReplicasMismatch Warning metalk8s-monitoring/prometheus-operator-operator
KubeDaemonSetNotScheduled Warning metalk8s-monitoring/prometheus-operator-prometheus-node-exporter
LoggingServiceAtRisk
Severity Critcal
Summary The logging service is at risk
Parent ObservabilityServicesAtRisk
Sub Alert Severity Filter
AlertmanagerConfigInconsistent Critical  
AlertmanagerMembersInconsistent Critical  
AlertmanagerFailedReload Critical  
LoggingServiceDegraded
Severity Warning
Summary The logging service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
VolumeDegraded Warning app.kubernetes.io/name=loki
VolumeAtRisk Critical app.kubernetes.io/name=loki
TargetDown Warning To be defined
KubeStatefulSetReplicasMismatch Warning metalk8s-logging/loki
KubeDaemonSetNotScheduled Warning metalk8s-logging/fluentbit
AlertingServiceAtRisk
Severity Critcal
Summary The alerting service is at risk
Parent ObservabilityServicesAtRisk
Sub Alert Severity Filter
AlertmanagerConfigInconsistent Critical  
AlertmanagerMembersInconsistent Critical  
AlertmanagerFailedReload Critical  
AlertingServiceDegraded
Severity Warning
Summary The alerting service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
VolumeDegraded Warning app.kubernetes.io/name=prometheus-operator-alertmanager
VolumeAtRisk Critical app.kubernetes.io/name=prometheus-operator-alertmanager
TargetDown Warning To be defined
KubeStatefulSetReplicasMismatch Warning metalk8s-monitoring/alertmanager-prometheus-operator-alertmanager
AlertmanagerFailedReload Warning  
DashboardingServiceDegraded
Severity Warning
Summary The dashboarding service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
KubeStatefulSetReplicasMismatch Warning metalk8s-monitoring/prometheus-operator-grafana
TargetDown Warning To be defined

Network

ControlPlaneNetworkDegraded
Severity Warning
Summary The Control Plane Network is degraded
Parent PlatformDegraded
Sub Alert Severity Filter
NodeNetworkReceiveErrs Warning Need to filter on the proper cp interface
NodeHighNumberConntrackEntriesUsed Warning Need to filter on the proper cp interface
NodeNetworkTransmitErrs Warning Need to filter on the proper cp interface
NodeNetworkInterfaceFlapping Warning Need to filter on the proper cp interface
WorkloadPlaneNetworkDegraded
Severity Warning
Summary The Workload Plane Network is degraded
Parent PlatformDegraded
Sub Alert Severity Filter
NodeNetworkReceiveErrs Warning Need to filter on the proper wp interface
NodeHighNumberConntrackEntriesUsed Warning Need to filter on the proper wp interface
NodeNetworkTransmitErrs Warning Need to filter on the proper wp interface
NodeNetworkInterfaceFlapping Warning Need to filter on the proper wp interface

Note

The name of the interface used by Workload Plane and/or Control Plane is not known in advance. As such, we should find a way to automatically configure the Network alerts based on Network configuration.

Note

Currently we don't have any alerts for the Virtual Plane which is provided by kube-proxy, calico-kube-controllers, calico-node. It is not even part of the MetalK8s Dashboard page. We may want to introduce it.

Access

AccessServicesDegraded
Severity Warning
Summary The Access services are degraded
Parent PlatformServicesDegraded
Sub Alert Severity Filter
IngressControllerDegraded Warning  
AuthenticationDegraded Warning  
IngressControllerDegraded
Severity Warning
Summary The Ingress Controllers for CP and WP are degraded
Parent AccessServicesDegraded
Sub Alert Severity Filter
KubeDeploymentReplicasMismatch Warning metalk8s-ingress/ingress-nginx-defaultbackend
KubeDaemonSetNotScheduled Warning metalk8s-system/ingress-nginx-controller
KubeDaemonSetNotScheduled Warning metalk8s-system/ingress-nginx-control-plane-controller
AuthenticationDegraded
Severity Warning
Summary The Authentication service for K8S API is degraded
Parent AccessServicesDegraded
Sub Alert Severity Filter
KubeDeploymentReplicasMismatch Warning metalk8s-auth/dex