Skip to content

Latest commit

 

History

History
761 lines (551 loc) · 20.4 KB

alert-grouping.rst

File metadata and controls

761 lines (551 loc) · 20.4 KB

Metalk8s predefined Alert rules and Alert Grouping

Context

As part of Metalk8s, we would like to provide the Administrator with built-in rules expressions that can be used to fire alerts and send notifications when one of the High Level entities of the system is degraded or impacted by the degradation of a Low Level component.

As an example, we would like to notify the administrator when the MetalK8s log service is degraded because of some specific observed symptoms:

  • not all log service replicas are scheduled
  • one of the persistent volumes claimed by one log service replica is getting full.
  • Log DB Ingestion rate is near zero

In this specific example, the goal is to invite the administrator to perform manual tasks to avoid having a Log Service interruption in the near future.

Vocabulary

Atomic Alert: An Alert which is based on existing metrics in Prometheus and which is linked to a specific symptom.

High Level Alert: An Alert which is based on other atomic alerts or High Level alerts.

Requirements

When receiving such High Level alerts, we would like the system to guide the administrator to find and understand the root cause of the alert as well as the path to resolve it. Accessing the list of observed low level symptoms will help the administrator's investigation.

Having the High Level Alerts also helps the administrator to have a better understanding of what part/layer/component of the System is currently impacted (without having to build a mental model to guess the impact of any existing atomic alert in the System)

image

A bunch of atomic alerts are already deployed but we don’t yet have the High Level Alerts that we could use to build the above the MetalK8s dashboard. Being able to define the impact of one atomic alert is a way to build those High Level Alerts:

It is impossible to modelize all possible causes through this kind of impacting tree. However, when an alert is received, the system shall suggest other alerts that may be linked to it, (maybe using matching labels).

Also, when accessing the Node or the Volume page / alert tab, the administrator should be able to visualise all the fired alerts that are described under Nodes or Volumes entities.

In the end, the way to visualise the impact of an atomic alert in the alert page is described with the screenshot below:

image

The High Level alerts should be easily identifiable in order to filter it out in the UI views. Indeed, the a first iteration we might want to display the atomic alerts only until all High Level alerts are implemented and deployed.

Severity Classification

  • Critical Alert = Red = Service Offline or At Risk, requires immediate intervention
  • Warning Alert = Yellow = Service Degraded, requires planned (within 1 week) intervention
  • No Active Alert = Green = Service Healthy

Notifications are either a mail, slack message or whatever routing supported by AlertManager or a decorated icon in the UI.

Data Model

We consider that Nodes and Volumes don't impact the Platform directly. As such they are not belonging to Platform.

Volumes

Nodes

Platform

Platform

PlatformAtRisk
Severity Critical
Summary The Platform is at risk
Parent none
Sub Alert Severity Filter
PlatformServicesAtRisk Critical
PlatformDegraded
Severity Warning
Summary The Platform is degraded
Parent none
Sub Alert Severity Filter
PlatformServicesDegraded Warning
ControlPlaneNetworkDegraded Warning
WorkloadPlaneNetworkDegraded Warning

Nodes

NodeAtRisk
Severity Critical
Summary Node <nodename> is at risk
Parent none
Sub Alert Severity Filter
KubeletClientCertificateExpiration Critical
NodeRAIDDegraded Critical
SystemPartitionAtRisk Critical
NodeDegraded
Severity Warning
Summary Node <nodename> is degraded
Parent none
Sub Alert Severity Filter
KubeNodeNotReady Warning
KubeNodeReadinessFlapping Warning
KubeNodeUnreachable Warning
KubeletClientCertificateExpiration Warning
KubeletClientCertificateRenewalErrors Warning
KubeletPlegDurationHigh Warning
KubeletPodStartUpLatencyHigh Warning
KubeletServerCertificateExpiration Warning
KubeletServerCertificateExpiration Warning
KubeletServerCertificateRenewalErrors Warning
KubeletTooManyPods Warning
NodeClockNotSynchronising Warning
NodeClockSkewDetected Warning
NodeRAIDDiskFailure Warning
NodeTextFileCollectorScrapeError Warning
SystemPartitionDegraded Warning

Currently no atomic Alert is defined yet for the following

  • System Unit (kubelet, containerd, salt-minion, ntp) would need to enrich node exporter
  • RAM
  • CPU

System Partitions

SystemPartitionAtRisk
Severity Warning
Summary The partition <mountpoint> on node <nodename> is at risk
Parent NodeAtRisk
Sub Alert Severity Filter
NodeFilesystemAlmostOutOfSpace Critical
NodeFilesystemAlmostOutOfFiles Critical
NodeFilesystemFilesFillingUp Critical
NodeFilesystemSpaceFillingUp Critical
SystemPartitionDegraded
Severity Warning
Summary The partition <mountpoint> on node <nodename> is degraded
Parent NodeDegraded
Sub Alert Severity Filter
NodeFilesystemAlmostOutOfSpace Warning
NodeFilesystemAlmostOutOfFiles Warning
NodeFilesystemFilesFillingUp Warning
NodeFilesystemSpaceFillingUp Warning

Volumes

VolumeAtRisk
Severity Critical
Summary The volume <volumename> on node <nodename> is at risk
Parent multiple parents
Sub Alert Severity Filter
KubePersistentVolumeErrors Warning
KubePersistentVolumeFillingUp Critical
VolumeDegraded
Severity Warning
Summary The volume <volumename> on node <nodename> is degraded
Parent multiple parents
Sub Alert Severity Filter
KubePersistentVolumeFillingUp Warning

Platform Services

PlatformServicesAtRisk
Severity Critical
Summary The Platform services are at risk
Parent PlatformAtRisk
Sub Alert Severity Filter
CoreServicesAtRisk Critical
ObservabilityServicesAtRisk Critical
PlatformServicesDegraded
Severity Warning
Summary The Platform services are degraded
Parent PlatformDegraded
Sub Alert Severity Filter
CoreServicesDegraded Warning
ObservabilityServicesDegraded Warning
AccessServicesDegraded Warning

Core

CoreServicesAtRisk
Severity Critical
Summary The Core services are at risk
Parent PlatformServicesAtRisk
Sub Alert Severity Filter
K8sMasterServicesAtRisk Critical
CoreServicesDegraded
Severity Warning
Summary The Core services are degraded
Parent PlatformServicesDegraded
Sub Alert Severity Filter
K8sMasterServicesDegraded Critical
BootstrapServicesDegraded Critical
K8sMasterServicesAtRisk
Severity Warning
Summary The kubernetes master services are at risk
Parent CoreServicesAtRisk
Sub Alert Severity Filter
KubeAPIErrorBudgetBurn Critical
etcdHighNumberOfFailedGRPCRequests Critical
etcdGRPCRequestsSlow Critical
etcdHighNumberOfFailedHTTPRequests Critical
etcdInsufficientMembers Critical
etcdMembersDown Critical
etcdNoLeader Critical
KubeStateMetricsListErrors Critical
KubeStateMetricsWatchErrors Critical
KubeAPIDown Critical
KubeClientCertificateExpiration Critical
KubeClientCertificateExpiration Critical
KubeControllerManagerDown Critical
KubeletDown Critical
KubeSchedulerDown Critical
K8sMasterServicesDegraded
Severity Warning
Summary The kubernetes master services are degraded
Parent CoreServicesDegraded
Sub Alert Severity Filter
KubeAPIErrorBudgetBurn Warning
etcdHighNumberOfFailedGRPCRequests Warning
etcdHTTPRequestsSlow Warning
etcdHighCommitDurations Warning
etcdHighFsyncDurations Warning
etcdHighNumberOfFailedHTTPRequests Warning
etcdHighNumberOfFailedProposals Warning
etcdHighNumberOfLeaderChanges Warning
etcdMemberCommunicationSlow Warning
KubeCPUOvercommit Warning
KubeCPUQuotaOvercommit Warning
KubeMemoryOvercommit Warning
KubeMemoryQuotaOvercommit Warning
KubeClientCertificateExpiration Warning
KubeClientErrors Warning
KubeVersionMismatch Warning
KubeDeploymentReplicasMismatch Warning kube-system/coredns
KubeDeploymentReplicasMismatch Warning metalk8s-monitoring/prometheus-adapter
KubeDeploymentReplicasMismatch Warning metalk8s-monitoring/prometheus-operator-kube-state-metrics
BootstrapServicesDegraded
Severity Warning
Summary The bootstrap services are degraded
Parent CoreServicesDegraded
Sub Alert Severity Filter
KubePodNotReady Warning kube-system/repositories-<bootstrapname>
KubePodNotReady Warning kube-system/salt-master-<bootstrapname>
KubeDeploymentReplicasMismatch Warning kube-system/storage-operator
KubeDeploymentReplicasMismatch Warning metalk8s-ui/metalk8s-ui

Note

The name of the bootstrap node depends on how MetalK8s is deployed. We would need to automatically configure this alert during deployment. We may want to use more deterministic filter to find out the repository and salt-master pods.

Observability

ObservabilityServicesAtRisk
Severity Critical
Summary The observability services are at risk
Parent PlatformServicesAtRisk
Sub Alert Severity Filter
MonitoringServiceAtRisk Critical
AlertingServiceAtRisk Critical
LoggingServiceAtRisk Critical
ObservabilityServicesDegraded
Severity Warning
Summary The observability services are degraded
Parent PlatformServicesDegraded
Sub Alert Severity Filter
MonitoringServiceDegraded Warning
AlertingServiceDegraded Warning
DashboardingServiceDegraded Warning
LoggingServiceDegraded Warning
MonitoringServiceAtRisk
Severity Warning
Summary The monitoring service is at risk
Parent ObservabilityServicesAtRisk
Sub Alert Severity Filter
PrometheusRuleFailures Critical
PrometheusRemoteWriteBehind Critical
PrometheusRemoteStorageFailures Critical
PrometheusErrorSendingAlertsToAnyAlertmanager Critical
PrometheusBadConfig Critical
MonitoringServiceDegraded
Severity Warning
Summary The monitoring service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
VolumeDegraded Warning app.kubernetes.io/name=prometheus-operator-prometheus
VolumeAtRisk Critical app.kubernetes.io/name=prometheus-operator-prometheus
TargetDown Warning To be defined
PrometheusTargetLimitHit Warning
PrometheusTSDBReloadsFailing Warning
PrometheusTSDBCompactionsFailing Warning
PrometheusRemoteWriteDesiredShards Warning
PrometheusOutOfOrderTimestamps Warning
PrometheusNotificationQueueRunningFull Warning
PrometheusNotIngestingSamples Warning
PrometheusNotConnectedToAlertmanagers Warning
PrometheusMissingRuleEvaluations Warning
PrometheusErrorSendingAlertsToSomeAlertmanagers Warning
PrometheusDuplicateTimestamps Warning
PrometheusOperatorWatchErrors Warning
PrometheusOperatorSyncFailed Warning
PrometheusOperatorRejectedResources Warning
PrometheusOperatorReconcileErrors Warning
PrometheusOperatorNotReady Warning
PrometheusOperatorNodeLookupErrors Warning
PrometheusOperatorListErrors Warning
KubeStatefulSetReplicasMismatch Warning metalk8s-monitoring/prometheus-prometheus-operator-prometheus
KubeDeploymentReplicasMismatch Warning metalk8s-monitoring/prometheus-operator-operator
KubeDaemonSetNotScheduled Warning metalk8s-monitoring/prometheus-operator-prometheus-node-exporter
LoggingServiceAtRisk
Severity Critcal
Summary The logging service is at risk
Parent ObservabilityServicesAtRisk
Sub Alert Severity Filter
AlertmanagerConfigInconsistent Critical
AlertmanagerMembersInconsistent Critical
AlertmanagerFailedReload Critical
LoggingServiceDegraded
Severity Warning
Summary The logging service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
VolumeDegraded Warning app.kubernetes.io/name=loki
VolumeAtRisk Critical app.kubernetes.io/name=loki
TargetDown Warning To be defined
KubeStatefulSetReplicasMismatch Warning metalk8s-logging/loki
KubeDaemonSetNotScheduled Warning metalk8s-logging/fluentbit
AlertingServiceAtRisk
Severity Critcal
Summary The alerting service is at risk
Parent ObservabilityServicesAtRisk
Sub Alert Severity Filter
AlertmanagerConfigInconsistent Critical
AlertmanagerMembersInconsistent Critical
AlertmanagerFailedReload Critical
AlertingServiceDegraded
Severity Warning
Summary The alerting service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
VolumeDegraded Warning app.kubernetes.io/name=prometheus-operator-alertmanager
VolumeAtRisk Critical app.kubernetes.io/name=prometheus-operator-alertmanager
TargetDown Warning To be defined
KubeStatefulSetReplicasMismatch Warning metalk8s-monitoring/alertmanager-prometheus-operator-alertmanager
AlertmanagerFailedReload Warning
DashboardingServiceDegraded
Severity Warning
Summary The dashboarding service is degraded
Parent ObservabilityServicesDegraded
Sub Alert Severity Filter
KubeStatefulSetReplicasMismatch Warning metalk8s-monitoring/prometheus-operator-grafana
TargetDown Warning To be defined

Network

ControlPlaneNetworkDegraded
Severity Warning
Summary The Control Plane Network is degraded
Parent PlatformDegraded
Sub Alert Severity Filter
NodeNetworkReceiveErrs Warning Need to filter on the proper cp interface
NodeHighNumberConntrackEntriesUsed Warning Need to filter on the proper cp interface
NodeNetworkTransmitErrs Warning Need to filter on the proper cp interface
NodeNetworkInterfaceFlapping Warning Need to filter on the proper cp interface
WorkloadPlaneNetworkDegraded
Severity Warning
Summary The Workload Plane Network is degraded
Parent PlatformDegraded
Sub Alert Severity Filter
NodeNetworkReceiveErrs Warning Need to filter on the proper wp interface
NodeHighNumberConntrackEntriesUsed Warning Need to filter on the proper wp interface
NodeNetworkTransmitErrs Warning Need to filter on the proper wp interface
NodeNetworkInterfaceFlapping Warning Need to filter on the proper wp interface

Note

The name of the interface used by Workload Plane and/or Control Plane is not known in advance. As such, we should find a way to automatically configure the Network alerts based on Network configuration.

Note

Currently we don't have any alerts for the Virtual Plane which is provided by kube-proxy, calico-kube-controllers, calico-node. It is not even part of the MetalK8s Dashboard page. We may want to introduce it.

Access

AccessServicesDegraded
Severity Warning
Summary The Access services are degraded
Parent PlatformServicesDegraded
Sub Alert Severity Filter
IngressControllerDegraded Warning
AuthenticationDegraded Warning
IngressControllerDegraded
Severity Warning
Summary The Ingress Controllers for CP and WP are degraded
Parent AccessServicesDegraded
Sub Alert Severity Filter
KubeDeploymentReplicasMismatch Warning metalk8s-ingress/ingress-nginx-defaultbackend
KubeDaemonSetNotScheduled Warning metalk8s-system/ingress-nginx-controller
KubeDaemonSetNotScheduled Warning metalk8s-system/ingress-nginx-control-plane-controller
AuthenticationDegraded
Severity Warning
Summary The Authentication service for K8S API is degraded
Parent AccessServicesDegraded
Sub Alert Severity Filter
KubeDeploymentReplicasMismatch Warning metalk8s-auth/dex