add alert for machine with long deletion phase

This change adds the MachineNotYetDeleted alert for a machine with phase "Deleting" for longer than 360 minutes, at "warning" severity. Also updates the alerts documentation with new information. The MachineWithNoRunningPhase alert has also been modified to exclude machines with the "Deleting" phase to ensure that we do not double fire an alert for this condition. ref: https://issues.redhat.com/browse/OCPCLOUD-921
wking · Feb 23, 2021 · b13dbe9 · b13dbe9
1 parent 682004f
commit b13dbe9
Show file tree

Hide file tree

Showing 2 changed files with 32 additions and 2 deletions.
diff --git a/docs/user/Alerts.md b/docs/user/Alerts.md
@@ -22,7 +22,7 @@ Machine did not reach the “Running” Phase.  Running phase is when the machin
 ### Query
 ```
 # for: 10m
-(mapi_machine_created_timestamp_seconds{phase!="Running"}) > 0
+(mapi_machine_created_timestamp_seconds{phase!="Running|Deleting"}) > 0
 ```
 
 ### Possible Causes
@@ -33,6 +33,26 @@ Machine did not reach the “Running” Phase.  Running phase is when the machin
 ### Resolution
 If the machine never became a node, consult the machine troubleshooting guide.
 
+## MachineNotYetDeleted
+Machine has been in the "Deleting" phase for a long time. Deleting phase is added to a machine when it has been marked for deletion and given a deletion timestamp in etcd.
+
+### Query
+```
+# for: 360m
+(mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0
+```
+
+### Possible Causes
+* Invalid cloud credentials are preventing deletion.
+* A [Pod disruption budget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) is
+  preventing Node removal.
+* A Pod with a very long [graceful termination period](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#graceful-termination-of-preemption-victims) is preventing Node removal.
+
+### Resolution
+Consult the `machine-controller`'s logs for root causes (see the [Troubleshooting Guide](TroubleShooting.md). In some
+cases the machine may need to be removed manaually, starting with the instance in the cloud provider's console and
+then the machine in OpenShift.
+
 ## MachineAPIOperatorMetricsCollectionFailing
 Machine-api metrics are not being collected successfully.  This would be a very unusual error to see.
 

diff --git a/install/0000_90_machine-api-operator_04_alertrules.yaml b/install/0000_90_machine-api-operator_04_alertrules.yaml
@@ -26,12 +26,22 @@ spec:
       rules:
         - alert: MachineWithNoRunningPhase
           expr: |
-            (mapi_machine_created_timestamp_seconds{phase!="Running"}) > 0
+            (mapi_machine_created_timestamp_seconds{phase!="Running|Deleting"}) > 0
           for: 60m
           labels:
             severity: warning
           annotations:
             message: "machine {{ $labels.name }} is in phase: {{ $labels.phase }}"
+    - name: machine-not-yet-deleted
+      rules:
+        - alert: MachineNotYetDeleted
+          expr: |
+            (mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0
+          for: 360m
+          labels:
+            severity: warning
+          annotations:
+            message: "machine {{ $labels.name }} has been in Deleting phase for more than 6 hours"
     - name: machine-api-operator-metrics-collector-up
       rules:
         - alert: MachineAPIOperatorMetricsCollectionFailing