Skip to content

Commit

Permalink
add alert for machine with long deletion phase
Browse files Browse the repository at this point in the history
This change adds the MachineNotYetDeleted alert for a machine
with phase "Deleting" for longer than 360 minutes, at "warning" severity.
Also updates the alerts documentation with new information.

The MachineWithNoRunningPhase alert has also been modified to exclude
machines with the "Deleting" phase to ensure that we do not double fire
an alert for this condition.

ref: https://issues.redhat.com/browse/OCPCLOUD-921
  • Loading branch information
elmiko committed Feb 23, 2021
1 parent 682004f commit b13dbe9
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 2 deletions.
22 changes: 21 additions & 1 deletion docs/user/Alerts.md
Expand Up @@ -22,7 +22,7 @@ Machine did not reach the “Running” Phase. Running phase is when the machin
### Query
```
# for: 10m
(mapi_machine_created_timestamp_seconds{phase!="Running"}) > 0
(mapi_machine_created_timestamp_seconds{phase!="Running|Deleting"}) > 0
```

### Possible Causes
Expand All @@ -33,6 +33,26 @@ Machine did not reach the “Running” Phase. Running phase is when the machin
### Resolution
If the machine never became a node, consult the machine troubleshooting guide.

## MachineNotYetDeleted
Machine has been in the "Deleting" phase for a long time. Deleting phase is added to a machine when it has been marked for deletion and given a deletion timestamp in etcd.

### Query
```
# for: 360m
(mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0
```

### Possible Causes
* Invalid cloud credentials are preventing deletion.
* A [Pod disruption budget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) is
preventing Node removal.
* A Pod with a very long [graceful termination period](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#graceful-termination-of-preemption-victims) is preventing Node removal.

### Resolution
Consult the `machine-controller`'s logs for root causes (see the [Troubleshooting Guide](TroubleShooting.md). In some
cases the machine may need to be removed manaually, starting with the instance in the cloud provider's console and
then the machine in OpenShift.

## MachineAPIOperatorMetricsCollectionFailing
Machine-api metrics are not being collected successfully. This would be a very unusual error to see.

Expand Down
12 changes: 11 additions & 1 deletion install/0000_90_machine-api-operator_04_alertrules.yaml
Expand Up @@ -26,12 +26,22 @@ spec:
rules:
- alert: MachineWithNoRunningPhase
expr: |
(mapi_machine_created_timestamp_seconds{phase!="Running"}) > 0
(mapi_machine_created_timestamp_seconds{phase!="Running|Deleting"}) > 0
for: 60m
labels:
severity: warning
annotations:
message: "machine {{ $labels.name }} is in phase: {{ $labels.phase }}"
- name: machine-not-yet-deleted
rules:
- alert: MachineNotYetDeleted
expr: |
(mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0
for: 360m
labels:
severity: warning
annotations:
message: "machine {{ $labels.name }} has been in Deleting phase for more than 6 hours"
- name: machine-api-operator-metrics-collector-up
rules:
- alert: MachineAPIOperatorMetricsCollectionFailing
Expand Down

0 comments on commit b13dbe9

Please sign in to comment.