Add additional monitoring rules to the PrometheusRule #791
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind enhancement
What does this PR do / why we need it:
This PR provides additional rules for alerting, specifically it captures the following situations:
This helps users better monitor Argo CD Applications using the built-in OpenShift monitoring stack.
A couple of additional comments:
The progressing for more then 10 minutes might ruffle some feathers since the Health check for Subscriptions leaves it in a Progressing state rather then Suspended. I'm working on adjusting the health check for upstream but it's not there yet. Note the alert can be silenced if customers find it annoying, we could also lower the severity to info.
I chose to make Unknown for Sync State critical since it means the Application is not syncing properly. However if folks feel like this is too high it can be dropped down to warning. If we do this it can be combined with the ArgoCDSyncAlert since they would share the same severity.
I wanted to change the name of ArgoCDSyncAlert to ArgoCDOutOfSyncAlert but realized that customers may have monitoring and configuration depending on this name so I have left it the same as now.
Have you updated the necessary documentation?
The documentation does not mention specific alerts AFAIK so I do not feel like it needs to be covered. However this should be included in the release notes.
Which issue(s) this PR fixes:
https://issues.redhat.com/browse/GITOPS-4873
Test acceptance criteria:
Updated unit tests however I wonder if the way I'm doing it could be improved by parameterizing the MonitoringRules and then having both the code and unit tests share the same definitions?
How to test changes / Special notes to the reviewer:
Deploy applications with bad sync and health statues and verify that OpenShift Alerts are triggering after the alert duration expires.