Skip to content

Commit

Permalink
test/e2e/upgrade/alert: Allow high-CPU alerts
Browse files Browse the repository at this point in the history
We've allowed these for non-update jobs since 12b022c (allow high
CPU alerts to be firing and pending, 2021-04-26, openshift#26102).  But they
show up in update jobs too.  For example [1] included:

  alert ExtremelyHighIndividualControlPlaneCPU fired for 60 seconds with labels: {instance="ci-op-vjm670pq-1ff06-pn8bq-master-1", severity="critical"}
  alert HighOverallControlPlaneCPU fired for 240 seconds with labels: {severity="warning"}

Searching for recent frequency:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+.*High.*ControlPlaneCPU+fired+for' | grep 'failures match' | sort
  periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 49 runs, 65% failed, 3% of failures match = 2% impact
  periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
  pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
  release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417199789052792832
  • Loading branch information
wking committed Jul 20, 2021
1 parent a8864ca commit bbb7b48
Showing 1 changed file with 14 additions and 0 deletions.
14 changes: 14 additions & 0 deletions test/e2e/upgrade/alert/alert.go
Expand Up @@ -107,6 +107,16 @@ func (t *UpgradeTest) Test(f *framework.Framework, done <-chan struct{}, upgrade
Text: "https://bugzilla.redhat.com/show_bug.cgi?id=1955489",
},
}
allowedFiringAlerts := helper.MetricConditions{
{
Selector: map[string]string{"alertname": "HighOverallControlPlaneCPU"},
Text: "high CPU utilization during e2e runs is normal",
},
{
Selector: map[string]string{"alertname": "ExtremelyHighIndividualControlPlaneCPU"},
Text: "high CPU utilization during e2e runs is normal",
},
}

pendingAlertsWithBugs := helper.MetricConditions{
{
Expand Down Expand Up @@ -176,6 +186,10 @@ count_over_time(ALERTS{alertstate="firing",severity!="info",alertname!~"Watchdog
for _, series := range result.Data.Result {
labels := helper.StripLabels(series.Metric, "alertname", "alertstate", "prometheus")
violation := fmt.Sprintf("alert %s fired for %s seconds with labels: %s", series.Metric["alertname"], series.Value, helper.LabelsAsSelector(labels))
if cause := allowedFiringAlerts.Matches(series); cause != nil {
debug.Insert(fmt.Sprintf("%s (allowed: %s)", violation, cause.Text))
continue
}
if cause := firingAlertsWithBugs.Matches(series); cause != nil {
knownViolations.Insert(fmt.Sprintf("%s (open bug: %s)", violation, cause.Text))
} else {
Expand Down

0 comments on commit bbb7b48

Please sign in to comment.