opni-alerting infinite hang / crashback loop #542

alexandreLamarre · 2022-09-01T18:52:48Z

It is possible for the way we are manipulating the config files/ config maps to arrive at an invalid state for an AlertManager config, causing an infinite hang / crashback loop.

The reason was caused by an invalid STMP smarthost missing with a valid email receiver setup,

So the fix for this should be in two parts:

Dynamically detect email configs, and deploy an opni STMP smarthout if they exist/ destroy the STMP smarthoust if they go from existing -> not existing, in order to keep deployments even more lean.
[More importantly] include functionality similar to amtool check-config binary when applying patches to the underlying config file/ config map.

The text was updated successfully, but these errors were encountered:

alexandreLamarre · 2022-09-01T19:33:07Z

Turns out they introduced a new http_enabledv2 yaml flag in httpconfigs in prometheus/common, which is incompatible with AM.

AM currently uses :

	github.com/prometheus/common v0.32.1 // indirect

while we use v0.3.4

alexandreLamarre · 2022-09-08T22:36:39Z

As a follow up to this issue, implemented a "pre" reconciler loop which analyzes a series of errors received from AlertManager's LoadConfig function, which prevents all kinds of errors.

The crashback restart errors aren't actually that much of a concern in production, since the opni alerting operator handles those when the rollout restart fails & reverts it, but...
In some cases AM will start without exiting and run normally, but will prune nodes in the routing tree OR delete receiver configurations under some conditions (like missing defaults), which will "softlock" users from certain configurations once AlertManager reaches that state (fails silently).

alexandreLamarre added alerting bug Something isn't working enhancement New feature or request labels Sep 1, 2022

alexandreLamarre self-assigned this Sep 1, 2022

alexandreLamarre closed this as completed Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opni-alerting infinite hang / crashback loop #542

opni-alerting infinite hang / crashback loop #542

alexandreLamarre commented Sep 1, 2022

alexandreLamarre commented Sep 1, 2022

alexandreLamarre commented Sep 8, 2022

opni-alerting infinite hang / crashback loop #542

opni-alerting infinite hang / crashback loop #542

Comments

alexandreLamarre commented Sep 1, 2022

alexandreLamarre commented Sep 1, 2022

alexandreLamarre commented Sep 8, 2022