Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opni-alerting infinite hang / crashback loop #542

Closed
alexandreLamarre opened this issue Sep 1, 2022 · 2 comments
Closed

opni-alerting infinite hang / crashback loop #542

alexandreLamarre opened this issue Sep 1, 2022 · 2 comments
Assignees
Labels
alerting bug Something isn't working enhancement New feature or request

Comments

@alexandreLamarre
Copy link
Contributor

It is possible for the way we are manipulating the config files/ config maps to arrive at an invalid state for an AlertManager config, causing an infinite hang / crashback loop.

The reason was caused by an invalid STMP smarthost missing with a valid email receiver setup,

So the fix for this should be in two parts:

  • Dynamically detect email configs, and deploy an opni STMP smarthout if they exist/ destroy the STMP smarthoust if they go from existing -> not existing, in order to keep deployments even more lean.

  • [More importantly] include functionality similar to amtool check-config binary when applying patches to the underlying config file/ config map.

@alexandreLamarre alexandreLamarre added alerting bug Something isn't working enhancement New feature or request labels Sep 1, 2022
@alexandreLamarre alexandreLamarre self-assigned this Sep 1, 2022
@alexandreLamarre
Copy link
Contributor Author

Turns out they introduced a new http_enabledv2 yaml flag in httpconfigs in prometheus/common, which is incompatible with AM.

AM currently uses :

	github.com/prometheus/common v0.32.1 // indirect

while we use v0.3.4

@alexandreLamarre
Copy link
Contributor Author

As a follow up to this issue, implemented a "pre" reconciler loop which analyzes a series of errors received from AlertManager's LoadConfig function, which prevents all kinds of errors.

  • The crashback restart errors aren't actually that much of a concern in production, since the opni alerting operator handles those when the rollout restart fails & reverts it, but...
  • In some cases AM will start without exiting and run normally, but will prune nodes in the routing tree OR delete receiver configurations under some conditions (like missing defaults), which will "softlock" users from certain configurations once AlertManager reaches that state (fails silently).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alerting bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant