You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In theory, it is possible that the process responsible of restarting the node during the healing crashes silently before executing the stop part (inner function in https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_autoheal.erl#L364). Any exception before registering the process, would not log any error and leave the node monitor and autoheal in such a state that it will ignore other healing commands from the winner node.
We have seen logs supporting this theory and nodes ignoring the winner_is messages from over an hour after logging Autoheal: we were selected to restart; winner is ... and not stopping.
As such, we should monitor the spawned process and abort the healing process if an unexpected crash is detected. More logging will provide useful information if the issue re-occurs.
The text was updated successfully, but these errors were encountered:
In theory, it is possible that the process responsible of restarting the node during the healing crashes silently before executing the stop part (inner function in https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_autoheal.erl#L364). Any exception before registering the process, would not log any error and leave the node monitor and autoheal in such a state that it will ignore other healing commands from the winner node.
We have seen logs supporting this theory and nodes ignoring the
winner_is
messages from over an hour after loggingAutoheal: we were selected to restart; winner is ...
and not stopping.As such, we should monitor the spawned process and abort the healing process if an unexpected crash is detected. More logging will provide useful information if the issue re-occurs.
The text was updated successfully, but these errors were encountered: