Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on autoheal if the process responsible for restarting the node crashes #1346

Closed
dcorbacho opened this issue Aug 31, 2017 · 0 comments
Assignees
Milestone

Comments

@dcorbacho
Copy link
Contributor

In theory, it is possible that the process responsible of restarting the node during the healing crashes silently before executing the stop part (inner function in https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_autoheal.erl#L364). Any exception before registering the process, would not log any error and leave the node monitor and autoheal in such a state that it will ignore other healing commands from the winner node.

We have seen logs supporting this theory and nodes ignoring the winner_is messages from over an hour after logging Autoheal: we were selected to restart; winner is ... and not stopping.

As such, we should monitor the spawned process and abort the healing process if an unexpected crash is detected. More logging will provide useful information if the issue re-occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants