Deadlock on autoheal if the process responsible for restarting the node crashes #1346

dcorbacho · 2017-08-31T11:05:43Z

In theory, it is possible that the process responsible of restarting the node during the healing crashes silently before executing the stop part (inner function in https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_autoheal.erl#L364). Any exception before registering the process, would not log any error and leave the node monitor and autoheal in such a state that it will ignore other healing commands from the winner node.

We have seen logs supporting this theory and nodes ignoring the winner_is messages from over an hour after logging Autoheal: we were selected to restart; winner is ... and not stopping.

As such, we should monitor the spawned process and abort the healing process if an unexpected crash is detected. More logging will provide useful information if the issue re-occurs.

The text was updated successfully, but these errors were encountered:

dcorbacho self-assigned this Aug 31, 2017

dcorbacho mentioned this issue Aug 31, 2017

Link process responsible of restart during autoheal, and abort if needed #1347

Merged

michaelklishin added bug effort-medium labels Aug 31, 2017

michaelklishin added this to the 3.6.12 milestone Aug 31, 2017

michaelklishin closed this as completed Sep 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock on autoheal if the process responsible for restarting the node crashes #1346

Deadlock on autoheal if the process responsible for restarting the node crashes #1346

dcorbacho commented Aug 31, 2017

Deadlock on autoheal if the process responsible for restarting the node crashes #1346

Deadlock on autoheal if the process responsible for restarting the node crashes #1346

Comments

dcorbacho commented Aug 31, 2017