New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mirrored queue crash with out of sync ACKs #749
Comments
The problem is raised when the next situation happens:
I'm still investigating if this is a race condition in the synchronization of the database ( Maybe we could do something with the unexpected elements in the queue? I don't know yet. |
I created a PR that solves the crash, but not the root cause. I couldn't find a way to solve the broken ring, the nodes into partial partition can't see the update through the 'live' node. Maybe is a timing issue or mnesia might already be inconsistent and don't update anymore, as the inconsistent database event is triggered just afterwards. |
When some ring members are unreachable, ignoring log operations for them is probably about as well as we can do. Eventually I will take a look at the specifics in a bit. |
This issue is a bit too "inside baseball" => not including into release notes. |
Using the patch for #714, in a 3-node cluster configured to test #545, the GM might eventually crash when processing an
activity
message:which I believe leads in the other nodes to:
and
This is not suspected to have been introduced by #714, but a consequence of the deadlock being resolved. Thus, the system continues running on partial partitions with pause_minority, and eventually reaches an inconsistent state.
The text was updated successfully, but these errors were encountered: