Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locks_leader:call/2 timeouts after joining new node #30

Open
x0id opened this issue Dec 14, 2016 · 17 comments
Open

locks_leader:call/2 timeouts after joining new node #30

x0id opened this issue Dec 14, 2016 · 17 comments

Comments

@x0id
Copy link
Contributor

x0id commented Dec 14, 2016

Trying to use locks app (master branch, c9b585a) I've got interesting failure in a scenario described below.

There were 4 nodes alive connected to each other - A, B, C and D. Node D was a leader.
One time new node E was started, it discovered other running nodes and connected to them.
Before new node E even connected to other nodes, it decided it was a leader.

Once node E connected to other nodes, it sent its leadership info to them. For all 3 non-leaders A, B and C node E locks_leader’s callback elected(State, Election, Pid) was called with "Pid" of the “joined” node (A, B and C) process. In its turn, node’s A, B and C locks_leader’s callback surrendered(State, Synch, Election) was called.

When new leader E connected to old leader D, netsplit happened. Node D won, it’s locks_leader’s callback elected(State, Election, undefined) was called and all other nodes (A, B, C and E) received notification in a callback surrendered(State, Synch, Election), so node E was informed that it was not a leader anymore.

Since then all calls locks_leader:call/2 made in nodes A, B and C ended up with timeout. Same call made in D and E worked as usual with no errors. So it seems that internal state of locks leader of the "passive" nodes A, B and C was compromised by fighting leaders D and E...

@x0id
Copy link
Contributor Author

x0id commented Dec 14, 2016

Race condition is possible since this happens not every time.

@uwiger
Copy link
Owner

uwiger commented Jan 26, 2017

Sorry for not digging into this yet, but I've been a bit swamped.

One thing to check when this happens is whether the locks themselves are consistent (e.g. by checking ets:tab2list(locks_server_locks) on the relevant nodes. Previous problems with locks_leader have tended to be caused by bugs in the notification logic, i.e. the locks agent will sometimes fail to detect that it needs to notify the leader instance of a relevant change in lock status. This could e.g. result in some leader not trying to claim a lock on a newly connected node.

@x0id
Copy link
Contributor Author

x0id commented Feb 3, 2017

This happened recently in group of 3 nodes, result of ets:tab2list(locks_server_locks) is below:

node@A (leader):
[{lock, [locks_leader, {asg_manager, asg_manager}], 8, <node@A.217.0>, [{w, [{entry, <node@A.257.0>, <node@A.255.0>, 3, direct}]}, {w, [{entry, <node@B.266.0>, <node@B.265.0>, 8, direct}]}], []}]

node@B (newly connected node, hanging on subsequent locks_leader:call/2):
[{lock, [locks_leader, {asg_manager, asg_manager}], 13, <node@B.219.0>, [{w, [{entry, <node@A.257.0>, <node@A.255.0>, 11, direct}]}, {w, [{entry, <node@B.266.0>, <node@B.265.0>, 11, direct}]}, {w, [{entry, <node@C.263.0>, <node@C.261.0>, 13, direct}]}], []}]

node@C:
[{lock, [locks_leader, {asg_manager, asg_manager}], 2, <node@C.217.0>, [{w, [{entry, <node@B.266.0>, <node@B.265.0>, 1, direct}]}, {w, [{entry, <node@C.263.0>, <node@C.261.0>, 2, direct}]}], []}]

@uwiger
Copy link
Owner

uwiger commented Jun 6, 2017

Thanks, that's interesting (and sorry for missing your reply for so long).
So A and C are unaware of each other. I will see if I can figure out what causes it, but it appears to be a variation of the problem I described above.

@uwiger
Copy link
Owner

uwiger commented Jun 6, 2017

Not sure if the above solves anything, but it closes a possible loophole, and adds test cases. I have not yet been able to reproduce the problem.

@ten0s
Copy link

ten0s commented Nov 3, 2017

This project https://github.com/ten0s/locks-test reproduces the issue.

All nodes are started with '-connect_all false'. A new node is about to join a connected cluster with a leader. It starts locks_leader without connecting to others and becomes a leader in its own separate cluster. A bit later the nodes ping each other, the netsplit happens and sometimes some nodes hang at https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L510.

If the new node connects to others before starting locks_leader https://github.com/ten0s/locks-test/blob/master/src/test.erl#L73, then everything works.

The above fix doesn't change anything.

@uwiger
Copy link
Owner

uwiger commented Dec 5, 2017

FYI, I've been running your test program. The thing that seems to happen, at least some of the time, is that a node detects that it's no more the leader, broadcasts a leader_uncertain event, and enters the safe loop. In the cases I've observed where it gets stuck there, the actual leader (holding all the locks) still has that node in the synced list, i.e. assumes that it has already told it who the leader is.

One problem I found from visual inspection is that the synced list doesn't always seem to be cleared. Another potential problem is that the candidate needs to be sure that the agent will issue a new notification (once the candidate enters the safe loop), which may mean that an asynchronous await_all_locks request should be issued (there is no such thing yet). Another alternative would be to call locks_agent:change_flag(Agent, notify, true), even though it's already set from the start. The way it's implemented, it will always respond with a notification of the current status.

The good news is that the core locking algorithm doesn't seem to be at fault.

I will have to attend to other things for a while, but will get back to this. Thanks for the interesting test case, and apologies for the delay in responding.

@uwiger
Copy link
Owner

uwiger commented Dec 11, 2017

Before pushing, I let the locks-test program run continuously for 10 minutes, and observed no hickups or pauses.

@ten0s
Copy link

ten0s commented Jul 12, 2018

I upgraded the locks version in https://github.com/ten0s/locks-test.

AFAIS, the previous version c9b585a hangs within 30 secs,
the latest version 8e9b2e3 is much more stable, but
it also hangs withing 30 mins in the same place https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L517.

Check nodes: [{'slave10@127.0.0.1',false},
              {'slave1@127.0.0.1',false},
              {'slave5@127.0.0.1',
                  {'EXIT',
                      {timeout,{gen_server,call,[asg_manager,is_leader]}}}},
              {'slave6@127.0.0.1',
                  {'EXIT',
                      {timeout,{gen_server,call,[asg_manager,is_leader]}}}},
              {'slave4@127.0.0.1',false},
              {'slave3@127.0.0.1',false},
              {'slave9@127.0.0.1',false},
              {'slave2@127.0.0.1',false},
              {'slave7@127.0.0.1',false}]

Press Enter to continue
(slave5@127.0.0.1)6> process_info(whereis(asg_manager), [current_function]).
[{current_function,{locks_leader,safe_loop,1}}]

@uwiger
Copy link
Owner

uwiger commented Jul 12, 2018

Thanks for being so tenacious! :)
I will try to find time to take a look at that again. At least the window seems to have been significantly reduced, albeit not closed entirely.

@x0id
Copy link
Contributor Author

x0id commented Jan 13, 2020

I've traced the issue in more details and found one stable pattern:

node_A: set node_A a leader
...
node_B: set node_B a leader
...
node_C: set node_B an announced leader
node_C: set node_A an announced leader
node_C: unset leader node_A as reported uncertain - now node_C stuck in the safe_loop since there are no other leader's announcements issued.

The key moment is the am_leader message from node_A is sent before the am_leader message from node_B, but these messages are received by node_C in the "reverse" order. This is pretty possible and totally breaks the logic implemented in the locks_leader.

I couldn't find any easy way to fix the locks_leader.

The locks app addresses all these asynchronous communications disbalance doing the job pretty well, but the locks_leader rolled it a bit back... Could it be possible to use internal locks info to deduce who is the leader instead of passing the leadership through the announcement messages building an extra layer of async communications?

@uwiger
Copy link
Owner

uwiger commented Jan 13, 2020

The locks app addresses all these asynchronous communications disbalance doing the job pretty well, but the locks_leader rolled it a bit back... Could it be possible to use internal locks info to deduce who is the leader instead of passing the leadership through the announcement messages building an extra layer of async communications?

Yes, the locks_leader should rely on the internal locks info to decide who's the leader, but it seems to me as if the key problem here is deciding when another announcement is needed from the locks_agent. So the leader gets stuck waiting for a new announcement that never arrives, since the locks agents believe they have already shared the latest info.

@uwiger
Copy link
Owner

uwiger commented Jan 13, 2020

A possibility, albeit clumsy, might be to have a timeout on the safe_loop, where the locks_leader polls the agent for new info. Perhaps this could e.g. be limited to a period after nodeups arriving.

@x0id
Copy link
Contributor Author

x0id commented Jan 13, 2020

Currently locks_leader is getting info regarding its own leadership via lock notifications, but it does not use internal lock structures to check who is a new leader - it merely relies on getting am_leader message(s) from other nodes. This makes the bad scenario possible.

@x0id
Copy link
Contributor Author

x0id commented Jan 13, 2020

Regarding timeouts - this is what we do as a workaround in the app using locks_leader - when timeout happens (due to stuck in the safe_loop) - the app stops locks_leader, it restarts and tries again.

@uwiger
Copy link
Owner

uwiger commented Jan 13, 2020

It might be possible to detect accidental reordering by saving a limited number of am_leader and checking lock versions. Another approach might be to double-check am_leader messages against the most recent lock_info, and issuing leader_uncertain messages if any discrepancy is found.

I don't have time to experiment with it right now, but will try to get around to it soon.

@uwiger
Copy link
Owner

uwiger commented Sep 21, 2020

Pls note PR #42
It still doesn't solve all issues, but I think it goes further in allowing the problems to be identified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants