locks_leader:call/2 timeouts after joining new node #30

x0id · 2016-12-14T19:12:02Z

Trying to use locks app (master branch, c9b585a) I've got interesting failure in a scenario described below.

There were 4 nodes alive connected to each other - A, B, C and D. Node D was a leader.
One time new node E was started, it discovered other running nodes and connected to them.
Before new node E even connected to other nodes, it decided it was a leader.

Once node E connected to other nodes, it sent its leadership info to them. For all 3 non-leaders A, B and C node E locks_leader’s callback elected(State, Election, Pid) was called with "Pid" of the “joined” node (A, B and C) process. In its turn, node’s A, B and C locks_leader’s callback surrendered(State, Synch, Election) was called.

When new leader E connected to old leader D, netsplit happened. Node D won, it’s locks_leader’s callback elected(State, Election, undefined) was called and all other nodes (A, B, C and E) received notification in a callback surrendered(State, Synch, Election), so node E was informed that it was not a leader anymore.

Since then all calls locks_leader:call/2 made in nodes A, B and C ended up with timeout. Same call made in D and E worked as usual with no errors. So it seems that internal state of locks leader of the "passive" nodes A, B and C was compromised by fighting leaders D and E...

The text was updated successfully, but these errors were encountered:

x0id · 2016-12-14T19:20:36Z

Race condition is possible since this happens not every time.

uwiger · 2017-01-26T12:35:56Z

Sorry for not digging into this yet, but I've been a bit swamped.

One thing to check when this happens is whether the locks themselves are consistent (e.g. by checking ets:tab2list(locks_server_locks) on the relevant nodes. Previous problems with locks_leader have tended to be caused by bugs in the notification logic, i.e. the locks agent will sometimes fail to detect that it needs to notify the leader instance of a relevant change in lock status. This could e.g. result in some leader not trying to claim a lock on a newly connected node.

x0id · 2017-02-03T12:58:43Z

This happened recently in group of 3 nodes, result of ets:tab2list(locks_server_locks) is below:

node@A (leader):
[{lock, [locks_leader, {asg_manager, asg_manager}], 8, <node@A.217.0>, [{w, [{entry, <node@A.257.0>, <node@A.255.0>, 3, direct}]}, {w, [{entry, <node@B.266.0>, <node@B.265.0>, 8, direct}]}], []}]

node@B (newly connected node, hanging on subsequent locks_leader:call/2):
[{lock, [locks_leader, {asg_manager, asg_manager}], 13, <node@B.219.0>, [{w, [{entry, <node@A.257.0>, <node@A.255.0>, 11, direct}]}, {w, [{entry, <node@B.266.0>, <node@B.265.0>, 11, direct}]}, {w, [{entry, <node@C.263.0>, <node@C.261.0>, 13, direct}]}], []}]

node@C:
[{lock, [locks_leader, {asg_manager, asg_manager}], 2, <node@C.217.0>, [{w, [{entry, <node@B.266.0>, <node@B.265.0>, 1, direct}]}, {w, [{entry, <node@C.263.0>, <node@C.261.0>, 2, direct}]}], []}]

uwiger · 2017-06-06T18:23:22Z

Thanks, that's interesting (and sorry for missing your reply for so long).
So A and C are unaware of each other. I will see if I can figure out what causes it, but it appears to be a variation of the problem I described above.

uwiger · 2017-06-06T22:47:41Z

Not sure if the above solves anything, but it closes a possible loophole, and adds test cases. I have not yet been able to reproduce the problem.

ten0s · 2017-11-03T10:22:00Z

This project https://github.com/ten0s/locks-test reproduces the issue.

All nodes are started with '-connect_all false'. A new node is about to join a connected cluster with a leader. It starts locks_leader without connecting to others and becomes a leader in its own separate cluster. A bit later the nodes ping each other, the netsplit happens and sometimes some nodes hang at https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L510.

If the new node connects to others before starting locks_leader https://github.com/ten0s/locks-test/blob/master/src/test.erl#L73, then everything works.

The above fix doesn't change anything.

uwiger · 2017-12-05T08:57:03Z

FYI, I've been running your test program. The thing that seems to happen, at least some of the time, is that a node detects that it's no more the leader, broadcasts a leader_uncertain event, and enters the safe loop. In the cases I've observed where it gets stuck there, the actual leader (holding all the locks) still has that node in the synced list, i.e. assumes that it has already told it who the leader is.

One problem I found from visual inspection is that the synced list doesn't always seem to be cleared. Another potential problem is that the candidate needs to be sure that the agent will issue a new notification (once the candidate enters the safe loop), which may mean that an asynchronous await_all_locks request should be issued (there is no such thing yet). Another alternative would be to call locks_agent:change_flag(Agent, notify, true), even though it's already set from the start. The way it's implemented, it will always respond with a notification of the current status.

The good news is that the core locking algorithm doesn't seem to be at fault.

I will have to attend to other things for a while, but will get back to this. Thanks for the interesting test case, and apologies for the delay in responding.

uwiger · 2017-12-11T17:55:57Z

Before pushing, I let the locks-test program run continuously for 10 minutes, and observed no hickups or pauses.

ten0s · 2018-07-12T14:23:50Z

I upgraded the locks version in https://github.com/ten0s/locks-test.

AFAIS, the previous version c9b585a hangs within 30 secs,
the latest version 8e9b2e3 is much more stable, but
it also hangs withing 30 mins in the same place https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L517.

Check nodes: [{'slave10@127.0.0.1',false},
              {'slave1@127.0.0.1',false},
              {'slave5@127.0.0.1',
                  {'EXIT',
                      {timeout,{gen_server,call,[asg_manager,is_leader]}}}},
              {'slave6@127.0.0.1',
                  {'EXIT',
                      {timeout,{gen_server,call,[asg_manager,is_leader]}}}},
              {'slave4@127.0.0.1',false},
              {'slave3@127.0.0.1',false},
              {'slave9@127.0.0.1',false},
              {'slave2@127.0.0.1',false},
              {'slave7@127.0.0.1',false}]

Press Enter to continue

(slave5@127.0.0.1)6> process_info(whereis(asg_manager), [current_function]).
[{current_function,{locks_leader,safe_loop,1}}]

uwiger · 2018-07-12T14:43:42Z

Thanks for being so tenacious! :)
I will try to find time to take a look at that again. At least the window seems to have been significantly reduced, albeit not closed entirely.

x0id · 2020-01-13T02:23:06Z

I've traced the issue in more details and found one stable pattern:

node_A: set node_A a leader
...
node_B: set node_B a leader
...
node_C: set node_B an announced leader
node_C: set node_A an announced leader
node_C: unset leader node_A as reported uncertain - now node_C stuck in the safe_loop since there are no other leader's announcements issued.

The key moment is the am_leader message from node_A is sent before the am_leader message from node_B, but these messages are received by node_C in the "reverse" order. This is pretty possible and totally breaks the logic implemented in the locks_leader.

I couldn't find any easy way to fix the locks_leader.

The locks app addresses all these asynchronous communications disbalance doing the job pretty well, but the locks_leader rolled it a bit back... Could it be possible to use internal locks info to deduce who is the leader instead of passing the leadership through the announcement messages building an extra layer of async communications?

uwiger · 2020-01-13T10:31:51Z

The locks app addresses all these asynchronous communications disbalance doing the job pretty well, but the locks_leader rolled it a bit back... Could it be possible to use internal locks info to deduce who is the leader instead of passing the leadership through the announcement messages building an extra layer of async communications?

Yes, the locks_leader should rely on the internal locks info to decide who's the leader, but it seems to me as if the key problem here is deciding when another announcement is needed from the locks_agent. So the leader gets stuck waiting for a new announcement that never arrives, since the locks agents believe they have already shared the latest info.

uwiger · 2020-01-13T10:34:44Z

A possibility, albeit clumsy, might be to have a timeout on the safe_loop, where the locks_leader polls the agent for new info. Perhaps this could e.g. be limited to a period after nodeups arriving.

x0id · 2020-01-13T13:19:15Z

Currently locks_leader is getting info regarding its own leadership via lock notifications, but it does not use internal lock structures to check who is a new leader - it merely relies on getting am_leader message(s) from other nodes. This makes the bad scenario possible.

x0id · 2020-01-13T13:22:01Z

Regarding timeouts - this is what we do as a workaround in the app using locks_leader - when timeout happens (due to stuck in the safe_loop) - the app stops locks_leader, it restarts and tries again.

uwiger · 2020-01-13T14:49:52Z

It might be possible to detect accidental reordering by saving a limited number of am_leader and checking lock versions. Another approach might be to double-check am_leader messages against the most recent lock_info, and issuing leader_uncertain messages if any discrepancy is found.

I don't have time to experiment with it right now, but will try to get around to it soon.

uwiger · 2020-09-21T06:24:39Z

Pls note PR #42
It still doesn't solve all issues, but I think it goes further in allowing the problems to be identified.

uwiger mentioned this issue Jun 6, 2017

changed nodes mgmt to ordset; added test cases #31

Merged

uwiger mentioned this issue Dec 11, 2017

Making locks_leader pass locks-test #36

Merged

x0id mentioned this issue Jan 13, 2020

Add async case to reply_await_/2 #39

Merged

uwiger mentioned this issue Apr 6, 2020

Uw general improvements #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

locks_leader:call/2 timeouts after joining new node #30

locks_leader:call/2 timeouts after joining new node #30

x0id commented Dec 14, 2016

x0id commented Dec 14, 2016

uwiger commented Jan 26, 2017

x0id commented Feb 3, 2017

uwiger commented Jun 6, 2017

uwiger commented Jun 6, 2017

ten0s commented Nov 3, 2017 •

edited

Loading

uwiger commented Dec 5, 2017 •

edited

Loading

uwiger commented Dec 11, 2017

ten0s commented Jul 12, 2018 •

edited

Loading

uwiger commented Jul 12, 2018

x0id commented Jan 13, 2020 •

edited

Loading

uwiger commented Jan 13, 2020

uwiger commented Jan 13, 2020

x0id commented Jan 13, 2020 •

edited

Loading

x0id commented Jan 13, 2020

uwiger commented Jan 13, 2020

uwiger commented Sep 21, 2020

locks_leader:call/2 timeouts after joining new node #30

locks_leader:call/2 timeouts after joining new node #30

Comments

x0id commented Dec 14, 2016

x0id commented Dec 14, 2016

uwiger commented Jan 26, 2017

x0id commented Feb 3, 2017

uwiger commented Jun 6, 2017

uwiger commented Jun 6, 2017

ten0s commented Nov 3, 2017 • edited Loading

uwiger commented Dec 5, 2017 • edited Loading

uwiger commented Dec 11, 2017

ten0s commented Jul 12, 2018 • edited Loading

uwiger commented Jul 12, 2018

x0id commented Jan 13, 2020 • edited Loading

uwiger commented Jan 13, 2020

uwiger commented Jan 13, 2020

x0id commented Jan 13, 2020 • edited Loading

x0id commented Jan 13, 2020

uwiger commented Jan 13, 2020

uwiger commented Sep 21, 2020

ten0s commented Nov 3, 2017 •

edited

Loading

uwiger commented Dec 5, 2017 •

edited

Loading

ten0s commented Jul 12, 2018 •

edited

Loading

x0id commented Jan 13, 2020 •

edited

Loading

x0id commented Jan 13, 2020 •

edited

Loading