When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7

dskliarov · 2014-12-04T17:53:31Z

After I started 2 locks on 2 nodes, everything is working as expected: One node become a leader. locks_leader process on both nodes is in gen_server mode(current function is gen_server/loop). When I starting 3rd node, first 2 nodes become unresponsive to gen_server calls(current function on lock_leader process on this 2 nodes is lock_leader/safe_loop). lock_leader is waiting message have_all_locks, but not getting it.

I modify check_if_done function(line 787) in locks_agent.erl to resolve it

check_if_done(#state{pending = Pending} = State, Msgs) ->
case ets:info(Pending, size) of
0 ->
Msg = {have_all_locks, []},
notify_msgs([Msg|Msgs], have_all(State));
_ ->
check_if_done_(State, Msgs)
end.

After change, Leader node is in gen_server/loop, but other 2 nodes are in gen_leader/safe_loop.

Modified function get_locks (line 1194) to handle case when ets table is empty
get_locks([H|T], Ls) ->
case ets_lookup(Ls, H) of
[L] -> [L | get_locks(T, Ls)];
[] -> get_locks(T, Ls)
end;
get_locks([], _) ->
[].

After all that, I still having issues and continue to debug. Could you please check, and tell me if I am on the right path. Thank you

uwiger · 2014-12-05T08:26:00Z

Could you describe the exact steps you take to reproduce this?
I tried with gdict on two nodes, then adding a third, and saw no issues.

Also, in which situations do you see the get_locks/2 function failing because there are no locks? The assumption was that it will never be called unless there are locks. This assumption could of course be broken, but I'd like to understand when that happens.

dskliarov · 2014-12-05T14:11:00Z

Thank you for looking the problem, Ulf. In my application, I have a logic to form a cluster(reading config from binary file, initiate monitoring nodes and process replication). Looks like, when loks started before cluster is formed, it creates net split scenario(2 nodes have a leader and new node is a leader). In this case in function locks_info, leader will be set to undefined in record #st. After that messages from_leader will be ignored and node will never switch to the gen_server/loop. I modified from_leader function and it is working now.

rom_leader(L, Msg, #st{leader = undefined} = S) ->
S1 = S#st{leader = L},
from_leader(L,Msg,S1);
from_leader(L, Msg, #st{leader = L, mod = M, mod_state = MSt} = S) ->
?event({processing_from_leader, L, Msg}, S),
callback(M:from_leader(Msg, MSt, opaque(S)), S);
from_leader(_OtherL, _Msg, S) ->
?event({ignoring_from_leader, _OtherL, _Msg}, S),
S.

I can issue pull request.

About get_locks function, I got bad_match error, when I shut down one of the nodes.

uwiger · 2014-12-05T22:31:25Z

Just accepting the leader from 'from_leader' is problematic. The elected() and surrendered() callbacks need to be called appropriately. So before the handshake between leader and candidate, it's appropriate to ignore the 'from_leader' messages.

uwiger · 2014-12-06T21:59:10Z

Could you try the changes in PR #8 ?

uwiger · 2014-12-07T08:59:17Z

Never mind. Wrong thinking on my part. Have to fix.

dskliarov · 2014-12-08T14:22:29Z

Thank you

ctbarbour · 2014-12-23T01:52:39Z

FWIW, I was able to recreate the issue with get_locks failing with a badmatch error when a node shuts down using the gdict module in the examples. The uw-leader-hanging branch seems to have resolved the issue. To recreate the issue on master I started three nodes, dev1, dev2, dev3, all with the examples application started.

$> erl -pa ./ebin -pa deps/*/ebin -pa ./examples/ebin -sname dev${i} -setcookie locks -eval "application:ensure_all_started(examples)."

Connect the nodes and start a new gdict on each node starting from dev1 using:

{ok, Gdict} = gdict:new(dev).

I'm able to store and fetch values from all nodes. When I kill any node, I get the follow stacktrace on each node that's not the leader, in this case dev1:

=ERROR REPORT==== 22-Dec-2014::17:35:58 ===                                                                                                                                [2/248]
    locks_agent: aborted
    reason: {badmatch,[]}
    trace: [{locks_agent,get_locks,2,
                         [{file,"src/locks_agent.erl"},{line,1195}]},
            {locks_agent,get_locks,2,
                         [{file,"src/locks_agent.erl"},{line,1196}]},
            {locks_agent,analyse,2,[{file,"src/locks_agent.erl"},{line,1210}]},
            {locks_agent,handle_locks,1,
                         [{file,"src/locks_agent.erl"},{line,843}]},
            {locks_agent,handle_info,2,
                         [{file,"src/locks_agent.erl"},{line,608}]},
            {locks_agent,handle_msg,2,
                         [{file,"src/locks_agent.erl"},{line,256}]},
            {locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,231}]},
            {locks_agent,agent_init,3,
                         [{file,"src/locks_agent.erl"},{line,198}]}]

=ERROR REPORT==== 22-Dec-2014::17:35:58 ===
Error in process <0.57.0> on node 'dev2' with exit value: {{badmatch,[]},[
{locks_agent,agent_init,3,[{file,"src/locks_agent.erl"},{line,205}]}]}

Let me know if providing any more information would be helpful.

uwiger · 2015-01-14T09:18:21Z

Ok, thanks. I think I'll merge the uw-leader-hanging branch.

uwiger · 2015-10-21T19:22:06Z

Is this still a problem?

dskliarov · 2015-10-21T19:27:29Z

I switch for now to gen_server. It is working without any issues. I am
going to come back to the gen_lock later. Thank you for both of this
projects.

On Wednesday, 21 October 2015, Ulf Wiger notifications@github.com wrote:

Is this still a problem?

—
Reply to this email directly or view it on GitHub
#7 (comment).

Dmitri Skliarov

uwiger · 2015-10-21T19:36:16Z

Ok. Closing this issue.

ctbarbour · 2015-10-21T19:37:52Z

Not able reproduce the issue on master so no longer a problem as far as I can tell.

uwiger mentioned this issue Dec 6, 2014

add locks_leader:maybe_become_leader/2 #8

Merged

MarkNijhof mentioned this issue Dec 15, 2014

gproc_dist hangs on leader election for new nodes uwiger/gproc#71

Open

uwiger closed this as completed Oct 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7

When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7

dskliarov commented Dec 4, 2014

uwiger commented Dec 5, 2014

dskliarov commented Dec 5, 2014

uwiger commented Dec 5, 2014

uwiger commented Dec 6, 2014

uwiger commented Dec 7, 2014

dskliarov commented Dec 8, 2014

ctbarbour commented Dec 23, 2014

uwiger commented Jan 14, 2015

uwiger commented Oct 21, 2015

dskliarov commented Oct 21, 2015

uwiger commented Oct 21, 2015

ctbarbour commented Oct 21, 2015

When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7

When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7

Comments

dskliarov commented Dec 4, 2014

uwiger commented Dec 5, 2014

dskliarov commented Dec 5, 2014

uwiger commented Dec 5, 2014

uwiger commented Dec 6, 2014

uwiger commented Dec 7, 2014

dskliarov commented Dec 8, 2014

ctbarbour commented Dec 23, 2014

uwiger commented Jan 14, 2015

uwiger commented Oct 21, 2015

dskliarov commented Oct 21, 2015

uwiger commented Oct 21, 2015

ctbarbour commented Oct 21, 2015