Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7

Closed
dskliarov opened this issue Dec 4, 2014 · 12 comments

Comments

@dskliarov
Copy link

After I started 2 locks on 2 nodes, everything is working as expected: One node become a leader. locks_leader process on both nodes is in gen_server mode(current function is gen_server/loop). When I starting 3rd node, first 2 nodes become unresponsive to gen_server calls(current function on lock_leader process on this 2 nodes is lock_leader/safe_loop). lock_leader is waiting message have_all_locks, but not getting it.

I modify check_if_done function(line 787) in locks_agent.erl to resolve it

check_if_done(#state{pending = Pending} = State, Msgs) ->
case ets:info(Pending, size) of
0 ->
Msg = {have_all_locks, []},
notify_msgs([Msg|Msgs], have_all(State));
_ ->
check_if_done_(State, Msgs)
end.

After change, Leader node is in gen_server/loop, but other 2 nodes are in gen_leader/safe_loop.

Modified function get_locks (line 1194) to handle case when ets table is empty
get_locks([H|T], Ls) ->
case ets_lookup(Ls, H) of
[L] -> [L | get_locks(T, Ls)];
[] -> get_locks(T, Ls)
end;
get_locks([], _) ->
[].

After all that, I still having issues and continue to debug. Could you please check, and tell me if I am on the right path. Thank you

@uwiger
Copy link
Owner

uwiger commented Dec 5, 2014

Could you describe the exact steps you take to reproduce this?
I tried with gdict on two nodes, then adding a third, and saw no issues.

Also, in which situations do you see the get_locks/2 function failing because there are no locks? The assumption was that it will never be called unless there are locks. This assumption could of course be broken, but I'd like to understand when that happens.

@dskliarov
Copy link
Author

Thank you for looking the problem, Ulf. In my application, I have a logic to form a cluster(reading config from binary file, initiate monitoring nodes and process replication). Looks like, when loks started before cluster is formed, it creates net split scenario(2 nodes have a leader and new node is a leader). In this case in function locks_info, leader will be set to undefined in record #st. After that messages from_leader will be ignored and node will never switch to the gen_server/loop. I modified from_leader function and it is working now.

rom_leader(L, Msg, #st{leader = undefined} = S) ->
S1 = S#st{leader = L},
from_leader(L,Msg,S1);
from_leader(L, Msg, #st{leader = L, mod = M, mod_state = MSt} = S) ->
?event({processing_from_leader, L, Msg}, S),
callback(M:from_leader(Msg, MSt, opaque(S)), S);
from_leader(_OtherL, _Msg, S) ->
?event({ignoring_from_leader, _OtherL, _Msg}, S),
S.

I can issue pull request.

About get_locks function, I got bad_match error, when I shut down one of the nodes.

@uwiger
Copy link
Owner

uwiger commented Dec 5, 2014

Just accepting the leader from 'from_leader' is problematic. The elected() and surrendered() callbacks need to be called appropriately. So before the handshake between leader and candidate, it's appropriate to ignore the 'from_leader' messages.

@uwiger
Copy link
Owner

uwiger commented Dec 6, 2014

Could you try the changes in PR #8 ?

@uwiger
Copy link
Owner

uwiger commented Dec 7, 2014

Never mind. Wrong thinking on my part. Have to fix.

@dskliarov
Copy link
Author

Thank you

@ctbarbour
Copy link

FWIW, I was able to recreate the issue with get_locks failing with a badmatch error when a node shuts down using the gdict module in the examples. The uw-leader-hanging branch seems to have resolved the issue. To recreate the issue on master I started three nodes, dev1, dev2, dev3, all with the examples application started.

$> erl -pa ./ebin -pa deps/*/ebin -pa ./examples/ebin -sname dev${i} -setcookie locks -eval "application:ensure_all_started(examples)."

Connect the nodes and start a new gdict on each node starting from dev1 using:

{ok, Gdict} = gdict:new(dev).

I'm able to store and fetch values from all nodes. When I kill any node, I get the follow stacktrace on each node that's not the leader, in this case dev1:

=ERROR REPORT==== 22-Dec-2014::17:35:58 ===                                                                                                                                [2/248]
    locks_agent: aborted
    reason: {badmatch,[]}
    trace: [{locks_agent,get_locks,2,
                         [{file,"src/locks_agent.erl"},{line,1195}]},
            {locks_agent,get_locks,2,
                         [{file,"src/locks_agent.erl"},{line,1196}]},
            {locks_agent,analyse,2,[{file,"src/locks_agent.erl"},{line,1210}]},
            {locks_agent,handle_locks,1,
                         [{file,"src/locks_agent.erl"},{line,843}]},
            {locks_agent,handle_info,2,
                         [{file,"src/locks_agent.erl"},{line,608}]},
            {locks_agent,handle_msg,2,
                         [{file,"src/locks_agent.erl"},{line,256}]},
            {locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,231}]},
            {locks_agent,agent_init,3,
                         [{file,"src/locks_agent.erl"},{line,198}]}]

=ERROR REPORT==== 22-Dec-2014::17:35:58 ===
Error in process <0.57.0> on node 'dev2' with exit value: {{badmatch,[]},[
{locks_agent,agent_init,3,[{file,"src/locks_agent.erl"},{line,205}]}]}

Let me know if providing any more information would be helpful.

@uwiger
Copy link
Owner

uwiger commented Jan 14, 2015

Ok, thanks. I think I'll merge the uw-leader-hanging branch.

@uwiger
Copy link
Owner

uwiger commented Oct 21, 2015

Is this still a problem?

@dskliarov
Copy link
Author

I switch for now to gen_server. It is working without any issues. I am
going to come back to the gen_lock later. Thank you for both of this
projects.

On Wednesday, 21 October 2015, Ulf Wiger notifications@github.com wrote:

Is this still a problem?


Reply to this email directly or view it on GitHub
#7 (comment).

Dmitri Skliarov

@uwiger
Copy link
Owner

uwiger commented Oct 21, 2015

Ok. Closing this issue.

@uwiger uwiger closed this as completed Oct 21, 2015
@ctbarbour
Copy link

Not able reproduce the issue on master so no longer a problem as far as I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants