-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7
Comments
Could you describe the exact steps you take to reproduce this? Also, in which situations do you see the get_locks/2 function failing because there are no locks? The assumption was that it will never be called unless there are locks. This assumption could of course be broken, but I'd like to understand when that happens. |
Thank you for looking the problem, Ulf. In my application, I have a logic to form a cluster(reading config from binary file, initiate monitoring nodes and process replication). Looks like, when loks started before cluster is formed, it creates net split scenario(2 nodes have a leader and new node is a leader). In this case in function locks_info, leader will be set to undefined in record #st. After that messages from_leader will be ignored and node will never switch to the gen_server/loop. I modified from_leader function and it is working now. rom_leader(L, Msg, #st{leader = undefined} = S) -> I can issue pull request. About get_locks function, I got bad_match error, when I shut down one of the nodes. |
Just accepting the leader from 'from_leader' is problematic. The |
Could you try the changes in PR #8 ? |
Never mind. Wrong thinking on my part. Have to fix. |
Thank you |
FWIW, I was able to recreate the issue with
Connect the nodes and start a new {ok, Gdict} = gdict:new(dev). I'm able to store and fetch values from all nodes. When I kill any node, I get the follow stacktrace on each node that's not the leader, in this case =ERROR REPORT==== 22-Dec-2014::17:35:58 === [2/248]
locks_agent: aborted
reason: {badmatch,[]}
trace: [{locks_agent,get_locks,2,
[{file,"src/locks_agent.erl"},{line,1195}]},
{locks_agent,get_locks,2,
[{file,"src/locks_agent.erl"},{line,1196}]},
{locks_agent,analyse,2,[{file,"src/locks_agent.erl"},{line,1210}]},
{locks_agent,handle_locks,1,
[{file,"src/locks_agent.erl"},{line,843}]},
{locks_agent,handle_info,2,
[{file,"src/locks_agent.erl"},{line,608}]},
{locks_agent,handle_msg,2,
[{file,"src/locks_agent.erl"},{line,256}]},
{locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,231}]},
{locks_agent,agent_init,3,
[{file,"src/locks_agent.erl"},{line,198}]}]
=ERROR REPORT==== 22-Dec-2014::17:35:58 ===
Error in process <0.57.0> on node 'dev2' with exit value: {{badmatch,[]},[
{locks_agent,agent_init,3,[{file,"src/locks_agent.erl"},{line,205}]}]} Let me know if providing any more information would be helpful. |
Ok, thanks. I think I'll merge the uw-leader-hanging branch. |
Is this still a problem? |
I switch for now to gen_server. It is working without any issues. I am On Wednesday, 21 October 2015, Ulf Wiger notifications@github.com wrote:
Dmitri Skliarov |
Ok. Closing this issue. |
Not able reproduce the issue on master so no longer a problem as far as I can tell. |
After I started 2 locks on 2 nodes, everything is working as expected: One node become a leader. locks_leader process on both nodes is in gen_server mode(current function is gen_server/loop). When I starting 3rd node, first 2 nodes become unresponsive to gen_server calls(current function on lock_leader process on this 2 nodes is lock_leader/safe_loop). lock_leader is waiting message have_all_locks, but not getting it.
I modify check_if_done function(line 787) in locks_agent.erl to resolve it
check_if_done(#state{pending = Pending} = State, Msgs) ->
case ets:info(Pending, size) of
0 ->
Msg = {have_all_locks, []},
notify_msgs([Msg|Msgs], have_all(State));
_ ->
check_if_done_(State, Msgs)
end.
After change, Leader node is in gen_server/loop, but other 2 nodes are in gen_leader/safe_loop.
Modified function get_locks (line 1194) to handle case when ets table is empty
get_locks([H|T], Ls) ->
case ets_lookup(Ls, H) of
[L] -> [L | get_locks(T, Ls)];
[] -> get_locks(T, Ls)
end;
get_locks([], _) ->
[].
After all that, I still having issues and continue to debug. Could you please check, and tell me if I am on the right path. Thank you
The text was updated successfully, but these errors were encountered: