-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CT for leader tests, let agent monitor nodes, incr leader test #13
Conversation
Tries to address a failing case found by Garret Smith: This way works as expected (no deadlock): * start a, start locks, start locks_leader process * start b, start locks, start locks_leader process, connect a to b * start c, connect to a This way deadlocks: * start a, start locks, start locks_leader process * start b, connect a to b, start locks, start locks_leader process -- at this point, everything is fine -- * start c, connect to a, locks_leader process on 'a' is deadlocked The problem seemed to be a race condition between monitor_nodes() detection in the leader and agent, causing node a to lose track of the lock on b. This was provisionally fixed by letting the agent monitor the nodes for the leader, notifying it only when the locks server is running/terminated on a given node. It seems probable that a race condition still exists in the agent, which can be triggered if a lock request arrives just before the agent is told that a new node is running. However, having the leader monitor the nodes via the agent reduces the risk of falling into this hole. It is also possible that it can only happen when using continuous locking patterns like the leader does.
Manual testing looks good. Running the CT suite, I get inconsistent results :/ Sometimes (2 out of 5 runs) the gdict_netsplit test fails on the gdict:find() on line 163 with a {badmatch, error}. This is an improvement over previous behavior, but seems to confirm your suspicion that a race condition still exists. |
I re-ran tests with an unmodified locks_agent.erl, and found another issue. The code for the locks_running event only checked in the pending requests set for requests to (re-)issue to the newly appeared node, but we may also have held locks that need to be 'refreshed', if they use the I still noticed, when running the tests several times, that the netsplit test can fail with a {badmatch,error}. As far as I can tell, the failure lies in the gdict synch, although I haven't started debugging it yet. Looking at the trace log, the leader synch worked as expected an leader consensus was achieved. |
The problems in the test suite seem to have been fixed. There was a bug in test_cb, the locks_leader callback used by gdict. The Also, some read checks in gdict_all_nodes() didn't account for replication delays. I have now run the suite a large number of times without errors. |
@garret-smith have you tried the latest commits on your end? |
I am completely swamped today. I will make time on Friday and let you know. On Thu, Aug 6, 2015 at 11:03 AM, Ulf Wiger notifications@github.com wrote:
|
I've run the automated tests multiple times with no failures. Can't get it to break. |
Use CT for leader tests, let agent monitor nodes, incr leader test
Well, then. It's certainly better than it was before, so ... merged. Thanks for your help. |
Tries to address a failing case found by Garret Smith:
This way works as expected (no deadlock):
This way deadlocks:
-- at this point, everything is fine --
The problem seemed to be a race condition between monitor_nodes()
detection in the leader and agent, causing node a to lose track of
the lock on b. This was provisionally fixed by letting the agent
monitor the nodes for the leader, notifying it only when the locks
server is running/terminated on a given node.
It seems probable that a race condition still exists in the agent,
which can be triggered if a lock request arrives just before the
agent is told that a new node is running. However, having the leader
monitor the nodes via the agent reduces the risk of falling into this
hole. It is also possible that it can only happen when using continuous
locking patterns like the leader does.