Use CT for leader tests, let agent monitor nodes, incr leader test #13

uwiger · 2015-08-03T09:33:01Z

Tries to address a failing case found by Garret Smith:

This way works as expected (no deadlock):

start a, start locks, start locks_leader process
start b, start locks, start locks_leader process, connect a to b
start c, connect to a

This way deadlocks:

start a, start locks, start locks_leader process
start b, connect a to b, start locks, start locks_leader process
-- at this point, everything is fine --
start c, connect to a, locks_leader process on 'a' is deadlocked

The problem seemed to be a race condition between monitor_nodes()
detection in the leader and agent, causing node a to lose track of
the lock on b. This was provisionally fixed by letting the agent
monitor the nodes for the leader, notifying it only when the locks
server is running/terminated on a given node.

It seems probable that a race condition still exists in the agent,
which can be triggered if a lock request arrives just before the
agent is told that a new node is running. However, having the leader
monitor the nodes via the agent reduces the risk of falling into this
hole. It is also possible that it can only happen when using continuous
locking patterns like the leader does.

Tries to address a failing case found by Garret Smith: This way works as expected (no deadlock): * start a, start locks, start locks_leader process * start b, start locks, start locks_leader process, connect a to b * start c, connect to a This way deadlocks: * start a, start locks, start locks_leader process * start b, connect a to b, start locks, start locks_leader process -- at this point, everything is fine -- * start c, connect to a, locks_leader process on 'a' is deadlocked The problem seemed to be a race condition between monitor_nodes() detection in the leader and agent, causing node a to lose track of the lock on b. This was provisionally fixed by letting the agent monitor the nodes for the leader, notifying it only when the locks server is running/terminated on a given node. It seems probable that a race condition still exists in the agent, which can be triggered if a lock request arrives just before the agent is told that a new node is running. However, having the leader monitor the nodes via the agent reduces the risk of falling into this hole. It is also possible that it can only happen when using continuous locking patterns like the leader does.

garret-smith · 2015-08-03T20:51:14Z

Manual testing looks good. Running the CT suite, I get inconsistent results :/ Sometimes (2 out of 5 runs) the gdict_netsplit test fails on the gdict:find() on line 163 with a {badmatch, error}. This is an improvement over previous behavior, but seems to confirm your suspicion that a race condition still exists.

uwiger · 2015-08-05T16:02:42Z

I re-ran tests with an unmodified locks_agent.erl, and found another issue. The code for the locks_running event only checked in the pending requests set for requests to (re-)issue to the newly appeared node, but we may also have held locks that need to be 'refreshed', if they use the all_alive or a majority condition. Making that change alone made the tests pass (i.e. no deadlock/timeout).

I still noticed, when running the tests several times, that the netsplit test can fail with a {badmatch,error}. As far as I can tell, the failure lies in the gdict synch, although I haven't started debugging it yet. Looking at the trace log, the leader synch worked as expected an leader consensus was achieved.

uwiger · 2015-08-05T23:19:50Z

The problems in the test suite seem to have been fixed. There was a bug in test_cb, the locks_leader callback used by gdict. The test_cb:elected/3 callback function needs to check each time for new candidates; it would sometimes forego synching and simply pass along its own state, which could lead to data loss.

Also, some read checks in gdict_all_nodes() didn't account for replication delays.

I have now run the suite a large number of times without errors.

uwiger · 2015-08-06T18:03:54Z

@garret-smith have you tried the latest commits on your end?

garret-smith · 2015-08-06T22:39:53Z

I am completely swamped today. I will make time on Friday and let you know.

On Thu, Aug 6, 2015 at 11:03 AM, Ulf Wiger notifications@github.com wrote:

@garret-smith https://github.com/garret-smith have you tried the latest
commits on your end?

—
Reply to this email directly or view it on GitHub
#13 (comment).

garret-smith · 2015-08-07T22:18:05Z

I've run the automated tests multiple times with no failures. Can't get it to break.
Manual testing still looks good.

Use CT for leader tests, let agent monitor nodes, incr leader test

uwiger · 2015-08-08T09:45:58Z

Well, then. It's certainly better than it was before, so ... merged.

Thanks for your help.

uwiger added 2 commits August 3, 2015 11:21

Fix possible clobbering of agent's down list

9b29cc7

also reissue held reqs at locks_running

8b583dd

fix synch bug in test_cb, fixes in suite

4f28f2f

uwiger added a commit that referenced this pull request Aug 8, 2015

Merge pull request #13 from uwiger/uw-incr-leader-test

105657c

Use CT for leader tests, let agent monitor nodes, incr leader test

uwiger merged commit 105657c into master Aug 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CT for leader tests, let agent monitor nodes, incr leader test #13

Use CT for leader tests, let agent monitor nodes, incr leader test #13

uwiger commented Aug 3, 2015

garret-smith commented Aug 3, 2015

uwiger commented Aug 5, 2015

uwiger commented Aug 5, 2015

uwiger commented Aug 6, 2015

garret-smith commented Aug 6, 2015

garret-smith commented Aug 7, 2015

uwiger commented Aug 8, 2015

Use CT for leader tests, let agent monitor nodes, incr leader test #13

Use CT for leader tests, let agent monitor nodes, incr leader test #13

Conversation

uwiger commented Aug 3, 2015

garret-smith commented Aug 3, 2015

uwiger commented Aug 5, 2015

uwiger commented Aug 5, 2015

uwiger commented Aug 6, 2015

garret-smith commented Aug 6, 2015

garret-smith commented Aug 7, 2015

uwiger commented Aug 8, 2015