Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CT for leader tests, let agent monitor nodes, incr leader test #13

Merged
merged 4 commits into from
Aug 8, 2015

Conversation

uwiger
Copy link
Owner

@uwiger uwiger commented Aug 3, 2015

Tries to address a failing case found by Garret Smith:

This way works as expected (no deadlock):

  • start a, start locks, start locks_leader process
  • start b, start locks, start locks_leader process, connect a to b
  • start c, connect to a

This way deadlocks:

  • start a, start locks, start locks_leader process
  • start b, connect a to b, start locks, start locks_leader process
    -- at this point, everything is fine --
  • start c, connect to a, locks_leader process on 'a' is deadlocked

The problem seemed to be a race condition between monitor_nodes()
detection in the leader and agent, causing node a to lose track of
the lock on b. This was provisionally fixed by letting the agent
monitor the nodes for the leader, notifying it only when the locks
server is running/terminated on a given node.

It seems probable that a race condition still exists in the agent,
which can be triggered if a lock request arrives just before the
agent is told that a new node is running. However, having the leader
monitor the nodes via the agent reduces the risk of falling into this
hole. It is also possible that it can only happen when using continuous
locking patterns like the leader does.

Tries to address a failing case found by Garret Smith:

This way works as expected (no deadlock):
* start a, start locks, start locks_leader process
* start b, start locks, start locks_leader process, connect a to b
* start c, connect to a

This way deadlocks:
* start a, start locks, start locks_leader process
* start b, connect a to b, start locks, start locks_leader process
-- at this point, everything is fine --
* start c, connect to a, locks_leader process on 'a' is deadlocked

The problem seemed to be a race condition between monitor_nodes()
detection in the leader and agent, causing node a to lose track of
the lock on b. This was provisionally fixed by letting the agent
monitor the nodes for the leader, notifying it only when the locks
server is running/terminated on a given node.

It seems probable that a race condition still exists in the agent,
which can be triggered if a lock request arrives just before the
agent is told that a new node is running. However, having the leader
monitor the nodes via the agent reduces the risk of falling into this
hole. It is also possible that it can only happen when using continuous
locking patterns like the leader does.
@garret-smith
Copy link

Manual testing looks good. Running the CT suite, I get inconsistent results :/ Sometimes (2 out of 5 runs) the gdict_netsplit test fails on the gdict:find() on line 163 with a {badmatch, error}. This is an improvement over previous behavior, but seems to confirm your suspicion that a race condition still exists.

@uwiger
Copy link
Owner Author

uwiger commented Aug 5, 2015

I re-ran tests with an unmodified locks_agent.erl, and found another issue. The code for the locks_running event only checked in the pending requests set for requests to (re-)issue to the newly appeared node, but we may also have held locks that need to be 'refreshed', if they use the all_alive or a majority condition. Making that change alone made the tests pass (i.e. no deadlock/timeout).

I still noticed, when running the tests several times, that the netsplit test can fail with a {badmatch,error}. As far as I can tell, the failure lies in the gdict synch, although I haven't started debugging it yet. Looking at the trace log, the leader synch worked as expected an leader consensus was achieved.

@uwiger
Copy link
Owner Author

uwiger commented Aug 5, 2015

The problems in the test suite seem to have been fixed. There was a bug in test_cb, the locks_leader callback used by gdict. The test_cb:elected/3 callback function needs to check each time for new candidates; it would sometimes forego synching and simply pass along its own state, which could lead to data loss.

Also, some read checks in gdict_all_nodes() didn't account for replication delays.

I have now run the suite a large number of times without errors.

@uwiger
Copy link
Owner Author

uwiger commented Aug 6, 2015

@garret-smith have you tried the latest commits on your end?

@garret-smith
Copy link

I am completely swamped today. I will make time on Friday and let you know.

On Thu, Aug 6, 2015 at 11:03 AM, Ulf Wiger notifications@github.com wrote:

@garret-smith https://github.com/garret-smith have you tried the latest
commits on your end?


Reply to this email directly or view it on GitHub
#13 (comment).

@garret-smith
Copy link

I've run the automated tests multiple times with no failures. Can't get it to break.
Manual testing still looks good.

uwiger added a commit that referenced this pull request Aug 8, 2015
Use CT for leader tests, let agent monitor nodes, incr leader test
@uwiger uwiger merged commit 105657c into master Aug 8, 2015
@uwiger
Copy link
Owner Author

uwiger commented Aug 8, 2015

Well, then. It's certainly better than it was before, so ... merged.

Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants