Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent race condition when JobBuffer#push is run in a tight loop and workers are waiting on jobs #318

Merged
merged 1 commit into from
Dec 22, 2021

Conversation

oeoeaio
Copy link
Contributor

@oeoeaio oeoeaio commented Dec 8, 2021

This PR resolves an issue identified by @ebeigarts while crafting a test for #285 (see comment).

NOTE: this PR depends on #285, but we thought we would merge that PR separately rather than attempting to mash these changes in too.

Problem

The crux of the issue is that when JobBuffer#push is run on a tight loop (as it is in the test provided by @ebeigarts, which we have committed here, with credit) while there are workers waiting on jobs, there is a race condition between:

  1. a worker being released in PriorityQueue#pop and reducing the @waiting counter
  2. the thread running JobBuffer#push re-entering that method and evaluating pq.waiting_count.times

If 2 happens before 1 (which seems unlikely but not impossible), then then too many jobs will be pushed into the priority queue, leading to the Que.assert(waiting_count > 0) assertion in ProrityQueue#push failing.

It's unclear whether this is all that likely to happen in reality, but given there is proof that the race condition is possible we thought it better to try and resolve it.

Resolution

By shifting the push loop into the priority worker itself (#populate), we are able to utilize the priority queue's own mutex to ensure the @waiting count does not change while the push loop is running.

We have made PriorityQueue#_push a private unsynchonized method since it is now only used via #populate.

It's unclear whether there are performance impacts from making this change (workers are blocked from picking up jobs which we are pushing new jobs into each queue), but there is no measurable difference in the time taken for the test (about 2.3s with the original version which moves 40k jobs through the buffer) on my machine with or without this change.

@oeoeaio oeoeaio self-assigned this Dec 8, 2021
raise "Deadlock"
end
end
end
Copy link
Contributor Author

@oeoeaio oeoeaio Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only take 0.1 seconds to run and fails on master about 50% of the time. We tried to develop a more targeted test for this issue but everything we tried ended up reaching into the internal implementation to such an extent that the test became very brittle.

I would be interested to know if anyone has a different hit rate for the Deadlock exception on master on their machine.

@ZimbiX
Copy link
Member

ZimbiX commented Dec 22, 2021

Given we've fixed the test subsequently, I'm going to remove it from this PR and merge the test in another

… workers are waiting on jobs

Co-authored-by: Maddy Markovitz <maddy.markovitz@greensync.com.au>
Co-authored-by: Brendan Weibrecht <brendan@weibrecht.net.au>
@ZimbiX ZimbiX merged commit abb3621 into que-rb:master Dec 22, 2021
@ZimbiX ZimbiX deleted the fix-push-race-condition branch December 22, 2021 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants