Prevent race condition when JobBuffer#push is run in a tight loop and workers are waiting on jobs #318
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR resolves an issue identified by @ebeigarts while crafting a test for #285 (see comment).
NOTE: this PR depends on #285, but we thought we would merge that PR separately rather than attempting to mash these changes in too.
Problem
The crux of the issue is that when
JobBuffer#push
is run on a tight loop (as it is in the test provided by @ebeigarts, which we have committed here, with credit) while there are workers waiting on jobs, there is a race condition between:PriorityQueue#pop
and reducing the@waiting
counterJobBuffer#push
re-entering that method and evaluatingpq.waiting_count.times
If 2 happens before 1 (which seems unlikely but not impossible), then then too many jobs will be pushed into the priority queue, leading to the
Que.assert(waiting_count > 0)
assertion inProrityQueue#push
failing.It's unclear whether this is all that likely to happen in reality, but given there is proof that the race condition is possible we thought it better to try and resolve it.
Resolution
By shifting the push loop into the priority worker itself (
#populate
), we are able to utilize the priority queue's own mutex to ensure the@waiting
count does not change while the push loop is running.We have made
PriorityQueue#_push
a private unsynchonized method since it is now only used via#populate
.It's unclear whether there are performance impacts from making this change (workers are blocked from picking up jobs which we are pushing new jobs into each queue), but there is no measurable difference in the time taken for the test (about 2.3s with the original version which moves 40k jobs through the buffer) on my machine with or without this change.