Prevent race condition when JobBuffer#push is run in a tight loop and workers are waiting on jobs #318

oeoeaio · 2021-12-08T23:06:40Z

This PR resolves an issue identified by @ebeigarts while crafting a test for #285 (see comment).

NOTE: this PR depends on #285, but we thought we would merge that PR separately rather than attempting to mash these changes in too.

Problem

The crux of the issue is that when JobBuffer#push is run on a tight loop (as it is in the test provided by @ebeigarts, which we have committed here, with credit) while there are workers waiting on jobs, there is a race condition between:

a worker being released in PriorityQueue#pop and reducing the @waiting counter
the thread running JobBuffer#push re-entering that method and evaluating pq.waiting_count.times

If 2 happens before 1 (which seems unlikely but not impossible), then then too many jobs will be pushed into the priority queue, leading to the Que.assert(waiting_count > 0) assertion in ProrityQueue#push failing.

It's unclear whether this is all that likely to happen in reality, but given there is proof that the race condition is possible we thought it better to try and resolve it.

Resolution

By shifting the push loop into the priority worker itself (#populate), we are able to utilize the priority queue's own mutex to ensure the @waiting count does not change while the push loop is running.

We have made PriorityQueue#_push a private unsynchonized method since it is now only used via #populate.

It's unclear whether there are performance impacts from making this change (workers are blocked from picking up jobs which we are pushing new jobs into each queue), but there is no measurable difference in the time taken for the test (about 2.3s with the original version which moves 40k jobs through the buffer) on my machine with or without this change.

oeoeaio · 2021-12-08T23:11:05Z

lib/que/job_buffer.spec.rb

+        raise "Deadlock"
+      end
+    end
+  end


This test only take 0.1 seconds to run and fails on master about 50% of the time. We tried to develop a more targeted test for this issue but everything we tried ended up reaching into the internal implementation to such an extent that the test became very brittle.

I would be interested to know if anyone has a different hit rate for the Deadlock exception on master on their machine.

ZimbiX · 2021-12-22T06:51:51Z

Given we've fixed the test subsequently, I'm going to remove it from this PR and merge the test in another

… workers are waiting on jobs Co-authored-by: Maddy Markovitz <maddy.markovitz@greensync.com.au> Co-authored-by: Brendan Weibrecht <brendan@weibrecht.net.au>

oeoeaio self-assigned this Dec 8, 2021

oeoeaio commented Dec 8, 2021

View reviewed changes

oeoeaio force-pushed the fix-push-race-condition branch from b149985 to 308784b Compare December 8, 2021 23:20

oeoeaio mentioned this pull request Dec 8, 2021

Do not mutate the JobBuffer from PriorityQueue to prevent deadlocks #285

Merged

Prevent race condition when JobBuffer#push is run in a tight loop and…

89d4fa0

… workers are waiting on jobs Co-authored-by: Maddy Markovitz <maddy.markovitz@greensync.com.au> Co-authored-by: Brendan Weibrecht <brendan@weibrecht.net.au>

ZimbiX force-pushed the fix-push-race-condition branch from 308784b to 89d4fa0 Compare December 22, 2021 06:53

ZimbiX merged commit abb3621 into que-rb:master Dec 22, 2021

ZimbiX deleted the fix-push-race-condition branch December 22, 2021 06:53

ZimbiX mentioned this pull request Dec 22, 2021

Fix worker thread attrition #321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent race condition when JobBuffer#push is run in a tight loop and workers are waiting on jobs #318

Prevent race condition when JobBuffer#push is run in a tight loop and workers are waiting on jobs #318

oeoeaio commented Dec 8, 2021 •

edited

Loading

oeoeaio Dec 8, 2021 •

edited

Loading

ZimbiX commented Dec 22, 2021

Prevent race condition when JobBuffer#push is run in a tight loop and workers are waiting on jobs #318

Prevent race condition when JobBuffer#push is run in a tight loop and workers are waiting on jobs #318

Conversation

oeoeaio commented Dec 8, 2021 • edited Loading

Problem

Resolution

oeoeaio Dec 8, 2021 • edited Loading

Choose a reason for hiding this comment

ZimbiX commented Dec 22, 2021

oeoeaio commented Dec 8, 2021 •

edited

Loading

oeoeaio Dec 8, 2021 •

edited

Loading