Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition in AbstractAsyncWriter causing "java.lang.RuntimeException: Queue should be empty but is size:" #564

Closed
1 task done
akiezun opened this issue Apr 10, 2016 · 4 comments

Comments

@akiezun
Copy link
Contributor

akiezun commented Apr 10, 2016

java.lang.RuntimeException: Queue should be empty but is size: 276
        at htsjdk.samtools.util.AbstractAsyncWriter.close(AbstractAsyncWriter.java:75)
        at org.broadinstitute.hellbender.utils.read.ArtificialBAMBuilder.makeBAMFile(ArtificialBAMBuilder.java:143)
        at org.broadinstitute.hellbender.utils.read.ArtificialBAMBuilder.makeTemporaryBAMFile(ArtificialBAMBuilder.java:130)
        at org.broadinstitute.hellbender.utils.read.ArtificialBAMBuilderUnitTest.testBamProvider(ArtificialBAMBuilderUnitTest.java:69)

Subject of the issue

I think there's a race condition in AbstractAsyncWriter that is causing broadinstitute/gatk#1699

The race is exposed by a context switch in the middle of evaluating the || condition of the writer thread

The scenario is like this.

  1. let's say at a specific moment the queue is empty (but the writer is not closed)

  2. writer thread starts executing this the loop condition

while (!queue.isEmpty() || !isClosed.get()) {

it evaluates the first condition and queue is empty so it moves on to evaluating the second condition.
But before it gets to it, ...

  1. main thread calls write a few times - the queue becomes non empty

  2. main thread calls close()

main runs this line and thus flips isClosed to true !this.isClosed.getAndSet(true)

then main checks that this.queue.isEmpty() which it is not so it moves on
then main blocks on writer.join() and waits

  1. writer thread wakes up and evaluates the second condition of the while loop: isClosed.get() and that returns true now so the loop is not entered - because the first part of the condition is still remembered as false)

Writer thread ends the run() method and dies

  1. main thread wakes up, and blows up on
if (!this.queue.isEmpty()) { throw new RuntimeException("Queue should be empty but is size: " + this.queue.size()); }
``
@akiezun
Copy link
Contributor Author

akiezun commented Apr 10, 2016

i think reversing the order of operations in the while loop will fix it, ie

while (!isClosed.get() || !queue.isEmpty()) 

because, if isClosed() returns true, then the main thread must have at least stared to execute the close() call and no new elements will be added to the queue. So, the queue emptiness status will not change while the two parts of the || are being evaluated.

Can someone confirm that there is indeed a race condition and that the proposed fix makes it go away?

@droazen
Copy link
Contributor

droazen commented Apr 11, 2016

Discussed in person with @akiezun, and we convinced ourselves that there is indeed a potential race here due to the writer thread checking for emptiness while it's still possible to add additional items via the main thread. Checking for closure first, and only checking emptiness if the writer is already in a state where no more items can be added (ie., it's already closed) seems like it should fix the race, but empirical testing and additional sets of eyes are required.

The way the class is currently constructed seems like an invitation to races of this sort, since you have an isClosed that is decoupled from the queue itself and shared across threads. Comments in the BlockingQueue docs recommend inserting a poison item into the queue to signal closure, but this would be tricky in this case since the enclosing class is generic.

@droazen
Copy link
Contributor

droazen commented Apr 11, 2016

@akiezun Should we disable async I/O in GATK4 until this is resolved?

@akiezun
Copy link
Contributor Author

akiezun commented Apr 11, 2016

@droazen if possible i'd rather fix this. but yes, if we have no fix for this by the time we close the milestone, we'll turn off asyncIO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants