Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Make BAMFileReader and some related classes public #786
Conversation
|
@droazen, can you take a look? |
codecov-io
commented
Jan 19, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #786 +/- ##
==============================================
- Coverage 64.75% 64.649% -0.101%
==============================================
Files 525 523 -2
Lines 31679 31634 -45
Branches 5414 6776 +1362
==============================================
- Hits 20512 20451 -61
- Misses 9026 9030 +4
- Partials 2141 2153 +12
Continue to review full report at Codecov.
|
droazen
self-assigned this
Jan 24, 2017
| + if (mCurrentIterator != null) { | ||
| + throw new IllegalStateException("Iteration in progress"); | ||
| + } | ||
| + if (mIsSeekable) { |
lbergelson
Jan 24, 2017
Contributor
this should should throw if it's not seekable, not silently give you back the wrong iterator
| @@ -329,6 +329,29 @@ public ValidationStringency getValidationStringency() { | ||
| return mCurrentIterator; | ||
| } | ||
| + /** | ||
| + * Prepare to iterate through the SAMRecords in file order, starting at the given | ||
| + * file pointer. The restrictions described for {@link #getIterator()} apply here. |
droazen
Jan 24, 2017
Contributor
It seems like this is unused in the GATK/Hadoop-BAM client code in favor of getIterator(final SAMFileSpan chunks) -- can we remove?
tomwhite
Jan 25, 2017
Contributor
Yes - not needed in the latest version of the Hadoop-BAM client code. Well spotted!
| + return createIndexIterator(intervals, contained, span == null ? null : span.toCoordinateArray()); | ||
| + } | ||
| + | ||
| + public CloseableIterator<SAMRecord> createIndexIterator(final QueryInterval[] intervals, |
| @@ -763,7 +801,7 @@ private void assertIntervalsOptimized(final QueryInterval[] intervals) { | ||
| /** | ||
| * Iterate over the SAMRecords defined by the sections of the file described in the ctor argument. | ||
| */ | ||
| - private class BAMFileIndexIterator extends BAMFileIterator { | ||
| + public class BAMFileIndexIterator extends BAMFileIterator { |
droazen
Jan 24, 2017
Contributor
Why does this need to be made public? It seems like there are no direct usages in any downstream PRs
nh13
Jan 24, 2017
Contributor
@droazen can you use the review feature so I don't get a notification per comment?
droazen
Jan 24, 2017
Contributor
We were doing a collaborative review on this branch, so we needed our comments to be visible to everyone right away rather than at the end of the review. Sorry for the inconvenience.
| @@ -115,15 +115,55 @@ public SAMFileSpan removeContentsBefore(final SAMFileSpan fileSpan) { | ||
| validateSorted(); | ||
| final BAMFileSpan trimmedChunkList = new BAMFileSpan(); | ||
| + long chunkStart = bamFileSpan.chunks.get(0).getChunkStart(); |
| for(final Chunk chunkToTrim: chunks) { | ||
| - if(chunkToTrim.getChunkEnd() > chunkToTrim.getChunkStart()) { | ||
| - if(chunkToTrim.getChunkStart() >= bamFileSpan.chunks.get(0).getChunkStart()) { | ||
| + if(chunkToTrim.getChunkEnd() > chunkStart) { |
droazen
Jan 24, 2017
•
Contributor
This does not seem equivalent to the previous implementation -- it used to be comparing against the start of the current chunk (chunkToTrim), but now it's comparing against the start of the first chunk (chunkStart). Is this intentional? Was the previous behavior incorrect?
tomwhite
Jan 25, 2017
Contributor
Yes, this was intentional as the previous behaviour was incorrect. If a chunk entirely preceded the cutoff, it would never be removed.
There were no tests for this before, and the method is not used internally by htsjdk, which is possibly how this was missed. I've now added unit tests.
| + return new Object[][] { | ||
| + { span(chunk(6,10), chunk(11,15)), null, span(chunk(6,10), chunk(11,15)) }, | ||
| + { span(chunk(6,10), chunk(11,15)), span(), span(chunk(6,10), chunk(11,15)) }, | ||
| + { span(chunk(6,10), chunk(11,15)), span(chunk(6, 0)), span(chunk(6,10), chunk(11,15)) }, |
| @@ -381,7 +381,7 @@ public PrimitiveSamReaderToSamReaderAdapter(final PrimitiveSamReader p, final Sa | ||
| this.resource = resource; | ||
| } | ||
| - PrimitiveSamReader underlyingReader() { | ||
| + public PrimitiveSamReader underlyingReader() { |
droazen
Jan 24, 2017
•
Contributor
Can you explain why you can't just use the existing iterator(final SAMFileSpan chunks) on the PrimitiveSamReaderToSamReaderAdapter, instead of making underlyingReader() public, calling it on the adapter, and then calling getIterator(fileSpan) on it?
tomwhite
Jan 25, 2017
Contributor
PrimitiveSamReaderToSamReaderAdapter is not public - but even if it was, the underlying reader is still needed to get the BAMFileReader so its newly-exposed getFileSpan() and createIndexIterator() methods can be called.
|
First-pass review complete, back to @tomwhite to answer some of our questions. |
droazen
assigned tomwhite and unassigned droazen
Jan 24, 2017
|
Thanks for the detailed reviews @droazen and @lbergelson. I've addressed all the comments. |
droazen
requested changes
Feb 14, 2017
Second-pass review complete, back to @tomwhite for more changes.
| for(final Chunk chunkToTrim: chunks) { | ||
| - if(chunkToTrim.getChunkEnd() > chunkToTrim.getChunkStart()) { | ||
| - if(chunkToTrim.getChunkStart() >= bamFileSpan.chunks.get(0).getChunkStart()) { | ||
| + if(chunkToTrim.getChunkEnd() > chunkStart) { |
droazen
Feb 14, 2017
Contributor
Just want to confirm that you now have at least one unit test that fails before this bug fix (chunkToTrim.getChunkEnd() > chunkToTrim.getChunkStart() -> chunkToTrim.getChunkEnd() > chunkStart) and passes after it.
| + validateSorted(); | ||
| + | ||
| + final BAMFileSpan trimmedChunkList = new BAMFileSpan(); | ||
| + final long chunkEnd = bamFileSpan.chunks.get(0).getChunkEnd(); |
droazen
Feb 14, 2017
Contributor
This doesn't seem right -- the logical "end" of the BAMFileSpan is the end of the last chunk, not the end of the first chunk, isn't it, since the chunks list is sorted?
Can you fix this, and add a unit test that fails before the change and passes after?
tomwhite
Feb 15, 2017
Contributor
I was assuming there was only one chunk, but you're right - there could be more. Fixed, and added more tests to cover these cases.
| + } | ||
| + else { | ||
| + // This chunk from the list partially overlaps the filtering chunk and must be trimmed. | ||
| + trimmedChunkList.add(new Chunk(chunkToTrim.getChunkStart(),chunkEnd)); |
droazen
Feb 14, 2017
•
Contributor
Do we have a good test to confirm that the chunks in the returned filespan are still sorted after trimming? (applies to removeContentsBefore() as well)
tomwhite
Feb 15, 2017
Contributor
Yes, the test checks that the chunks are in the order specified (which is sorted).
| + final boolean contained, | ||
| + final long[] filePointers) { | ||
| + | ||
| + assertIntervalsOptimized(intervals); |
droazen
Feb 14, 2017
•
Contributor
This sanity check (that the intervals are sorted, and that overlapping/adjacent intervals are properly merged) now happens much later than previously when calling the two-argument overload of this method (which is what the query() methods all call). Specifically, the check now happens after the index is actually queried in getFileSpan(). Are we certain that the process of querying the index is guaranteed not to blow up if the intervals were not optimized?
droazen
Feb 14, 2017
Contributor
Probably we should ensure that assertIntervalsOptimized() is called as the very first step in both overloads of createIndexIterator() to be safe, even if that means that the two-argument version can't call directly into the three-argument version.
| @@ -381,7 +381,7 @@ public PrimitiveSamReaderToSamReaderAdapter(final PrimitiveSamReader p, final Sa | ||
| this.resource = resource; | ||
| } | ||
| - PrimitiveSamReader underlyingReader() { | ||
| + public PrimitiveSamReader underlyingReader() { |
| + { span(chunk(6,10), chunk(11,15)), span(chunk(0,11)), span(chunk(6,10)) }, | ||
| + { span(chunk(6,10), chunk(11,15)), span(chunk(0,12)), span(chunk(6,10), chunk(11,12)) }, | ||
| + { span(chunk(6,10), chunk(11,15)), span(chunk(0,15)), span(chunk(6,10), chunk(11,15)) }, | ||
| + { span(chunk(6,10), chunk(11,15)), span(chunk(0,16)), span(chunk(6,10), chunk(11,15)) }, |
droazen
Feb 14, 2017
Contributor
What's missing here (and in the test cases for removeContentsBefore() as well) are test cases in which the cutoff BAMFileSpan consists of multiple chunks. As mentioned above, I think you have a bug in removeContentsAfter() in that case.
|
Thanks for the review @droazen. I've addressed all your comments. |
|
|
tomwhite commentedJan 19, 2017
... and expose methods for iterating over a part of a BAM file (needed for Hadoop-BAM,
which processes BAM files in parallel, see HadoopGenomics/Hadoop-BAM#91).
Also, add BAMFileSpan#removeContentsAfter method to mirror
removeContentsBefore.
Description
Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?
Checklist