Make BAMFileReader and some related classes public #786

Merged
merged 1 commit into from Feb 17, 2017

Conversation

Projects
None yet
5 participants
Contributor

tomwhite commented Jan 19, 2017

... and expose methods for iterating over a part of a BAM file (needed for Hadoop-BAM,
which processes BAM files in parallel, see HadoopGenomics/Hadoop-BAM#91).

Also, add BAMFileSpan#removeContentsAfter method to mirror
removeContentsBefore.

Description

Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)
Contributor

tomwhite commented Jan 19, 2017

@droazen, can you take a look?

codecov-io commented Jan 19, 2017 edited

Codecov Report

Merging #786 into master will decrease coverage by -0.101%.
The diff coverage is 85.185%.

@@              Coverage Diff               @@
##             master      #786       +/-   ##
==============================================
- Coverage     64.75%   64.649%   -0.101%     
==============================================
  Files           525       523        -2     
  Lines         31679     31634       -45     
  Branches       5414      6776     +1362     
==============================================
- Hits          20512     20451       -61     
- Misses         9026      9030        +4     
- Partials       2141      2153       +12
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/SamReader.java 80.208% <ø> (ø) 0 <ø> (ø)
src/main/java/htsjdk/samtools/BAMFileReader.java 65.179% <66.667%> (+1.595%) 0 <ø> (-37)
src/main/java/htsjdk/samtools/BAMFileSpan.java 71.875% <90.476%> (+38.542%) 0 <ø> (-11)
src/main/java/htsjdk/samtools/BinaryTagCodec.java 59.116% <ø> (-17.127%) 0% <ø> (-68%)
src/main/java/htsjdk/tribble/TribbleException.java 38.889% <ø> (-11.111%) 0% <ø> (-4%)
...samtools/seekablestream/SeekableStreamFactory.java 73.913% <ø> (-9.42%) 0% <ø> (-6%)
...sjdk/samtools/util/Md5CalculatingOutputStream.java 70.27% <ø> (-8.108%) 0% <ø> (-8%)
...ain/java/htsjdk/tribble/AbstractFeatureReader.java 64.286% <ø> (-6.548%) 0% <ø> (-18%)
...c/main/java/htsjdk/tribble/TabixFeatureReader.java 63.235% <ø> (-5.421%) 0% <ø> (-8%)
src/main/java/htsjdk/variant/vcf/VCFCodec.java 61.364% <ø> (-4.545%) 0% <ø> (-13%)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b9d5d5...12eb90f. Read the comment docs.

droazen self-assigned this Jan 24, 2017

+ if (mCurrentIterator != null) {
+ throw new IllegalStateException("Iteration in progress");
+ }
+ if (mIsSeekable) {
@lbergelson

lbergelson Jan 24, 2017

Contributor

this should should throw if it's not seekable, not silently give you back the wrong iterator

@tomwhite

tomwhite Jan 25, 2017

Contributor

Removed this method as it was not needed.

@@ -329,6 +329,29 @@ public ValidationStringency getValidationStringency() {
return mCurrentIterator;
}
+ /**
+ * Prepare to iterate through the SAMRecords in file order, starting at the given
+ * file pointer. The restrictions described for {@link #getIterator()} apply here.
@droazen

droazen Jan 24, 2017

Contributor

It seems like this is unused in the GATK/Hadoop-BAM client code in favor of getIterator(final SAMFileSpan chunks) -- can we remove?

@tomwhite

tomwhite Jan 25, 2017

Contributor

Yes - not needed in the latest version of the Hadoop-BAM client code. Well spotted!

+ return createIndexIterator(intervals, contained, span == null ? null : span.toCoordinateArray());
+ }
+
+ public CloseableIterator<SAMRecord> createIndexIterator(final QueryInterval[] intervals,
@droazen

droazen Jan 24, 2017

Contributor

Add javadoc for this method

@@ -763,7 +801,7 @@ private void assertIntervalsOptimized(final QueryInterval[] intervals) {
/**
* Iterate over the SAMRecords defined by the sections of the file described in the ctor argument.
*/
- private class BAMFileIndexIterator extends BAMFileIterator {
+ public class BAMFileIndexIterator extends BAMFileIterator {
@droazen

droazen Jan 24, 2017

Contributor

Why does this need to be made public? It seems like there are no direct usages in any downstream PRs

@nh13

nh13 Jan 24, 2017

Contributor

@droazen can you use the review feature so I don't get a notification per comment?

@droazen

droazen Jan 24, 2017

Contributor

We were doing a collaborative review on this branch, so we needed our comments to be visible to everyone right away rather than at the end of the review. Sorry for the inconvenience.

@@ -115,15 +115,55 @@ public SAMFileSpan removeContentsBefore(final SAMFileSpan fileSpan) {
validateSorted();
final BAMFileSpan trimmedChunkList = new BAMFileSpan();
+ long chunkStart = bamFileSpan.chunks.get(0).getChunkStart();
@lbergelson

lbergelson Jan 24, 2017

Contributor

make this final I think

for(final Chunk chunkToTrim: chunks) {
- if(chunkToTrim.getChunkEnd() > chunkToTrim.getChunkStart()) {
- if(chunkToTrim.getChunkStart() >= bamFileSpan.chunks.get(0).getChunkStart()) {
+ if(chunkToTrim.getChunkEnd() > chunkStart) {
@droazen

droazen Jan 24, 2017 edited

Contributor

This does not seem equivalent to the previous implementation -- it used to be comparing against the start of the current chunk (chunkToTrim), but now it's comparing against the start of the first chunk (chunkStart). Is this intentional? Was the previous behavior incorrect?

@tomwhite

tomwhite Jan 25, 2017

Contributor

Yes, this was intentional as the previous behaviour was incorrect. If a chunk entirely preceded the cutoff, it would never be removed.

There were no tests for this before, and the method is not used internally by htsjdk, which is possibly how this was missed. I've now added unit tests.

+ return new Object[][] {
+ { span(chunk(6,10), chunk(11,15)), null, span(chunk(6,10), chunk(11,15)) },
+ { span(chunk(6,10), chunk(11,15)), span(), span(chunk(6,10), chunk(11,15)) },
+ { span(chunk(6,10), chunk(11,15)), span(chunk(6, 0)), span(chunk(6,10), chunk(11,15)) },
@droazen

droazen Jan 24, 2017

Contributor

You are creating chunks with end < start -- is this allowed?

@tomwhite

tomwhite Jan 25, 2017

Contributor

Yes. End is not used, so it can be any value.

@@ -381,7 +381,7 @@ public PrimitiveSamReaderToSamReaderAdapter(final PrimitiveSamReader p, final Sa
this.resource = resource;
}
- PrimitiveSamReader underlyingReader() {
+ public PrimitiveSamReader underlyingReader() {
@droazen

droazen Jan 24, 2017 edited

Contributor

Can you explain why you can't just use the existing iterator(final SAMFileSpan chunks) on the PrimitiveSamReaderToSamReaderAdapter, instead of making underlyingReader() public, calling it on the adapter, and then calling getIterator(fileSpan) on it?

@tomwhite

tomwhite Jan 25, 2017

Contributor

PrimitiveSamReaderToSamReaderAdapter is not public - but even if it was, the underlying reader is still needed to get the BAMFileReader so its newly-exposed getFileSpan() and createIndexIterator() methods can be called.

Contributor

droazen commented Jan 24, 2017

First-pass review complete, back to @tomwhite to answer some of our questions.

@droazen droazen assigned tomwhite and unassigned droazen Jan 24, 2017

Contributor

tomwhite commented Jan 25, 2017

Thanks for the detailed reviews @droazen and @lbergelson. I've addressed all the comments.

@droazen

Second-pass review complete, back to @tomwhite for more changes.

for(final Chunk chunkToTrim: chunks) {
- if(chunkToTrim.getChunkEnd() > chunkToTrim.getChunkStart()) {
- if(chunkToTrim.getChunkStart() >= bamFileSpan.chunks.get(0).getChunkStart()) {
+ if(chunkToTrim.getChunkEnd() > chunkStart) {
@droazen

droazen Feb 14, 2017

Contributor

Just want to confirm that you now have at least one unit test that fails before this bug fix (chunkToTrim.getChunkEnd() > chunkToTrim.getChunkStart() -> chunkToTrim.getChunkEnd() > chunkStart) and passes after it.

@tomwhite

tomwhite Feb 15, 2017

Contributor

Yes, five new unit tests fail without this change.

+ validateSorted();
+
+ final BAMFileSpan trimmedChunkList = new BAMFileSpan();
+ final long chunkEnd = bamFileSpan.chunks.get(0).getChunkEnd();
@droazen

droazen Feb 14, 2017

Contributor

This doesn't seem right -- the logical "end" of the BAMFileSpan is the end of the last chunk, not the end of the first chunk, isn't it, since the chunks list is sorted?

Can you fix this, and add a unit test that fails before the change and passes after?

@tomwhite

tomwhite Feb 15, 2017

Contributor

I was assuming there was only one chunk, but you're right - there could be more. Fixed, and added more tests to cover these cases.

+ }
+ else {
+ // This chunk from the list partially overlaps the filtering chunk and must be trimmed.
+ trimmedChunkList.add(new Chunk(chunkToTrim.getChunkStart(),chunkEnd));
@droazen

droazen Feb 14, 2017 edited

Contributor

Do we have a good test to confirm that the chunks in the returned filespan are still sorted after trimming? (applies to removeContentsBefore() as well)

@tomwhite

tomwhite Feb 15, 2017

Contributor

Yes, the test checks that the chunks are in the order specified (which is sorted).

+ final boolean contained,
+ final long[] filePointers) {
+
+ assertIntervalsOptimized(intervals);
@droazen

droazen Feb 14, 2017 edited

Contributor

This sanity check (that the intervals are sorted, and that overlapping/adjacent intervals are properly merged) now happens much later than previously when calling the two-argument overload of this method (which is what the query() methods all call). Specifically, the check now happens after the index is actually queried in getFileSpan(). Are we certain that the process of querying the index is guaranteed not to blow up if the intervals were not optimized?

@droazen

droazen Feb 14, 2017

Contributor

Probably we should ensure that assertIntervalsOptimized() is called as the very first step in both overloads of createIndexIterator() to be safe, even if that means that the two-argument version can't call directly into the three-argument version.

@tomwhite

tomwhite Feb 15, 2017

Contributor

Done.

@@ -381,7 +381,7 @@ public PrimitiveSamReaderToSamReaderAdapter(final PrimitiveSamReader p, final Sa
this.resource = resource;
}
- PrimitiveSamReader underlyingReader() {
+ public PrimitiveSamReader underlyingReader() {
@droazen

droazen Feb 14, 2017

Contributor

Add docs for this method, since you've made it public

@tomwhite

tomwhite Feb 15, 2017

Contributor

Done

+ { span(chunk(6,10), chunk(11,15)), span(chunk(0,11)), span(chunk(6,10)) },
+ { span(chunk(6,10), chunk(11,15)), span(chunk(0,12)), span(chunk(6,10), chunk(11,12)) },
+ { span(chunk(6,10), chunk(11,15)), span(chunk(0,15)), span(chunk(6,10), chunk(11,15)) },
+ { span(chunk(6,10), chunk(11,15)), span(chunk(0,16)), span(chunk(6,10), chunk(11,15)) },
@droazen

droazen Feb 14, 2017

Contributor

What's missing here (and in the test cases for removeContentsBefore() as well) are test cases in which the cutoff BAMFileSpan consists of multiple chunks. As mentioned above, I think you have a bug in removeContentsAfter() in that case.

@tomwhite

tomwhite Feb 15, 2017

Contributor

Done.

@tomwhite tomwhite Make BAMFileReader and some related classes public, and expose
methods for iterating over a part of a BAM file (needed for Hadoop-BAM,
which processes BAM files in parallel).

Also, add BAMFileSpan#removeContentsAfter method to mirror
removeContentsBefore.
12eb90f
Contributor

tomwhite commented Feb 15, 2017

Thanks for the review @droazen. I've addressed all your comments.

Contributor

droazen commented Feb 17, 2017

👍 latest version looks good -- merging

@droazen droazen merged commit 55bf01b into samtools:master Feb 17, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment