Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes and tests for CRAM index queries. #552

Merged
merged 1 commit into from Jun 12, 2016

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Apr 4, 2016

Includes fixes for #481, #550, #551, and #562;

@droazen
Copy link
Contributor

droazen commented Apr 4, 2016

@vadimzalunin Can you review? We were somewhat surprised to learn that cram queries are broken in so many cases..

if (alignEnd == 0) {
// handle placed but "unmapped" since they can pass the overlap test
return (alignStart >= interval.start) &&
(interval.end <= 0 || alignStart <= interval.end);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle case where alignStart < interval.start && alignEnd == 0 in a manner consistent with the BAM-querying code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(BAMQueryMultipleIntervalsIteratorFilter.compareIntervalToRecord() suggests that unmapped reads with a start position are assigned an end position equal to start for BAM queries)

@cmnbroad cmnbroad changed the title Fixes and tests for CRAM single-interval index queries. Fixes and tests for CRAM index queries. Apr 9, 2016
@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Apr 9, 2016

I've added a second commit to this PR with fixes for the comments here, as well as a proposed fix for #551. The second commit reduces the two CRAMFileReader interval iterator implementations (IntervalIterator and MultipleIntervalIterator) into a single implementation that uses (a temporarily cloned copy of the) interval matching code used by the BAM iterators iterators to do interval matching. If we get agreement that this is the right strategy, I'll refactor the cloned code so it can be shared between the implementations.

@vadimzalunin
Copy link
Contributor

@cmnbroad the approach looks good. A shared iterator impl would be ideal.

@cmnbroad
Copy link
Collaborator Author

This now includes fixes for #481, #550, #551, and #562; uses a multiple interval filter iterator implementation that is shared with BAMFileReader; and adds a bunch of new CRAM tests.

@cmnbroad
Copy link
Collaborator Author

@vadimzalunin Please review. thx.

}
}
throw new SAMException("Failed to query alignment start: " + sequence + " at " + start);
return new CRAMIntervalIterator(new QueryInterval[]{new QueryInterval(referenceIndex, start, -1)}, true);
}

CloseableIterator<SAMRecord> query(final int referenceIndex,
final int start, final int end, final boolean overlap) throws IOException {
Copy link
Collaborator Author

@cmnbroad cmnbroad Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Delete this method. I meant to remove it as its no longer used internally and should just be deleted (its not part of the public query interface, and is confusing since it reverses the polarity of contained/overlap compared to the other methods).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on deletion if possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@droazen
Copy link
Contributor

droazen commented May 6, 2016

@vadimzalunin Do you have time to do a final review on this? This is blocking us from safely using CRAM as input to the GATK, since our queries on CRAMs are currently not returning the correct results. Thanks!

/**
* Filters out records that do not match any of the given intervals and query type.
*/
public class BAMQueryMultipleIntervalsIteratorFilter implements BAMIteratorFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've made this formerly private class public, can you add direct unit tests for it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

new String[]{"a", "b", "c", "d", "e"}},
{cramQueryWithBAI, cramQueryReference,
new QueryInterval[]{new QueryInterval(0, 100010, 100010), new QueryInterval(0, 100011, 100011)},
new String[]{"a", "b", "c", "d", "e"}},
Copy link
Contributor

@droazen droazen May 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend having more than two test cases here. Should have tests involving multiple contigs, as well as more than two intervals, and also cover cases like no reads overlap any intervals, first read is before the first interval, first interval is before the first read, etc. If your existing test file doesn't have enough reads/contigs, can use something like src/test/resources/org/broadinstitute/hellbender/engine/reads_data_source_test1.bam from the GATK, which uses a tiny reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also try porting the test cases from ReadsDataSourceUnitTest here, which use the reads_data_source_test1.bam file mentioned above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added quite a few more tests for both overlapped and contained.

@droazen
Copy link
Contributor

droazen commented May 19, 2016

First-pass review complete, back to @cmnbroad

@droazen droazen assigned cmnbroad and unassigned vadimzalunin May 19, 2016
@cmnbroad cmnbroad force-pushed the cn_cram_query_fix branch 3 times, most recently from 5b14052 to 9db6fdc Compare May 27, 2016 18:53
@cmnbroad cmnbroad assigned cmnbroad and droazen and unassigned cmnbroad May 27, 2016
@cmnbroad
Copy link
Collaborator Author

Didn't squash or rebase yet but otherwise I think this covers it.

} catch (final IOException e) {
throw new RuntimeEOFException(e);
}
next(); // advance to the first record that matches the filter criteria
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you confirmed that this iterator works with an empty cram file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of tests to validate this.

@droazen
Copy link
Contributor

droazen commented Jun 9, 2016

Review complete, back to @cmnbroad. Merge after addressing remaining comments and rebasing.

@coveralls
Copy link

coveralls commented Jun 10, 2016

Coverage Status

Coverage increased (+0.07%) to 68.141% when pulling 12b0345 on cmnbroad:cn_cram_query_fix into 6237f7d on samtools:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants