-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI index support for BAM files #1040
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1040 +/- ##
===============================================
+ Coverage 68.336% 68.939% +0.603%
- Complexity 8001 8208 +207
===============================================
Files 541 546 +5
Lines 32681 32971 +290
Branches 5536 5558 +22
===============================================
+ Hits 22333 22730 +397
+ Misses 8122 7987 -135
- Partials 2226 2254 +28
|
@valeriuo I'm sorry we haven't gotten to this yet. I've been very busy and this is obviously a complicated PR. It's a really valuable PR though, lots of people want CSI support, so we will definitely get around to reviewing it when someone has bandwidth. |
@lbergelson I realize there is a lot of code in here. Some of it is duplicated from |
@lbergelson could this be prioritized, there's a picard PR waiting on it. broadinstitute/picard#998 |
@yfarjoun Yes, I was planning on prioritizing this as the next thing after going through a bunch of small stuff that has been in flight for a while. |
@valeriuo I've begun to take a look at this. Sorry for the extremely long delay. Are you still around and interested in helping get this in? |
@lbergelson, definitely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valeriuo I'm still looking through this, I just wanted to post what I'd gotten through so you can see what I'm thinking. Almost all of my comments are structural / code style things rather than anything broken about the implementation. It looks pretty good to me, but I'm not super familiar with the indexes, so I may have to come back to you with questions in a bit.
I'm increasingly annoyed by the BAMIndex class hierarchy. It seems like we should be able to implement this as a subclass of AbstractBamIndex, and the fact that it isn't obvious how to do that is problematic. I think we'll need to consider moving some things around.
I haven't taken a serious look at the tests yet, but I suspect I'm going to want to add additional test cases. It looks like you have good unit test coverage, and BamIndexValidator might be checking everything I want, but as a heads up I might want more tests that open bam readers on various combinations of indexes, particularly ones accessible only through nio.Path APIs.
* @param endPos 1-based end of target region, inclusive. | ||
* @return bit set for each bin that may contain SAMRecords in the target region. | ||
*/ | ||
protected BitSet regionToBins(final int startPos, final int endPos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of deleting this, please deprecate it and have it call through to the preferred one.
import java.nio.channels.FileChannel; | ||
import java.util.*; | ||
|
||
public class BAMCSIFileIndex implements BrowseableBAMIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this to CSIIndex. Although I guess that's a bit like ATMMachine...
@Override | ||
public int getLevelForBin(Bin bin) { | ||
if(bin == null || bin.getBinNumber() > getMaxBins()) { | ||
throw new SAMException("Tried to get level for invalid bin."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include the invalid value in the error message.
@Override | ||
public int getFirstLocusInBin(Bin bin) { | ||
if(bin == null || bin.getBinNumber() > getMaxBins()) { | ||
throw new SAMException("Tried to get first locus for invalid bin."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@Override | ||
public void close() {} | ||
|
||
private void verifyBAMMagicNumber(final String sourceName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be renamed to verifyCSIMagicNumer
* Constructors | ||
*/ | ||
|
||
private BAMCSIFileIndex(final SeekableStream stream, final SAMSequenceDictionary dictionary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the constructors should be refactored. I would unify them by a private constructor
private BAMCSIFileIndex(IndexFileBuffer buffer, String Source, SAMSequenceDictionary dictionary)
and then have the other constructers call though to that. In the end we want r 3 public constructors.
One that takes SeekableStream
as the input, one for Path
, and one for File
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, Path
would be memory mapped and File
would be randomly accessed. Is this what you had in mind?
* Statistics include count of aligned and unaligned reads for each reference sequence | ||
* and a count of all records with no start coordinate | ||
*/ | ||
static public BAMIndexMetaData[] getCSIIndexStats(final BAMFileReader bam) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index hierarchy is such a mess. I think we should consider restructuring it to some degree to make this method unnecessary.
metaDataSeen = true; | ||
continue; // don't create a Bin | ||
} else { | ||
skipBytes(16 * nChunks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets extract a named constant here
return mIndexBuffer.position(); | ||
} | ||
|
||
private abstract static class IndexFileBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should not duplicate this code. Lets pull out this and the implementers as package protected classes instead. While we're at it we can make this an interface and have it extend Closeable.
@@ -34,6 +34,7 @@ | |||
public interface BAMIndex extends Closeable { | |||
|
|||
public static final String BAMIndexSuffix = ".bai"; | |||
public static final String BAMIndexSuffix2 = ".csi"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It bothers me that these don't follow constant naming convetion. I would rename this to CSI_INDEX_SUFFIX
, deprecate the existing BAMIndexSuffix
and add a new BAM_INDEX_SUFFIX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BAI_INDEX_SUFFIX
would probably be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that's better.
…elNumber is too big. 2. Fixes a bug: the level size for the last level was miscalculated. It should be 32768 (2^15). 3. Removes the unused method regionToBins, whose code was already implemented in the class GenomicIndexUtil.
Change `BAMIndexSuffix2` to `CSI_INDEX_SUFFIX`. Deprecate `BAMIndexSuffix` and use `BAI_INDEX_SUFFIX` instead. Make setters of `CSIIndex` private.
…vate interface `IndexFileBuffer` and classes `IndexFileBuffer`, `MemoryMappedFileBuffer` and `RandomAccessFileBuffer`. Refactor the code to use the new classes.
… `AbstractBAMFileIndex`. Fix a few index arithmetic bugs.
b16a421
to
0e9a1d3
Compare
@lbergelson I've addressed all the topics mentioned in your review. I didn't add any unit tests, but I wanted to push the latest code version, because I am going on a short holiday until next week and I thought you could have another look over it in the meantime. I've managed to refactor the BAM index class hierarchy by opening slightly the I've tested the CSI index functionality by running Picard Also, I've rebased the PR onto master. |
@valeriuo Oh, that's awesome! I will take a look. That's interesting about the compression. It doesn't seem to mention it anywhere in the spec. I assume they're just gzip? Or are they bgzip? We should be able to add detection of the compression and automatically decompress. There was just recently some code added to make that sort of detection easier as well. |
Htslib writes CSI using bgzf (but I doubt it matters whether it is this or pure gzip as we have no random access capability to our indices anyway). You are correct in that it is not documented. I note that Tabix does document this, as bgzf, so I assume this is just an omission in the CSI documentation. Note that what CSI does to BAI - allowing for longer chromosomes - needs to be applied to tabix TBI also. There is no point having long chromosome support in BAM if we then cannot analyse it and handle bgzipped VCF. Bcftools does this via an intermediate CSI/TBI hybrid, which is sadly also undocumented! It uses a tabix header as the meta-data for a CSI file, which means it doesn't have to have every #contig line present in the index and reference numbering is based on presence in the file rather than presence in the VCF header. I believe this is an attempt to handle the case where the VCF header does not exist (#contig lines are optional), but I can't really be sure. Maybe it's simply an accident of sharing code paths! Note, I've just added CSI and BAI support for bgzipped SAM in Htslib, mainly for completeness but also as a safe workaround for when BAM just doesn't cut it (eg a chromosome more than 2Gb long). Does htsjdk support this? I haven't got TBI support for it, but arguably it's not helpful anyway given it also has the same size limit. |
Ping, @lbergelson et al I wonder what the status of CSI support is? Just asking so I know how to respond to IGV tickets, not requesting anything. |
I don't have anything else to add to the PR, at this point. I did some basic tests with Picard and they looked OK. |
It's been stalled forever waiting on me. I need to do a review but I never find the time. |
@lbergelson I understand believe me. If I can help let me know, again no pressure I just need to respond to a ticket on my side. |
I would like to get this into the next release, but I've been saying that to myself for every release... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valeriuo I started doing what should hopefully be the final review pass on this. It's looking really good. It's great that you figured out how to merge it with the existing index class hierarchy. I still haven't looked over quite everything yet, I just wanted to checkpoint what I've done so far though. Thank you for your overtaxed patience.
@@ -39,7 +39,18 @@ | |||
|
|||
static final byte[] BAM_MAGIC = "BAM\1".getBytes(); | |||
/** | |||
* BAM index file magic number. | |||
* BAM index file magic numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* BAM index file magic numbers. | |
* BAM index file magic numbers. | |
* @deprecated prefer {@link BAI_INDEX_MAGIC} |
@@ -366,27 +366,15 @@ protected int getMaxAddressibleGenomicLocation() { | |||
} | |||
|
|||
/** | |||
* @deprecated Use {@link GenomicIndexUtil#regionToBins(int, int)} instead. | |||
* | |||
* Get candidate bins for the specified region | |||
* @param startPos 1-based start of target region, inclusive. | |||
* @param endPos 1-based end of target region, inclusive. | |||
* @return bit set for each bin that may contain SAMRecords in the target region. | |||
*/ | |||
protected BitSet regionToBins(final int startPos, final int endPos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected BitSet regionToBins(final int startPos, final int endPos) { | |
@Deprecated | |
protected BitSet regionToBins(final int startPos, final int endPos) { |
Don't forget the actual deprecation tag.
|
||
verifyBAMMagicNumber(stream.getSource()); | ||
int [] sequenceIndexes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int [] sequenceIndexes; | |
int[] sequenceIndexes; |
|
||
verifyBAMMagicNumber(file.getName()); | ||
private AbstractBAMFileIndex(final IndexFileBuffer indexFileBuffer, final String source, final SAMSequenceDictionary dictionary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: revisit this comment before commiting
There's something funny about the constructor heirachy. It feels like verifyBAMMagicNumber should probably be made abstract and called always.
else | ||
return GenomicIndexUtil.LEVEL_STARTS[levelNumber+1]-GenomicIndexUtil.LEVEL_STARTS[levelNumber]; | ||
if (levelNumber >= getNumIndexLevels()) { | ||
throw new SAMException("Level number is too big (" + levelNumber + ")."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is duplicated it might be worth extracting a method assertLevelIsValid
method.
|
||
@Override | ||
public int position() { | ||
return (int)mCompressedStream.getPosition(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably check for > maxInt and throw in that case. Unlikely with an index file... but you never know...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on this. BGZF compressed streams use virtual offsets, which are 64 bit by construction and usually exceed Integer.MAX_VALUE
. I will change position
return type and seek
argument type to long
.
@Override | ||
public void readBytes(final byte[] bytes) { | ||
try { | ||
mCompressedStream.read(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a subtle bug here. mCompressedStream.read(bytes)
makes no guarantee that it will read as many bytes as you ask for. You have to check the return value of read and continue reading until you either have read as many bytes as you require, or it returns -1 in which case you should probably throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need something like readFully or readBytes from BinaryCodec here I think.
import java.io.IOException; | ||
|
||
|
||
class CompressedIndexFileBuffer implements IndexFileBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs class level javadoc
|
||
class CompressedIndexFileBuffer implements IndexFileBuffer { | ||
|
||
private BlockCompressedInputStream mCompressedStream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this final
|
||
import java.io.Closeable; | ||
|
||
interface IndexFileBuffer extends Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably get some javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of comments. Sorry there are so many. It's a lot of complex code.
BAMIndexMetaData[] data = getIndexStats(bam); | ||
|
||
BAMIndexMetaData[] data = null; | ||
if (bam.getIndexType().equals(SamIndexes.BAI) || bam.getIndexType().equals(SamIndexes.CSI)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check necessary or is it effectively handled by the null -> throw check below?
@@ -43,49 +43,73 @@ public static int exhaustivelyTestIndex(final SamReader reader) { // throws Exce | |||
// look at all chunk offsets in a linear index to make sure they are valid | |||
|
|||
if (reader.indexing().hasBrowseableIndex()) { | |||
if (SamIndexes.BAI.fileNameSuffix.endsWith(reader.type().indexExtension())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole file could be simplified a lot by extracting methods and rearranging things. If you don't want to do that as part of this pr we can file an issue and add this as is.
* @param levelNumber Level for which to compute the size. | ||
* @return | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra new line
* Extends the functionality of {@link AbstractBAMFileIndex#getFirstBinInLevel(int)} , | ||
* which cannot be overridden due to its static nature. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
* which cannot be overridden due to its static nature. | ||
*/ | ||
|
||
public int getFirstBinOnLevel (final int levelNumber) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might make sense to just pull this implementation up to the base class. You would have to pull up getBinDepth as well. I guess there might be some performance concerns but I doubt this is any slower than the nested ifs and array lookups that it uses.
If you think that's not a good solution, this needs to be renamed to something more clearly different. Maybe getFirstBinInLevelForCSI
seek(BAMFileConstants.CSI_MINSHIFT_OFFSET); | ||
} | ||
setMinShift(readInteger()); // min_shift | ||
setBinDepth(readInteger() + 1); // depth - HTSlib doesn't count the first level (bin 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this comment.
} else if (indexBin == getMaxBins() + 1) { | ||
// meta data - build the bin so that the count of bins is correct; | ||
// but don't attach meta chunks to the bin, or normal queries will be off | ||
for (int ci = 0; ci < nChunks; ci++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets pull out a readChunks
in the superclass and use it in all 4 places this block of code is repeated.
|
||
/** Continually reads from the provided {@link SeekableStream} into the buffer until the specified number of bytes are read, or | ||
* until the stream is exhausted, throwing a {@link RuntimeIOException}. */ | ||
private static void readFully(final SeekableStream in, final byte[] buffer, final int offset, final int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to use BinaryCodec instead of reimplementing readFully.
@Override | ||
public void readBytes(final byte[] bytes) { | ||
try { | ||
mCompressedStream.read(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need something like readFully or readBytes from BinaryCodec here I think.
|
||
@Override | ||
public int readInteger() { | ||
final byte[] intbuff = new byte[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating a new byte[] for each integer read is going to be super inefficient. It would make sense to reuse the buffer. This method suffers from the same problem with read not guaranteeing a definite number of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test that actually opens a bam reader and exercises the various query methods with a file that only has a CSI index.
Fixed `CSIIndex.getStartOfLastLinearBin` method. Fixed read methods in `CompressedIndexFileBuffer` class. Changed `seek` and `position` methods to work with `long` offsets. Simplified the constructor hierarchy in `AbstractBAMFileIndex`. Added more unit tests in `BAMFileReaderTest`. Added javadoc and annotations.
I've addressed most of the change requests and also added a new test class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valeriuo I think this is good to go. Thank you for your extreme patience and perseverance. 👍
Description
Checklist