-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto indexing durng writing of BAM, SAM.gz, CRAM, BCF, VCF.gz #718
Conversation
Summary of the changes:
|
FYI there's a related feature request in pysam: pysam-developers/pysam#684 |
Of the 4 options discussed, I too prefer to avoid the last 2. "Explicit is better than implicit.": https://www.python.org/dev/peps/pep-0020/ |
Examples: test_view -x foo.bam.bai or test_view -x foo.bam.csi -m 14 Also fixed prototype of hts_idx_finish so it can return error codes.
This was always implied by the usage, but not implemented.
This required some changes to the APIs for *_idx_init and *_idx_save as with BAI/CSI we build the index in memory and write it at the end, while CRAI it writes the index as it goes. The index filename is now passed in during init and cached. Also amended bcf_hdr_write to set hfp->format correctly so we can query this later to identify the stream as BCF instead of generic BGZF.
This is messier than it appears due to subtle differences between CSI on BCF and CSI on VCF. CSI operates the same on BCF and BAM in that each ##contig or @sq header is counted and all are indexable, irrespective of whether they have data. CSI on VCF however operates like tabix, so the n_ref field is only incremented for contigs with data. In order to work out which one is which, it then embeds a tabix header inside the CSI meta data field. Thus the foo.vcf.gz.csi file is neither pure CSI nor pure TBX. In order to handle this case when indexing on the fly, it has to spot newly covered contigs and dynamically append to the meta data. (It does this only if the input format is VCF.)
For now this is using sam_itr_querys2 and sam_itr_next2 new functions as to read a SAM file we need the bam_hdr_t struct (mapping rname to tid). This isn't needed by BAM and the iterator bam_readrec callback function was not designed around having the same things available as the general purpose sam_read1 call. Note VCF does this differently. SAM.gz is only readable via the synced reader and cannot be read using a normal hts iterator. This is perhaps an oversight.
It's not ideal as -o is in use (sorry, I did that because -I was in use at the time so picked -i/-o; a bad decision). However this is just a test / debug tool anyway.
Handle possible realloc failure in compress_binning() Some values read from index files are used for memory allocations. Ensure that they won't cause problems due to integer overflow.
Fixes bug introduced in 7317e4f where it incorrectly appended a NUL byte to the input string instead of the new copy.
Fix call to unlink(NULL) when running `bgzip -d < file.gz > file`.
Add M5: tags to test/index.sam so that the cram tests can find the right references. It's done this way rather then using the reference fasta file so that no UR: tags are added, possibly changing the length of the header. Add a function to test.pl to make an MD5 based cache from test/ce.fa. It's done this way as checking in the reference files would make the git repository quite a bit bigger.
On reading the SAM header, store a pointer to it in the htsFile struct for use by the iterators. Allows the existing iterator API to work with SAM files, which means programs that use BAM indexes don't need to be modified for SAM. The one disadvantage is that the header may be kept in memory for longer than is expected (although reading SAM needs it anyway, so it shouldn't be a huge issue). The header is currently only kept for SAM files. CRAM does its own thing, and some programs (notably samtools sort) expect to be able to drop the original header after reading it. As sort can currently have thousands of BAM files open at once, keeping all their headers could lead to excessive memory consumption. To allow a NULL header in bam, the checks on core.tid and core.mtid in sam_read1 have to be relaxed a little (although it's no worse than before when the iterator used bam_read1). Reference counting is used to prevent the header from being deleted prematurely.
Ensure that the input is a BGZF file before using htsfp->fp.bgzf in bcf_itr_next() and sam_itr_next(). Make indexing refuse to work on non-bgzf SAM It might be possible to get indexes to work on plain SAM, but it would involve lots of changes. We also want to encourage use of compressed files.
103dc64
to
04d029a
Compare
Rebased with some additions:
A rather more major change is that the iterator API has reverted to the original one, so no NB: This needs a .so number bump on merging. |
…ABI] * Enable indexing of and iteration over BGZF-compressed SAM. * Enable building indexes while writing BGZF-compressed SAM, BAM, CRAM, BGZF-compressed VCF and BCF files * Add extra tests and fix some bugs * Add / improve doxygen documentation * Increment TWO_TO_THREE_TRANSITION_COUNT due to ABI break
I don't have a samtools interface for this yet, but there are new test mechanisms in htslib for validating this. In samtools I'm unsure how we'd specify it. Off the top of my head I could think of:
I also added the ability to use BAI and CSI indices for bgzipped SAM while I was at it. (And indeed the ability to even write it natively.)
Note there are a few questions over the best API for SAM.gz. Our generic
sam_read1
function takes fp and hdr. The generichts_itr_next
function only passes on fp though (and bgzf fp rather than htsFile fp at that). the multi-region iterator improves on the type of fp, but it still lacks the header.We have two choices here. We could use the data argument to pass over a new struct containing both htsFile and bam_hdr_t, but if we do this in hts layer then we're removing the ability for callers to use this for anything else. That's perhaps not such a big issue. Note to do this we'd need to store the header structure somewhere when we initialise the iterator, either in the fp or the itr structure itself both work (I'd favour the latter).
Alternatively we can create a new sam_itr_next2 function which has an additional argument passed in to it (hdr) and we get this to do the struct construction before running hts_itr_next, which doesn't alter the abilities of the underlying function. However it does make an entirely new API function to call. This is what I did, but I think perhaps the former may be preferable.
Finally test_view has been upgraded to read/write vcf/bcf. It fails on vcf.gz reading, but that's due to it only being supported by synced reader.