Auto indexing durng writing of BAM, SAM.gz, CRAM, BCF, VCF.gz #718

jkbonfield · 2018-06-22T08:54:23Z

I don't have a samtools interface for this yet, but there are new test mechanisms in htslib for validating this. In samtools I'm unsure how we'd specify it. Off the top of my head I could think of:

A new samtools view option to ask for index (plus what type, and size of min_shift).
output-fmt-option keywords to request indexing, type, options, filename.
Some automatic defaults? (Probably bad)
Filename trickery: eg -o foo.bam/csi (also bad!)

I also added the ability to use BAI and CSI indices for bgzipped SAM while I was at it. (And indeed the ability to even write it natively.)

Note there are a few questions over the best API for SAM.gz. Our generic sam_read1 function takes fp and hdr. The generic hts_itr_next function only passes on fp though (and bgzf fp rather than htsFile fp at that). the multi-region iterator improves on the type of fp, but it still lacks the header.

We have two choices here. We could use the data argument to pass over a new struct containing both htsFile and bam_hdr_t, but if we do this in hts layer then we're removing the ability for callers to use this for anything else. That's perhaps not such a big issue. Note to do this we'd need to store the header structure somewhere when we initialise the iterator, either in the fp or the itr structure itself both work (I'd favour the latter).

Alternatively we can create a new sam_itr_next2 function which has an additional argument passed in to it (hdr) and we get this to do the struct construction before running hts_itr_next, which doesn't alter the abilities of the underlying function. However it does make an entirely new API function to call. This is what I did, but I think perhaps the former may be preferable.

Finally test_view has been upgraded to read/write vcf/bcf. It fails on vcf.gz reading, but that's due to it only being supported by synced reader.

jkbonfield · 2018-06-22T11:10:31Z

Summary of the changes:

bgzf_peek to permit SAM header parsing stopping before reading the next line. This is helpful when indexing as otherwise the first sam_idx_init has to have a cached bgzf_tell value copied from the header parsing. This may mean we can simplify other bits of code, eg the fp->line.l check in sam_read1. (It's not harmful to keep it though.)
Addition of idxfp to cram_fd structure, for purposes of writing the index on the fly. This is because the old cram index functions wrote the index as it read the file, rather than building the index in memory and writing it out at the end. We could change that, but this was the smaller tweak.
Various new elements of hts_idx_t. Specifically tbi_n and last_tbi_tid to track the number of references and the last reference number so we can do the mapping of CSI tids to TBI tids. (CSI is all, TBI is all covered.) I don't particularly like this, but couldn't think of anything better.
New hts_idx_tbi_name function for the purposes of TBI index generation. Again, we can't do this at the start from the header as we don't know what is covered. We do it on the fly, storing it in the already existant idx->meta pointer. We call hts_idx_tbi_name for each and every VCF record (if we're indexing), but it spots the same rname as last time quickly so hopefully this isn't a performance hit. (I didn't measure it though.)
Added an hts_idx_amend_last function to tweak the cached idx->z.last_off value. Normally we keep calling hts_idx_push repeatedly and it uses the last offset with the offset in this call to get the file extents covered by the record being pushed. However when we start a new bgzf block, the cached last value is incorrect. This provides a way to correct this whenever we start a new block.
hts_idx_fmt function, to return idx->fmt. This is used in sam_idx_save, but after various tweaks I think it's no longer necessary. I'll cull it and maybe add a generic format to direct hts_idx_save to use the idx->fmt value.
htsFile now has an idx pointer (and fnidx). The fnidx is needed because we only open the file at the time of calling hts_idx_save. The idx pointer is there so we can accumulate the index while writing each record. Indeed the presence of it is the determining factor for auto-index or not. However the fact this now exists may help unify BAM with CRAM and remove the need for the hts_cram_idx_t *cidx code.
sam_itr_queryi2, sam_itr_querys2, sam_itr_next2 functions - see above. It may be possible to avoid these and hack hts_itr_next instead.
New sam_idx_init and sam_idx_save functions to go along side the existing bam ones. These only support CSI and BAI. Should I add TBI too? Right now I'm using min_shift as the indicator between CSI/BAI, much as bcftools uses it for CSI/TBI. If we want all 3 we need a better switch.
Additional test harness function: test_compare (run command and compare two files). I can't believe we had no way to compare two files (rather than stdout to file), but I really can't see it. If this is a duplicate, please say.
Opening a vcf file now sets fp->format.format, instead of leaving it as some generic text format. This is needed for detecting the format while creating indices.
Added some error checking to hts_idx_finish.

~~Things still to do:~~
~~- A few white space / trivial code tweaks.~~

~~- Look at why the samtools test harness now fails.~~

kyleabeauchamp · 2018-06-22T17:14:04Z

FYI there's a related feature request in pysam: pysam-developers/pysam#684

kyleabeauchamp · 2018-07-06T17:52:22Z

Of the 4 options discussed, I too prefer to avoid the last 2.

"Explicit is better than implicit.": https://www.python.org/dev/peps/pep-0020/

Examples: test_view -x foo.bam.bai or test_view -x foo.bam.csi -m 14 Also fixed prototype of hts_idx_finish so it can return error codes.

This was always implied by the usage, but not implemented.

This required some changes to the APIs for *_idx_init and *_idx_save as with BAI/CSI we build the index in memory and write it at the end, while CRAI it writes the index as it goes. The index filename is now passed in during init and cached. Also amended bcf_hdr_write to set hfp->format correctly so we can query this later to identify the stream as BCF instead of generic BGZF.

This is messier than it appears due to subtle differences between CSI on BCF and CSI on VCF. CSI operates the same on BCF and BAM in that each ##contig or @sq header is counted and all are indexable, irrespective of whether they have data. CSI on VCF however operates like tabix, so the n_ref field is only incremented for contigs with data. In order to work out which one is which, it then embeds a tabix header inside the CSI meta data field. Thus the foo.vcf.gz.csi file is neither pure CSI nor pure TBX. In order to handle this case when indexing on the fly, it has to spot newly covered contigs and dynamically append to the meta data. (It does this only if the input format is VCF.)

For now this is using sam_itr_querys2 and sam_itr_next2 new functions as to read a SAM file we need the bam_hdr_t struct (mapping rname to tid). This isn't needed by BAM and the iterator bam_readrec callback function was not designed around having the same things available as the general purpose sam_read1 call. Note VCF does this differently. SAM.gz is only readable via the synced reader and cannot be read using a normal hts iterator. This is perhaps an oversight.

It's not ideal as -o is in use (sorry, I did that because -I was in use at the time so picked -i/-o; a bad decision). However this is just a test / debug tool anyway.

…ces.

Handle possible realloc failure in compress_binning() Some values read from index files are used for memory allocations. Ensure that they won't cause problems due to integer overflow.

Fixes bug introduced in 7317e4f where it incorrectly appended a NUL byte to the input string instead of the new copy.

Fix call to unlink(NULL) when running `bgzip -d < file.gz > file`.

Add M5: tags to test/index.sam so that the cram tests can find the right references. It's done this way rather then using the reference fasta file so that no UR: tags are added, possibly changing the length of the header. Add a function to test.pl to make an MD5 based cache from test/ce.fa. It's done this way as checking in the reference files would make the git repository quite a bit bigger.

On reading the SAM header, store a pointer to it in the htsFile struct for use by the iterators. Allows the existing iterator API to work with SAM files, which means programs that use BAM indexes don't need to be modified for SAM. The one disadvantage is that the header may be kept in memory for longer than is expected (although reading SAM needs it anyway, so it shouldn't be a huge issue). The header is currently only kept for SAM files. CRAM does its own thing, and some programs (notably samtools sort) expect to be able to drop the original header after reading it. As sort can currently have thousands of BAM files open at once, keeping all their headers could lead to excessive memory consumption. To allow a NULL header in bam, the checks on core.tid and core.mtid in sam_read1 have to be relaxed a little (although it's no worse than before when the iterator used bam_read1). Reference counting is used to prevent the header from being deleted prematurely.

Ensure that the input is a BGZF file before using htsfp->fp.bgzf in bcf_itr_next() and sam_itr_next(). Make indexing refuse to work on non-bgzf SAM It might be possible to get indexes to work on plain SAM, but it would involve lots of changes. We also want to encourage use of compressed files.

daviesrob · 2019-02-13T17:29:08Z

Rebased with some additions:

Extra error checking
Bug-fixes (a couple of which were quite old bugs)
Improved documentation
More tests
Multi-region iterator on SAM

A rather more major change is that the iterator API has reverted to the original one, so no sam_itr_next2 etc. This allows existing code to work on SAM with little or no modification, but means a pointer to the header has to be stored in the htsFile struct. This currently only happens for SAM, although CRAM does it implicitly (and always has) by storing header-related data in the cram_fd. BAM escapes for now because a few tools (notably samtools sort) expect to be able to discard the header on some files straight after reading and could use rather more memory if unable to. The header struct gets a reference count so that it isn't freed prematurely.

NB: This needs a .so number bump on merging.

…ABI] * Enable indexing of and iteration over BGZF-compressed SAM. * Enable building indexes while writing BGZF-compressed SAM, BAM, CRAM, BGZF-compressed VCF and BCF files * Add extra tests and fix some bugs * Add / improve doxygen documentation * Increment TWO_TO_THREE_TRANSITION_COUNT due to ABI break

jkbonfield added the breaks-ABI label Jun 22, 2018

bioinformed mentioned this pull request Jun 22, 2018

How to index on the fly when writing pysam-developers/pysam#684

Open

This was referenced Jun 25, 2018

Indexing on the fly samtools/samtools#848

Closed

tools that write BAM/CRAM should have option to index samtools/samtools#774

Closed

jkbonfield mentioned this pull request Oct 26, 2018

Sam multi-threading #786

Closed

jkbonfield added 4 commits February 7, 2019 13:40

Added auto-indexing capabilities to bam writing.

9417875

Examples: test_view -x foo.bam.bai or test_view -x foo.bam.csi -m 14 Also fixed prototype of hts_idx_finish so it can return error codes.

Upgraded test_view to read/write VCF & BCF.

46c9b6c

This was always implied by the usage, but not implemented.

Added support for BCF indexing on-the-fly

2965261

daviesrob force-pushed the auto_index branch from 28f88d0 to 103dc64 Compare February 8, 2019 15:03

jkbonfield and others added 14 commits February 13, 2019 14:21

Added test/test_index tool to call {sam,bam}_index_build.

0bdd67d

Added a test_view "-p out" option to avoid stdout.

31bdad9

It's not ideal as -o is in use (sorry, I did that because -I was in use at the time so picked -i/-o; a bad decision). However this is just a test / debug tool anyway.

We can now index SAM.gz files not on-the-fly too.

0007d39

Added test_index support to check the old (not on the fly) built indi…

0e49413

…ces.

Improve error checking for building / loading indexes

9efe5b8

Handle possible realloc failure in compress_binning() Some values read from index files are used for memory allocations. Ensure that they won't cause problems due to integer overflow.

Append NUL to correct array in hts_idx_set_meta()

5edacc4

Fixes bug introduced in 7317e4f where it incorrectly appended a NUL byte to the input string instead of the new copy.

Stop bgzip from trying to unlink NULL when decompressing stdin

f7e32f1

Fix call to unlink(NULL) when running `bgzip -d < file.gz > file`.

Add doxygen for indexing and iterator functions

8dc6468

Make the multi-region iterator work on SAM files.

7bf8de1

daviesrob force-pushed the auto_index branch from 103dc64 to 04d029a Compare February 13, 2019 14:44

daviesrob mentioned this pull request Feb 15, 2019

hts_itr_multi_t versus hts_itr_t #785

Closed

daviesrob merged commit 04d029a into samtools:develop Feb 19, 2019

jkbonfield deleted the auto_index branch June 5, 2019 17:55

This was referenced Mar 6, 2023

Does bcftools support --write-index like samtools? samtools/bcftools#1853

Closed

Feature request: add support for on the fly creation of VCF/BCF indexes #1555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto indexing durng writing of BAM, SAM.gz, CRAM, BCF, VCF.gz #718

Auto indexing durng writing of BAM, SAM.gz, CRAM, BCF, VCF.gz #718

jkbonfield commented Jun 22, 2018 •

edited

Loading

jkbonfield commented Jun 22, 2018 •

edited

Loading

kyleabeauchamp commented Jun 22, 2018

kyleabeauchamp commented Jul 6, 2018

daviesrob commented Feb 13, 2019

Auto indexing durng writing of BAM, SAM.gz, CRAM, BCF, VCF.gz #718

Auto indexing durng writing of BAM, SAM.gz, CRAM, BCF, VCF.gz #718

Conversation

jkbonfield commented Jun 22, 2018 • edited Loading

jkbonfield commented Jun 22, 2018 • edited Loading

kyleabeauchamp commented Jun 22, 2018

kyleabeauchamp commented Jul 6, 2018

daviesrob commented Feb 13, 2019

jkbonfield commented Jun 22, 2018 •

edited

Loading

jkbonfield commented Jun 22, 2018 •

edited

Loading