Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto indexing durng writing of BAM, SAM.gz, CRAM, BCF, VCF.gz #718

Merged
merged 18 commits into from
Feb 19, 2019

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Jun 22, 2018

I don't have a samtools interface for this yet, but there are new test mechanisms in htslib for validating this. In samtools I'm unsure how we'd specify it. Off the top of my head I could think of:

  • A new samtools view option to ask for index (plus what type, and size of min_shift).
  • output-fmt-option keywords to request indexing, type, options, filename.
  • Some automatic defaults? (Probably bad)
  • Filename trickery: eg -o foo.bam/csi (also bad!)

I also added the ability to use BAI and CSI indices for bgzipped SAM while I was at it. (And indeed the ability to even write it natively.)

Note there are a few questions over the best API for SAM.gz. Our generic sam_read1 function takes fp and hdr. The generic hts_itr_next function only passes on fp though (and bgzf fp rather than htsFile fp at that). the multi-region iterator improves on the type of fp, but it still lacks the header.

We have two choices here. We could use the data argument to pass over a new struct containing both htsFile and bam_hdr_t, but if we do this in hts layer then we're removing the ability for callers to use this for anything else. That's perhaps not such a big issue. Note to do this we'd need to store the header structure somewhere when we initialise the iterator, either in the fp or the itr structure itself both work (I'd favour the latter).

Alternatively we can create a new sam_itr_next2 function which has an additional argument passed in to it (hdr) and we get this to do the struct construction before running hts_itr_next, which doesn't alter the abilities of the underlying function. However it does make an entirely new API function to call. This is what I did, but I think perhaps the former may be preferable.

Finally test_view has been upgraded to read/write vcf/bcf. It fails on vcf.gz reading, but that's due to it only being supported by synced reader.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 22, 2018

Summary of the changes:

  • bgzf_peek to permit SAM header parsing stopping before reading the next line. This is helpful when indexing as otherwise the first sam_idx_init has to have a cached bgzf_tell value copied from the header parsing. This may mean we can simplify other bits of code, eg the fp->line.l check in sam_read1. (It's not harmful to keep it though.)

  • Addition of idxfp to cram_fd structure, for purposes of writing the index on the fly. This is because the old cram index functions wrote the index as it read the file, rather than building the index in memory and writing it out at the end. We could change that, but this was the smaller tweak.

  • Various new elements of hts_idx_t. Specifically tbi_n and last_tbi_tid to track the number of references and the last reference number so we can do the mapping of CSI tids to TBI tids. (CSI is all, TBI is all covered.) I don't particularly like this, but couldn't think of anything better.

  • New hts_idx_tbi_name function for the purposes of TBI index generation. Again, we can't do this at the start from the header as we don't know what is covered. We do it on the fly, storing it in the already existant idx->meta pointer. We call hts_idx_tbi_name for each and every VCF record (if we're indexing), but it spots the same rname as last time quickly so hopefully this isn't a performance hit. (I didn't measure it though.)

  • Added an hts_idx_amend_last function to tweak the cached idx->z.last_off value. Normally we keep calling hts_idx_push repeatedly and it uses the last offset with the offset in this call to get the file extents covered by the record being pushed. However when we start a new bgzf block, the cached last value is incorrect. This provides a way to correct this whenever we start a new block.

  • hts_idx_fmt function, to return idx->fmt. This is used in sam_idx_save, but after various tweaks I think it's no longer necessary. I'll cull it and maybe add a generic format to direct hts_idx_save to use the idx->fmt value.

  • htsFile now has an idx pointer (and fnidx). The fnidx is needed because we only open the file at the time of calling hts_idx_save. The idx pointer is there so we can accumulate the index while writing each record. Indeed the presence of it is the determining factor for auto-index or not. However the fact this now exists may help unify BAM with CRAM and remove the need for the hts_cram_idx_t *cidx code.

  • sam_itr_queryi2, sam_itr_querys2, sam_itr_next2 functions - see above. It may be possible to avoid these and hack hts_itr_next instead.

  • New sam_idx_init and sam_idx_save functions to go along side the existing bam ones. These only support CSI and BAI. Should I add TBI too? Right now I'm using min_shift as the indicator between CSI/BAI, much as bcftools uses it for CSI/TBI. If we want all 3 we need a better switch.

  • Additional test harness function: test_compare (run command and compare two files). I can't believe we had no way to compare two files (rather than stdout to file), but I really can't see it. If this is a duplicate, please say.

  • Opening a vcf file now sets fp->format.format, instead of leaving it as some generic text format. This is needed for detecting the format while creating indices.

  • Added some error checking to hts_idx_finish.

Things still to do:
- A few white space / trivial code tweaks.

- Look at why the samtools test harness now fails.

@kyleabeauchamp
Copy link

FYI there's a related feature request in pysam: pysam-developers/pysam#684

@kyleabeauchamp
Copy link

Of the 4 options discussed, I too prefer to avoid the last 2.

"Explicit is better than implicit.": https://www.python.org/dev/peps/pep-0020/

@jkbonfield jkbonfield mentioned this pull request Oct 26, 2018
Examples: test_view -x foo.bam.bai or test_view -x foo.bam.csi -m 14

Also fixed prototype of hts_idx_finish so it can return error codes.
This was always implied by the usage, but not implemented.
This required some changes to the APIs for *_idx_init and *_idx_save
as with BAI/CSI we build the index in memory and write it at the end,
while CRAI it writes the index as it goes.  The index filename is now
passed in during init and cached.

Also amended bcf_hdr_write to set hfp->format correctly so we can
query this later to identify the stream as BCF instead of generic BGZF.
jkbonfield and others added 14 commits February 13, 2019 14:21
This is messier than it appears due to subtle differences between CSI
on BCF and CSI on VCF.

CSI operates the same on BCF and BAM in that each ##contig or @sq header
is counted and all are indexable, irrespective of whether they have
data.  CSI on VCF however operates like tabix, so the n_ref field is
only incremented for contigs with data.  In order to work out which
one is which, it then embeds a tabix header inside the CSI meta data
field.  Thus the foo.vcf.gz.csi file is neither pure CSI nor pure TBX.

In order to handle this case when indexing on the fly, it has to spot
newly covered contigs and dynamically append to the meta data.  (It
does this only if the input format is VCF.)
For now this is using sam_itr_querys2 and sam_itr_next2 new functions
as to read a SAM file we need the bam_hdr_t struct (mapping rname to
tid).  This isn't needed by BAM and the iterator bam_readrec callback
function was not designed around having the same things available as
the general purpose sam_read1 call.

Note VCF does this differently.  SAM.gz is only readable via the
synced reader and cannot be read using a normal hts iterator.  This is
perhaps an oversight.
It's not ideal as -o is in use (sorry, I did that because -I was in
use at the time so picked -i/-o; a bad decision).  However this is
just a test / debug tool anyway.
Handle possible realloc failure in compress_binning()

Some values read from index files are used for memory allocations.
Ensure that they won't cause problems due to integer overflow.
Fixes bug introduced in 7317e4f where it incorrectly appended a NUL
byte to the input string instead of the new copy.
Fix call to unlink(NULL) when running `bgzip -d < file.gz > file`.
Add M5: tags to test/index.sam so that the cram tests can find the
right references.  It's done this way rather then using the
reference fasta file so that no UR: tags are added, possibly
changing the length of the header.

Add a function to test.pl to make an MD5 based cache from
test/ce.fa.  It's done this way as checking in the reference
files would make the git repository quite a bit bigger.
On reading the SAM header, store a pointer to it in the htsFile
struct for use by the iterators.  Allows the existing iterator API
to work with SAM files, which means programs that use BAM indexes
don't need to be modified for SAM.  The one disadvantage is that
the header may be kept in memory for longer than is expected
(although reading SAM needs it anyway, so it shouldn't be a
huge issue).

The header is currently only kept for SAM files.  CRAM does its
own thing, and some programs (notably samtools sort) expect to
be able to drop the original header after reading it.  As sort
can currently have thousands of BAM files open at once, keeping
all their headers could lead to excessive memory consumption.

To allow a NULL header in bam, the checks on core.tid and
core.mtid in sam_read1 have to be relaxed a little (although
it's no worse than before when the iterator used bam_read1).

Reference counting is used to prevent the header from being
deleted prematurely.
Ensure that the input is a BGZF file before using htsfp->fp.bgzf
in bcf_itr_next() and sam_itr_next().

Make indexing refuse to work on non-bgzf SAM

It might be possible to get indexes to work on plain SAM, but it
would involve lots of changes.  We also want to encourage use of
compressed files.
@daviesrob
Copy link
Member

Rebased with some additions:

  • Extra error checking
  • Bug-fixes (a couple of which were quite old bugs)
  • Improved documentation
  • More tests
  • Multi-region iterator on SAM

A rather more major change is that the iterator API has reverted to the original one, so no sam_itr_next2 etc. This allows existing code to work on SAM with little or no modification, but means a pointer to the header has to be stored in the htsFile struct. This currently only happens for SAM, although CRAM does it implicitly (and always has) by storing header-related data in the cram_fd. BAM escapes for now because a few tools (notably samtools sort) expect to be able to discard the header on some files straight after reading and could use rather more memory if unable to. The header struct gets a reference count so that it isn't freed prematurely.

NB: This needs a .so number bump on merging.

daviesrob added a commit to daviesrob/htslib that referenced this pull request Feb 19, 2019
…ABI]

* Enable indexing of and iteration over BGZF-compressed SAM.

* Enable building indexes while writing BGZF-compressed	SAM, BAM,
  CRAM, BGZF-compressed VCF and BCF files

* Add extra tests and fix some bugs

* Add / improve doxygen documentation

* Increment TWO_TO_THREE_TRANSITION_COUNT due to ABI break
@daviesrob daviesrob merged commit 04d029a into samtools:develop Feb 19, 2019
@jkbonfield jkbonfield deleted the auto_index branch June 5, 2019 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants