Alignments returned by sam_itr_query are affected by required_fields in CRAM files #640

dpryan79 · 2018-01-05T11:49:23Z

For some reason, whether a given alignment is returned by sam_itr_queryi() in a CRAM file can be affected by whether and which required_fields have been set when opening the CRAM file. Let's take as an example this test file from deepTools (you can find the index and the fasta file for decoding it in the same directory). I'll use the following (generally poorly written) program as a test example:

#include <stdio.h>
#include <stdlib.h>
#include "htslib/sam.h"
#include "htslib/hts.h"

int main(int argc, char *argv[]) {
    samFile *cram = sam_open("test1.cram", "rc");
    hts_idx_t *idx = sam_index_load(cram, "test1.cram");
    char foo[1024];
    hts_opt *opts = NULL;
    hts_itr_t *iter = NULL;
    int cnt = 0, ret;
    bam1_t *b = bam_init1();

    //Set some options
    sprintf(foo, "required_fields=%i", argv[1]);
    hts_opt_add(&opts, foo);
    hts_opt_apply(cram, opts);
    hts_opt_free(opts);

    //Iterate over a range, counting the number of alignments
    iter = sam_itr_queryi(idx, 0, 99, 1100);
    while((ret=sam_itr_next(cram, iter, b)) > 0) cnt++;
    hts_itr_destroy(iter);

    sam_close(cram);
    bam_destroy1(b);
    free(idx);
    printf("There were %i entries found\n", cnt);
    return 0;
}

If I compile that to ./foo then I can easily play with the results:

$ ./foo 0xF
There were 0 entries found
$ ./foo 0xFF
There were 0 entries found
$ ./foo 0xFFF
There were 143 entries found
$ ./foo 0x8FF
There were 140 entries found

The same sort of thing happens with samtools view -c --input-fmt-option required_fields=0x8FF test1.cram 3R:99-1100 in the 1.6 release and going back to at least 1.3, so it's not purely a coding error on my part. I assume that this is not intended behavior.

The text was updated successfully, but these errors were encountered:

jkbonfield · 2018-01-05T12:28:00Z

This is probably a bug as I wouldn't expect that to happen.

The cram file in question has almost all fields shoe-horned into the CORE block, which means most fields have dependencies on each other which makes the required_fields parameter boil down to fetching and decoding everything anyway - in theory!

It's rather hairy code requiring knowledge of both code dependencies (to decode field X we have to go through code path Y which needs access to field Z, thus X depends on Z) and data dependencies (field P and Q colocate in the same block, thus P depends on Q and Q depends on P). All done in a recursive fashion until no other dependencies are found. Ugh! I thought I'd tested this with randomly allocated file blocks, but perhaps not.

I'll investigate and thanks for the report.

dpryan79 · 2018-01-05T12:35:21Z

@jkbonfield You're making me glad I've never dived into the CRAM code :)

jkbonfield · 2018-01-05T12:44:08Z

It's not necessary for CRAM, but it was an optimisation I realised could be made given the columnar storage. It turned out to be trickier than I anticipated though. :-)

jkbonfield · 2018-01-05T17:34:43Z

Brain dump so I don't forget this over the weekend.
Ok two problems.

I did find a bug with the required fields code, but it's not the one you saw! Specifically if you don't ask for sequences then it doesn't load the reference up, and if it doesn't have the reference loaded it doesn't generate the sequence string (obviously) but it also doesn't compute the alignment end. Thus alignments starting outside the region but extending into the region are lost when no sequence is requested. Trivial to fix and verified that works.
The strange case... Filtering out aux data and also having no read-group means the auxiliary data size is zero. The cram_to_bam code bizarrely seems to be returning the number of bytes added to the end of the bam struct after the rest of the read has been created (ie the length of aux data). This bubbles its way all the way up to the iterator code, which ends up getting a zero return value if you have absolutely no auxiliary data present and gets an early EOF. This is extremely unlikely as you nearly always get MD:Z and NM:i tags added too, but it is indeed possible to turn off absolutely every aux field using the filter options.

I'm trying to figure out quite what the return values should be for this code and for the iterator, but again I suspect the fix, once I've worked it out, is trivial.

Phew.

dpryan79 · 2018-01-06T11:05:09Z

Glad I'm not the one going down this particular rabbit hole :)

…elds option. 1. bam_construct_seq returned 0 instead of number of bytes written to the bam_seq_t struct. This in turn bubbled up to cram_get_seq returning only the number of auxiliary bytes written to instead of the total record size. Under specific circumstances (no RG tag, no NM/MD requested, and no other aux fields), this meant cram_get_bam_seq returned 0 on a successful read. This in turn was interpreted as EOF by the iterator. Given bam_read1 returns the size of the bam structure it feels appropriate for the cram version to mirror it, as originally intended. 2. If we use the required_fields option to exclude sequence data, then we also exclude loading of the reference (which is a useful optimisation). However this then also meant the cram_decode_seq function wasn't correctly modifying the alignment end, leaving it set to alignment start. The consequence is range queries combined with required_fields usage could miss reads which start before the range and end within it. Fixes samtools#640

jkbonfield · 2018-01-08T12:33:38Z

I think I've navigated the mine field and worked it out, with PR for review by the rest of the team. Sorry for the bug and thanks for reporting it.

dpryan79 · 2018-01-08T12:36:55Z

Thanks for the quick fix!

dpryan79 changed the title ~~Alignments returned by hts_itr_query() are affected by required_fields in CRAM files~~ Alignments returned by sam_itr_query are affected by required_fields in CRAM files Jan 5, 2018

valeriuo self-assigned this Jan 5, 2018

dpryan79 mentioned this issue Jan 5, 2018

CRAM support deeptools/deepTools#619

Closed

jkbonfield mentioned this issue Jan 8, 2018

Fixed return value of cram_get_bam_seq and fixed a bug in required_fields option. #641

Merged

valeriuo closed this as completed in #641 Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignments returned by sam_itr_query are affected by required_fields in CRAM files #640

Alignments returned by sam_itr_query are affected by required_fields in CRAM files #640

dpryan79 commented Jan 5, 2018 •

edited

jkbonfield commented Jan 5, 2018

dpryan79 commented Jan 5, 2018

jkbonfield commented Jan 5, 2018 •

edited

jkbonfield commented Jan 5, 2018 •

edited

dpryan79 commented Jan 6, 2018

jkbonfield commented Jan 8, 2018

dpryan79 commented Jan 8, 2018

Alignments returned by sam_itr_query are affected by required_fields in CRAM files #640

Alignments returned by sam_itr_query are affected by required_fields in CRAM files #640

Comments

dpryan79 commented Jan 5, 2018 • edited

jkbonfield commented Jan 5, 2018

dpryan79 commented Jan 5, 2018

jkbonfield commented Jan 5, 2018 • edited

jkbonfield commented Jan 5, 2018 • edited

dpryan79 commented Jan 6, 2018

jkbonfield commented Jan 8, 2018

dpryan79 commented Jan 8, 2018

dpryan79 commented Jan 5, 2018 •

edited

jkbonfield commented Jan 5, 2018 •

edited

jkbonfield commented Jan 5, 2018 •

edited