Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignments returned by sam_itr_query are affected by required_fields in CRAM files #640

Closed
dpryan79 opened this issue Jan 5, 2018 · 7 comments
Assignees

Comments

@dpryan79
Copy link
Contributor

dpryan79 commented Jan 5, 2018

For some reason, whether a given alignment is returned by sam_itr_queryi() in a CRAM file can be affected by whether and which required_fields have been set when opening the CRAM file. Let's take as an example this test file from deepTools (you can find the index and the fasta file for decoding it in the same directory). I'll use the following (generally poorly written) program as a test example:

#include <stdio.h>
#include <stdlib.h>
#include "htslib/sam.h"
#include "htslib/hts.h"

int main(int argc, char *argv[]) {
    samFile *cram = sam_open("test1.cram", "rc");
    hts_idx_t *idx = sam_index_load(cram, "test1.cram");
    char foo[1024];
    hts_opt *opts = NULL;
    hts_itr_t *iter = NULL;
    int cnt = 0, ret;
    bam1_t *b = bam_init1();

    //Set some options
    sprintf(foo, "required_fields=%i", argv[1]);
    hts_opt_add(&opts, foo);
    hts_opt_apply(cram, opts);
    hts_opt_free(opts);

    //Iterate over a range, counting the number of alignments
    iter = sam_itr_queryi(idx, 0, 99, 1100);
    while((ret=sam_itr_next(cram, iter, b)) > 0) cnt++;
    hts_itr_destroy(iter);

    sam_close(cram);
    bam_destroy1(b);
    free(idx);
    printf("There were %i entries found\n", cnt);
    return 0;
}

If I compile that to ./foo then I can easily play with the results:

$ ./foo 0xF
There were 0 entries found
$ ./foo 0xFF
There were 0 entries found
$ ./foo 0xFFF
There were 143 entries found
$ ./foo 0x8FF
There were 140 entries found

The same sort of thing happens with samtools view -c --input-fmt-option required_fields=0x8FF test1.cram 3R:99-1100 in the 1.6 release and going back to at least 1.3, so it's not purely a coding error on my part. I assume that this is not intended behavior.

@dpryan79 dpryan79 changed the title Alignments returned by hts_itr_query() are affected by required_fields in CRAM files Alignments returned by sam_itr_query are affected by required_fields in CRAM files Jan 5, 2018
@valeriuo valeriuo self-assigned this Jan 5, 2018
@jkbonfield
Copy link
Contributor

This is probably a bug as I wouldn't expect that to happen.

The cram file in question has almost all fields shoe-horned into the CORE block, which means most fields have dependencies on each other which makes the required_fields parameter boil down to fetching and decoding everything anyway - in theory!

It's rather hairy code requiring knowledge of both code dependencies (to decode field X we have to go through code path Y which needs access to field Z, thus X depends on Z) and data dependencies (field P and Q colocate in the same block, thus P depends on Q and Q depends on P). All done in a recursive fashion until no other dependencies are found. Ugh! I thought I'd tested this with randomly allocated file blocks, but perhaps not.

I'll investigate and thanks for the report.

@dpryan79
Copy link
Contributor Author

dpryan79 commented Jan 5, 2018

@jkbonfield You're making me glad I've never dived into the CRAM code :)

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 5, 2018

It's not necessary for CRAM, but it was an optimisation I realised could be made given the columnar storage. It turned out to be trickier than I anticipated though. :-)

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 5, 2018

Brain dump so I don't forget this over the weekend.
Ok two problems.

  1. I did find a bug with the required fields code, but it's not the one you saw! Specifically if you don't ask for sequences then it doesn't load the reference up, and if it doesn't have the reference loaded it doesn't generate the sequence string (obviously) but it also doesn't compute the alignment end. Thus alignments starting outside the region but extending into the region are lost when no sequence is requested. Trivial to fix and verified that works.

  2. The strange case... Filtering out aux data and also having no read-group means the auxiliary data size is zero. The cram_to_bam code bizarrely seems to be returning the number of bytes added to the end of the bam struct after the rest of the read has been created (ie the length of aux data). This bubbles its way all the way up to the iterator code, which ends up getting a zero return value if you have absolutely no auxiliary data present and gets an early EOF. This is extremely unlikely as you nearly always get MD:Z and NM:i tags added too, but it is indeed possible to turn off absolutely every aux field using the filter options.

I'm trying to figure out quite what the return values should be for this code and for the iterator, but again I suspect the fix, once I've worked it out, is trivial.

Phew.

@dpryan79
Copy link
Contributor Author

dpryan79 commented Jan 6, 2018

Glad I'm not the one going down this particular rabbit hole :)

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 8, 2018
…elds option.

1. bam_construct_seq returned 0 instead of number of bytes written to
the bam_seq_t struct.  This in turn bubbled up to cram_get_seq
returning only the number of auxiliary bytes written to instead of the
total record size.

Under specific circumstances (no RG tag, no NM/MD requested, and no
other aux fields), this meant cram_get_bam_seq returned 0 on a
successful read.  This in turn was interpreted as EOF by the iterator.
Given bam_read1 returns the size of the bam structure it feels
appropriate for the cram version to mirror it, as originally intended.

2. If we use the required_fields option to exclude sequence data, then
we also exclude loading of the reference (which is a useful
optimisation).  However this then also meant the cram_decode_seq
function wasn't correctly modifying the alignment end, leaving it set
to alignment start.

The consequence is range queries combined with required_fields usage
could miss reads which start before the range and end within it.

Fixes samtools#640
@jkbonfield
Copy link
Contributor

I think I've navigated the mine field and worked it out, with PR for review by the rest of the team. Sorry for the bug and thanks for reporting it.

@dpryan79
Copy link
Contributor Author

dpryan79 commented Jan 8, 2018

Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants