New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignments returned by sam_itr_query are affected by required_fields in CRAM files #640
Comments
This is probably a bug as I wouldn't expect that to happen. The cram file in question has almost all fields shoe-horned into the CORE block, which means most fields have dependencies on each other which makes the required_fields parameter boil down to fetching and decoding everything anyway - in theory! It's rather hairy code requiring knowledge of both code dependencies (to decode field X we have to go through code path Y which needs access to field Z, thus X depends on Z) and data dependencies (field P and Q colocate in the same block, thus P depends on Q and Q depends on P). All done in a recursive fashion until no other dependencies are found. Ugh! I thought I'd tested this with randomly allocated file blocks, but perhaps not. I'll investigate and thanks for the report. |
@jkbonfield You're making me glad I've never dived into the CRAM code :) |
It's not necessary for CRAM, but it was an optimisation I realised could be made given the columnar storage. It turned out to be trickier than I anticipated though. :-) |
Brain dump so I don't forget this over the weekend.
I'm trying to figure out quite what the return values should be for this code and for the iterator, but again I suspect the fix, once I've worked it out, is trivial. Phew. |
Glad I'm not the one going down this particular rabbit hole :) |
…elds option. 1. bam_construct_seq returned 0 instead of number of bytes written to the bam_seq_t struct. This in turn bubbled up to cram_get_seq returning only the number of auxiliary bytes written to instead of the total record size. Under specific circumstances (no RG tag, no NM/MD requested, and no other aux fields), this meant cram_get_bam_seq returned 0 on a successful read. This in turn was interpreted as EOF by the iterator. Given bam_read1 returns the size of the bam structure it feels appropriate for the cram version to mirror it, as originally intended. 2. If we use the required_fields option to exclude sequence data, then we also exclude loading of the reference (which is a useful optimisation). However this then also meant the cram_decode_seq function wasn't correctly modifying the alignment end, leaving it set to alignment start. The consequence is range queries combined with required_fields usage could miss reads which start before the range and end within it. Fixes samtools#640
I think I've navigated the mine field and worked it out, with PR for review by the rest of the team. Sorry for the bug and thanks for reporting it. |
Thanks for the quick fix! |
For some reason, whether a given alignment is returned by
sam_itr_queryi()
in a CRAM file can be affected by whether and whichrequired_fields
have been set when opening the CRAM file. Let's take as an example this test file from deepTools (you can find the index and the fasta file for decoding it in the same directory). I'll use the following (generally poorly written) program as a test example:If I compile that to
./foo
then I can easily play with the results:The same sort of thing happens with
samtools view -c --input-fmt-option required_fields=0x8FF test1.cram 3R:99-1100
in the 1.6 release and going back to at least 1.3, so it's not purely a coding error on my part. I assume that this is not intended behavior.The text was updated successfully, but these errors were encountered: