-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAM: slice header record_counter too small #56
Comments
More than 2 billion records in a slice sounds impractical to me but it's Vadim On 27/11/2014 10:01, James Bonfield wrote:
|
On Thu, Nov 27, 2014 at 02:43:51AM -0800, Vadim Zalunin wrote:
"Number of records" is the number within the slice. "Record counter" Container pos 243765 size 38017
Incidentally, this is also a bug in cramtools-2.1.jar. It's not adding jkb@seq3a[samtools.../htslib] java -jar ~/work/cram/cramtools-2.1.jar bam -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa -I /tmp/b.cram 1:4000000-4000100 But that's a different issue that I hadn't yet got around to filing a jkb@seq3a[samtools.../htslib] java -jar ~/work/cram/cramtools-2.1.jar bam -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa -I /tmp/b.cram |grep TGGGAGGACTGCAGGACTAGCTGCACAAGACAACTGGCATTTCTGGGGTGCACGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCC James The Wellcome Trust Sanger Institute is operated by Genome Research |
Am I correct in thinking that 0 to 2^31-1 in ITF8 encodes in the exact same binary as LTF8? If so it's essentially an invisible change and could be done at any time to either format. It needs thinking about and I'm not 100% convinced I have LTF8 correctly coded, but the spec is a bit weak. It defines LTF8 in terms of ITF8, and ITF8 is defined to be unary encoding - basically N 1 bits followed by a zero. For positive signed numbers the top bit is always zero, so that practically means the first bit of the number is always 0 anyway. For negative numbers I'm not convinced this holds true though as we seem to use 0xff ff ff ff 0f (or is it f0?) which implies there is no zero between the run of number-of-bytes bits and the actual value. Ie 0-3 1s plus zero, or 4 1s and no zero needed? If so the spec is incorrect, but it also means LTF8 is incorrect as it's not just ITF8 longer. |
Scratch that, ITF8 and LTF8 are indeed different. ITF8/LTF8 on 0x3f12 is 0xbf12 (1011 1111 0001 0010). 0x7f12 needs 3 bytes so becomes 0xc07f12. In this situation, we don't need all the bits, so it fills up from the right side. Ie the padding bits are in the top byte (1100 0000). We could easily have said they padding bits are in the last byte. Ie 0xc7f120, but it's more logical the first way around. EXCEPT for when we have 4 additional bytes required. In that scenario ITF8 explicitly requires us to change behaviour and use the last bits to be padding rather than the first bits. This sudden change of behaviour means that LTF8 isn't just ITF8 continued, as a 5 byte ITF8 string is interpreted as a different number by LTF8. Is there a flaw in my reasoning? I need to write some test code I guess. Unfortunately we're stuck with it now, but with hindsight ITF8/LTF8 aren't sensibly defined :( |
Ah, ok, but the record_counter already is LTF8 in CRAM3. I've reverted Vadim On 27/11/2014 11:04, James Bonfield wrote:
|
Ah OK I wasn't aware we'd already changed it for CRAM 3. That wasn't in the list of proposed changes. I took a quick look at it and I see you've changed it for slice header, but it also needs changing for container header too. Thanks, |
Done. Vadim On 27/11/2014 13:28, James Bonfield wrote:
|
Thanks, closing issue then, although ideally the LTF8 explaination should be clearer if we can come up with a more precise wording. |
Section 7.5 defines slice header record_counter as itf8. This implicitly is a 32-bit signed integer (ltf8 is the 64-bit one), which in practice means 2 billion sequences before the number overflows.
That's a large file, but not impossibly large. I broke the 1 billion fragment BAM file in production in 2009 and I'd be suprised if people haven't gone an order of magnitude higher than that since then.
We should try and squeeze this fix in before CRAM v3.0 goes live.
The text was updated successfully, but these errors were encountered: