CRAM: slice header record_counter too small #56

jkbonfield · 2014-11-27T10:01:31Z

Section 7.5 defines slice header record_counter as itf8. This implicitly is a 32-bit signed integer (ltf8 is the 64-bit one), which in practice means 2 billion sequences before the number overflows.

That's a large file, but not impossibly large. I broke the 1 billion fragment BAM file in production in 2009 and I'd be suprised if people haven't gone an order of magnitude higher than that since then.

We should try and squeeze this fix in before CRAM v3.0 goes live.

vadimzalunin · 2014-11-27T10:43:49Z

More than 2 billion records in a slice sounds impractical to me but it's
ok since this is not damaging anything really. I've updated the specs in
the CRAMv3 branch.

Vadim

On 27/11/2014 10:01, James Bonfield wrote:

Section 7.5 defines slice header record_counter as itf8. This
implicitly is a 32-bit signed integer (ltf8 is the 64-bit one), which
in practice means 2 billion sequences before the number overflows.

That's a large file, but not impossibly large. I broke the 1 billion
fragment BAM file in production in 2009 and I'd be suprised if people
haven't gone an order of magnitude higher than that since then.

We should try and squeeze this fix in before CRAM v3.0 goes live.

—
Reply to this email directly or view it on GitHub
#56.

jkbonfield · 2014-11-27T11:04:19Z

On Thu, Nov 27, 2014 at 02:43:51AM -0800, Vadim Zalunin wrote:

More than 2 billion records in a slice sounds impractical to me but it's
ok since this is not damaging anything really. I've updated the specs in
the CRAMv3 branch.

"Number of records" is the number within the slice. "Record counter"
is the number since the start of the stream. Eg from cram_dump:

Container pos 243765 size 38017
Ref id: 0
Ref pos: 3185526 + 826926
Rec counter: 50000
No. recs: 10000
No. bases 1000000
No. blocks: 8
No. landmarks: 1
{515}

   Container_header block pos 243789
     Preservation map:
       TD => 17559636 (0x10bf054)
       SM => 17559629 (0x10bf04d)
   ...
   Slice 1/1, container offset 515, file offset 244304
       Slice content type MAPPED_SLICE
       Slice ref seq    0
       Slice ref start  3185526
       Slice ref span   826926
       Slice MD5        f3c8aae884782723a7b05e8c134ef26f
       Rec counter      50000
       No. records      10000
       No. blocks       6
       Blk IDS:         {0, 1, 2, 3, 4}
       Ref base id:     -1

       Block 1/6
       ...

Incidentally, this is also a bug in cramtools-2.1.jar. It's not adding
the record counter figure on when doing random access lookups:

jkb@seq3a[samtools.../htslib] java -jar ~/work/cram/cramtools-2.1.jar bam -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa -I /tmp/b.cram 1:4000000-4000100
9810 99 1 4000039 60 100M = 4000440 501 TGGGAGGACTGCAGGACTAGCTGCACAAGACAACTGGCATTTCTGGGGTGCACGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCC ???????????????????????????????????????????????????????????????????????????????????????????????????? RG:Z:1#49
9811 99 1 4000080 60 100M = 4000456 476 TCTGGGGTGCATGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCCCATCCCGGAGCCTCCCGACTTCTCTTGGGGAGGGGATCATG ???????????????????????????????????????????????????????????????????????????????????????????????????? RG:Z:1#49

But that's a different issue that I hadn't yet got around to filing a
ticket against. Feel free to copy paste this over to your own github
issue tracker if it hasn't already been fixed.

jkb@seq3a[samtools.../htslib] java -jar ~/work/cram/cramtools-2.1.jar bam -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa -I /tmp/b.cram |grep TGGGAGGACTGCAGGACTAGCTGCACAAGACAACTGGCATTTCTGGGGTGCACGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCC
59810 99 1 4000039 60 100M = 4000440 501 TGGGAGGACTGCAGGACTAGCTGCACAAGACAACTGGCATTTCTGGGGTGCACGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCC ???????????????????????????????????????????????????????????????????????????????????????????????????? RG:Z:1#49

James

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

jkbonfield · 2014-11-27T12:12:44Z

Am I correct in thinking that 0 to 2^31-1 in ITF8 encodes in the exact same binary as LTF8? If so it's essentially an invisible change and could be done at any time to either format.

It needs thinking about and I'm not 100% convinced I have LTF8 correctly coded, but the spec is a bit weak. It defines LTF8 in terms of ITF8, and ITF8 is defined to be unary encoding - basically N 1 bits followed by a zero. For positive signed numbers the top bit is always zero, so that practically means the first bit of the number is always 0 anyway. For negative numbers I'm not convinced this holds true though as we seem to use 0xff ff ff ff 0f (or is it f0?) which implies there is no zero between the run of number-of-bytes bits and the actual value. Ie 0-3 1s plus zero, or 4 1s and no zero needed? If so the spec is incorrect, but it also means LTF8 is incorrect as it's not just ITF8 longer.

jkbonfield · 2014-11-27T12:24:09Z

Scratch that, ITF8 and LTF8 are indeed different.

ITF8/LTF8 on 0x3f12 is 0xbf12 (1011 1111 0001 0010). 0x7f12 needs 3 bytes so becomes 0xc07f12. In this situation, we don't need all the bits, so it fills up from the right side. Ie the padding bits are in the top byte (1100 0000). We could easily have said they padding bits are in the last byte. Ie 0xc7f120, but it's more logical the first way around.

EXCEPT for when we have 4 additional bytes required. In that scenario ITF8 explicitly requires us to change behaviour and use the last bits to be padding rather than the first bits. This sudden change of behaviour means that LTF8 isn't just ITF8 continued, as a 5 byte ITF8 string is interpreted as a different number by LTF8. Is there a flaw in my reasoning? I need to write some test code I guess.

Unfortunately we're stuck with it now, but with hindsight ITF8/LTF8 aren't sensibly defined :(

vadimzalunin · 2014-11-27T13:13:32Z

Ah, ok, but the record_counter already is LTF8 in CRAM3. I've reverted
the commit.

Vadim

On 27/11/2014 11:04, James Bonfield wrote:

On Thu, Nov 27, 2014 at 02:43:51AM -0800, Vadim Zalunin wrote:

More than 2 billion records in a slice sounds impractical to me but
it's
ok since this is not damaging anything really. I've updated the
specs in
the CRAMv3 branch.

"Number of records" is the number within the slice. "Record counter"
is the number since the start of the stream. Eg from cram_dump:

Container pos 243765 size 38017
Ref id: 0
Ref pos: 3185526 + 826926
Rec counter: 50000
No. recs: 10000
No. bases 1000000
No. blocks: 8
No. landmarks: 1
{515}

Container_header block pos 243789
Preservation map:
TD => 17559636 (0x10bf054)
SM => 17559629 (0x10bf04d)
...
Slice 1/1, container offset 515, file offset 244304
Slice content type MAPPED_SLICE
Slice ref seq 0
Slice ref start 3185526
Slice ref span 826926
Slice MD5 f3c8aae884782723a7b05e8c134ef26f
Rec counter 50000
No. records 10000
No. blocks 6
Blk IDS: {0, 1, 2, 3, 4}
Ref base id: -1

Block 1/6
...

Incidentally, this is also a bug in cramtools-2.1.jar. It's not adding
the record counter figure on when doing random access lookups:

jkb@seq3a[samtools.../htslib] java -jar ~/work/cram/cramtools-2.1.jar
bam -R
/nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa
-I /tmp/b.cram 1:4000000-4000100
9810 99 1 4000039 60 100M = 4000440 501
TGGGAGGACTGCAGGACTAGCTGCACAAGACAACTGGCATTTCTGGGGTGCACGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCC
????????????????????????????????????????????????????????????????????????????????????????????????????
RG:Z:1#49
9811 99 1 4000080 60 100M = 4000456 476
TCTGGGGTGCATGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCCCATCCCGGAGCCTCCCGACTTCTCTTGGGGAGGGGATCATG
????????????????????????????????????????????????????????????????????????????????????????????????????
RG:Z:1#49

But that's a different issue that I hadn't yet got around to filing a
ticket against. Feel free to copy paste this over to your own github
issue tracker if it hasn't already been fixed.

jkb@seq3a[samtools.../htslib] java -jar ~/work/cram/cramtools-2.1.jar
bam -R
/nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa
-I /tmp/b.cram |grep
TGGGAGGACTGCAGGACTAGCTGCACAAGACAACTGGCATTTCTGGGGTGCACGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCC
59810 99 1 4000039 60 100M = 4000440 501
TGGGAGGACTGCAGGACTAGCTGCACAAGACAACTGGCATTTCTGGGGTGCACGCTTGGTGCTGCCTCAGCACGATGTGAAGGCTGCGTGTGCACCGTCC
????????????????????????????????????????????????????????????????????????????????????????????????????
RG:Z:1#49

James

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
Reply to this email directly or view it on GitHub
#56 (comment).

jkbonfield · 2014-11-27T13:28:13Z

Ah OK I wasn't aware we'd already changed it for CRAM 3. That wasn't in the list of proposed changes.

I took a quick look at it and I see you've changed it for slice header, but it also needs changing for container header too.

Thanks,
James

vadimzalunin · 2014-11-27T13:33:38Z

Done.

Vadim

On 27/11/2014 13:28, James Bonfield wrote:

Ah OK I wasn't aware we'd already changed it for CRAM 3. That wasn't
in the list of proposed changes.

I took a quick look at it and I see you've changed it for slice
header, but it also needs changing for container header too.

Thanks,
James

—
Reply to this email directly or view it on GitHub
#56 (comment).

jkbonfield · 2014-11-27T13:51:39Z

Thanks, closing issue then, although ideally the LTF8 explaination should be clearer if we can come up with a more precise wording.

jmarshall added the cram label Nov 27, 2014

jkbonfield closed this as completed Nov 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRAM: slice header record_counter too small #56

CRAM: slice header record_counter too small #56

jkbonfield commented Nov 27, 2014

vadimzalunin commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

vadimzalunin commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

vadimzalunin commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

CRAM: slice header record_counter too small #56

CRAM: slice header record_counter too small #56

Comments

jkbonfield commented Nov 27, 2014

vadimzalunin commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

vadimzalunin commented Nov 27, 2014

jkbonfield commented Nov 27, 2014

vadimzalunin commented Nov 27, 2014

jkbonfield commented Nov 27, 2014