Bgzip text mode (try 2) #1493

jkbonfield · 2022-08-11T14:49:36Z

An alternative to @mlin's #1369 for consideration.

The two implementations differ by:

Using kstring's kgetline3 (a new variant of kgetline2), vs manual newline scanning.
Efficiency (my version has considerably less overhead)
Code complexity (my version adds around twice as many new lines of code)
Text mode explicit (Mike's) vs text mode implicit (--binary to disable).
IO block sizes are freer (Mike's) vs required to exactly match the BGZF block size (mine).

At bgzf -l1 Mike's text mode added 6.2% CPU cycles, vs +0.4% for mine. However at level 6 these are considerably reduced as the Deflate overhead starts to dominate and these drop to 2.7% vs 0.0%. So maybe this isn't such an issue anyway. The main difference in CPU comes from searching backwards for the first newline vs searching forwards for the last newline. On long read technologies these two implementations start to converge a bit, but kgetline still has additional overhead with extra memcpys.

I also found the original implementation wasn't entirely foolproof when dealing with headers that start with lines over 64KB long. This is admittedly an extreme corner case, but it does mean the extra complexity of the new code may be doing something more anyway.

Co-authored-by: Mike Lin <dna@mlin.net> Compressing text now promotes alignment of BGZF blocks with the uncompressed text lines. BGZF blocks start at the beginning of an input line and end after some subsequent newline (except when the block's first line overflows the BGZF block size). This ensures it's possible to specify byte ranges of a BGZF file that decompress into complete text records -- useful for parallel processing and "slicing" from remote servers. To disable this feature and provide a way to produce identical output with 1.15 and earlier, the --binary option forces text data to be processed as if it were binary. The idea and initial implementation was Mike Lin's, with the current revised implementation by James Bonfield.

jkbonfield added 2 commits August 11, 2022 14:56

Make the bgzip -g option less opaque.

eace1dd

daviesrob assigned whitwham Aug 30, 2022

whitwham merged commit 191ddc0 into samtools:develop Sep 1, 2022

fplazaonate mentioned this pull request Apr 17, 2023

Old ruby (2.0.0) cannot read gzip files created by samtools samtools/samtools#1831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bgzip text mode (try 2) #1493

Bgzip text mode (try 2) #1493

jkbonfield commented Aug 11, 2022

Bgzip text mode (try 2) #1493

Bgzip text mode (try 2) #1493

Conversation

jkbonfield commented Aug 11, 2022