Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bgzip text mode (try 2) #1493

Merged
merged 2 commits into from
Sep 1, 2022
Merged

Conversation

jkbonfield
Copy link
Contributor

An alternative to @mlin's #1369 for consideration.

The two implementations differ by:

  • Using kstring's kgetline3 (a new variant of kgetline2), vs manual newline scanning.
  • Efficiency (my version has considerably less overhead)
  • Code complexity (my version adds around twice as many new lines of code)
  • Text mode explicit (Mike's) vs text mode implicit (--binary to disable).
  • IO block sizes are freer (Mike's) vs required to exactly match the BGZF block size (mine).

At bgzf -l1 Mike's text mode added 6.2% CPU cycles, vs +0.4% for mine. However at level 6 these are considerably reduced as the Deflate overhead starts to dominate and these drop to 2.7% vs 0.0%. So maybe this isn't such an issue anyway. The main difference in CPU comes from searching backwards for the first newline vs searching forwards for the last newline. On long read technologies these two implementations start to converge a bit, but kgetline still has additional overhead with extra memcpys.

I also found the original implementation wasn't entirely foolproof when dealing with headers that start with lines over 64KB long. This is admittedly an extreme corner case, but it does mean the extra complexity of the new code may be doing something more anyway.

Co-authored-by: Mike Lin <dna@mlin.net>

Compressing text now promotes alignment of BGZF blocks with the
uncompressed text lines. BGZF blocks start at the beginning of an
input line and end after some subsequent newline (except when the
block's first line overflows the BGZF block size).

This ensures it's possible to specify byte ranges of a BGZF file that
decompress into complete text records -- useful for parallel
processing and "slicing" from remote servers.

To disable this feature and provide a way to produce identical output
with 1.15 and earlier, the --binary option forces text data to be
processed as if it were binary.

The idea and initial implementation was Mike Lin's, with the current
revised implementation by James Bonfield.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants