Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgzf gzip-read feature does not support multi-member (concatenated) gzip files #742

Closed
adamnovak opened this issue Jul 23, 2018 · 0 comments

Comments

@adamnovak
Copy link
Contributor

The bgzf library that is part of htslib is supposed to support reading blocked-gzipped, gzipped, and uncompressed data, with the format of the input data being detected at runtime.

This mostly works. However, the decompression of gzip-format files fails when those files have multiple "members", each with its own zlib header and constituting its own zlib stream.

The gzip RFC specifies that gzip files may have multiple members:

2.2. File format

      A gzip file consists of a series of "members" (compressed data
      sets).  The format of each member is specified in the following
      section.  The members simply appear one after another in the file,
      with no additional information before, between, or after them.

A side effect of this is that concatenating two gzip files results in a valid gzip file (and is probably the easiest way to get a multi-member file).

When the bgzf gzip decompressor reaches the end of a gzip member, it uses this code:

htslib/bgzf.c

Lines 614 to 619 in a4910bf

ret = inflate(fp->gz_stream, Z_NO_FLUSH);
if (ret < 0 && ret != Z_BUF_ERROR) {
hts_log_error("Inflate operation failed: %s", bgzf_zerr(ret, ret == Z_DATA_ERROR ? fp->gz_stream : NULL));
fp->errcode |= BGZF_ERR_ZLIB;
return -1;
}

The inflate() call is going to return Z_STREAM_END (which is >0, and which is not checked for), and will leave the start of the next stream "available" in the input buffer, if present, while not necessarily filling the output buffer. The current code takes the ending data, but ignores the return code.

The next time it tries to read, it discards the unused data in the input buffer, throws a new input buffer at inflate(), notices that this time it got no more output data, and signals an EOF. So the next gzip member isn't decompressed, and in fact some of its data is discarded.

It looks like, borrowing from https://stackoverflow.com/a/17822217, the right thing to do here is to call inflateReset() whenever a zlib stream ends, to clear the Z_STREAM_END condition, so we can decompress the remainder of the input buffer as a new stream if possible.

We might also want to change the input code, which assumes that if there is any room left in the output buffer, it means the input buffer was totally consumed and needs to be refilled. That's not true when Z_STREAM_END happens. There's also another positive error code, Z_NEED_DICT, that zlib can emit and that isn't checked for.

I'm going to see if I can put together a PR to resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant