You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The bgzf library that is part of htslib is supposed to support reading blocked-gzipped, gzipped, and uncompressed data, with the format of the input data being detected at runtime.
This mostly works. However, the decompression of gzip-format files fails when those files have multiple "members", each with its own zlib header and constituting its own zlib stream.
The gzip RFC specifies that gzip files may have multiple members:
2.2. File format
A gzip file consists of a series of "members" (compressed data
sets). The format of each member is specified in the following
section. The members simply appear one after another in the file,
with no additional information before, between, or after them.
A side effect of this is that concatenating two gzip files results in a valid gzip file (and is probably the easiest way to get a multi-member file).
When the bgzf gzip decompressor reaches the end of a gzip member, it uses this code:
The inflate() call is going to return Z_STREAM_END (which is >0, and which is not checked for), and will leave the start of the next stream "available" in the input buffer, if present, while not necessarily filling the output buffer. The current code takes the ending data, but ignores the return code.
The next time it tries to read, it discards the unused data in the input buffer, throws a new input buffer at inflate(), notices that this time it got no more output data, and signals an EOF. So the next gzip member isn't decompressed, and in fact some of its data is discarded.
It looks like, borrowing from https://stackoverflow.com/a/17822217, the right thing to do here is to call inflateReset() whenever a zlib stream ends, to clear the Z_STREAM_END condition, so we can decompress the remainder of the input buffer as a new stream if possible.
We might also want to change the input code, which assumes that if there is any room left in the output buffer, it means the input buffer was totally consumed and needs to be refilled. That's not true when Z_STREAM_END happens. There's also another positive error code, Z_NEED_DICT, that zlib can emit and that isn't checked for.
I'm going to see if I can put together a PR to resolve this.
The text was updated successfully, but these errors were encountered:
The bgzf library that is part of htslib is supposed to support reading blocked-gzipped, gzipped, and uncompressed data, with the format of the input data being detected at runtime.
This mostly works. However, the decompression of gzip-format files fails when those files have multiple "members", each with its own zlib header and constituting its own zlib stream.
The gzip RFC specifies that gzip files may have multiple members:
A side effect of this is that concatenating two gzip files results in a valid gzip file (and is probably the easiest way to get a multi-member file).
When the bgzf gzip decompressor reaches the end of a gzip member, it uses this code:
htslib/bgzf.c
Lines 614 to 619 in a4910bf
The
inflate()
call is going to returnZ_STREAM_END
(which is >0, and which is not checked for), and will leave the start of the next stream "available" in the input buffer, if present, while not necessarily filling the output buffer. The current code takes the ending data, but ignores the return code.The next time it tries to read, it discards the unused data in the input buffer, throws a new input buffer at
inflate()
, notices that this time it got no more output data, and signals an EOF. So the next gzip member isn't decompressed, and in fact some of its data is discarded.It looks like, borrowing from https://stackoverflow.com/a/17822217, the right thing to do here is to call
inflateReset()
whenever a zlib stream ends, to clear theZ_STREAM_END
condition, so we can decompress the remainder of the input buffer as a new stream if possible.We might also want to change the input code, which assumes that if there is any room left in the output buffer, it means the input buffer was totally consumed and needs to be refilled. That's not true when
Z_STREAM_END
happens. There's also another positive error code,Z_NEED_DICT
, that zlib can emit and that isn't checked for.I'm going to see if I can put together a PR to resolve this.
The text was updated successfully, but these errors were encountered: