Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip.compress(..., mtime=0) in cpython 3.11+ unexpectedly sets OS byte in gzip header #112346

Open
dennisvang opened this issue Nov 23, 2023 · 8 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dennisvang
Copy link

dennisvang commented Nov 23, 2023

Bug report

description

Using gzip.compress() with mtime=0 in 3.8<=cpython<=3.10, the OS byte, i.e. the 10th byte in the GZIP header, is set to 255 "unknown" (also see e.g. #83302):

return struct.pack("<BBBBLBB", 0x1f, 0x8b, 8, 0, int(mtime), xfl, 255)

However, in cpython 3.11 and 3.12, the OS byte is suddenly set to a "known" value, e.g. 3 ("Unix") on Ubuntu.

This is not mentioned in the changelog for Python 3.11.

This may lead to problems in the context of reproducible builds. In our case, hash checking fails after decompressing and re-compressing a gzipped archive.

how to reproduce

Here's an example, where byte 10 is \xff in python 3.10 and \x03 in python 3.11:

~ $ python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
>>> import gzip
>>> gzip.compress(b'', mtime=0)
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

~ $ pyenv shell 3.11
~ $ python
Python 3.11.6 (main, Nov 23 2023, 17:30:16) [GCC 11.4.0] on linux
>>> import gzip
>>> gzip.compress(b'', mtime=0)
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

cause

I guess this is caused by python 3.11 delegating the gzip.compress() call to zlib if mtime=0, as mentioned in the docs:

Changed in version 3.11: Speed is improved by compressing all data at once instead of in a streamed fashion. Calls with mtime set to 0 are delegated to zlib.compress() for better speed.

and source:

cpython/Lib/gzip.py

Lines 609 to 612 in 89ddea4

if mtime == 0:
# Use zlib as it creates the header with 0 mtime by default.
# This is faster and with less overhead.
return zlib.compress(data, level=compresslevel, wbits=31)

Apparently zlib does set the OS byte.

CPython versions tested on:

3.8, 3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, macOS, Windows

Linked PRs

@dennisvang dennisvang added the type-bug An unexpected behavior, bug, or error label Nov 23, 2023
@Eclips4 Eclips4 added the stdlib Python modules in the Lib dir label Nov 23, 2023
@dennisvang
Copy link
Author

dennisvang commented Nov 23, 2023

In itself I think it may be a good thing that the OS byte is properly set.

The problem is just that the change is not explicitly documented, as far as I know.

dennisvang added a commit to dennisvang/tufup that referenced this issue Nov 23, 2023
@rhpvorderman
Copy link
Contributor

Hi @dennisvang . This is my fault, as I delegated gzip.compress(mtime=0) to zlib.compress, incorrectly assuming this was the same. The reason is that zlib.compress is faster. But if it leads to behavioral changes, that is not acceptable.

I believe this can easily be remedied by removing the codepath.

@rhpvorderman
Copy link
Contributor

I have made a PR. Just now and put Bugfix in the name. Now I hope it will get attention.

@dennisvang
Copy link
Author

@rhpvorderman Thanks for picking this up.

I wonder, if this is the only side-effect, and if the performance gain from using zlib.compress is worth it, perhaps you could just keep delegating to zlib and change byte 10 back to \xff afterwards?

@rhpvorderman
Copy link
Contributor

Well, as mentioned in the PR, keeping two separate code paths caused issues before. It is best to keep one codepath. There is a mention in the documentation about zlib.compress so users who need the performance can use it themselves.

@dennisvang
Copy link
Author

@rhpvorderman You're right, that makes sense.

@rhpvorderman
Copy link
Contributor

ping
This bug and fix have been lingering for a while.

@serhiy-storchaka
Copy link
Member

For reference, this feature was added in bpo-43613 (gh-87779). It included more optimizations, the only issue with delegating the whole compression to zlib, when mtime is 0.

The fix looks correct and it still preserves some speed up. An alternate solution could be to call zlib.compress() (even if mtime is not 0) and then patch the result for mtime and the OS byte, but I do not know how reliable is it and whether method is faster.

rhpvorderman added a commit to rhpvorderman/cpython that referenced this issue Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

4 participants