Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip.compress and gzip.decompress are sub-optimally implemented. #87779

Closed
rhpvorderman mannequin opened this issue Mar 24, 2021 · 4 comments
Closed

gzip.compress and gzip.decompress are sub-optimally implemented. #87779

rhpvorderman mannequin opened this issue Mar 24, 2021 · 4 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@rhpvorderman
Copy link
Mannequin

rhpvorderman mannequin commented Mar 24, 2021

BPO 43613
Nosy @rhettinger, @ambv, @rhpvorderman
PRs
  • bpo-43613: Faster implementation of gzip.compress and gzip.decompress #27941
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-09-03.07:28:45.545>
    created_at = <Date 2021-03-24.06:27:19.024>
    labels = ['library', 'performance']
    title = 'gzip.compress and gzip.decompress are sub-optimally implemented.'
    updated_at = <Date 2021-09-03.07:37:25.085>
    user = 'https://github.com/rhpvorderman'

    bugs.python.org fields:

    activity = <Date 2021-09-03.07:37:25.085>
    actor = 'rhpvorderman'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-09-03.07:28:45.545>
    closer = 'rhpvorderman'
    components = ['Library (Lib)']
    creation = <Date 2021-03-24.06:27:19.024>
    creator = 'rhpvorderman'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43613
    keywords = ['patch']
    message_count = 4.0
    messages = ['389438', '389495', '400921', '400985']
    nosy_count = 3.0
    nosy_names = ['rhettinger', 'lukasz.langa', 'rhpvorderman']
    pr_nums = ['27941']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue43613'
    versions = []

    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Mar 24, 2021

    When working on python-isal which aims to provide faster drop-in replacements for the zlib and gzip modules I found that the gzip.compress and gzip.decompress are suboptimally implemented which hurts performance.

    gzip.compress and gzip.decompress both do the following things:

    • Instantiate a BytesIO object to mimick a file
    • Instantiate a GzipFile object to compress or read the file.

    That means there is way more Python code involved than strictly necessary. Also the 'data' is already fully in memory, but the data is streamed anyway. That is quite a waste.

    I propose the following:

    • The documentation should make it clear that zlib.decompress(... ,wbits=31) and zlib.compress(..., wbits=31) (after 43612 has been addressed), are both quicker but come with caveats. zlib.compress can not set mtime. zlib.decompress does not take multimember gzip into account.
    • For gzip.compress -> The GzipFile._write_gzip_header function should be moved to a module wide _gzip_header function that returns a bytes object. GzipFile._write_gzip_header can call this function. gzip.compress can also call this function to create a header. gzip.compress than calls zlib.compress(data, wbits=-15) (after 43612 has been fixed) to create a raw deflate block. A gzip trailer can be easily created by calling zlib.crc32(data) and len(data) & 0xffffffff and packing those into a struct. See for an example implementation here: https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L242
      -> For gzip.decompress it becomes quite more involved. A read_gzip_header function can be created, but the current implementation returns EOFErrors if the header is incomplete due to a truncated file instead of BadGzipFile errors. This makes it harder to implement something that is not a major break from current gzip.decompress. Apart from the header, the implementation is straightforward. Do a while true loop. All operations are performed in the loop. Validate the header and report the end of the header. Create a zlib.decompressobj(wbits=-15). Decompress all the data from the end of header. Flush. Extract the crc and length from the first 8 bytes of the unused data. data = decompobj.unused_data[8:]. if not data: break. For a reference implementation check here: https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L300. Note that the decompress function is quite straightforward. Checking the header however while maintaining backwards compatibility with gzip.decompress is not so simple.

    And that brings to another point. Should non-descriptive EOFErrors be raised when reading the gzip header? Or throw informative BadGzipFile errors when the header is parsed. I tend towards the latter. For example BadGzipFile("Truncated header") instead of EOFError. Or at least EOFError("Truncated gzip header"). I am aware that confounds this issue with another issue, but these things are coupled in the implementation so both need to be solved at the same time.

    Given the headaches that gzip.decompress gives it might be easier to solve gzip.compress first in a first PR and do gzip.decompress later.

    @rhpvorderman rhpvorderman mannequin added performance Performance or resource usage labels Mar 24, 2021
    @iritkatriel iritkatriel added stdlib Python modules in the Lib dir labels Mar 24, 2021
    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Mar 25, 2021

    I created bpo-43621 for the error issue. There should only be BadGzipFile. Once that is fixed, having only one error type will make it easier to implement some functions that are shared across the gzip.py codebase.

    @ambv
    Copy link
    Contributor

    ambv commented Sep 2, 2021

    New changeset ea23e78 by Ruben Vorderman in branch 'main':
    bpo-43613: Faster implementation of gzip.compress and gzip.decompress (GH-27941)
    ea23e78

    @rhpvorderman rhpvorderman mannequin closed this as completed Sep 3, 2021
    @rhpvorderman rhpvorderman mannequin closed this as completed Sep 3, 2021
    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Sep 3, 2021

    Issue was solved by moving code from _GzipReader to separate functions and maintaining the same error structure.
    This solved the problem with maximum code reuse and full backwards compatibility.

    This issue was closed.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants