Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZipFileExt.read() can be incredibly slow; patch included #48228

Closed
lightstruk mannequin opened this issue Sep 26, 2008 · 8 comments
Closed

ZipFileExt.read() can be incredibly slow; patch included #48228

lightstruk mannequin opened this issue Sep 26, 2008 · 8 comments
Labels
extension-modules C modules in the Modules dir performance Performance or resource usage

Comments

@lightstruk
Copy link
Mannequin

lightstruk mannequin commented Sep 26, 2008

BPO 3978
Nosy @pitrou
Files
  • zipfile_read_perf.patch: zipfile.py extraction performance improvement
  • zeroes.zip: demonstration ZIP, explodes from 100 KiB to 100 MiB
  • zipperf.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-05.12:48:59.444>
    created_at = <Date 2008-09-26.19:23:37.979>
    labels = ['extension-modules', 'performance']
    title = 'ZipFileExt.read() can be incredibly slow; patch included'
    updated_at = <Date 2011-01-14.16:39:32.218>
    user = 'https://bugs.python.org/lightstruk'

    bugs.python.org fields:

    activity = <Date 2011-01-14.16:39:32.218>
    actor = 'andreb'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-05.12:48:59.444>
    closer = 'pitrou'
    components = ['Extension Modules']
    creation = <Date 2008-09-26.19:23:37.979>
    creator = 'lightstruk'
    dependencies = []
    files = ['11621', '11622', '12346']
    hgrepos = []
    issue_num = 3978
    keywords = ['patch']
    message_count = 8.0
    messages = ['73880', '73921', '74135', '77761', '115643', '126260', '126261', '126275']
    nosy_count = 3.0
    nosy_names = ['pitrou', 'lightstruk', 'andreb']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue3978'
    versions = ['Python 3.2']

    @lightstruk
    Copy link
    Mannequin Author

    lightstruk mannequin commented Sep 26, 2008

    I've created a patch that improves the decompression performance of
    zipfile.py by up to two orders of magnitude.

    In ZipFileExt.read(), decompressed bytes waiting to be read() sit in a
    string buffer, self.readbuffer. When a piece of that string is read,
    the string is split in two, with the first piece being returned, and the
    second piece becoming the new self.readbuffer. Each of these two pieces
    must be allocated space and have their contents copied into them. When
    the length of the readbuffer far exceeds the number of bytes requested,
    allocating space for the two substrings and copying in their contents
    becomes very expensive.

    The attached zeroes.zip demonstrates a worst-case scenario for this
    problem. It contains one 100 MiB file filled with zeroes. This file
    compresses to just 100 KiB, however, because it is so repetitive. This
    repetitive data means that the zlib decompressor returns many MiBs of
    uncompressed data when fed just 64 KiB of compressed data. Each call to
    read() requests only 16 KiB, so each call must reallocate and copy many
    MiBs.

    The attached patch makes the read buffer a StringIO instead of a string.
    Each call to the decompressor creates a new StringIO buffer. Reading
    from the StringIO does not create a new string for the unread data.
    When the buffer has been exhausted, a new StringIO is created with the
    next batch of uncompressed bytes.

    The patch also fixes the behavior of zipfile.py when called as a script
    with the -e flag. Before, to extract a file, it decompressed the entire
    file to memory, and then wrote the entire file to disk. This behavior
    is undesirable if the decompressed file is even remotely large. Now, it
    uses ZipFile.extractall(), which correctly streams the decompressed data
    to disk.

    unzip vs. Python's zipfile.py vs. patched zipfile.py:

    $ time unzip -e zeroes.zip
    Archive:  zeroes.zip
      inflating: zeroes_unzip/zeroes

    real 0m0.707s
    user 0m0.463s
    sys 0m0.244s

    $ time python zipfileold.py -e zeroes.zip zeroes_old

    real 3m42.012s
    user 0m57.670s
    sys 2m43.510s

    $ time python zipfile.py -e zeroes.zip zeroes_patched

    real 0m0.986s
    user 0m0.409s
    sys 0m0.490s

    In this test, the patched version is 246x faster than the unpatched
    version, and is not far off the pace of the C version.

    Incidentally, this patch also improves performance when the data is not
    repetitive. I tested a ZIP containing a single compressed file filled
    with random data, created by running
    $ dd if=/dev/urandom of=random bs=1024 count=1024
    $ zip random.zip random
    This archive demonstrates the opposite scenario - where compression has
    next to no impact on file size, and the read buffer will never be
    dramatically larger than the amount of data fed to the zlib decompressor.

    $ time python zipfileold.py -e random.zip random_old

    real 0m0.063s
    user 0m0.053s
    sys 0m0.010s

    $ time python zipfile.py -e random.zip random_patched

    real 0m0.059s
    user 0m0.047s
    sys 0m0.012s

    @lightstruk lightstruk mannequin added extension-modules C modules in the Modules dir performance Performance or resource usage labels Sep 26, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Sep 27, 2008

    Very interesting, but it will have to wait for 2.7/3.1. 2.6 and 3.0 are
    in the final phases of the release process.

    @lightstruk
    Copy link
    Mannequin Author

    lightstruk mannequin commented Oct 1, 2008

    Why not include this in 2.6.1 or 3.0.1? The patch fixes several bugs;
    it does not provide any new functionality.

    @lightstruk lightstruk mannequin changed the title ZipFileExt.read() can be incredibly slow ZipFileExt.read() can be incredibly slow; patch included Dec 5, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Dec 13, 2008

    Attaching a cleanup of the proposed patch. The funny thing is that for
    me, both the unpatched and patched versions are as fast as the unzip binary.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2010

    The patch has been outdated by other independent performance work on the zipfile module. In Python 3.2, the zipfile module is actually slightly faster than the "unzip" program:

    • first with the supplied "zeroes.zip" file:
    $ rm -f zeroes && time -p unzip -e zeroes.zip
    Archive:  zeroes.zip
      inflating: zeroes                  
    real 0.56
    user 0.50
    sys 0.06
    
    $ time -p ./python -m zipfile -e zeroes.zip .
    real 0.45
    user 0.34
    sys 0.10
    • Then with a 100MB random file:
    $ rm -f random && time -p unzip -e random.zip
    Archive:  random.zip
      inflating: random                  
    real 0.69
    user 0.61
    sys 0.07
    
    $ rm -f random && time -p ./python -m zipfile -e random.zip .
    real 0.33
    user 0.18
    sys 0.14

    @pitrou pitrou closed this as completed Sep 5, 2010
    @andreb
    Copy link
    Mannequin

    andreb mannequin commented Jan 14, 2011

    If I may chime in, as I don't know where else to put this.

    I am still seeing the same performance as the OP when I use extractall() with a password protected ZIP of size 287 MB (containing one compressed movie file of size 297 MB).

    The total running time for extractall.py was
    real 35m24.448s
    user 34m52.423s
    sys 0m1.448s

    For a bash script using unzip -P the running time on the same file was

    real 0m19.026s
    user 0m8.359s
    sys 0m0.414s

    extractall.py loops over the contents of a directory using os.walk, identifies zip files by file extension and extracts a certain portion of the filename as password using a regex. If I leave the ZipFile.extractall part out of it and run it it takes 0.15 s.

    This is with Python 2.7.1 and Python 3.1.2 on Mac OS X 10.6.4 on an 8-core MacPro with 16 GB of RAM. The file is read from an attached USB drive. Maybe that makes a difference. I wish I could tell you more.

    This is just for the record. I don't expect this to be fixed.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 14, 2011

    I am still seeing the same performance as the OP when I use
    extractall() with a password protected ZIP of size 287 MB (containing
    one compressed movie file of size 297 MB).

    Please try with a non-password protected file.

    @andreb
    Copy link
    Mannequin

    andreb mannequin commented Jan 14, 2011

    "Decryption is extremely slow as it is implemented in native Python rather than C"

    Right, of course, I missed this when reading the docs.
    I have a habit of jumping straight to the point.

    As I was asked to try it with a non-password protected zip file here's the numbers for comparison.

    Same file, re-zipped without encryption, extractall.py now finishes in 16 s.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant