New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZipFileExt.read() can be incredibly slow; patch included #48228
Comments
I've created a patch that improves the decompression performance of In ZipFileExt.read(), decompressed bytes waiting to be read() sit in a The attached zeroes.zip demonstrates a worst-case scenario for this The attached patch makes the read buffer a StringIO instead of a string. The patch also fixes the behavior of zipfile.py when called as a script unzip vs. Python's zipfile.py vs. patched zipfile.py: $ time unzip -e zeroes.zip
Archive: zeroes.zip
inflating: zeroes_unzip/zeroes real 0m0.707s $ time python zipfileold.py -e zeroes.zip zeroes_old real 3m42.012s $ time python zipfile.py -e zeroes.zip zeroes_patched real 0m0.986s In this test, the patched version is 246x faster than the unpatched Incidentally, this patch also improves performance when the data is not $ time python zipfileold.py -e random.zip random_old real 0m0.063s $ time python zipfile.py -e random.zip random_patched real 0m0.059s |
Very interesting, but it will have to wait for 2.7/3.1. 2.6 and 3.0 are |
Why not include this in 2.6.1 or 3.0.1? The patch fixes several bugs; |
Attaching a cleanup of the proposed patch. The funny thing is that for |
The patch has been outdated by other independent performance work on the zipfile module. In Python 3.2, the zipfile module is actually slightly faster than the "unzip" program:
$ rm -f zeroes && time -p unzip -e zeroes.zip
Archive: zeroes.zip
inflating: zeroes
real 0.56
user 0.50
sys 0.06
$ time -p ./python -m zipfile -e zeroes.zip .
real 0.45
user 0.34
sys 0.10
$ rm -f random && time -p unzip -e random.zip
Archive: random.zip
inflating: random
real 0.69
user 0.61
sys 0.07
$ rm -f random && time -p ./python -m zipfile -e random.zip .
real 0.33
user 0.18
sys 0.14 |
If I may chime in, as I don't know where else to put this. I am still seeing the same performance as the OP when I use extractall() with a password protected ZIP of size 287 MB (containing one compressed movie file of size 297 MB). The total running time for extractall.py was For a bash script using unzip -P the running time on the same file was real 0m19.026s extractall.py loops over the contents of a directory using os.walk, identifies zip files by file extension and extracts a certain portion of the filename as password using a regex. If I leave the ZipFile.extractall part out of it and run it it takes 0.15 s. This is with Python 2.7.1 and Python 3.1.2 on Mac OS X 10.6.4 on an 8-core MacPro with 16 GB of RAM. The file is read from an attached USB drive. Maybe that makes a difference. I wish I could tell you more. This is just for the record. I don't expect this to be fixed. |
Please try with a non-password protected file. |
"Decryption is extremely slow as it is implemented in native Python rather than C" Right, of course, I missed this when reading the docs. As I was asked to try it with a non-password protected zip file here's the numbers for comparison. Same file, re-zipped without encryption, extractall.py now finishes in 16 s. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: