-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _BlocksOutputBuffer for bz2/lzma/zlib module #85658
Comments
🔵 bz2/lzma module's current growth algorithm bz2/lzma module's initial output buffer size is 8KB [1][2], and they are using this output buffer growth algorithm [3][4]: newsize = size + (size >> 3) + 6 [1] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L109 For many case, the output buffer is resized too many times. size = 8*1024
for i in range(1, 120):
print('Step %d ' % i, format(size, ','), 'bytes')
size = size + (size >> 3) + 6 Step 1 8,192 bytes 🔵 zlib module's current growth algorithm zlib module's initial output buffer size is 16KB [5], in each growth the buffer size doubles [6]. [5] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L32 This algorithm has a higher risk of running out of memory: ... 🔵 Add _BlocksOutputBuffer for bz2/lzma/zlib module Proposed PR uses a list of bytes object to represent output buffer. I only tested decompression, because the result is more obvious than compression. For special data benchmark (all data consists of b'a'), see these attached pictures, _BlocksOutputBuffer has linear performance:
After switching to _BlocksOutputBuffer, the code of bz2/lzma is more concise, the code of zlib is basically translated statement by statement, IMO it's safe and easy for review. 🔵 Real data benchmark For real data, the weight of resizing output buffer is not big, so the performance improvement is not as big as above pictures: ----------------- bz2 ----------------- linux-2.6.39.4.tar.bz2 firefox-79.0.linux-i686.tar.bz2 ffmpeg-4.3.1.tar.bz2 gimp-2.10.20.tar.bz2 sde-external-8.56.0-2020-07-05-lin.tar.bz2 ----------------- lzma ----------------- linux-5.7.10.tar.xz linux-2.6.39.4.tar.xz gcc-9.3.0.tar.xz Python-3.8.5.tar.xz firefox-79.0.source.tar.xz ----------------- zlib ----------------- linux-5.7.10.tar.gz linux-2.6.39.4.tar.gz gcc-9.3.0.tar.gz Python-3.8.5.tgz openjdk-14.0.2_linux-x64_bin.tar.gz postgresql-10.12-1-linux-x64-binaries.tar.gz -------------- benchmark end -------------- 🔵 Add .readall() function to _compression.DecompressReader class The last commit add .readall() function to _compression.DecompressReader, it optimize the .read(-1) for BZ2File/LZMAFile/GzipFile object. The following description is a bit complicate:
raw = _compression.DecompressReader(fp, BZ2Decompressor/LZMADecompressor/zlib.decompressobj)
self._buffer = io.BufferedReader(raw)
Benchmark with real data, call .read(-1) on XXXXFile object: ----------------- BZ2File ----------------- linux-2.6.39.4.tar.bz2 firefox-79.0.linux-i686.tar.bz2 ffmpeg-4.3.1.tar.bz2 gimp-2.10.20.tar.bz2 sde-external-8.56.0-2020-07-05-lin.tar.bz2 ----------------- LZMAFile ----------------- linux-5.7.10.tar.xz linux-2.6.39.4.tar.xz gcc-9.3.0.tar.xz Python-3.8.5.tar.xz firefox-79.0.source.tar.xz ----------------- GzipFile ----------------- linux-5.7.10.tar.gz linux-2.6.39.4.tar.gz gcc-9.3.0.tar.gz Python-3.8.5.tgz openjdk-14.0.2_linux-x64_bin.tar.gz postgresql-10.12-1-linux-x64-binaries.tar.gz -------------- benchmark end -------------- |
Ma Lin proposed this approach (PR 21740) for _PyBytesWriter/_PyUnicodeWriter on python-dev: |
It would be interested to see if using _PyBytesWriter in bz2, lzma, zlib and _io.FileIO.readall() would speed up the code. I would prefer to centralize the buffer allocation logic in _PyBytesWriter rather than having custom code in each file. _PyBytesWriter is designed to reduce the number of realloc() by using overallocation, it uses a different overallocation ratio on Windows, and it uses a small buffer allocated on the stack for strings up to 512 bytes. |
I modify lzma module to use different growth factors, see attached picture different_factors.png 1.5x should be the growth factor of _PyBytesWriter under Windows. So if change _PyBytesWriter to use memory blocks, maybe there will be no performance improvement. Over allocate factor of _PyBytesWriter:
(I'm using Windows 10) |
ping |
I left some review comments on the PR. I like the algorithm being used. I don't really _like_ that this is a .h file acting as a C template to inject effectively the same static code into each module that wants to use it... Which I think is the concern Victor is expressing in a comment above. I could live with this PR as is because it is at least easy to maintain. But I anticipate we'll want to refactor it in the future to be shared code instead of a copy compiled per module. This is the kind of thing that also makes sense to be usable outside of just these modules. Compression modules for the other popular compression algorithms currently not in the stdlib but available on PyPI shouldn't need to reinvent this wheel on their own without reason. (lzo, brotli, zstandard, no doubt others...) It is also worth looking at those to see if they've already implemented something like this and how it differs. |
looking around it appears you've proposed an independent implementation of this for the thir party brotli module? google/brotli#856 that is what i mean about making this reusable :) |
I think so too. The defines of BOB_BUFFER_TYPE/BOB_SIZE_TYPE/BOB_SIZE_MAX are ugly. If put the core code together, these defines can be put in a thin wrapper in _bz2module.c/_lzmamodule.c/zlibmodule.c files. This can be done now, but it's ideal to improve it more thoroughly in 3.11. _PyBytesWriter has different behavior, user may access existing data as plain data, which is impossible for _BlocksOutputBuffer. An API/code can be carefully designed, efficient/flexible/elegant, then the code may be used in some sites in CPython. |
I tried, it looks well. |
Very sorry for update at the last moment. Please review the last commit in PR 21740, the previous commits have not been changed. The changes: 1, Move 2, Ask the user to initialize the struct instance like this, and use assertions to check it: Then no longer worry about whether buffer.list is uninitialized in error handling. 3, Change the type of BUFFER_BLOCK_SIZE from 4, These functions return allocated size on success, return -1 on failure: 5, All functions are decorated with |
The above changes were made in this commit:
After that commit, there is a new commit, it resolves the code conflicts introduced by PR 22126 one hour ago.
Sorry to complicate the review again. For the change from 55705f6 to 45d7526, see the uploaded file (45d7526.diff), it can also be easily seen with a Git client. |
Thanks, this is great work! Especially when living within the constraints of C and the existing code. |
Thanks for reviewing this big patch. |
Found a backward incompatible behavior. Before the patch, in 64-bit build, zlib module allows the initial size > UINT32_MAX. After the patch, when init_size > UINT32_MAX, it raises a ValueError. PR 25738 fixes this backward incompatibility. Moreover, if you don't mind, I would like to take this opportunity to rename the wrapper functions from Buffer_* to OutputBuffer_*, so that the readers can easily distinguish between input buffer and output buffer. |
Renaming to OutputBuffer sounds like a good idea. On Thu, Apr 29, 2021, 7:55 PM Ma Lin <report@bugs.python.org> wrote:
|
Sorry, for the (init_size > UINT32_MAX) problem, I have a better solution. Please imagine this scenario:
If set the zlib.decompress(data, bufsize=10*1024*1024*1024)
But in the current code, the initial size is clamped to UINT32_MAX, so there are two regressions:
PR 26143 uses an UINT32_MAX sliding window for the first block, now the initial buffer size can be greater than UINT32_MAX. _BlocksOutputBuffer_Finish() already has a fast path for single block. Benchmark this code: zlib.decompress(data, bufsize=10*1024*1024*1024)
Maybe some user code rely on this corner case. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: