Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

Open
rhpvorderman opened this issue Jan 23, 2023 · 6 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Jan 23, 2023

Bug report

The _BlocksOutputBuffer was introduced for all compression modules (bz2, lzma and zlib) in python 3.10 #21740 .

It performs well on its advertised quality: speedy and memory efficient creation of very large in-memory blocks.

However it came at the cost of:

  • Lots of extra code
  • A performance regression in the common use case

The common use case is small things in memory. Having large in-memory buffers (multiple gigabytes) is an anti-pattern and streaming interfaces should be used instead. Which in turn utilize small buffers.

For a small buffer the BlocksOutputBuffer needs to create a list and a bytes object, while the 3.9 method only creates the bytes object. When the initial bytes object is too small, the blocksoutputbuffer creates another bytes object and resizes the list. The 3.9 method simply resizes the bytes object.
The 3.9 method does not scale well beyond a large number of resizings while the BlocksOutputBuffer does (due to cost amortization in the list resize and the fact that it doesn't resize bytes objects, saving on memcpy) but this only applies to very large buffers, and these should be rare.

This can be shown by removing the blocksoutputbuffer and reverting to the 3.9 method of arranging the output buffer. I have made a branch here https://github.com/rhpvorderman/cpython/tree/noblocks .

Microbenchmarks. Taking current README.rst (10044 bytes) and compressing with compression level 1.
BlocksoutputBuffer

$ ./python -m pyperf timeit -s 'data=open("README.rst", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 180 us +- 1 us

arrange_output_buffer:

$ ./python -m pyperf timeit -s 'data=open("README.rst", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 174 us +- 1 us

Also when taking a larger file Lib/_pydecimal.py (229220 uncompressed, 60974 bytes compressed) which certainly requires resizing the initial 16K buffer.
BlocksOutputBuffer:

$ ./python -m pyperf timeit -s 'data=open("Lib/_pydecimal.py", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 2.37 ms +- 0.01 ms

arrange_output_buffer

./python -m pyperf timeit -s 'data=open("Lib/_pydecimal.py", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 2.28 ms +- 0.01 ms

Q.E.D. BlocksOutputBuffer always loses against simple byte resizing for smaller buffers. _pydecimal.py is more than 200K so that is already quite a big input.

Additionally:

 Modules/zlibmodule.c | 417 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------------------------------------------------------------------------------------
 1 file changed, 117 insertions(+), 300 deletions(-)

And this is not counting the removal of the extra header file that provides the BlocksOutputBuffer.

The BlocksOutputBuffer is a nice piece of work when taken in isolation, but it optimizes for the pathological case rather than the common case. It is detrimental to the common case and it requires a lot of extra code that needs to be maintained.

If there are some issues with slow performance on larger buffers with arrange_output_buffer, I think several optimizations can still be done which are not as invasive as the _BlocksOutputBufer. For instance when zlib.decompress is called on a 100MB object, it makes sense to start the output buffer at 100MB rather than at 16K (default in 3.9). This severely limits the amount of resizes required. The same goes for zlib.compress. If the input is 100MB the maximum output is going to be maximally 100MB+header and trailer size. Reserving 100MB first and then downscaling to the compressed size (say 20MB) is much quicker than resizing a 16K buffer to 20MB using doublings only.

Linked PRs

@rhpvorderman rhpvorderman added the type-bug An unexpected behavior, bug, or error label Jan 23, 2023
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Jan 23, 2023
@rhpvorderman
Copy link
Contributor Author

To complement the rambling above, I will make a systematic overview here to show why blocksoutputbuffer is inefficient.

The arrange output buffer function works by simply creating a bytes object at 16K. Then scaling it up by doubling to 32k, 64k etc. At the end of the function it resizes the bytes object to the size of the contents.

The BlocksOutputBuffer works by creating a list of size 1 and adding a bytes object. When the bytes object is full, a new bytes object is added. The last bytes object added is resized at the end, and a PyBytes_Join is called on the list.

size of output BlocksOutputBuffer arrange_output_buffer
<16k 1x PyBytes_New; 1x _PyBytes_Resize; 1x PyList_New 1x PyBytes_New; 1x _PyBytes_Resize
<32k 1x PyBytes_New; 1x _PyBytes_Resize; 1x PyList_New 1x PyBytes_New; 2x _PyBytes_Resize
< 64k 2x PyBytes_New; 1x PyList_Resize; 1x _PyBytes_Resize; 1x PyList_New; 1x PyBytes_Join 1x PyBytes_New; 3x _PyBytes_Resize
< 128k 3x PyBytes_New; 2x PyList_Resize; 1x _PyBytes_Resize;;1x PyList_New; 1x PyBytes_Join 1x PyBytes_New; 4x _PyBytes_Resize

BlocksOutputBuffer does much more work initially. However at much larger buffer sizes, the cost of PyList_Resize decreases due to cost amortization. Conversely arrange_output_buffer's calls to _PyBytes_Resize are initially very cheap, because this will return the same pointer many times. When the buffer sizes get larger, resizing almost always means creating a new bytes object + an additional memcpy to transfer the data. So it is clear why BlocksOutputBuffer performs better when the size of the output gets larger. It is also clear why arrange_output_buffer will always perform better at lower output sizes.

So how about memory usage? It turns out that for small data the blocksoutputbuffer allocates more agressively:

/* According to the block sizes defined by BUFFER_BLOCK_SIZE, the whole
   allocated size growth step is:
    1   32 KB       +32 KB
    2   96 KB       +64 KB
    3   352 KB      +256 KB
    4   1.34 MB     +1 MB
    5   5.34 MB     +4 MB

So there is less overallocation when using arrange_output_buffer as well.

@rhpvorderman
Copy link
Contributor Author

ping

@arhadthedev
Copy link
Member

@Yhg1s, @gpshead (as zlib experts)

cc @animalize (as the author of gh-21740)

@rhpvorderman
Copy link
Contributor Author

ping

rhpvorderman added a commit to rhpvorderman/cpython that referenced this issue May 11, 2023
rhpvorderman added a commit to rhpvorderman/cpython that referenced this issue May 11, 2023
@gpshead
Copy link
Member

gpshead commented May 11, 2023

This issue is not urgent (and being microbenchmark based isn't entirely clear there is an actual issue).

I recognize the added code complexity with the additional list of bytes buffer code in place most everywhere now. But if you want to make an argument against it, a word of warning about how to communicate this: Tossing around words like "pathological" and "common case" with only a microbenchmark as data to show for the usage pattern you are biased towards comes off as aggressively antagonistic rather than persuasive.

So lets jump back to actual real world non microbenchmark data: What does running the pyperformance suite show about main at your PR #101279 branch point vs the PR? That represents practical applications.

Accepting PR #101279 would be flip-flopping back to another way of doing things with users who perceive a greater need for one implementation just pointing fingers at the others. So I've marked the PR as draft and do-not-merge for the time being just so it is clear to other committers who may see it that we shouldn't assume that PR is the direction we want to go without further discussion and non-arbitrary data on the issue.

When it comes to optimizations around buffering, one implementation may not fit all needs. Implementations seeking to eeek out the last few percent of performance sometimes contain multiple strategies and dynamically pick which to use based on heuristics. That'd also be a possible viable outcome that could satisfy both performance desires here. But in the absence of relevant non-contrived microbenchmarks, making that judgement call isn't something we should just do on a whim. Different applications have different needs and API call patterns.

@rhpvorderman
Copy link
Contributor Author

But if you want to make an argument against it, a word of warning about how to communicate this: Tossing around words like "pathological" and "common case" with only a microbenchmark as data to show for the usage pattern you are biased towards comes off as aggressively antagonistic rather than persuasive.

I am so sorry. I did not intend to be so abrasive. @gpshead thank you for calling me out on this. @animalize, I hope you will accept my apology for my behavior. This was certainly not an acceptable way to communicate and I do respect the work you have done on this feature.

When it comes to optimizations around buffering, one implementation may not fit all needs. [...] Different applications have different needs and API call patterns.

Yes the current implementation has the advantage of not overallocating when very large compressed files are decompressed in memory entirely. That I don't know of such use cases will not make them magically disappear.

Given that the gzip application already works with an optimized strategy for buffer allocation for that particular use case it is indeed not urgent. I will benchmark it better with real-world applications and come back with my findings.

@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

5 participants