BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

rhpvorderman · 2023-01-23T14:33:56Z

Bug report

The _BlocksOutputBuffer was introduced for all compression modules (bz2, lzma and zlib) in python 3.10 #21740 .

It performs well on its advertised quality: speedy and memory efficient creation of very large in-memory blocks.

However it came at the cost of:

Lots of extra code
A performance regression in the common use case

The common use case is small things in memory. Having large in-memory buffers (multiple gigabytes) is an anti-pattern and streaming interfaces should be used instead. Which in turn utilize small buffers.

For a small buffer the BlocksOutputBuffer needs to create a list and a bytes object, while the 3.9 method only creates the bytes object. When the initial bytes object is too small, the blocksoutputbuffer creates another bytes object and resizes the list. The 3.9 method simply resizes the bytes object.
The 3.9 method does not scale well beyond a large number of resizings while the BlocksOutputBuffer does (due to cost amortization in the list resize and the fact that it doesn't resize bytes objects, saving on memcpy) but this only applies to very large buffers, and these should be rare.

This can be shown by removing the blocksoutputbuffer and reverting to the 3.9 method of arranging the output buffer. I have made a branch here https://github.com/rhpvorderman/cpython/tree/noblocks .

Microbenchmarks. Taking current README.rst (10044 bytes) and compressing with compression level 1.
BlocksoutputBuffer

$ ./python -m pyperf timeit -s 'data=open("README.rst", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 180 us +- 1 us

arrange_output_buffer:

$ ./python -m pyperf timeit -s 'data=open("README.rst", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 174 us +- 1 us

Also when taking a larger file Lib/_pydecimal.py (229220 uncompressed, 60974 bytes compressed) which certainly requires resizing the initial 16K buffer.
BlocksOutputBuffer:

$ ./python -m pyperf timeit -s 'data=open("Lib/_pydecimal.py", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 2.37 ms +- 0.01 ms

arrange_output_buffer

./python -m pyperf timeit -s 'data=open("Lib/_pydecimal.py", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 2.28 ms +- 0.01 ms

Q.E.D. BlocksOutputBuffer always loses against simple byte resizing for smaller buffers. _pydecimal.py is more than 200K so that is already quite a big input.

Additionally:

 Modules/zlibmodule.c | 417 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------------------------------------------------------------------------------------
 1 file changed, 117 insertions(+), 300 deletions(-)

And this is not counting the removal of the extra header file that provides the BlocksOutputBuffer.

The BlocksOutputBuffer is a nice piece of work when taken in isolation, but it optimizes for the pathological case rather than the common case. It is detrimental to the common case and it requires a lot of extra code that needs to be maintained.

If there are some issues with slow performance on larger buffers with arrange_output_buffer, I think several optimizations can still be done which are not as invasive as the _BlocksOutputBufer. For instance when zlib.decompress is called on a 100MB object, it makes sense to start the output buffer at 100MB rather than at 16K (default in 3.9). This severely limits the amount of resizes required. The same goes for zlib.compress. If the input is 100MB the maximum output is going to be maximally 100MB+header and trailer size. Reserving 100MB first and then downscaling to the compressed size (say 20MB) is much quicker than resizing a 16K buffer to 20MB using doublings only.

Linked PRs

gh-101260: Return to a simpler output buffer for zlib, bz2 and lzma for better performance. #101279

The text was updated successfully, but these errors were encountered:

rhpvorderman · 2023-01-24T09:08:38Z

To complement the rambling above, I will make a systematic overview here to show why blocksoutputbuffer is inefficient.

The arrange output buffer function works by simply creating a bytes object at 16K. Then scaling it up by doubling to 32k, 64k etc. At the end of the function it resizes the bytes object to the size of the contents.

The BlocksOutputBuffer works by creating a list of size 1 and adding a bytes object. When the bytes object is full, a new bytes object is added. The last bytes object added is resized at the end, and a PyBytes_Join is called on the list.

size of output	BlocksOutputBuffer	arrange_output_buffer
<16k	1x PyBytes_New; 1x _PyBytes_Resize; 1x PyList_New	1x PyBytes_New; 1x _PyBytes_Resize
<32k	1x PyBytes_New; 1x _PyBytes_Resize; 1x PyList_New	1x PyBytes_New; 2x _PyBytes_Resize
< 64k	2x PyBytes_New; 1x PyList_Resize; 1x _PyBytes_Resize; 1x PyList_New; 1x PyBytes_Join	1x PyBytes_New; 3x _PyBytes_Resize
< 128k	3x PyBytes_New; 2x PyList_Resize; 1x _PyBytes_Resize;;1x PyList_New; 1x PyBytes_Join	1x PyBytes_New; 4x _PyBytes_Resize

BlocksOutputBuffer does much more work initially. However at much larger buffer sizes, the cost of PyList_Resize decreases due to cost amortization. Conversely arrange_output_buffer's calls to _PyBytes_Resize are initially very cheap, because this will return the same pointer many times. When the buffer sizes get larger, resizing almost always means creating a new bytes object + an additional memcpy to transfer the data. So it is clear why BlocksOutputBuffer performs better when the size of the output gets larger. It is also clear why arrange_output_buffer will always perform better at lower output sizes.

So how about memory usage? It turns out that for small data the blocksoutputbuffer allocates more agressively:

/* According to the block sizes defined by BUFFER_BLOCK_SIZE, the whole
   allocated size growth step is:
    1   32 KB       +32 KB
    2   96 KB       +64 KB
    3   352 KB      +256 KB
    4   1.34 MB     +1 MB
    5   5.34 MB     +4 MB

So there is less overallocation when using arrange_output_buffer as well.

rhpvorderman · 2023-03-21T10:25:41Z

ping

arhadthedev · 2023-04-01T04:44:21Z

@Yhg1s, @gpshead (as zlib experts)

cc @animalize (as the author of gh-21740)

rhpvorderman · 2023-05-11T05:26:09Z

ping

# Conflicts: # Modules/_bz2module.c

gpshead · 2023-05-11T06:49:02Z

This issue is not urgent (and being microbenchmark based isn't entirely clear there is an actual issue).

I recognize the added code complexity with the additional list of bytes buffer code in place most everywhere now. But if you want to make an argument against it, a word of warning about how to communicate this: Tossing around words like "pathological" and "common case" with only a microbenchmark as data to show for the usage pattern you are biased towards comes off as aggressively antagonistic rather than persuasive.

So lets jump back to actual real world non microbenchmark data: What does running the pyperformance suite show about main at your PR #101279 branch point vs the PR? That represents practical applications.

Accepting PR #101279 would be flip-flopping back to another way of doing things with users who perceive a greater need for one implementation just pointing fingers at the others. So I've marked the PR as draft and do-not-merge for the time being just so it is clear to other committers who may see it that we shouldn't assume that PR is the direction we want to go without further discussion and non-arbitrary data on the issue.

When it comes to optimizations around buffering, one implementation may not fit all needs. Implementations seeking to eeek out the last few percent of performance sometimes contain multiple strategies and dynamically pick which to use based on heuristics. That'd also be a possible viable outcome that could satisfy both performance desires here. But in the absence of relevant non-contrived microbenchmarks, making that judgement call isn't something we should just do on a whim. Different applications have different needs and API call patterns.

rhpvorderman · 2023-05-12T13:42:12Z

But if you want to make an argument against it, a word of warning about how to communicate this: Tossing around words like "pathological" and "common case" with only a microbenchmark as data to show for the usage pattern you are biased towards comes off as aggressively antagonistic rather than persuasive.

I am so sorry. I did not intend to be so abrasive. @gpshead thank you for calling me out on this. @animalize, I hope you will accept my apology for my behavior. This was certainly not an acceptable way to communicate and I do respect the work you have done on this feature.

When it comes to optimizations around buffering, one implementation may not fit all needs. [...] Different applications have different needs and API call patterns.

Yes the current implementation has the advantage of not overallocating when very large compressed files are decompressed in memory entirely. That I don't know of such use cases will not make them magically disappear.

Given that the gzip application already works with an optimized strategy for buffer allocation for that particular use case it is indeed not urgent. I will benchmark it better with real-world applications and come back with my findings.

rhpvorderman added the type-bug An unexpected behavior, bug, or error label Jan 23, 2023

AlexWaygood added the performance Performance or resource usage label Jan 23, 2023

bedevere-bot mentioned this issue Jan 24, 2023

gh-101260: Return to a simpler output buffer for zlib, bz2 and lzma for better performance. #101279

Draft

rhpvorderman mentioned this issue Jan 24, 2023

Use blocks_output_buffer pycompression/python-isal#79

Closed

rhpvorderman added a commit to rhpvorderman/cpython that referenced this issue May 11, 2023

Merge branch 'main' into pythongh-101260

741bb89

rhpvorderman added a commit to rhpvorderman/cpython that referenced this issue May 11, 2023

Merge remote-tracking branch 'upstream/main' into pythongh-101260

b12be9b

# Conflicts: # Modules/_bz2module.c

iritkatriel added the stdlib Python modules in the Lib dir label Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

rhpvorderman commented Jan 23, 2023 •

edited by bedevere-bot

rhpvorderman commented Jan 24, 2023

rhpvorderman commented Mar 21, 2023

arhadthedev commented Apr 1, 2023

rhpvorderman commented May 11, 2023

gpshead commented May 11, 2023

rhpvorderman commented May 12, 2023

BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

Comments

rhpvorderman commented Jan 23, 2023 • edited by bedevere-bot

Bug report

Linked PRs

rhpvorderman commented Jan 24, 2023

rhpvorderman commented Mar 21, 2023

arhadthedev commented Apr 1, 2023

rhpvorderman commented May 11, 2023

gpshead commented May 11, 2023

rhpvorderman commented May 12, 2023

rhpvorderman commented Jan 23, 2023 •

edited by bedevere-bot