[WiP] Rewrite zstd decoder to use an API that supports multiple frames (fix issue #3008) #3021

grossag · 2023-05-08T19:02:58Z

Compressed zstd data is composed of one or more frames. python-zstandard has several APIs that only support decoding one frame. This is not useful for urllib3 because it will result in decompressed data that is truncated to 1048576 bytes.

This patch changes the zstd decoder to use an API that is known to be able to decompress multiple frames at a time. It uses ZstdDecompressionReader, which takes any object that fulfills the interface of io.RawBase. The approach here is to have ZstdDecoder create an io.BytesIo object, feed data to it, and have python-zstandard consume data when it can.

With this change, two zstd tests are broken:

test_decode_zstd_incomplete seems to be an incorrect test because it assumes that slicing the last element will cause zstd decompression to fail, but the current compressed package doesn't behave that way. TODO: I need to change the test to create data with checksums and slice in a way that breaks the checksum validation.
test_chunked_decoding_zstd is expecting an exception to be thrown but it isn't. This seems to be because the HTTPResponse.read() is misbehaving. The issue is that there is a loop while len(self._decoded_buffer) < amt and data: which drains enough data to grab headers. But this test is draining the full response buffer and thus causing _fp to close, so ZstdDecoder.flush() is never called. TODO: figure out how to fix HTTPResponse.read() so ZstdDecoder.flush() is called in this case.

This change also adds a new test test_decode_zstd_multiple_frames which is a work-in-progress. I'd like to find a way to generate this data programmatically instead of adding the file text.txt.zstd, but I haven't been able to do it so far.

Compressed zstd data is composed of one or more frames. python-zstandard has several APIs that only support decoding one frame. This is not useful for urllib3 because it will result in decompressed data that is truncated to 1048576 bytes. This patch changes the zstd decoder to use an API that is known to be able to decompress multiple frames at a time. It uses ZstdDecompressionReader, which takes any object that fulfills the interface of io.RawBase. The approach here is to have ZstdDecoder create an io.BytesIo object, feed data to it, and have python-zstandard consume data when it can.

grossag · 2023-05-08T19:29:54Z

I'm going to close this because #3022 is a better solution than mine.

grossag mentioned this pull request May 8, 2023

Fix multi-frame Zstandard response decoding #3022

Merged

grossag closed this May 8, 2023

grossag mentioned this pull request May 15, 2023

python-zstandard produces incomplete data after decompress indygreg/python-zstandard#196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WiP] Rewrite zstd decoder to use an API that supports multiple frames (fix issue #3008) #3021

[WiP] Rewrite zstd decoder to use an API that supports multiple frames (fix issue #3008) #3021

grossag commented May 8, 2023

grossag commented May 8, 2023

[WiP] Rewrite zstd decoder to use an API that supports multiple frames (fix issue #3008) #3021

[WiP] Rewrite zstd decoder to use an API that supports multiple frames (fix issue #3008) #3021

Conversation

grossag commented May 8, 2023

grossag commented May 8, 2023