Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support optional crc32 for uncompressed streaming zip32 and zip64 #134

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ikreymer
Copy link

@ikreymer ikreymer commented Jul 25, 2024

This PR implements the idea originally discussed in #17 and #58, producing ZIP files with actual length in local header and 0 crc32, and including a data descriptor with the length and actual crc32.

This allows specifying ZIP file members with NO_COMPRESSION_64(file_size, 0) and NO_COMPRESSION_32(file_size, 0) and does not raise the invalid crc32 exception, but instead computes it and stores it in the data descriptor.

The ZIP files produces with this implementation should:

This would really help our use case to be able to support this without having a custom fork.

The main use case is being able to stream-zip files from S3-like buckets where the size is available, but crc-32 usually is not (also mentioned in #17)

- if crc32 value passed in is 0, then include data descriptor record
with actual length and crc32
@michalc
Copy link
Member

michalc commented Jul 27, 2024

Interesting... will have more of a look around and ponder. Some initial thoughts/questions/requests on this:

  • Does it produce ZIP files that are not valid according to the spec? From https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT:

    If this bit is set, the fields crc-32, compressed size and uncompressed size are set to zero in the local header. The correct values are put in the data descriptor immediately following the compressed data.

    I'm not sure right now if we should go against the spec...

  • I think zero is a valid CRC_32, so if some sort of sentinel value were to be used to change behaviour/structure, it should be something else. None ?

  • The pattern of using the CRCActual object to get data out from _no_compression_streamed_data: this isn't consistent with the very similar case of getting data out from _zip_data. I think _no_compression_streamed_data should return the actual crc_32, just like _zip_data does. So then the value could be retrieved with something along the lines of actual_crc_32 = yield from encryption_func(_zip_data(....

  • And can we test far more: not just stream-unzip, but Python's ZipFile, unzip, bsdcpio/libarchive, 7zip, and AES encrypted versions as well where possible. If this is against the spec, I would say extra important

My biggest concern is the spec thing...

@michalc
Copy link
Member

michalc commented Aug 11, 2024

To communicate, I am more and more anti this because it results in ZIP files that do not adhere to the spec. Even if we test a load of existing unzippers, it's not very friendly to future unzippers that expect files to adhere to the spec, or even new versions of existing unzippers that make changes expecting them to be fine because of the spec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants