Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to upgrade to zip64 for long zips #42

Closed
wants to merge 2 commits into from

Conversation

craigds
Copy link

@craigds craigds commented May 25, 2023

When writing a zip with a lot of files, or one with a few really big files, it's possible to exceed an offset of 4GiB even though all files written are smaller than 4GiB.

Currently, writing a ZIP_32 file header when the zip length has exceeded 4GiB causes an overflow error.

However, since there's no way for the caller to know that the offset has exceeded 4GiB, there's no way for them to opt in to choosing ZIP_64 instead.

This change adds an allow_upgrade_to_64 kwarg, so that the caller can opt in to writing zip64 files when the offset exceeds 4GiB.

Added later:

The second commit also fixes issues related to writing only ZIP_32 files but where the total zip size exceeds 4GiB - in which case a zip64 central directory and/or end-of-central-directory must be written.

When writing a zip with a lot of files, or one with a few really big
files, it's possible to exceed an offset of 4GiB even though all files
written are smaller than 4GiB.

Currently, writing a ZIP_32 file header when the zip length has exceeded
4GiB causes an overflow error.

However, since there's no way for the *caller* to know that the offset
has exceeded 4GiB, there's no way for them to opt in to choosing ZIP_64
instead.

This change adds a allow_upgrade_to_64 kwarg so that the caller can opt
in to writing zip64 files when the offset exceeds 4GiB.
@craigds craigds marked this pull request as draft May 25, 2023 05:21
@craigds
Copy link
Author

craigds commented May 25, 2023

From scrutinizing the code, I suspect there is another case related to this:

When all the file header offsets are <4GiB and are written as ZIP_32, but the last file spans the 4GiB boundary, the central directory start offset is >4GiB, but the End of Central Directory Record is not written in the zip64 format. Instead, an overflow error occurs.

The same flag I used here can be used to write the EOCD record in zip64 format in that case - but I'll add that to this PR and add a test for it tomorrow.

@michalc
Copy link
Member

michalc commented May 25, 2023

Hi @craigds,

Thanks for the PR! Will have a look and more of a ponder on it.

My main question so far is - why not set ZIP_64 on all the member files? If this is done, then it's known that if a client can open a ZIP file generated, even a little one, then they will always be able to open a file generated if the same config is used. I fear gotchas down the road that sometimes ZIPs created by stream-unzip would be able to be opened by a particular client, but sometimes they can't. (A bit like the issue I reported in libarchive/libarchive#1834 - it's a long thread, but the crux of it is that Java seems to have some dynamic ZIP/ZIP_64 behaviour when creating ZIP files. Not quite the same behaviour as the one in this PR, but it causes gotchas when things cross the 64 boundary)

Michal

@craigds
Copy link
Author

craigds commented May 25, 2023

Yes, we tried that. Unfortunately we hit a bug in Safari, which I believe you can repro if you try: Safari has an auto-extract feature for downloaded archives, but it has a bug - it only extracts a single file from zip64 zip files.

It doesn't to happen for large (4GB+) zipfiles though - Safari doesn't attempt to auto-extract those (perhaps because they knew this was a problem? There's no public bug tracker so nowhere to see/report this bug)

So our workaround is to generate zip32 wherever possible, and fallback to zip64 where absolutely necessary.

@craigds
Copy link
Author

craigds commented May 25, 2023

plus, it seems a good idea to use zip32 where possible anyway to maximise compatibility

@michalc
Copy link
Member

michalc commented May 25, 2023

It doesn't to happen for large (4GB+) zipfiles though

So I can reproduce the bug, but so far it seem to happen if the first file in the archive is ZIP_64, and it seems independent of size - or at least, I can't get size to affect it.

In more detail, making a file like this:

from datetime import datetime
from stream_zip import ZIP_64, ZIP_32, stream_zip

member_files = (
    (
        'my-file-1.txt',
        datetime.now(),
        0o600,
        ZIP_64,
        (b'\0' * 1024 for _ in range(0, 1024)),
    ),
    (
        'my-file-2.txt',
        datetime.now(),
        0o600,
        ZIP_32,
        (b'\0' * 1024 for _ in range(0, 1024)),
    ),
)

with open('out.zip', 'wb') as f:
    for zipped_chunk in stream_zip(member_files):
        f.write(zipped_chunk)

and then downloading it via https://www.npmjs.com/package/http-server in Safari, it auto-extracts only the first file. However, if I swap the _32 and _64, then it auto-extracts both files (in a folder).

So I am wondering - do zip64 files made by other programs suffer the fate in Safari? I wonder if it's to do with the fact files created here use data descriptors...

(Also - it's crossed my mind that there is a bit of a workaround in client code? Pass ZIP_32 on the first file, but ZIP_64 on the others?)

@michalc
Copy link
Member

michalc commented May 25, 2023

Also one other thing is crossing my mind - if something like this is to go in, is there a simpler API for it, even a backwards-incompatible one? Having an option per member file and a global option for the whole zip that interplays with the per-member options... my instinct is there is a simpler way. Maybe no option per member, but just a global option? Say:

mode=ZIP_32
mode=ZIP_32_AUTO_64
mode=ZIP_64

@craigds
Copy link
Author

craigds commented May 25, 2023

Maybe no option per member, but just a global option

I'd like to do that, but if I understand correctly, that's not possible when streaming the input files unless the total filesize is known in advance. This is because the file header block differs between zip32/64 and needs to go before the file contents. So stream-zip would need to know the filesize to determine whether the file should be written as zip64 or zip32.

To be clear, in our case we do have the filesizes in advance (we're writing a bunch of pre-existing files from S3) so we could use that API just fine. But other users of stream-zip might not be able to use it if they're generating the input files on demand while streaming them.

I considered an API where you pass each member's filesize in for each member, instead of the zip mode, and stream-zip uses it to choose the appropriate zip mode for each member. That would be backward compatible and allow a simpler API for callers who already know the filesize.

However, there is also the problem of the second case I mentioned, where every file is ZIP_32 but the total size of the zip file is >4GiB. In that case stream-zip needs to write the central directory as zip64 even though all members are zip32. So I think a per-zipfile option is better, because it allows us to solve that problem.

The other point I'd like to suggest is that I think a better default for the option would be True. I submitted this PR with the default set to False because I'm aware that PRs tend to get merged easier when they don't change things for existing users :) But I don't see the point in streaming gigabytes of zipfile to a user and then erroring out with an overflow error. Surely it's usually better to give them a zipfile, even if some obscure systems might not work with it. (Experience suggests those systems are rare - we've been generating large zip64s since 2009 and have had few issues reported)

@craigds craigds marked this pull request as ready for review May 26, 2023 02:38
Previously, if all files are zip32, but the total size of the zip
exceeds 4GiB, stream-zip would throw an overflow error, because it would
try to write a zip32 central directory and/or end-of-central-directory
record. This would fail because those records contain offsets over 4GiB,
which cannot be represented in zip32 versions of those records.

This change upgrades to zip64 central directory and/or
end-of-central-directory records if the caller
provides the `allow_upgrade_to_64=True` kwarg.
@craigds
Copy link
Author

craigds commented May 26, 2023

I considered an API where you pass each member's filesize in for each member, instead of the zip mode, and stream-zip uses it to choose the appropriate zip mode for each member. That would be backward compatible and allow a simpler API for callers who already know the filesize.

I'd be happy to submit a separate PR for this if you think it's a good idea. It's essentially how we're using stream-zip anyway, and it'd make life a bit nicer to have that filesize detection live inside stream-zip.

@michalc
Copy link
Member

michalc commented May 26, 2023

(I know I'm only responding to some of your points, still pondering)

Surely it's usually better to give them a zipfile, even if some obscure systems might not work with it.

I think I mostly disagree - if for high compatibility (and I think for this project, we want that), then a component of that is to fail early if for whatever reason we don't get it. And, another related component, is if a zip file constructed with a certain configuration works on a system, then if another zip file is generated with the same config, it should also continue to work. I'm specifically thinking of non-prod -> prod setups. Quite often non-prod has smaller datasets where most testing is done.

The other point I'd like to suggest is that I think a better default for the option would be True

I think also a consideration - zip64 support still isn't great. For example, the system discussed here, Safari, seems to have semi-broken zip64 support. LibreOffice as well, when it opens ODS files, which are zip files, it doesn't support zip64 as well last I checked: they have to be zip32. I would say these facts lend weight to not defaulting to a configuration that sometimes has zip64 components, and sometimes not.

But one other thing has crossed my mind. Say this PR were to go in. To make a zip file that works on Safari without losing files, then there would still be a requirement on client code from what I can tell - that the first member of the zip must be ZIP_32. If it's ZIP_64, then only that first file is auto-extracted and the others lost. In which case - for Safari support "first file ZIP_32, all others ZIP_64" could be the recommendation, via documentation. This would be the same recommendation if the PR were not to go in, and so the option feels a bit unnecessary for this particular case?

@craigds
Copy link
Author

craigds commented May 26, 2023

We're discussing two separate topics, so I'll try to keep my answers separate:

On what the right API might be:

But one other thing has crossed my mind. Say this PR were to go in. To make a zip file that works on Safari without losing files, then there would still be a requirement on client code from what I can tell - that the first member of the zip must be ZIP_32. If it's ZIP_64, then only that first file is auto-extracted and the others lost. In which case - for Safari support "first file ZIP_32, all others ZIP_64" could be the recommendation, via documentation. This would be the same recommendation if the PR were not to go in, and so the option feels a bit unnecessary for this particular case?

Yes, that's a good point. In that case the API that's required here would be something like a ZIP_32_AUTO_64 mode per file, like you said. Then the caller can pass ZIP_32 for the first file if they're targeting Safari, and ZIP_32_AUTO_64 for all other files. I suppose then the contract could be that if any file was passed a ZIP_32_AUTO_64 mode, stream-zip could assume that writing a central-directory as zip64 is acceptable if required - and no separate flag would be needed to opt in.

If you'd prefer that API, it's slightly more flexible than the one I've implemented here 👍 If you like I will update my PR to match.

On whether throwing overflow errors is a good idea

However, as a user of stream-zip, I would still not want to use that above workaround - Safari is a very small portion of our user base, and most of our zips are mostly zip32. I would prefer to serve a zip64 rather than throw an overflow error.

Here's my reasoning: Let's say (numbers are made up):

  • 2% of users use Safari with auto-expanding zip files enabled
  • 10% of zips need to be zip64
    • Of those, 50% of the time, the first file happens to be big enough to need to be written as zip64

If we throw an overflow error every time the first file in a streamed archive is big enough to be zip64, then we will simply not be able to serve 5% of our zips - with no possible user workaround. Clearly, this isn't a great option.

If, however, we just use auto-upgrade zips to zip64 when required, then:

  • 98% of users don't use Safari with auto-expanding zips, so don't have a problem. We're already getting a lower error rate than when we religiously throw overflow errors
  • 1.9% of users use Safari but don't happen to hit the bug because their particular zip file happens to not hit it.
  • (2% * 10% * 50%) == 0.1% of users will have some weird unpacking confusion in Safari. They get confused, try another browser or ask a question on our support chat. We or a chatbot point them to a small help article on what the problem is, and they can figure out a workaround.

The comparison is 5% of user zips disrupted with no workaround, or 0.1% of user zips disrupted with a documentable workaround. It gets even better if we can pre-sort the files involved and find a zip32-able file as the first member - but if that's not possible, it still makes clear sense to not throw an overflow error in this case.

@craigds
Copy link
Author

craigds commented May 26, 2023

Thanks for taking the time to give thoughtful replies to this stuff by the way 😄 💯

@michalc
Copy link
Member

michalc commented May 26, 2023

If, however, we just use auto-upgrade zips to zip64 when required, then:

0.1% of user zips disrupted with a documentable workaround

Ah - so if we could have a way of doing a “full” auto upgrade to zip64, I would agree with the reasoning and be all for it (Maybe via a non-default option? But that’s a detail)

The issue is we can’t do that in a streaming way in general. It’s not just the directory at the end, it’s about the header that goes before each file that specifies if that file is zip32 or zip64. It has to be decided before the beginning of the file, before the compressed and uncompressed sizes are known. This means that member files can't be auto-upgraded - you have to choose zip32 and risk overflow due to the member file ending up too big, or choose zip64 and risk lack of support. I don't think there is really another choice in general...

But... (at the risk of going round in circles/repeating myself a bit) since you know the uncompressed file sizes (and hopefully can make a judgement call that for files smaller than the zip32 limit the chance of their compressed size being bigger than the zip32 limit is vanishingly small in your case), to address the Safari issue you can do...

ZIP_32 on the first file if at all possible, ZIP_64 otherwise
ZIP_64 on all the others

This you can do in your case since you know the first file's uncompressed file size up front, and can (probably) judge that its compressed size won't hit the zip32 limit. This would give you the same 0.1% residual issues I think? And I'm not sure if we can really construct a reasonable API takes on some of the responsibility of the "ZIP_32 on the first file if possible, ZIP_64 otherwise"? And I'm not sure that auto-upgrading the directory at the end beyond what stream-zip does now, and/or introducing other constants, has any benefit over this approach?

(Also - I think there technically is a way to do auto upgrade individual member files from zip32 to zip64 - Java does it. But it has a high price - libarchive doesn't fully support it from what I can tell. Which actually might also mean that Safari doesn't support it... so it's a bit of a no-go for various reasons)

(Note - as mentioned above, it is possible that the compressed size is bigger than uncompressed size in a few cases. A very small file for example, e.g. 0 bytes. But also a large file that is entirely random has a good chance of this. Or, a large file that uses level=0 in construction of the zlib object I suspect too would have a compressed size bigger than its uncompressed size)

Edit: few edits as I've re-read/tweaked the above

@michalc
Copy link
Member

michalc commented May 26, 2023

(Shameless plug: https://github.com/uktrade/stream-unzip does unzip those Java zip64 files in a streaming way)

@michalc
Copy link
Member

michalc commented May 26, 2023

Ah also - I posted on Stack Overflow and Mark Adler (co-author of the zlib library that stream-zip uses under the hood) replied https://stackoverflow.com/a/76342264/1319998. Maybe there is something on the HTTP level that would stop Safari from unzipping the files, you could throw ZIP_64 on them all, and problem solved?

@michalc
Copy link
Member

michalc commented May 26, 2023

Maybe there is something on the HTTP level that would stop Safari from unzipping the files, you could throw ZIP_64 on them all, and problem solved?

Although I'm trying a few things like different content-type headers, and so far, no cigar

@michalc
Copy link
Member

michalc commented May 27, 2023

Am re-reading some of the above (some of which I don't think I really read properly before... apologies), and I am getting more convinced on a few points...

I considered an API where you pass each member's filesize in for each member, instead of the zip mode, and stream-zip uses it to choose the appropriate zip mode for each member. That would be backward compatible and allow a simpler API for callers who already know the filesize.

This is ringing in my head a bit now. But there is a bit of a problem:

(Note - as mentioned above, it is possible that the compressed size is bigger than uncompressed size in a few cases. A very small file for example, e.g. 0 bytes. But also a large file that is entirely random has a good chance of this. Or, a large file that uses level=0 in construction of the zlib object I suspect too would have a compressed size bigger than its uncompressed size)

and as you said:

However, there is also the problem of the second case I mentioned, where every file is ZIP_32 but the total size of the zip file is >4GiB. In that case stream-zip needs to write the central directory as zip64 even though all members are zip32. So I think a per-zipfile option is better, because it allows us to solve that problem.

So I am now leaning towards accepting an uncompressed size for each member, and it puts the whole ZIP into auto mode - so "it might need ZIP_64 support". But we have to be very particular - we would have to find the maximum overhead for compression, so the limit of bumping a particular file from zip32 to zip64 would have to take this limit into consideration? And also if any file's offset requires a zip64 directory, then one will be added (as you do in this PR I think?)

But - if this API is made by passing a size instead of a mode, then we have a problem (/opportunity?) - NO_COMPRESSION_64 and NO_COMPRESSION_32. What to do about files with known sizes that we don't want compressed? On the opportunity side - it would be great to be able to use a known size for such files - because right now the NO_COMPRESSION modes result in buffering in memory to find the sizes, because they have to go before the data in the zip files. I guess I'm thinking a backwards-incompatible API change there...

(But, for your particular Safari case, compared to ZIP_32 first if possible, ZIP_64 on all other files, it wouldn't give a meaningful difference, just a nicer API?)

@michalc
Copy link
Member

michalc commented May 27, 2023

I guess I'm thinking a backwards-incompatible API change there...

Actually, doesn't have to be... the "mode" option could take all of these:

ZIP_32
ZIP_32(uncompressed_size)
ZIP_64
ZIP_64(uncompressed_size)
ZIP_AUTO(uncompressed_size)
NO_COMPRESSION_32
NO_COMPRESSION_32(uncompressed_size)
NO_COMPRESSION_64
NO_COMPRESSION_64(uncompressed_size)
NO_COMPRESSION_AUTO(uncompressed_size)

(I'm a bit inspired by SQLAlchemy here - data types in column definitions can be a "type" e.g. VARCHAR or something that looks like an instance of a "type", e.g. VARCHAR(15))

@michalc
Copy link
Member

michalc commented May 27, 2023

because right now the NO_COMPRESSION modes result in buffering in memory to find the sizes

Gah - not just the sizes, but also the CRC32. So I guess it would have to be:

ZIP_32
ZIP_32(uncompressed_size)
ZIP_64
ZIP_64(uncompressed_size)
ZIP_AUTO(uncompressed_size)
NO_COMPRESSION_32
NO_COMPRESSION_32(uncompressed_size, crc32)
NO_COMPRESSION_64
NO_COMPRESSION_64(uncompressed_size, crc32)
NO_COMPRESSION_AUTO(uncompressed_size, crc32)

@michalc
Copy link
Member

michalc commented May 27, 2023

Ah - so some of those modes I don't think make sense (non-auto + compression + known size). So the sets of modes would be:

ZIP_32
ZIP_64
ZIP_AUTO(uncompressed_size)
NO_COMPRESSION_32
NO_COMPRESSION_32(uncompressed_size, crc32)
NO_COMPRESSION_64
NO_COMPRESSION_64(uncompressed_size, crc32)
NO_COMPRESSION_AUTO(uncompressed_size, crc32)

@michalc
Copy link
Member

michalc commented May 27, 2023

Aha - found https://stackoverflow.com/a/23578269/1319998 that states* that for $n$ input bytes, the maximum compressed size of a deflate-compressed file** is

$$n + 5\left(\left\lfloor \frac{n}{16383} \right\rfloor + 1\right)$$

So for the ZIP_AUTO mode, if the the above expression is bigger than the zip32 limit, will bump that file to zip64 (and ensure the directory is zip64). This should never result in an overflow (well, other than due to hitting zip64 limits), and is reasonable best effort in terms of only requiring zip64 for files that need it. The only way to be better would be to pre-compute the compressed sizes.

*Interestingly Mark Adler's answer

**I think this is if doing "usual" deflate - there could be ways of making deflate "worse", but I suspect we don't do that here

@michalc
Copy link
Member

michalc commented May 27, 2023

Actually, just realised we can probably just do this ahead of time to find the zip32 limit in terms of uncompressed size.

And have done this - the max uncompressed "auto" limit for zip32 I think would be 4,293,656,890 bytes, about a megabyte less than the zip32 limit, 4,294,967,295 bytes.

The uncompressed auto mode however, that would use the "full" zip32 limit, since there is no compression overhead.

ZIP_32
ZIP_64
ZIP_AUTO(uncompressed_size): if uncompressed_size <= 4,293,656,890 then ZIP_32, ZIP_64 otherwise
NO_COMPRESSION_32
NO_COMPRESSION_32(uncompressed_size, crc32)
NO_COMPRESSION_64
NO_COMPRESSION_64(uncompressed_size, crc32)
NO_COMPRESSION_AUTO(uncompressed_size, crc32): if uncompressed_size <= 4,294,967,295 then NO_COMPRESSION_32, NO_COMPRESSION_64 otherwise

@michalc
Copy link
Member

michalc commented May 27, 2023

Ah - which might also helpful in your case, right now in the current version of stream-zip? If you have any files with an uncompressed_size <= 4,293,656,890, then you can safely put that as ZIP_32 up front as the first file?

@michalc
Copy link
Member

michalc commented May 27, 2023

(Er... this is now my stream of consciousness... not expecting replies!)

Hmmm... I am starting to suspect it's more complex actually (of course, this is about zip files). So this formula

$$n + 5\left(\left\lfloor \frac{n}{16383} \right\rfloor + 1\right)$$

is based on a minimum block size of 16k... but I wonder if that's not strictly true. I don't fully understand, but I think that the default zlib.compressobj(wbits=-zlib.MAX_WBITS, level=9) in stream-zip gives a block size of 32k (from looking at https://stackoverflow.com/a/37213079/1319998, another Mark Adler answer). So maybe the limit does need to by dynamic.

So I now suspect that the default safe limit for zip32 is 4,294,312,030 bytes...

@michalc
Copy link
Member

michalc commented May 27, 2023

So... I think the block size, and so the safe zip32 limit, is related to the level argument in the zlib object. Slightly unfortunately, I don't think you can get it back once it's set. Maybe the auto mode should ignore the "global" zlib object, and set its own.

ZIP_32
ZIP_64
ZIP_AUTO(uncompressed_size, wbits=-zlib.MAX_WBITS, level=9): if uncompressed_size <= safe for this level then ZIP_32, ZIP_64 otherwise
NO_COMPRESSION_32
NO_COMPRESSION_32(uncompressed_size, crc32)
NO_COMPRESSION_64
NO_COMPRESSION_64(uncompressed_size, crc32)
NO_COMPRESSION_AUTO(uncompressed_size, crc32): if uncompressed_size <= 4,294,967,295 then NO_COMPRESSION_32, NO_COMPRESSION_64 otherwise

@craigds
Copy link
Author

craigds commented May 28, 2023

So... I think the block size, and so the safe zip32 limit, is related to the level argument in the zlib object

https://stackoverflow.com/questions/37176508/how-does-deflatezlib-determine-block-size suggests it is related to the memlevel and wbits parameters instead. I haven't quite grokked the maths involved to be honest. But there are only 9 memlevels - probably we can compute a single constant which is the worst-case situation and just use that as the zip32 limit when in auto mode? I'm assuming it won't be that far off 4GiB.

NO_COMPRESSION_32(uncompressed_size, crc32)

It probably wouldn't hurt to break things into a few different parameters rather than one flag thing containing a bunch of different parameters. e.g. in this PR I split NO_COMPRESSION_32 into bitfield flags NO_COMPRESSION and ZIP_32, which makes some of the codepaths cleaner and feels generally more elegant. If you're planning to change the API then the tuple could even turn into a simple class instance (ZipMember?), which would make future extensibility easier

No one really uses crc32 for anything much these days, so it seems unlikely a caller has the crc32 for their files lying around. So if stream-zip needs it, maybe it's fine to buffer the file and compute it.

(We switched to a compressobj with level=0 for our already-compressed files to avoid that problem)

@michalc
Copy link
Member

michalc commented May 28, 2023

So far also, this is what I think the various limits on the uncompressed size for ZIP_AUTO to allow the file to be zip32

mem_level: 1 , limit: 4132279136
mem_level: 2 , limit: 4212371770
mem_level: 3 , limit: 4253349390
mem_level: 4 , limit: 4274077375
mem_level: 5 , limit: 4284501975
mem_level: 6 , limit: 4289729530
mem_level: 7 , limit: 4292347130
mem_level: 8 , limit: 4293656890
mem_level: 9 , limit: 4294312010

calculated by

zip_32_limit = 0xffffffff

def zip32_limit_for(mem_level): 
    block_size = (1 << (mem_level + 6)) - 1

    for uncompressed_size in range(zip_32_limit, 0, -1):
        max_compressed_size = uncompressed_size + 5*(uncompressed_size//block_size + 1)
        if max_compressed_size <= zip_32_limit:
            return uncompressed_size

for l in range(1, 10):
    uncompressed_limit = zip32_limit_for(l)
    print('mem_level:', l, ', limit:', uncompressed_limit)

@michalc
Copy link
Member

michalc commented May 28, 2023

I'm tinkering with what such a ZIP_AUTO mode would look like, and wondering if something like this:

def ZIP_AUTO(uncompressed_size, **compressobj_args):
    mem_level_limits = {
        1: 4132279136,
        2: 4212371770,
        3: 4253349390,
        4: 4274077375,
        5: 4284501975,
        6: 4289729530,
        7: 4292347130,
        8: 4293656890,
        9: 4294312010,
    }

    def method_and_compressobj():
        mem_level = compressobj_args.get('mem_level', zlib.DEF_MEM_LEVEL)
        limit = mem_level_limits[mem_level]
        method = ZIP_32 if uncompressed_size <= limit else ZIP_64
        return (method, lambda: zlib.compressobj(**compressobj_args))
    return method_and_compressobj

Then at the top of the loop on member files at

for name, modified_at, perms, method, chunks in files:

have

file_method, file_get_compress_obj = \
    method() if callable(method) else \
    (method, get_compressobj)

Then pass through file_method and file_get_compress_obj to everything that needs it.

@michalc
Copy link
Member

michalc commented May 28, 2023

Ah so it's probably a bit more complicated - I keep forgetting about the offset overflow. So something like:

def ZIP_AUTO(uncompressed_size, **compressobj_args):
    mem_level_limits = {
        1: 4132279136,
        2: 4212371770,
        3: 4253349390,
        4: 4274077375,
        5: 4284501975,
        6: 4289729530,
        7: 4292347130,
        8: 4293656890,
        9: 4294312010,
    }

    def method_compressobj_dynamic_zip64_central_directory():
        mem_level = compressobj_args.get('mem_level', zlib.DEF_MEM_LEVEL)
        limit = mem_level_limits[mem_level]
        method = ZIP_32 if uncompressed_size <= limit else ZIP_64
        return (method, lambda: zlib.compressobj(**compressobj_args), True)
    return method_compressobj_dynamic_zip64_central_directory

which would be used like:

file_method, file_get_compress_obj, dynamic_zip64_central_directory = \
    method() if callable(method) else \
    (method, get_compressobj, False)

and I guess if any of the dynamic_zip64_central_directory is True, it would allow subsequent zip32 files at zip64 offsets and output a zip64 central directory only if necessary?

@michalc
Copy link
Member

michalc commented May 28, 2023

and I guess if any of the dynamic_zip64_central_directory is True, it would allow subsequent zip32 files at zip64 offsets and output a zip64 central directory only if necessary?

Or maybe better for dynamic_zip64_central_directory to be as "local" as possible - it only affects the current file in terms of offset overflow, and the central directory in terms of output it.

Or... I'm now wondering if it's even fine for zip32 files to be at zip64 offsets? Would zip64 clients even support that? Maybe it's safer for ZIP_AUTO to treat all files that are beyond a zip32 offset as zip64, even if it's a small file.

@michalc
Copy link
Member

michalc commented May 28, 2023

(I think I might have been getting confused on how the central directory works for zip64... I was vaguely remembering rather than looking at the code)

Or... I'm now wondering if it's even fine for zip32 files to be at zip64 offsets? Would zip64 clients even support that? Maybe it's safer for ZIP_AUTO to treat all files that are beyond a zip32 offset as zip64, even if it's a small file.

Actually even if it's fine, it would require more of a code change I suspect - the central directory entry for each file is constructed along with the file itself.

So... I now suspect this. The "real" mode of a file depends not just on the size of the file, but also its offset. Then no need for any extra dynamic behaviour on the directory.

def ZIP_AUTO(uncompressed_size, **compressobj_args):
    mem_level_limits = {
        1: 4132279136,
        2: 4212371770,
        3: 4253349390,
        4: 4274077375,
        5: 4284501975,
        6: 4289729530,
        7: 4292347130,
        8: 4293656890,
        9: 4294312010,
    }

    def method_compressobj(offset):
        mem_level = compressobj_args.get('mem_level', zlib.DEF_MEM_LEVEL)
        limit = mem_level_limits[mem_level]
        method = ZIP_64 if uncompressed_size > limit or offset > 0xffffffff else ZIP_32
        return (method, lambda: zlib.compressobj(**compressobj_args))
    return method_compressobj

Used as:

file_method, file_get_compress_obj = \
    method(offset) if callable(method) else \
    (method, get_compressobj)

and other than maybe some variable renaming or passing through variables to the functions that need them, nothing much to change - it's mostly adding to the code/structure that exists.

@michalc
Copy link
Member

michalc commented May 28, 2023

Or... I'm now wondering if it's even fine for zip32 files to be at zip64 offsets?

Actually this wouldn't even make sense. There is no way for a zip32 file in the central directory to have a zip64 offset since in the central directory there is nowhere in the struct to store such a large number unless it's a zip64 file (I really was getting confused on how the central directory is structured...)

@michalc
Copy link
Member

michalc commented May 28, 2023

Also I was a bit in two minds about the "callable" thing being a mode - but I guess if the real mode of a file depends on something that can't be determined earlier, like its offset, then a callable actually fits fairly well?

@michalc
Copy link
Member

michalc commented May 28, 2023

It does feel slightly odd that a mode are different types. Maybe could always be callable:

# The "private" modes
_ZIP_32 = object()
_ZIP_64 = object()
_NO_COMPRESSION_32 = object()
_NO_COMPRESSION_64 = object()

# The user-facing modes
def NO_COMPRESSION_32(offset, default_get_compressobj):
    return _NO_COMPRESSION_32, default_get_compressobj

def NO_COMPRESSION_64(offset, default_get_compressobj):
    return _NO_COMPRESSION_64, default_get_compressobj

def ZIP_32(offset, default_get_compressobj):
    return _ZIP_32, default_get_compressobj

def ZIP_64(offset, default_get_compressobj):
    return _ZIP_64, default_get_compressobj

def ZIP_AUTO(uncompressed_size, **compressobj_args):
    mem_level_limits = {
        1: 4132279136,
        2: 4212371770,
        3: 4253349390,
        4: 4274077375,
        5: 4284501975,
        6: 4289729530,
        7: 4292347130,
        8: 4293656890,
        9: 4294312010,
    }

    def method_compressobj(offset, default_get_compressobj):
        mem_level = compressobj_args.get('mem_level', zlib.DEF_MEM_LEVEL)
        limit = mem_level_limits[mem_level]
        method = _ZIP_64 if uncompressed_size > limit or offset > 0xffffffff else _ZIP_32
        return (method, lambda: zlib.compressobj(**compressobj_args))
    return method_compressobj

Used when iterating over the member files as:

for name, modified_at, perms, method, chunks in files:
    file_method, file_get_compress_obj = method(offset, get_compressobj)

and then in the code when doing things dynamically based on mode, it uses the private/underscored mode.

@michalc
Copy link
Member

michalc commented May 28, 2023

On testing - to test if a file really is zip32, I've just added the ability to disable zip64 support in stream-unzip uktrade/stream-unzip#46

@michalc
Copy link
Member

michalc commented May 29, 2023

Interestingly libarchive has a max uncompressed size for zip32, but it's 0xff000000 == 4278190080, which is none of the numbers I posted earlier...

https://github.com/libarchive/libarchive/blob/1f3c62ebf4d492ac21d3099b3b064993100dd997/libarchive/archive_write_set_format_zip.c#LL71C53-L71C63

@michalc
Copy link
Member

michalc commented May 29, 2023

Looks like they got to this value from testing libarchive/libarchive@4dbc86d

@michalc
Copy link
Member

michalc commented May 29, 2023

Looks like they use deflateInit2 passing a memLevel of 8: https://github.com/libarchive/libarchive/blob/1f3c62ebf4d492ac21d3099b3b064993100dd997/libarchive/archive_write_set_format_zip.c#LL995C7-L995C19, so the limit should be able to be 4293656890 rather than 4278190080.

@michalc
Copy link
Member

michalc commented May 29, 2023

Ah - just realised have to pass the wbits parameter as the right thing by default:

# The "private" modes
_ZIP_32 = object()
_ZIP_64 = object()
_NO_COMPRESSION_32 = object()
_NO_COMPRESSION_64 = object()

# The user-facing modes
def NO_COMPRESSION_32(offset, default_get_compressobj):
    return _NO_COMPRESSION_32, default_get_compressobj

def NO_COMPRESSION_64(offset, default_get_compressobj):
    return _NO_COMPRESSION_64, default_get_compressobj

def ZIP_32(offset, default_get_compressobj):
    return _ZIP_32, default_get_compressobj

def ZIP_64(offset, default_get_compressobj):
    return _ZIP_64, default_get_compressobj

def ZIP_AUTO(uncompressed_size, level=9, mem_level=zlib.DEF_MEM_LEVEL, wbits=-zlib.MAX_WBITS, **other_compressobj_kwargs):
    mem_level_limits = {
        1: 4132279136,
        2: 4212371770,
        3: 4253349390,
        4: 4274077375,
        5: 4284501975,
        6: 4289729530,
        7: 4292347130,
        8: 4293656890,
        9: 4294312010,
    }

    def method_compressobj(offset, default_get_compressobj):
        limit = mem_level_limits[mem_level]
        method = _ZIP_64 if uncompressed_size > limit or offset > 0xffffffff else _ZIP_32
        return (method, lambda: zlib.compressobj(level=level, mem_level=mem_level, wbits=wbits, **other_compressobj_kwargs))
    return method_compressobj

@michalc
Copy link
Member

michalc commented May 29, 2023

Or... maybe could just set mem_level to be 9, and not even let it be configurable.

# The "private" modes
_ZIP_32 = object()
_ZIP_64 = object()
_NO_COMPRESSION_32 = object()
_NO_COMPRESSION_64 = object()

# The user-facing modes
def NO_COMPRESSION_32(offset, default_get_compressobj):
    return _NO_COMPRESSION_32, default_get_compressobj

def NO_COMPRESSION_64(offset, default_get_compressobj):
    return _NO_COMPRESSION_64, default_get_compressobj

def ZIP_32(offset, default_get_compressobj):
    return _ZIP_32, default_get_compressobj

def ZIP_64(offset, default_get_compressobj):
    return _ZIP_64, default_get_compressobj

def ZIP_AUTO(uncompressed_size, level=9):
    def method_compressobj(offset, default_get_compressobj):
        method = _ZIP_64 if uncompressed_size > 4294312010 or offset > 0xffffffff else _ZIP_32
        return (method, lambda: zlib.compressobj(level=level, mem_level=9, wbits=wbits=-zlib.MAX_WBITS))
    return method_compressobj

If a use case comes up for having it, or other settings, configurable, can always add them then.

@craigds
Copy link
Author

craigds commented May 30, 2023

I got a bit lost in all that - a fair bit of detail there

Perhaps you could open a different PR? I'd be happy to review it.

@michalc
Copy link
Member

michalc commented May 30, 2023

Sorry yes I got a bit carried away!

Done in #43

@michalc
Copy link
Member

michalc commented May 31, 2023

Interestingly the tests are failing. I think the 4294312010 limit is fine on Python 3.10 and 3.11, but not 3.9...

@michalc michalc mentioned this pull request May 31, 2023
@michalc
Copy link
Member

michalc commented May 31, 2023

Strangely I see a zlib change in 3.6 https://docs.python.org/3/whatsnew/3.6.html#zlib

The compress() and decompress() functions now accept keyword arguments.

But nothing leaps out in Python 3.10

@michalc
Copy link
Member

michalc commented May 31, 2023

Ah... so although still not sure why the Python version difference, checking current offset of the file and size of the file isn't enough - the central directory needs to be zip64 if it starts at a big enough offset. So maybe some logic like in this PR is needed...

@michalc
Copy link
Member

michalc commented May 31, 2023

The extra strange thing is that it's level=0 that shows differences between Python versions, so I would have thought this be fairly deterministic since improvements to compression shouldn't effect it...

@michalc
Copy link
Member

michalc commented May 31, 2023

I've posted at https://stackoverflow.com/questions/76371334/zlib-difference-in-size-for-level-0-between-python-3-9-and-3-10 to try to help figure out the Python version differences. Something's happening that I'm not understanding...

michalc added a commit that referenced this pull request Jun 3, 2023
This allows clients to not worry about which ZIP_32 or ZIP_64 mode they need,
as long as they have the uncompressed size of the file.

For now, see the included tests for usage.

Most discussion is in #42
michalc added a commit that referenced this pull request Jun 3, 2023
This allows clients to not worry about which ZIP_32 or ZIP_64 mode they need,
as long as they have the uncompressed size of the file.

Most discussion is in #42
@michalc
Copy link
Member

michalc commented Jun 4, 2023

Inspired by the discussions and code in this PR, A ZIP_AUTO mode for each file has now been released in https://github.com/uktrade/stream-zip/releases/tag/v0.0.62

The ZIP_AUTO mode requires the uncompressed size of the file, and it

  • Makes each member Zip64 if either its size of its offset from the beginning of the file requires it
  • Makes the central directory Zip64 if there is at least one Zip64 member, or the offset of the central directory requires it, or the number of member files requires it

Since it requires the uncompressed size of the file, it can't be used in all cases, but I believe it can be used in the case discussed here since the uncompressed size of the files are available.

I am now wondering if a version that doesn't require the uncompressed size of the file could work though... a bit like in this PR. It might be as simple as allowing None in the uncompressed size of ZIP_AUTO, marking the file as ZIP_32 unless its offset from the start of the file requires it... It would result in an overflow error if the file is too big for ZIP_32 and earlier than 4GiB in the file, but not otherwise.

@michalc
Copy link
Member

michalc commented Jun 6, 2023

It might be as simple as allowing None in the uncompressed size of ZIP_AUTO, marking the file as ZIP_32 unless its offset from the start of the file requires it

@craigds What do you think - something you would like to do?

@craigds
Copy link
Author

craigds commented Jun 6, 2023

I wouldn't need it because we do have the filesizes in advance, but I think it makes sense for this library which mostly tries to avoid needing the filesizes in advance.

It can't work perfectly obviously so probably just documenting the caveat is the important bit

@michalc
Copy link
Member

michalc commented Jun 6, 2023

Ah understood! In that case, will close this PR - another one can be raised if/when for an auto mode without a known uncompressed size.

@michalc michalc closed this Jun 6, 2023
@craigds craigds deleted the upgrade-to-zip64 branch June 6, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants