[bug] compression-level: 0 results in ZIP file that can't be streaming-unzipped #675

jcflack · 2025-02-28T02:41:44Z

What happened?

#398 added a valuable feature to set the compression level. One important use is to be able to specify compression-level: 0 to avoid wasting computation for no benefit when the files being added to the zip are themselves already compressed.

However, upload-archive with compression-level: 0 produces a zip archive that cannot be unzipped with a streaming implementation such as java.util.zip.ZipInputStream.

That means the entire archive must first be downloaded and then unzipped with a non-streaming, seekable-file implementation such as java.util.zip.ZipFile.

The streaming failure reported by Java's ZipInputStream is:

Exception java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor

What did you expect to happen?

I expected the zip archive created by upload-artifact to be equally unzippable using a seekable-file implementation like ZipFile or with a streaming implementation like ZipInputStream.

How can we reproduce it?

Create a zip archive foo.zip with upload-artifact and compression-level: 0. Then attempt to list its contents as in this jshell script:

import java.util.zip.ZipFile;
import java.util.zip.ZipInputStream;

try ( var zf = new ZipFile("foo.zip") ) {
    zf.stream().forEach(System.out::println);
}

try ( var zs = new ZipInputStream(new FileInputStream("foo.zip")) ) {
    for ( var ze = zs.getNextEntry(); null != ze; ze = zs.getNextEntry() ) {
        System.out.println(ze);
        zs.closeEntry();
    }
}

Observe that the first try block, using ZipFile, successfully lists the archive, and the second try block, using ZipInputStream, throws the only DEFLATED entries can have EXT descriptor exception.

Retry using a zip file created locally with zip -c -0 on Linux and observe that both methods successfully list the archive.

Anything else we need to know?

There is a related report at bugs.openjdk.org that helps explain what is happening.

A zip archive has a "local header" preceding each member, and a "central directory" at the end, repeating some of the same information, such as the member's name, its original length, its length in the archive, and its CRC. It is also possible to have a "data descriptor" (or "EXT descriptor") record that follows a member.

The data descriptor is used when the utility creating the archive doesn't know those values at the time it starts writing the member. For example, when compressing, the size in the archive won't be known until the member has been written. For that case, bit 3 (0x08) in the "local header" will be set, telling any consumer that the values in the local header aren't set, and to go ahead and read the member to the end and get the values from the data descriptor that follows it.

That works for any compression level other than 0, because the DEFLATE compressed-data format has enough structure that the consumer can know where the end is, and find the data descriptor after that. But at compression level 0, there is no structure, just arbitrary bytes of original data, which could include bytes that look like a data descriptor record. A streaming consumer has no way to know where a compression-0 member ends, unless its length was in the "local header" at the front.

A seekable-file consumer can still succeed, because the member lengths are all repeated in the "central directory" at the end of the archive, which the consumer knows how to find.

But to make the archive usable by a streaming consumer, any producer creating a zip file with compression-0 members in it has got to write those with the correct information in the local header preceding each member, and cannot be lazy and put a data descriptor after the member, because a streaming consumer has no dependable way to recover that. (With no compression, a producer already knows the length in the archive in advance if it knows the original size of the file.)

The zip-writing implementation being used by upload-artifact is setting bit 3 in the local header and putting data descriptors after the members, even for compression-level 0.

As a workaround, where compression-level: 0 might be ideal (such as for archiving already-compressed files), compression-level: 1 can be used instead. It will spend as little wasted compression effort as possible, but still create a zip file with the necessary structure to be recovered both by seekable-file and by streaming consumers.

If it is not easy to fix the behavior, then the documentation should warn about it and suggest the workaround.

What version of the action are you using?

v4

What are your runner environments?

linux

Are you on GitHub Enterprise Server? If so, what version?

No response

The text was updated successfully, but these errors were encountered:

jcflack · 2025-02-28T02:51:05Z

"bit 3 (0x08) in the local header" => "bit 3 (0x08) in the local header general-flags field"

jcflack added the bug label Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] compression-level: 0 results in ZIP file that can't be streaming-unzipped #675

[bug] compression-level: 0 results in ZIP file that can't be streaming-unzipped #675

jcflack commented Feb 28, 2025

jcflack commented Feb 28, 2025

[bug] compression-level: 0 results in ZIP file that can't be streaming-unzipped #675

[bug] compression-level: 0 results in ZIP file that can't be streaming-unzipped #675

Comments

jcflack commented Feb 28, 2025

What happened?

What did you expect to happen?

How can we reproduce it?

Anything else we need to know?

What version of the action are you using?

What are your runner environments?

Are you on GitHub Enterprise Server? If so, what version?

jcflack commented Feb 28, 2025