javalib Zip methods need to be able to read & write UTF-8 names and comments. #3796

LeeTibbert · 2024-02-29T02:32:55Z

This is a duplicate of PR #937 opened August 15, 2017.

The code is clearly marked as TODO: add correct UTF-8 support.

From tracing so far:

it looks like one must be careful to check .zip both a zip file written on Window and one written on
unix-like. The name & comment encodings may differ.
it looks like the SN java.util.zip code makes some attempt to handle two and three byte UTF-8
but not four byte. The code was probably written well before Java handled 4 byte UTF-8.

I suggest replacing that code with the current practice of using a StandardCharsets.UTF_8 when
the zip general purpose flag indicates UTF-8 output for names & comments.
Current practice in the SN code seems to be to always write UTF-8, even if the general purpose flag
is clear, indicating MS-DOS CP 437 should be used. Since CP 437 is also single byte, the practice, in some places,
in the code of using Latin-1 iso-8859-1 should give byte compatible output, without the possibility of
tripping over Unicode Surrogates, pairs of CP 437 can even express such.

The use of Latin-1 is a bit of slight-of-hand which should be commented in the file. Exceedingly hard
to trace and convince oneself that it is correct. A good exam or job-interview question for those
who thrive on CharSets. CP 437 is not one of the standard Java CharSets. The slight-of-hand
binary passthrough avoids some poor, benighted SN devo having to implement a custom
CharSet provider for it.

Sometime in 2006 or so, the info-zip specification was updated to allow UTF-8 file names and comments.

Currently java.util.zip are, unconditionally, reading & writing something akin to Latin-1. Java uses UFT-16
for characters, which is a superset of Latin-1. That means that there are names and comments which
can be expressed in Java but not in java.util.zip.

Where this may beecome more than a theoretical gee-whiz concern is with efforts to implement the jdk.zipfs
file system.

The text was updated successfully, but these errors were encountered:

…3814) * Fix #3798, #3786: Implement UTF-8 support in java.lang.zip classes * Supply the missing reference .zip * javalib `java.util.zip` classes now support writing and reading UTF-8 ("Unicode Transformation Format – 8-bit") entry names and archive and entry comments. * `java.util.zip.ZipOutputStream` now follows the JVM practice of not throwing an Exception is zero entries are written. Former behavior was sensible, but not the JVM way. * both now use standard `java.lang.String` methods to do Charset conversions. In particular, this should now handle 4-byte UTF-8 codepoints.

LeeTibbert mentioned this issue Mar 6, 2024

Fix #3796, #3786: Implement UTF-8 support in java.util.zip classes #3814

Merged

LeeTibbert added the component:javalib label Mar 7, 2024

LeeTibbert self-assigned this Mar 7, 2024

WojciechMazur closed this as completed in #3814 Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

javalib Zip methods need to be able to read & write UTF-8 names and comments. #3796

javalib Zip methods need to be able to read & write UTF-8 names and comments. #3796

LeeTibbert commented Feb 29, 2024 •

edited

javalib Zip methods need to be able to read & write UTF-8 names and comments. #3796

javalib Zip methods need to be able to read & write UTF-8 names and comments. #3796

Comments

LeeTibbert commented Feb 29, 2024 • edited

LeeTibbert commented Feb 29, 2024 •

edited