Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

javalib Zip methods need to be able to read & write UTF-8 names and comments. #3796

Closed
LeeTibbert opened this issue Feb 29, 2024 · 0 comments · Fixed by #3814
Closed

javalib Zip methods need to be able to read & write UTF-8 names and comments. #3796

LeeTibbert opened this issue Feb 29, 2024 · 0 comments · Fixed by #3814
Assignees

Comments

@LeeTibbert
Copy link
Contributor

LeeTibbert commented Feb 29, 2024

This is a duplicate of PR #937 opened August 15, 2017.

The code is clearly marked as TODO: add correct UTF-8 support.

From tracing so far:

  • it looks like one must be careful to check .zip both a zip file written on Window and one written on
    unix-like. The name & comment encodings may differ.

  • it looks like the SN java.util.zip code makes some attempt to handle two and three byte UTF-8
    but not four byte. The code was probably written well before Java handled 4 byte UTF-8.

    I suggest replacing that code with the current practice of using a StandardCharsets.UTF_8 when
    the zip general purpose flag indicates UTF-8 output for names & comments.

  • Current practice in the SN code seems to be to always write UTF-8, even if the general purpose flag
    is clear, indicating MS-DOS CP 437 should be used. Since CP 437 is also single byte, the practice, in some places,
    in the code of using Latin-1 iso-8859-1 should give byte compatible output, without the possibility of
    tripping over Unicode Surrogates, pairs of CP 437 can even express such.

    The use of Latin-1 is a bit of slight-of-hand which should be commented in the file. Exceedingly hard
    to trace and convince oneself that it is correct. A good exam or job-interview question for those
    who thrive on CharSets. CP 437 is not one of the standard Java CharSets. The slight-of-hand
    binary passthrough avoids some poor, benighted SN devo having to implement a custom
    CharSet provider for it.


Sometime in 2006 or so, the info-zip specification was updated to allow UTF-8 file names and comments.

Currently java.util.zip are, unconditionally, reading & writing something akin to Latin-1. Java uses UFT-16
for characters, which is a superset of Latin-1. That means that there are names and comments which
can be expressed in Java but not in java.util.zip.

Where this may beecome more than a theoretical gee-whiz concern is with efforts to implement the jdk.zipfs
file system.

@LeeTibbert LeeTibbert self-assigned this Mar 7, 2024
WojciechMazur pushed a commit that referenced this issue Mar 7, 2024
…3814)

* Fix #3798, #3786: Implement UTF-8 support in java.lang.zip classes
* Supply the missing reference .zip
* javalib `java.util.zip` classes now support writing and reading UTF-8 ("Unicode Transformation Format – 8-bit")
   entry names and archive and entry comments.
* `java.util.zip.ZipOutputStream` now follows the JVM practice of not throwing an Exception is zero entries
    are written. Former behavior was sensible, but not the JVM way.
* both now use standard `java.lang.String` methods to do Charset conversions.  In particular, this
   should now handle 4-byte UTF-8 codepoints.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant