You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a duplicate of PR #937 opened August 15, 2017.
The code is clearly marked as TODO: add correct UTF-8 support.
From tracing so far:
it looks like one must be careful to check .zip both a zip file written on Window and one written on
unix-like. The name & comment encodings may differ.
it looks like the SN java.util.zip code makes some attempt to handle two and three byte UTF-8
but not four byte. The code was probably written well before Java handled 4 byte UTF-8.
I suggest replacing that code with the current practice of using a StandardCharsets.UTF_8 when
the zip general purpose flag indicates UTF-8 output for names & comments.
Current practice in the SN code seems to be to always write UTF-8, even if the general purpose flag
is clear, indicating MS-DOS CP 437 should be used. Since CP 437 is also single byte, the practice, in some places,
in the code of using Latin-1 iso-8859-1 should give byte compatible output, without the possibility of
tripping over Unicode Surrogates, pairs of CP 437 can even express such.
The use of Latin-1 is a bit of slight-of-hand which should be commented in the file. Exceedingly hard
to trace and convince oneself that it is correct. A good exam or job-interview question for those
who thrive on CharSets. CP 437 is not one of the standard Java CharSets. The slight-of-hand
binary passthrough avoids some poor, benighted SN devo having to implement a custom
CharSet provider for it.
Sometime in 2006 or so, the info-zip specification was updated to allow UTF-8 file names and comments.
Currently java.util.zip are, unconditionally, reading & writing something akin to Latin-1. Java uses UFT-16
for characters, which is a superset of Latin-1. That means that there are names and comments which
can be expressed in Java but not in java.util.zip.
Where this may beecome more than a theoretical gee-whiz concern is with efforts to implement the jdk.zipfs
file system.
The text was updated successfully, but these errors were encountered:
…3814)
* Fix#3798, #3786: Implement UTF-8 support in java.lang.zip classes
* Supply the missing reference .zip
* javalib `java.util.zip` classes now support writing and reading UTF-8 ("Unicode Transformation Format – 8-bit")
entry names and archive and entry comments.
* `java.util.zip.ZipOutputStream` now follows the JVM practice of not throwing an Exception is zero entries
are written. Former behavior was sensible, but not the JVM way.
* both now use standard `java.lang.String` methods to do Charset conversions. In particular, this
should now handle 4-byte UTF-8 codepoints.
This is a duplicate of PR #937 opened August 15, 2017.
The code is clearly marked as
TODO: add correct UTF-8 support.
From tracing so far:
it looks like one must be careful to check .zip both a zip file written on Window and one written on
unix-like. The name & comment encodings may differ.
it looks like the SN
java.util.zip
code makes some attempt to handle two and three byte UTF-8but not four byte. The code was probably written well before Java handled 4 byte UTF-8.
I suggest replacing that code with the current practice of using a
StandardCharsets.UTF_8
whenthe zip general purpose flag indicates UTF-8 output for names & comments.
Current practice in the SN code seems to be to always write UTF-8, even if the general purpose flag
is clear, indicating MS-DOS CP 437 should be used. Since CP 437 is also single byte, the practice, in some places,
in the code of using Latin-1
iso-8859-1
should give byte compatible output, without the possibility oftripping over Unicode Surrogates, pairs of CP 437 can even express such.
The use of Latin-1 is a bit of slight-of-hand which should be commented in the file. Exceedingly hard
to trace and convince oneself that it is correct. A good exam or job-interview question for those
who thrive on
CharSets
. CP 437 is not one of the standard Java CharSets. The slight-of-handbinary passthrough avoids some poor, benighted SN devo having to implement a custom
CharSet provider for it.
Sometime in 2006 or so, the info-zip specification was updated to allow UTF-8 file names and comments.
Currently
java.util.zip
are, unconditionally, reading & writing something akin to Latin-1. Java uses UFT-16for characters, which is a superset of Latin-1. That means that there are names and comments which
can be expressed in Java but not in
java.util.zip
.Where this may beecome more than a theoretical gee-whiz concern is with efforts to implement the
jdk.zipfs
file system.
The text was updated successfully, but these errors were encountered: