UTF8 updates#334
Conversation
Fixed toString cutoff
|
The |
JohnReppy
left a comment
There was a problem hiding this comment.
I think these constants are useful for documentation purposes (and max1Byte is used).
JohnReppy
left a comment
There was a problem hiding this comment.
Perhaps there should be a "skip" strategy that resynchronizes on the next valid UTF-8 sequence?
|
That could be a good option, though I don’t know how useful such a decoding option would be. I know Unicode doesn’t recommend ignoring errors because you can get unexpected string collisions. Another useful decoding strategy is used by Python called SurrogateEscape, which turns invalid byte b into 0xDC80 + b, but I think that would require reworking the iterator in an incompatible way. It could be something added later since it makes the most sense in combination with a corresponding encoding strategy. |
Unfortunately the lexical specification of SML string escapes makes So the result of |
9fe217b to
9f0cfbd
Compare
|
I believe that the |
Description
Fixed cutoff in Unicode character printing
The current implementation uses
max2Byte(0wxdf), when it should use0wxFFFF.This function handles code points, not their UTF8 encodings.
Changed 8-character Unicode character printing
The current implementation uses
\uXXXXfor 4 bytes and\uXXXXXXXXfor 8 bytes.Although these aren't SML language features, we should follow the decision of C by using capital U for 8 byte escapes to avoid confusion.
Decoding strategies
UTF8.getuis great for walking over valid Unicode strings, but sometimes it's useful to recover encoding errors. TheDECODE_REPLACEdecoding strategy can be provided togetuWithto replace these invalid bytes withREPLACEMENT CHARACTER.Motivation and Context
Printing a UTF-8 string using
String.concat (List.map UTF8.toString str)is confusing, as"\\u00010000\\u00DF0000"represents 6 code points (U+100000,U+00DF,U+30x 4).The change also aligns with other programming languages.
UTF8.getushould be enhanced to allow for graceful handling of invalid inputs. The current implementation does not allow for this, as simply skipping a byte after an exception does not mimic best practices for replacement character creation.How Has This Been Tested?
The
DECODE_REPLACEimplementation has been tested on the test cases in the Unicode recommendations page.