Skip to content

UTF8 updates#334

Merged
JohnReppy merged 3 commits into
smlnj:mainfrom
Skyb0rg007:utf8-tostring-change
Dec 21, 2024
Merged

UTF8 updates#334
JohnReppy merged 3 commits into
smlnj:mainfrom
Skyb0rg007:utf8-tostring-change

Conversation

@Skyb0rg007

Copy link
Copy Markdown
Contributor
(* Current *)
> UTF8.toString 0wx3BB;
"\\u000003BB"
> UTF8.toString 0wx1D11E;
"\\u0001D11E"

(* New *)
> UTF8.toString 0wx3BB;
"\\u03BB"
> UTF8.toString 0wx1D11E;
"\\U00001D11E"

(* Current *)
> StringCvt.scanString UTF8.getu "\128";
uncaught exception Invalid
(* New *)
> StringCvt.scanString (UTF8.getuWith UTF8.DECODE_REPLACE) "\128";
SOME 0wxFFFD : UTF8.wchar option

Description

Fixed cutoff in Unicode character printing

The current implementation uses max2Byte (0wxdf), when it should use 0wxFFFF.
This function handles code points, not their UTF8 encodings.

Changed 8-character Unicode character printing

The current implementation uses \uXXXX for 4 bytes and \uXXXXXXXX for 8 bytes.
Although these aren't SML language features, we should follow the decision of C by using capital U for 8 byte escapes to avoid confusion.

Decoding strategies

UTF8.getu is great for walking over valid Unicode strings, but sometimes it's useful to recover encoding errors. The DECODE_REPLACE decoding strategy can be provided to getuWith to replace these invalid bytes with REPLACEMENT CHARACTER.

Motivation and Context

Why is this change required? What problem does it solve?

Printing a UTF-8 string using String.concat (List.map UTF8.toString str) is confusing, as "\\u00010000\\u00DF0000" represents 6 code points (U+100000, U+00DF, U+30 x 4).
The change also aligns with other programming languages.

UTF8.getu should be enhanced to allow for graceful handling of invalid inputs. The current implementation does not allow for this, as simply skipping a byte after an exception does not mimic best practices for replacement character creation.

How Has This Been Tested?

The DECODE_REPLACE implementation has been tested on the test cases in the Unicode recommendations page.

@JohnReppy

Copy link
Copy Markdown
Contributor

The \uxxxx syntax is valid SML'97 syntax. The \Uxxxxxxxx syntax has been proposed for Successor ML.

@JohnReppy JohnReppy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these constants are useful for documentation purposes (and max1Byte is used).

@JohnReppy JohnReppy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there should be a "skip" strategy that resynchronizes on the next valid UTF-8 sequence?

@Skyb0rg007

Copy link
Copy Markdown
Contributor Author

That could be a good option, though I don’t know how useful such a decoding option would be. I know Unicode doesn’t recommend ignoring errors because you can get unexpected string collisions.

Another useful decoding strategy is used by Python called SurrogateEscape, which turns invalid byte b into 0xDC80 + b, but I think that would require reworking the iterator in an incompatible way. It could be something added later since it makes the most sense in combination with a corresponding encoding strategy.

@Skyb0rg007

Copy link
Copy Markdown
Contributor Author

The \uxxxx syntax is valid SML'97 syntax. The \Uxxxxxxxx syntax has been proposed for Successor ML.

Unfortunately the lexical specification of SML string escapes makes \Uxxxxxxxx pretty useless. Namely the \u and \U escapes are used as code unit escapes instead of code point escapes: ex. "\u033B" : string wouldn’t typecheck and \U would only be possible in UTF-32 encoded strings.
"\u00FF" : string should be the same as "\195\191" (the UTF-8 encoding of U+00FF) but it is currently interpreted as "\255" (invalid UTF-8).

So the result of UTF8.toString will not be able to be fed back into SML, and any attempted fix would have to be backwards incompatible. Maybe I’ll submit a Successor ML proposal.

@JohnReppy

Copy link
Copy Markdown
Contributor

I believe that the \uxxxx string escapes are meant to be used in wide-text string types, where specifying a code point makes sense. They were not meant to be a way to write UTF-8 sequences (I don't think that we were even thinking about UTF-8 when designing this stuff, although the inventors of it were just down the hall.)

@JohnReppy JohnReppy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@JohnReppy JohnReppy merged commit 2d94fc0 into smlnj:main Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants