Correctly handle UTF-8 in DEHEX #773
Labels
status.built
A change in codebase has been done to address the ticket.
status.tested
The change in code has been manually tested and verified to fix the issue.
type.review
Ticket describes a possible improvement.
Currently, DEHEX seems to treat some percent-encoded values as Latin1 (or direct Unicode codepoint values < 127, which is the same): one example is 0xB2, which when decoded using Latin1 is "²", U+00B2, "Superscript two". On the other hand, e.g. 0xCE is not decoded by dehex at all (by the same codepoint/Latin1 logic, it would be "Î", U+00CE, "Latin capital letter i with circumflex). Even Ignoring the inconsistency as an implementation artifact, mapping percent-escaped values 1:1 to Unicode codepoints is undesirable.
As percent-encoding is originally and primarily an escaping mechanism used in URIs, I think DEHEX should adhere to the URI percent encoding rules. In modern URI schemes, characters are UTF-8 encoded first, then percent-escaped. So decoding in DEHEX should do the reverse: percent-decode first, and then UTF-8 decode.
If that's deemed too complex to be done right away, a stop-gap measure towards full UTF-8 support could make DEHEX simply error out for any non ASCII (>0x7F) percent-encodings.
See also #772 for earlier discussion, and CureCode issue #1986 for the related discussion in Rebol 3.
The text was updated successfully, but these errors were encountered: