Correctly handle UTF-8 in DEHEX #773

earl · 2014-04-13T15:49:10Z

Currently, DEHEX seems to treat some percent-encoded values as Latin1 (or direct Unicode codepoint values < 127, which is the same): one example is 0xB2, which when decoded using Latin1 is "²", U+00B2, "Superscript two". On the other hand, e.g. 0xCE is not decoded by dehex at all (by the same codepoint/Latin1 logic, it would be "Î", U+00CE, "Latin capital letter i with circumflex). Even Ignoring the inconsistency as an implementation artifact, mapping percent-escaped values 1:1 to Unicode codepoints is undesirable.

As percent-encoding is originally and primarily an escaping mechanism used in URIs, I think DEHEX should adhere to the URI percent encoding rules. In modern URI schemes, characters are UTF-8 encoded first, then percent-escaped. So decoding in DEHEX should do the reverse: percent-decode first, and then UTF-8 decode.

red>> dehex "a%ce%b2c"
== "a%ce²c"  ;; Expected: "aβc"

If that's deemed too complex to be done right away, a stop-gap measure towards full UTF-8 support could make DEHEX simply error out for any non ASCII (>0x7F) percent-encodings.

See also #772 for earlier discussion, and CureCode issue #1986 for the related discussion in Rebol 3.

The text was updated successfully, but these errors were encountered:

dockimbel · 2014-04-13T15:53:13Z

These are valid concerns, sorry for merging/closing #772 too early, before deciding on how to handle the decoding in a Unicode context.

I agree about the double decoding you are proposing.

We need to come up with the best memory-efficient strategy for handling it (like if possible avoiding an intermediary buffer that would need to be discarded later). @qtxie do you have a proposition for how to best implement that?

earl · 2014-04-13T16:12:12Z

No worries, I think merging #772 right away was fine. It's a living code base, we can start with things even before they are perfect, and work out the kinks as we move along.

FIX: issue #773 (Correctly handle UTF-8 in DEHEX)

dockimbel added the Red label Apr 13, 2014

dockimbel added status.reviewed labels Apr 13, 2014

qtxie added a commit to qtxie/red that referenced this issue Apr 14, 2014

FIX: issue red#773 (Correctly handle UTF-8 in DEHEX)

cb74aaa

dockimbel added a commit that referenced this issue Apr 15, 2014

Merge pull request #774 from qtxie/issue-773

7931e17

FIX: issue #773 (Correctly handle UTF-8 in DEHEX)

dockimbel added status.built and removed status.reviewed labels Apr 15, 2014

dockimbel closed this as completed Apr 15, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly handle UTF-8 in DEHEX #773

Correctly handle UTF-8 in DEHEX #773

earl commented Apr 13, 2014

dockimbel commented Apr 13, 2014

earl commented Apr 13, 2014

Correctly handle UTF-8 in DEHEX #773

Correctly handle UTF-8 in DEHEX #773

Comments

earl commented Apr 13, 2014

dockimbel commented Apr 13, 2014

earl commented Apr 13, 2014