Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle UTF-8 in DEHEX #773

Closed
earl opened this issue Apr 13, 2014 · 2 comments
Closed

Correctly handle UTF-8 in DEHEX #773

earl opened this issue Apr 13, 2014 · 2 comments
Labels
status.built A change in codebase has been done to address the ticket. status.tested The change in code has been manually tested and verified to fix the issue. type.review Ticket describes a possible improvement.

Comments

@earl
Copy link
Contributor

earl commented Apr 13, 2014

Currently, DEHEX seems to treat some percent-encoded values as Latin1 (or direct Unicode codepoint values < 127, which is the same): one example is 0xB2, which when decoded using Latin1 is "²", U+00B2, "Superscript two". On the other hand, e.g. 0xCE is not decoded by dehex at all (by the same codepoint/Latin1 logic, it would be "Î", U+00CE, "Latin capital letter i with circumflex). Even Ignoring the inconsistency as an implementation artifact, mapping percent-escaped values 1:1 to Unicode codepoints is undesirable.

As percent-encoding is originally and primarily an escaping mechanism used in URIs, I think DEHEX should adhere to the URI percent encoding rules. In modern URI schemes, characters are UTF-8 encoded first, then percent-escaped. So decoding in DEHEX should do the reverse: percent-decode first, and then UTF-8 decode.

red>> dehex "a%ce%b2c"
== "a%ce²c"  ;; Expected: "aβc"

If that's deemed too complex to be done right away, a stop-gap measure towards full UTF-8 support could make DEHEX simply error out for any non ASCII (>0x7F) percent-encodings.

See also #772 for earlier discussion, and CureCode issue #1986 for the related discussion in Rebol 3.

@dockimbel dockimbel added the Red label Apr 13, 2014
@dockimbel
Copy link
Member

These are valid concerns, sorry for merging/closing #772 too early, before deciding on how to handle the decoding in a Unicode context.

I agree about the double decoding you are proposing.

We need to come up with the best memory-efficient strategy for handling it (like if possible avoiding an intermediary buffer that would need to be discarded later). @qtxie do you have a proposition for how to best implement that?

@earl
Copy link
Contributor Author

earl commented Apr 13, 2014

No worries, I think merging #772 right away was fine. It's a living code base, we can start with things even before they are perfect, and work out the kinks as we move along.

qtxie added a commit to qtxie/red that referenced this issue Apr 14, 2014
dockimbel added a commit that referenced this issue Apr 15, 2014
FIX: issue #773 (Correctly handle UTF-8 in DEHEX)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status.built A change in codebase has been done to address the ticket. status.tested The change in code has been manually tested and verified to fix the issue. type.review Ticket describes a possible improvement.
Projects
None yet
Development

No branches or pull requests

2 participants