to string! incorrectly converts invalid UTF-8 sequences #638

rebolbot · 2009-02-15T05:36:07Z

Submitted by: PeterWood

When converting an invalid UTF-8 sequence from binary!, to string! incorrectly converts the invalid character to another UTF-8 character.

I would have expected the incorrect character to be replaced by the Unicode “replacement character” codepoint FFFD.

> > f: "à"
> > == "à"
> > g: to string! #{C2E0}   ;; invalid UTF-8 sequence
> > == "à"
> > f = g
> > == true
> > enbase/base to binary! f 16
> > == "C3A0"
> > enbase/base to binary! g 16
> > == "C3A0"

^{CC – Data [ Version: alpha 33 Type: Bug Platform: All Category: Unspecified Reproduce: Always Fixed-in:none ]}

The text was updated successfully, but these errors were encountered:

rebolbot · 2009-02-15T19:24:08Z

Submitted by: BrianH

TO STRING! should only accept valid, minimal UTF-8, and convert invalid sequences to ^(FFDD) as Peter suggests. To do otherwise would be bad for security.

If we are going to allow more lax decoding it should only be done by DECODE, and then only when an option is specified, not by default.

See http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations

rebolbot · 2009-06-23T06:08:44Z

Submitted by: Carl

I think that's a serious performance hit. We can do it, but it will cost us conversion time.

rebolbot · 2009-06-23T15:32:18Z

Submitted by: PeterWood

An alternative approach would be to raise a script error if an invalid sequence is encountered. Such behaviour would be consistent with to date!:

>> to date! "32-Jan-2009" 
** Script error: cannot MAKE/TO date! from: "32-Jan-2009"
** Where: to
** Near: to date! "32-Jan-2009"

rebolbot · 2009-06-23T18:37:37Z

Submitted by: BrianH

You can see an entire date in one glance - a string error might be effectively hidden. Just erroring out would make it difficult to load strings with minor errors.

I think the performance hit comes from the detection, rather than from swapping in another character. How much of a hit are we talking about? This seems like a problem with such wide applicability that it is likely that efficient solutions have already been made. Perhaps an internet code search is in order.

After giving the security problem some thought, it occurs to me that all of our string processing is done on decoded strings, not the encoded binaries. The only exceptions so far are TRANSCODE - which doesn't execute code - and FIND-SCRIPT - which might be a problem. All of our security processing is done post-decode. The restriction on overlong forms is unnecessary in this case, since the overlong form exploit attacks binary screening. We do need to be careful of FIND-SCRIPT's REBOL header search though.

Even the Wikipedia page has changed since this ticket was added. Now they just refer to an RFC http://tools.ietf.org/html/rfc3629 that expresses the idea better, and the restricted values are red in the chart. If we follow the RFC we should be able to do this efficiently.

Perhaps one of the strategies here http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences can be used.

Another alternative would be to have the REBOL code decoders like FIND-SCRIPT, TO-BLOCK and TRANSCODE do the secure decoding, have TO-STRING do the fast decoding, and have DECODE do either depending on options.

rebolbot · 2009-06-24T03:45:18Z

Submitted by: Carl

I looked into it early on in the Unicode implementation process, but if I recall, the hit was about 50%. I was actually shocked that they allowed this "hole" in the encoding definition, because it penalizes all correctly encoded strings for the extremely rare possible invalid case.

We could possibly define a "valid-UTF8?" native to perform the validation scan separately. I think that would be fairly easy to add. The reason this approach makes sense is because you have to ask what it means to your program if an invalid encoding is found? If you scan it with valid-UTF8, that function could tell you the location of the error, and let you correct it (maybe).

Regarding FIND-SCRIPT, since we only accept UTF-8 source, and the REBOL signature is always ASCII (currently), then it's not a problem.

IMO I think the high level of attention this issues gets in the Unicode community is mainly because they don't want to see such errors propagate without detection.

rebolbot · 2009-06-24T03:52:45Z

Submitted by: BrianH

Well we certainly shouldn't ever generate non-minimized UTF-8 :)

As long as we document that non-minimized UTF will be accepted by TO-STRING, and tell people that this means that you should do your security validation after converting to string!, not before on the binary!, then we should be fine. It might even be considered a feature that your strings become less evil by passing through REBOL.

rebolbot · 2009-06-25T07:32:36Z

Submitted by: PeterWood

From the experience of looking at character encoding in REBOL.org, invalid UTF-8 characters are not as exceptional as you would expect. I think this is due to the fact that people have been very cavalier about character encoding in the past. Once we get to a homogenised UTF world, I would expect the exceptions to be few and far between but that's still some way off.

I personally believe that converting invalid sequences to the "invalid character" would be the better solution if it is not too "expensive".

rebolbot · 2009-06-25T20:29:32Z

Submitted by: BrianH

Carl: "the hit was about 50%"

Peter, it is definitely too expensive to convert invalid sequences to the "invalid character" since that would add P overhead to the decoding process to detect them. It would be better to just convert invalid sequences to valid characters, since that would not have any overhead.

The security issues are irrelevant for us because we work in Unicode internally, not UTF-8. If you need to validate a UTF-8 binary then a decoder could be added for DECODE, and a mezzanine VALID-UTF-8? wrapper to call it.

rebolbot · 2009-06-28T10:19:41Z

Submitted by: PeterWood

I don't understand how you can convert an invalid utf-8 sequence to a valid character. Invalid sequences don't map to a character, that's what makes them invalid sequences.

To what valid character would 0xC0 be converted ?
Or the sequence 0xC200?

rebolbot · 2009-07-14T00:43:40Z

Submitted by: BrianH

By invalid sequence we mean overlong encoding - that is what it would add the 50% overhead. Actual invalid sequences like #{C0} and #{C200} are already converted to the Unicode replacement character U+FFFD (#{EFBFBD} in UTF-8).

The current behavior is the best we can reasonably do: Overlong sequences are not flagged due to the overhead, but invalid sequences of the type that PeterWood suggests in his last message are caught. Any more would add too much overhead.

rebolbot · 2009-07-14T00:54:43Z

Submitted by: PeterWood

Actually invalid sequences like #{C200} are not converted to the replacement character. See the original example:

>> g: to string! #{C2E0}    ;; "Actual" invalid sequence
== "à"

>> enbase/base to binary! g 16
== "C3A0"

rebolbot · 2009-07-14T03:28:55Z

Submitted by: BrianH

Based on this code:

>> enbase/base #{C0} 2
== "11000000"  ; missing second byte
>> enbase/base to-binary to-string #{C0} 2
== "111011111011111110111101"  ; good: replacement character
>> enbase/base #{C000} 2
== "1100000000000000"  ; first bit of additional byte is 0 (required to be 1)
>> enbase/base to-binary to-string #{C000} 2
== "11101111101111111011110100000000"  ; good: replacement character, then next byte as additional character
>> enbase/base #{C2E0} 2
== "1100001011100000"  ; second bit of additional byte is 1 (required to be 0)
>> enbase/base to-binary to-string #{C2E0} 2
== "1100001110100000"  ; bad: should be two replacement characters: "111011111011111110111101111011111011111110111101"
>> enbase/base #{80} 2
== "10000000"  ; stray additional byte with no lead byte
>> enbase/base to-binary to-string #{80} 2
== "1100001010000000"  ; iffy: #{80} to #{bf} treated as latin1
>> enbase/base #{f8} 2
== "11111000"  ; value too large for a lead byte
>> enbase/base to-binary to-string #{f8} 2
== "111011111011111110111101"  ; good: replacement character
>> enbase/base #{f880808080} 2
== "1111100010000000100000001000000010000000"  ; value too large for a lead byte, with the right number of additionals
>> enbase/base to-binary to-string #{f880808080} 2
== "111011111011111110111101"  ; good: replacement character

It is clear that these encoding errors are caught and converted to the replacement char:

- First byte > #{F7}
- Missing additional bytes

And that these encoding errors are not caught:

- First byte from #{80} to #{bf} (missing first byte, high bits = 10)
- Additional bytes from #{c0} to #{ff} (second bit set, high bits = 11)
- Overlong forms

The overlong forms we can afford to ignore, but the other two uncaught errors should be caught if it is feasible.

rebolbot · 2009-08-20T19:56:08Z

Submitted by: Carl

I'm going to dismiss this bug now that this UTF check is available in A79:

>> if invalid-utf? #{C2E0} [print "invalid"]
invalid

If a side issue or problem came up in these discussions, please repost it as a new bug.

While the UTF-8 loading code for R3-Alpha gives an error vs. crashing on certain invalid UTF-8 patterns, it does not *fully* validate UTF-8. This means some inputs that are actually invalid UTF-8 will just be loaded as a string of "some other characters": metaeducation/rebol-issues#638 The reason for not doing greater validation was based on the belief that any of the string states you could get to from invalid UTF-8 data would also be producible from valid UTF-8 data. Hence, paying a cost (said to be about 50% overhead) wasn't worth it. This cost doesn't seem to make sense, as the checks are only on non ASCII sequences...which are uncommon in typical Rebol code. The checks seem quite worth it, and not overly expensive. So this commit simply reinstates the checks.

hostilefork · 2019-08-28T00:12:00Z

The commenting out of the check is sneaky! I'd actually gone over that code, but assumed it was unnecessary or being checked elsewhere, since it had no remarks!!

https://github.com/rebol/rebol/blob/25033f897b2bd466068d7663563cd3ff64740b94/src/core/s-unicode.c#L795

I can understand being a skeptic of Rebol's need to "police" UTF-8 validity, especially if it comes at a cost. But...I'm also skeptical of the 50% cost claim on the real line-of-work cases...because the check isn't needed on ASCII bytes, only when you're dealing with higher codepoints. And most all Rebol exchanged today is nearly entirely ASCII (if not more often, entirely ASCII).

Well we certainly shouldn't ever generate non-minimized UTF-8 :)

The best way to be certain of this is to be systemically ensuring that everything is valid when coming in and going out. :-/

Even if you're not concerned about what Rebol-the-EXE makes when it's spitting out encodings of an ANY-STRING!, you have to consider programmers who are working with UTF-8 at a lower level...maybe with byte surgery on content from other sources. Then that code might be taking for granted that the conversions have their back. I always thought it did, since there were some errors...so I assumed it was catching all of them (!) People well might not know about the INVALID-UTF? routine.

I'm sure there are security issues as well...just in terms of doing a conversion, throwing away some patterns of data on read...validating, and then passing on the original data believing it's what you just read. Someone else might read it differently, and somewhere in this process someone could get burned.

So I'm going to have to agree with Peter here, for once. But the issue can remain closed, as I'm satisfied with Ren-C's resolution...of just bringing back the check:

metaeducation/ren-c@210f3cd

rebolbot added Type.bug Status.important labels Jan 12, 2016

rebolbot closed this as completed Jan 12, 2016

This was referenced Jan 12, 2016

Remove the hexadecimal treatment of issue! for all but integer! and tuple! conversions #1130

Open

read/string converts invalid utf-8 to incorrect valid characters #1216

Open

hostilefork added Ren.important Status.resolved labels Aug 27, 2019

hostilefork mentioned this issue Aug 28, 2019

ENHEX - invalid edge case result metaeducation/ren-c#1003

Closed

Siskin-Bot mentioned this issue Feb 15, 2020

to string! incorrectly converts invalid UTF-8 sequences Oldes/Rebol-issues#638

Closed

hostilefork removed the Ren.important label Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to string! incorrectly converts invalid UTF-8 sequences #638

to string! incorrectly converts invalid UTF-8 sequences #638

rebolbot commented Feb 15, 2009

rebolbot commented Feb 15, 2009

rebolbot commented Jun 23, 2009

rebolbot commented Jun 23, 2009

rebolbot commented Jun 23, 2009

rebolbot commented Jun 24, 2009

rebolbot commented Jun 24, 2009

rebolbot commented Jun 25, 2009

rebolbot commented Jun 25, 2009

rebolbot commented Jun 28, 2009

rebolbot commented Jul 14, 2009

rebolbot commented Jul 14, 2009 •

edited by hostilefork

rebolbot commented Jul 14, 2009

rebolbot commented Aug 20, 2009

hostilefork commented Aug 28, 2019 •

edited

to string! incorrectly converts invalid UTF-8 sequences #638

to string! incorrectly converts invalid UTF-8 sequences #638

Comments

rebolbot commented Feb 15, 2009

rebolbot commented Feb 15, 2009

rebolbot commented Jun 23, 2009

rebolbot commented Jun 23, 2009

rebolbot commented Jun 23, 2009

rebolbot commented Jun 24, 2009

rebolbot commented Jun 24, 2009

rebolbot commented Jun 25, 2009

rebolbot commented Jun 25, 2009

rebolbot commented Jun 28, 2009

rebolbot commented Jul 14, 2009

rebolbot commented Jul 14, 2009 • edited by hostilefork

rebolbot commented Jul 14, 2009

rebolbot commented Aug 20, 2009

hostilefork commented Aug 28, 2019 • edited

rebolbot commented Jul 14, 2009 •

edited by hostilefork

hostilefork commented Aug 28, 2019 •

edited