Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to string! incorrectly converts invalid UTF-8 sequences #638

Closed
rebolbot opened this issue Feb 15, 2009 · 14 comments
Closed

to string! incorrectly converts invalid UTF-8 sequences #638

rebolbot opened this issue Feb 15, 2009 · 14 comments

Comments

@rebolbot
Copy link
Collaborator

Submitted by: PeterWood

When converting an invalid UTF-8 sequence from binary!, to string! incorrectly converts the invalid character to another UTF-8 character.

I would have expected the incorrect character to be replaced by the Unicode “replacement character” codepoint FFFD.

> > f: "à"
> > == "à"
> > g: to string! #{C2E0}   ;; invalid UTF-8 sequence
> > == "à"
> > f = g
> > == true
> > enbase/base to binary! f 16
> > == "C3A0"
> > enbase/base to binary! g 16
> > == "C3A0"

CC – Data [ Version: alpha 33 Type: Bug Platform: All Category: Unspecified Reproduce: Always Fixed-in:none ]

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

TO STRING! should only accept valid, minimal UTF-8, and convert invalid sequences to ^(FFDD) as Peter suggests. To do otherwise would be bad for security.

If we are going to allow more lax decoding it should only be done by DECODE, and then only when an option is specified, not by default.

See http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations

@rebolbot
Copy link
Collaborator Author

Submitted by: Carl

I think that's a serious performance hit. We can do it, but it will cost us conversion time.

@rebolbot
Copy link
Collaborator Author

Submitted by: PeterWood

An alternative approach would be to raise a script error if an invalid sequence is encountered. Such behaviour would be consistent with to date!:

>> to date! "32-Jan-2009" 
** Script error: cannot MAKE/TO date! from: "32-Jan-2009"
** Where: to
** Near: to date! "32-Jan-2009"

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

You can see an entire date in one glance - a string error might be effectively hidden. Just erroring out would make it difficult to load strings with minor errors.

I think the performance hit comes from the detection, rather than from swapping in another character. How much of a hit are we talking about? This seems like a problem with such wide applicability that it is likely that efficient solutions have already been made. Perhaps an internet code search is in order.

After giving the security problem some thought, it occurs to me that all of our string processing is done on decoded strings, not the encoded binaries. The only exceptions so far are TRANSCODE - which doesn't execute code - and FIND-SCRIPT - which might be a problem. All of our security processing is done post-decode. The restriction on overlong forms is unnecessary in this case, since the overlong form exploit attacks binary screening. We do need to be careful of FIND-SCRIPT's REBOL header search though.

Even the Wikipedia page has changed since this ticket was added. Now they just refer to an RFC http://tools.ietf.org/html/rfc3629 that expresses the idea better, and the restricted values are red in the chart. If we follow the RFC we should be able to do this efficiently.

Perhaps one of the strategies here http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences can be used.

Another alternative would be to have the REBOL code decoders like FIND-SCRIPT, TO-BLOCK and TRANSCODE do the secure decoding, have TO-STRING do the fast decoding, and have DECODE do either depending on options.

@rebolbot
Copy link
Collaborator Author

Submitted by: Carl

I looked into it early on in the Unicode implementation process, but if I recall, the hit was about 50%. I was actually shocked that they allowed this "hole" in the encoding definition, because it penalizes all correctly encoded strings for the extremely rare possible invalid case.

We could possibly define a "valid-UTF8?" native to perform the validation scan separately. I think that would be fairly easy to add. The reason this approach makes sense is because you have to ask what it means to your program if an invalid encoding is found? If you scan it with valid-UTF8, that function could tell you the location of the error, and let you correct it (maybe).

Regarding FIND-SCRIPT, since we only accept UTF-8 source, and the REBOL signature is always ASCII (currently), then it's not a problem.

IMO I think the high level of attention this issues gets in the Unicode community is mainly because they don't want to see such errors propagate without detection.

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

Well we certainly shouldn't ever generate non-minimized UTF-8 :)

As long as we document that non-minimized UTF will be accepted by TO-STRING, and tell people that this means that you should do your security validation after converting to string!, not before on the binary!, then we should be fine. It might even be considered a feature that your strings become less evil by passing through REBOL.

@rebolbot
Copy link
Collaborator Author

Submitted by: PeterWood

From the experience of looking at character encoding in REBOL.org, invalid UTF-8 characters are not as exceptional as you would expect. I think this is due to the fact that people have been very cavalier about character encoding in the past. Once we get to a homogenised UTF world, I would expect the exceptions to be few and far between but that's still some way off.

I personally believe that converting invalid sequences to the "invalid character" would be the better solution if it is not too "expensive".

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

Carl: "the hit was about 50%"

Peter, it is definitely too expensive to convert invalid sequences to the "invalid character" since that would add P overhead to the decoding process to detect them. It would be better to just convert invalid sequences to valid characters, since that would not have any overhead.

The security issues are irrelevant for us because we work in Unicode internally, not UTF-8. If you need to validate a UTF-8 binary then a decoder could be added for DECODE, and a mezzanine VALID-UTF-8? wrapper to call it.

@rebolbot
Copy link
Collaborator Author

Submitted by: PeterWood

I don't understand how you can convert an invalid utf-8 sequence to a valid character. Invalid sequences don't map to a character, that's what makes them invalid sequences.

To what valid character would 0xC0 be converted ?
Or the sequence 0xC200?

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

By invalid sequence we mean overlong encoding - that is what it would add the 50% overhead. Actual invalid sequences like #{C0} and #{C200} are already converted to the Unicode replacement character U+FFFD (#{EFBFBD} in UTF-8).

The current behavior is the best we can reasonably do: Overlong sequences are not flagged due to the overhead, but invalid sequences of the type that PeterWood suggests in his last message are caught. Any more would add too much overhead.

@rebolbot
Copy link
Collaborator Author

rebolbot commented Jul 14, 2009

Submitted by: PeterWood

Actually invalid sequences like #{C200} are not converted to the replacement character. See the original example:

>> g: to string! #{C2E0}    ;; "Actual" invalid sequence
== "à"
>> enbase/base to binary! g 16
== "C3A0"

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

Based on this code:

>> enbase/base #{C0} 2
== "11000000"  ; missing second byte
>> enbase/base to-binary to-string #{C0} 2
== "111011111011111110111101"  ; good: replacement character
>> enbase/base #{C000} 2
== "1100000000000000"  ; first bit of additional byte is 0 (required to be 1)
>> enbase/base to-binary to-string #{C000} 2
== "11101111101111111011110100000000"  ; good: replacement character, then next byte as additional character
>> enbase/base #{C2E0} 2
== "1100001011100000"  ; second bit of additional byte is 1 (required to be 0)
>> enbase/base to-binary to-string #{C2E0} 2
== "1100001110100000"  ; bad: should be two replacement characters: "111011111011111110111101111011111011111110111101"
>> enbase/base #{80} 2
== "10000000"  ; stray additional byte with no lead byte
>> enbase/base to-binary to-string #{80} 2
== "1100001010000000"  ; iffy: #{80} to #{bf} treated as latin1
>> enbase/base #{f8} 2
== "11111000"  ; value too large for a lead byte
>> enbase/base to-binary to-string #{f8} 2
== "111011111011111110111101"  ; good: replacement character
>> enbase/base #{f880808080} 2
== "1111100010000000100000001000000010000000"  ; value too large for a lead byte, with the right number of additionals
>> enbase/base to-binary to-string #{f880808080} 2
== "111011111011111110111101"  ; good: replacement character

It is clear that these encoding errors are caught and converted to the replacement char:

- First byte > #{F7}
- Missing additional bytes

And that these encoding errors are not caught:

- First byte from #{80} to #{bf} (missing first byte, high bits = 10)
- Additional bytes from #{c0} to #{ff} (second bit set, high bits = 11)
- Overlong forms

The overlong forms we can afford to ignore, but the other two uncaught errors should be caught if it is feasible.

@rebolbot
Copy link
Collaborator Author

Submitted by: Carl

I'm going to dismiss this bug now that this UTF check is available in A79:

>> if invalid-utf? #{C2E0} [print "invalid"]
invalid

If a side issue or problem came up in these discussions, please repost it as a new bug.

hostilefork added a commit to metaeducation/ren-c that referenced this issue Aug 28, 2019
While the UTF-8 loading code for R3-Alpha gives an error vs. crashing
on certain invalid UTF-8 patterns, it does not *fully* validate UTF-8.
This means some inputs that are actually invalid UTF-8 will just be
loaded as a string of "some other characters":

metaeducation/rebol-issues#638

The reason for not doing greater validation was based on the belief that
any of the string states you could get to from invalid UTF-8 data would
also be producible from valid UTF-8 data.  Hence, paying a cost (said
to be about 50% overhead) wasn't worth it.

This cost doesn't seem to make sense, as the checks are only on non
ASCII sequences...which are uncommon in typical Rebol code.  The checks
seem quite worth it, and not overly expensive.

So this commit simply reinstates the checks.
@hostilefork
Copy link
Member

hostilefork commented Aug 28, 2019

The commenting out of the check is sneaky! I'd actually gone over that code, but assumed it was unnecessary or being checked elsewhere, since it had no remarks!!

https://github.com/rebol/rebol/blob/25033f897b2bd466068d7663563cd3ff64740b94/src/core/s-unicode.c#L795

I can understand being a skeptic of Rebol's need to "police" UTF-8 validity, especially if it comes at a cost. But...I'm also skeptical of the 50% cost claim on the real line-of-work cases...because the check isn't needed on ASCII bytes, only when you're dealing with higher codepoints. And most all Rebol exchanged today is nearly entirely ASCII (if not more often, entirely ASCII).

Well we certainly shouldn't ever generate non-minimized UTF-8 :)

The best way to be certain of this is to be systemically ensuring that everything is valid when coming in and going out. :-/

Even if you're not concerned about what Rebol-the-EXE makes when it's spitting out encodings of an ANY-STRING!, you have to consider programmers who are working with UTF-8 at a lower level...maybe with byte surgery on content from other sources. Then that code might be taking for granted that the conversions have their back. I always thought it did, since there were some errors...so I assumed it was catching all of them (!) People well might not know about the INVALID-UTF? routine.

I'm sure there are security issues as well...just in terms of doing a conversion, throwing away some patterns of data on read...validating, and then passing on the original data believing it's what you just read. Someone else might read it differently, and somewhere in this process someone could get burned.

So I'm going to have to agree with Peter here, for once. But the issue can remain closed, as I'm satisfied with Ren-C's resolution...of just bringing back the check:

metaeducation/ren-c@210f3cd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants