Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of unpaired surrogates in JS strings #1348

Closed
Pauan opened this issue Mar 14, 2019 · 36 comments
Closed

Incorrect handling of unpaired surrogates in JS strings #1348

Pauan opened this issue Mar 14, 2019 · 36 comments
Labels

Comments

@Pauan
Copy link
Contributor

Pauan commented Mar 14, 2019

Describe the Bug

It was brought to my attention in Pauan/rust-dominator#10 that JavaScript strings (and DOMString) allow for unpaired surrogates.

When using TextEncoder, it will convert those unpaired surrogates into U+FFFD (the replacement character). According to the Unicode spec, this is correct behavior.

The issue is that because the unpaired surrogates are replaced, this is lossy, and that lossiness can cause serious issues.

You can read the above dominator bug report for the nitty gritty details, but the summary is that with <input> fields (and probably other things), it will send two input events, one for each surrogate.

When the first event arrives, the surrogate is unpaired, so because the string is immediately sent to Rust, the unpaired surrogate is converted into the replacement character.

Then the second event arrives, and the surrogate is still unpaired (because the first half was replaced), so the second half also gets replaced with the replacement character.

This has a lot of very deep implications, including for international languages (e.g. Chinese).

I did quite a bit of reading, and unfortunately I think the only real solution here is to always use JsString, and not convert into Rust String, because that is inherently lossy. Or if a conversion is done, it needs to do some checks to make sure that there aren't any unpaired surrogates.

@Pauan Pauan added the bug label Mar 14, 2019
@Pauan
Copy link
Contributor Author

Pauan commented Mar 14, 2019

(Another possible solution is to create a new Rust type which is internally just a Vec<u16>, and can losslessly represent a JS string)

@alexcrichton
Copy link
Contributor

I think this means that JS strings aren't guaranteed to be valid utf-16, right?

IT seems reasonable to me to add a JsString and JsStr type type to wasm-bindgen which basically behaves the same way as String and str except it's lossless with respect to the string on the JS side of things. That way APIs can opt-in to using the lossless, but slightly less convenient, representation (sort of like how env::var and env::var_os work).

Naming here may be a bit tricky though since js_sys::JsString exists and is an "anyref wrapper" around the actual JS string object (doesn't copy it into Rust yet), but other than that this may not be too too hard to implement!

@Pauan
Copy link
Contributor Author

Pauan commented Mar 14, 2019

I think this means that JS strings aren't guaranteed to be valid utf-16, right?

Indeed, JS Strings and DOMStrings are not always valid UTF-16. This is actually a part of the spec.

On the other hand, I believe USVString is valid UTF-16, and it's guaranteed to not contain surrogate pairs.

I think this has some implications for web-sys as well: anything which returns a JS String or DOMString should return JsString.

I'm not sure what should be returned for USVString. It could probably return JsString for consistency, or maybe it could return a Rust String.

@Pauan
Copy link
Contributor Author

Pauan commented Mar 14, 2019

This also has huge implications for function arguments: right now they tend to take &str, which is reasonable, convenient, and works correctly.

However, it is common to want to call a JS function (which will return JsString) and then pass the string as an argument to another JS function.

In that case the function cannot accept &str, because a JsStr is not valid UTF-8 (so it cannot be coerced into a &str). So the function should instead accept &JsStr.

That's a pretty big breaking change for many methods in web-sys. And it also has a significant ergonomic cost, and it means that passing a &str is now less efficient (since it has to be copied into a JsString).

I'm not really sure what the right option is here, they all seem bad.

@chris-morgan
Copy link

Servo’s experience is likely to be valuable here, with @SimonSapin’s WTF-8 as a large part of its solution to the problem.

I haven’t thought very much about the practicalities or APIs of it, but there could be a place for integrating carefully with Tendril, or if code size will be a problem there, some subset of it. Again, probably worth asking Simon for his opinion on this.

Something a little related to this matter that occurred to me this morning: one may want to perform serialisation/deserialisation of DOM strings with Serde (with serde-json or any other form), and one wants that to be as efficient as possible, so that working as close to the source representation as possible is desirable.

@Pauan
Copy link
Contributor Author

Pauan commented Mar 15, 2019

@chris-morgan Yes, being able to reuse WTF-8 (or in this case actually WTF-16) would be really useful.

I'm not sure how Serde integration will work, I imagine the WTF-16 type will impl Serde, so I think it should Just Work(tm)?

@chris-morgan
Copy link

chris-morgan commented Mar 15, 2019

(I think you misunderstood my Serde remark. What I mean is the string type of the likes of serde_json::{to_string, from_str}, rather than as something implementing Serialize or Deserialize. If we were to use a type that was [u16] underneath, using serde_json without further modifications would require conversion between string types in addition to the serialisation/deserialisation step, which would be sad.)

Well then, if we decide the Rust side needs to cope with unpaired surrogates, there are two possibilities:

  1. Replace TextEncoder and TextDecoder on the JS side with variants that won’t turn up their nose at lone surrogates, and then the string type on the Rust side is WTF-8 and thus easily (though not painlessly) converted to or from UTF-8.

  2. Just store the UTF-16 code units without any fanciness on the JS side, leaving any conversion code on the Rust side. (Hmm… what actually is OsStr on Windows? I never have quite got a clear answer from the code or docs on this.)

But I don’t think these are the only two options that we have. Although I pointed out the problem and want it to be fixed in some what that won’t be playing whack-a-mole, I also declare that it would be a shame to damage APIs, performance and code size for everyone and everything, in order to satisfy a few corner cases, if we can instead determine satisfactory ways of working around those corner cases.

Although you can encounter unmatched surrogates at any point in JS, I believe that the only situations in which you ever actually will are from key events, because characters in supplementary planes may come through in two parts. Nothing else actually gives you unmatched surrogates.

More specifically still, then:

  • keydown and keypress events on any elements may contain as key a lone surrogate.
  • keydown, keyup and keypress events on <input>/<textarea>/contenteditable elements will mangle the input’s value if it is retrieved on the low surrogate event. (And on keyup you can’t tell if it’s a surrogate event without looking at the last character of the input value, because your key code will be zero.)
  • input events on <input>/<textarea>/contenteditable elements will mangle the input’s value if it is retrieved on a high surrogate event.

For the keydown, keyup and keypress cases, I’m content to leave it be, because handling those properly for any sort of user input management is already black magic that has far worse problems to overcome (we’re pretty much only talking about rich text editors at this point), and I can’t think of anything feasible that would actually care about taking action when such characters were entered; so if anyone does care (and I expect it to literally be no one in the history of the universe), I say let them bind a function to retrieve the value safely. It won’t be that hard.

This then leaves the input event as genuinely the only place of any concern whatsoever. And shall we make everyone miserable for the sake of one? I’d be content with us adding a function to web_sys::InputEvent binding to this JavaScript function:

function should_ignore() {
    var tmp = this.data;
    return tmp && tmp.length == 1 && (tmp = tmp.charCodeAt(0)) >= 0xd800 && tmp < 0xdc00;
}

Then, simply issue a recommendation that anything that touches input events should start by calling event.should_ignore() and skip the event if it says to. Expect that mostly this will be arranged by libraries, and that in some situations apps will need to think about it, but that by and large people aren’t touching this stuff anyway. (Two way binding of inputs is already known to be a trap for the unwary—far too easy to trigger infinite loops—so mostly people will do it through libraries that can take care of this for them.)

This way, once the dust settles, we’ll get the very occasional user, who is normally going to have been making a new framework, that needs to be told “please fix your input event handler” because they wrote code that gets tripped up by this (and don’t expect that people will find this on their own). But we won’t have damaged ergonomics everywhere.

Exposing additional functionality in wasm-bindgen or js-sys or wherever to convert between a JavaScript String and Rust Vec<u16> may make sense; but I don’t expect anything in practice to want or need to use it (unless the should_ignore function were implemented via that).

Please let me keep using String on the Rust side! I’m almost sorry I brought this matter up! 😛

@Pauan
Copy link
Contributor Author

Pauan commented Mar 15, 2019

If we were to use a type that was [u16] underneath, using serde_json without further modifications would require conversion between string types in addition to the serialisation/deserialisation step, which would be sad.)

It might be possible to use the Unicode Private Use Areas to losslessly encode the surrogate pairs. That would allow for WTF-8 to be represented internally as a Rust String (and thus it can be serialized directly by Serde).

That would also allow for it to Deref to &str, fixing a lot of the ergonomic issues. I wonder if @SimonSapin has any ideas about that?

Just store the UTF-16 code units without any fanciness on the JS side, leaving any conversion code on the Rust side.

Yes, that can be accomplished by using JsString right now.

Nothing else actually gives you unmatched surrogates.

I'm not convinced that's true, but maybe you are right and that the "bad" APIs are few and far between.

Please let me keep using String on the Rust side!

Of course that will continue to be possible regardless, I'm not sure how ergonomic it will be, though.

I’m almost sorry I brought this matter up!

Please don't be. This is an important issue which would have eventually been found regardless.

@SimonSapin
Copy link

It might be possible to use the Unicode Private Use Areas to losslessly encode the surrogate pairs.

I think this wouldn’t be lossless, since two different input strings (one containing a lone surrogate, and one containing a private use area code point) could potentially be converted to the same string.

Also note this is about lone aka unpaired surrogates. UTF-16 / WTF-16 surrogate code unit pairs should absolutely be converted to a single non-BMP code point (which is 4 bytes in UTF-8).

what actually is OsStr on Windows? I never have quite got a clear answer from the code or docs on this.

It’s arguable whether that should be part of the docs or if it’s an internal implementation detail, but the memory representation of OsStr on Windows is WTF-8 bytes. Conversion occurs when calling system APIs. OsStr::to_str(&self) -> Option<&str> does not allocate or copy.

You can read the above dominator bug report for the nitty gritty details, but the summary is that with <input> fields (and probably other things), it will send two input events, one for each surrogate.

When the first event arrives, the surrogate is unpaired, so because the string is immediately sent to Rust, the unpaired surrogate is converted into the replacement character.

Then the second event arrives, and the surrogate is still unpaired (because the first half was replaced), so the second half also gets replaced with the replacement character.

(Emphasis added.)

In the linked issue:

Then, because the binding is two-way, it writes back to the text input.

I know ~nothing about what this code does, but why is the string that was just read from the text input immediately written back to it?

@lambda-fairy
Copy link

I know ~nothing about what this code does, but why is the string that was just read from the text input immediately written back to it?

I'm not Pauan, but I believe that code is implementing a React-style controlled component, where the DOM always reflects the current state as understood by the application.

@Pauan
Copy link
Contributor Author

Pauan commented Mar 15, 2019

I think this wouldn’t be lossless, since two different input strings (one containing a lone surrogate, and one containing a private use area code point) could potentially be converted to the same string.

Of course that is true, but my understanding is that "all bets are off" with the Private Use Area, so maybe it's reasonable to assume that it won't be used?

Also note this is about lone aka unpaired surrogates. UTF-16 / WTF-16 surrogate code unit pairs should absolutely be converted to a single non-BMP code point (which is 4 bytes in UTF-8).

Yes, of course. This bug report is about unpaired surrogates.

Conversion occurs when calling system APIs. OsStr::to_str(&self) -> Option<&str> does not allocate or copy.

That's really interesting. I assume it returns None if there are invalid characters in the OsStr?

I know ~nothing about what this code does, but why is the string that was just read from the text input immediately written back to it?

Dominator is a DOM library. In that example it is doing two-way data binding (which is very common with DOM libraries).

Two-way data binding means that when the <input> changes, it will update the Rust app's state. And conversely, when the Rust app's state changes, it updates the <input>. This ensures the <input> is always synchronized with the app's state.

So when the input event happens, it sends the unpaired surrogate to Rust (which mangles the string), and then it updates the app's state with the mangled string. Then, because the app's state has changed it writes that mangled string back into the <input>.

Then when the second input event happens, the <input>'s value has already been replaced with the mangled string, so the second half of the surrogate is unpaired.

Even if it didn't write back the value, this issue would still happen, because the first input event contains an unpaired surrogate. So that still has to be handled regardless.

@SimonSapin
Copy link

I assume it returns None if there are invalid characters in the OsStr?

Correct, where invalid characters mean unpaired surrogate byte sequences.

Then, because the app's state has changed it writes that mangled string back into the <input>.

If not for this bug, this “write back” would be a no-op, wouldn’t it? If so, why do it at all?

Even if it didn't write back the value, this issue would still happen, because the first input event contains an unpaired surrogate.

Wouldn’t the second event overwrite the effects of the first?

@Pauan
Copy link
Contributor Author

Pauan commented Mar 15, 2019

Correct, where invalid characters mean unpaired surrogate byte sequences.

Okay, great. Perhaps we could use a similar API to handle this bug.

If not for this bug, this “write back” would be a no-op, wouldn’t it? If so, why do it at all?

Yes, I'm pretty sure it is a no-op (though I'm not 100% sure on that).

The reason is mostly because the APIs are currently designed to never miss any changes, since stale state is really not good.

I have some plans to add in an optimization for this case to avoid the write-back. But in any case it's irrelevant to this issue, and this bug will still happen even without the write-back.

Wouldn’t the second event overwrite the effects of the first?

Not if the app has side effects during the first event.

For example, if the app took that string and sent it to a server, or put it into a file, or put it into a data structure (like a Vec or HashMap), or logged it to the console, or wrote it into a variable... any sort of side effect will run afoul of this bug.

It just so happens that in the dominator example the side effect was writing back to the DOM. But this issue is really not specific to dominator.

@SimonSapin
Copy link

Yes, such side effects of the first event could observe the intermediate string with a replacement character. But is this fundamentally different from intermediate events observing a string with letters that do not yet form a meaningful word because the user has not finished typing?

Also, even with the conversion made lossless at the DOM ↔ wasm boundary (possibly by having a JsString type that contains WTF-8 bytes), if intermediate events log to a console but the logging API takes a &str, some conversion will still be needed. If making that conversion lossless is important, the only option is using some kind of escaping or serialization to a format that is not “just plain text”, for example JSON-like backslash-u escape sequences.

@Pauan
Copy link
Contributor Author

Pauan commented Mar 16, 2019

But is this fundamentally different from intermediate events observing a string with letters that do not yet form a meaningful word because the user has not finished typing?

I don't understand this question. This is what is happening:

  1. The user enters a single character into the <input>.

  2. The browser sends two input events, one for each surrogate pair.

  3. When the first input event happens, the value of the <input> is not valid UTF-16 (because of the unpaired surrogate), so when the value is sent to Rust it is replaced with U+FFFD (the replacement character).

This is undesirable, because it means the string is wrong: it contains characters that the user never typed.

You could argue that this is a bug in the browser, and it should send a single event. I agree with that, but we cannot change the browsers, so we must workaround this issue.

This is completely unrelated to processing strings for words or anything like that, because in that case the string always contains the characters the user typed, and it is always valid Unicode (and so any mistakes would be logic errors in the word-splitting program).

In addition, the solutions for them are completely different: word splitting would likely require some form of debouncing, whereas we must solve this encoding issue without debouncing.

@alexcrichton
Copy link
Contributor

So modulo some naming, I think given all this I'd propose a solution that looks like:

  • Add a DomString and DomStr types to the wasm-bindgen crate (again, modulo naming, that's hopefully separate)
  • A DomString internally stores Vec<u16>
  • DomString derefs to DomStr
  • DomStr has APIs such as:
    • iter() - iterator of the u16 values
    • from_iter() - creation from u16 values
    • to_string{,_lossy}() - attempt to interpret as utf-16, returning None for to_string if it fails.
    • Maybe some extend-like APIs, or just exposing the raw underlying Vec<u16> somehow
  • Both DomString and DomStr can be used in boundary APIs the exact same way Vec<u16> is used.
    • Entering WebAssembly we use charCodeAt to create a list
    • Exiting WebAssembly we use fromCharCode and create a concatenated string

The intention here is that there's optionally a type that we can use in wasm-bindgen which is a lossless representation of a string in JS. It isn't necessarily the default, and we'll still support str and String everywhere (althoguh we'll update the documentation to note this gotcha). Additionally I'm thinking that Vec<u16> may be a natural internal representation for now because the main purpose of DomString would be to build something up to later interpret as utf-16. The main alternative would be WTF-8 with Vec<u8> internally, but the only real point of doing that would be to support an allocation-less to_str() method (like OsStr on Windows) and if DomString is only an intermediate type before we interpret UTF-16, it seems like to_str isn't that well motivated for our use case.

I think I'd probably prefer to also say we shouldn't change js-sys or web-sys at this time. We'll still use String and str (and Text{Decoder,Encoder}) everywhere for those. Over time I think we can possibly consider whitelisting APIs to instead use DomString (or similar), but if what @chris-morgan says is right in that unpaired surrogates are quite rare we may be able to get by mostly with this. In the meantime libraries like your @Pauan would probably bind the API internally (e.g. not use web-sys for these APIs where you want DomString)


Ok and for naming, one obvious other name would be JsString and JsStr, which while I think are more appropriate clash with js_sys::JsString, so we'd need to figure out what to do with that.

@fitzgen
Copy link
Member

fitzgen commented Mar 22, 2019

  • +1 for providing the escape hatch first, not changing js-sys and web-sys immediately, and then re-evaluating based on our experience with the escape hatch.

  • I don't have a strong opinion on ill-formed u16 code points in a Vec<u16> vs using some wtf-8 Vec<u8>. Defer to others here.

  • I'm happy-ish with the name DomString, but don't care much.

    • It is a little misleading when using js-sys/ECMAScript stuff outside of the Web (where there is no DOM)
    • I think we should avoid JsString and JsStr for the reasons you mentioned.
    • More bike shedding:
      • IllFormedUtf16String / IllFormedUtf16Str if we do Vec<u16> and UTF-16 code points
      • Wtf8String / Wtf8Str if we do Vec<u8> and WTF-8

@Pauan
Copy link
Contributor Author

Pauan commented Mar 23, 2019

The main alternative would be WTF-8 with Vec internally, but the only real point of doing that would be to support an allocation-less to_str() method (like OsStr on Windows) and if DomString is only an intermediate type before we interpret UTF-16, it seems like to_str isn't that well motivated for our use case.

I'm not sure I agree with that: since DomStr won't be the default, many APIs will continue to accept &str, and it's common to call an API which returns DomString and then pass it to another API (which accepts &str). Requiring an O(n) re-encoding + allocation + copying isn't great.

So if WTF-8 can faithfully represent everything (including private use areas), I think we should use it instead.

In the meantime libraries like your @Pauan would probably bind the API internally (e.g. not use web-sys for these APIs where you want DomString)

That's fine, though I think it has implications for Gloo (which I'll discuss on that repo).

Ok and for naming, one obvious other name would be JsString and JsStr, which while I think are more appropriate clash with js_sys::JsString, so we'd need to figure out what to do with that.

I think DomString is perfectly fine. It's what the official W3C spec uses, and it doesn't cause confusion with JsString.

I guess you could argue that using DomString for strings which aren't web APIs is a bit weird, but I think in practice most uses will be web APIs.


As another alternative, we could simply add in wasm-bindgen marshalling for the already-existing wtf8 crate (which is internally a Vec<u8>).

That sidesteps the entire naming choice, and also enables no-copying conversion to String and &str.

@alexcrichton
Copy link
Contributor

It's true yeah that DomString to &str requires an allocation and re-encoding, but if we use wtf-8 (or something like it) then we'll always be doing the encoding in JS whenever we enter/exit wasm. I think that's one of the things I'm mainly worried about is that we can't actually "just use" the wtf-8 crate, but rather we have to write the encoding in JS and maintain it as part of the default bindings.

This feels very similar to the decision of how to represent OsStr on Windows to me. We sort of just made an educated guess about how to represent it. I'm not really sure what the patterns of usage are we'd have with DomStr vs str.

@Pauan do you feel you've got a good handle on when we'd use DomStr vs str? I think we'll use those patterns as guidance for how to represent the type internally (to minimize encoding costs and such). Can you expand a bit on how you see this being used in gloo? (I'm mostly interested in developing a gut feeling for usage properties of DomStr and frequency)

@chris-morgan
Copy link

I consider @alexcrichton’s proposed plan to be reasonable. I don’t think WTF-8 is the way forwards with this; it doesn’t solve problems, merely shifts which side you have to write encoding on, Rust or JavaScript. A [u16]-based type seems to me the way to go, especially given the projected rarity of its use.

One correction: from_iter() would be on DomString rather than DomStr.

@Pauan
Copy link
Contributor Author

Pauan commented Mar 26, 2019

@alexcrichton I think right now we just need a way of detecting if a JS string is valid UTF-16 or not.

I see a couple options:

  1. The as_string method already says that it returns None if it's invalid UTF-8, but that's currently not true.

    So it could be changed so it actually does a validity check and returns None if the JS string isn't valid UTF-16.

  2. Add a new is_valid_utf16_string method to JsValue which does the validity checking and returns bool.

Either solution works fine. With that basic building block, it's now possible for the input event to detect whether the string is valid UTF-16 or not (and skip the event if it's not).

This completely sidesteps the issue: we no longer need to define DomString, and we don't need to use WTF-8 either. And that means the marshalling costs are the same as they currently are, since we're just using String.

@alexcrichton
Copy link
Contributor

@Pauan you're thinking that in these cases folks would take JsString, learn it's not valid utf-8, and then manually extract the u16 values? I like the idea of sidestepping the issue! If that can idiomatically solve the problem then that definitely sounds like our best bet.

As to which of those options, I think it makes sense to leave as_string as is and add a new method. We'd want to update the documentation of as_string to say what the implementation does (typeof + TextEncoder) along with these caveats for sure

@Pauan
Copy link
Contributor Author

Pauan commented Mar 27, 2019

@alexcrichton Well, I see two general use cases:

  1. If an input event contains unpaired surrogates, it should just completely ignore the entire event. That's solved by using JsValue::is_valid_utf16_string.

  2. Somebody actually wants to preserve the unpaired surrogates (for some reason). That case would be handled by DomString (or other similar strategies like Vec<u16>).

I don't actually know of any use cases for 2, so that's why I think solving 1 is good enough.

So we only need JsValue::is_valid_utf16_string, so that the input event can ignore unpaired surrogates.

And if later we actually do need to preserve unpaired surrogates, we can reinvestigate DomString.

Basically, after carefully thinking about this issue, my perspective has softened a lot, so I no longer see the benefit of trying to preserve unpaired surrogates.

@alexcrichton
Copy link
Contributor

Sounds like a solid plan to me! Would the addition in that case be:

impl JsValue {
    pub fn is_valid_utf16_string(&self) -> bool {
         // some wasm_bindgen intrinsic
    }
}

@Pauan
Copy link
Contributor Author

Pauan commented Mar 27, 2019

@alexcrichton Yes, exactly.

And the intrinsic would be implemented like this on the JS side (it has to be on the JS side, sadly):

function isValidUtf16String(str) {
    var i = 0;
    var len = str.length;

    while (i < len) {
        var char = str.charCodeAt(i);

        ++i;

        // Might be surrogate pair...
        if (char >= 0xD800) {
            // First half of surrogate pair
            if (char <= 0xDBFF) {
                // End of string
                if (i === len) {
                    return false;

                } else {
                    var next = str.charCodeAt(i);

                    // No second half
                    if (next < 0xDC00 || next > 0xDFFF) {
                        return false;
                    }
                }

            // Second half of surrogate pair
            } else if (char <= 0xDFFF) {
                return false;
            }
        }
    }

    return true;
}

(I loosely translated it from the DOMString spec by using some tips from the TextEncoder polyfill.)

It looks big, but it minifies quite well, and it should be extremely fast.

@lambda-fairy
Copy link

If we're leaving as_string as-is, can we at least rename it to as_string_lossy?

That would be more consistent with the standard library, where String::from_utf16_lossy does the same thing (and String::from_utf16 returns an error).

@Pauan
Copy link
Contributor Author

Pauan commented Mar 28, 2019

@lfairy I agree, but that will require a deprecation/breaking change cycle, so it's best to batch that together with a bunch of other breaking changes.

@alexcrichton
Copy link
Contributor

@Pauan I think it depends on whether we want to rename or not as to whether we do it now, because if we do want to rename having a deprecation is actually pretty easy to handle at any time.

I'm less certain though that we'd want to rename. It seems to me that it's far more common to use String and str than as_string which don't have a "lossy" warning. In that sense renaming as_string may not actually buy us much because it may not be used all that much.

I think we probably want to just update documentation to indicate the hazard?

@Pauan
Copy link
Contributor Author

Pauan commented Mar 29, 2019

@alexcrichton I don't have strong opinions about it: I really like the idea of being consistent with the Rust stdlib, but breaking changes aren't fun.

And as you say, any bindings that return String are lossy, and without any warnings! So I guess some doc changes are good enough.

alexcrichton added a commit to alexcrichton/wasm-bindgen that referenced this issue Apr 1, 2019
This commit aims to address rustwasm#1348 via a number of strategies:

* Documentation is updated to warn about UTF-16 vs UTF-8 problems
  between JS and Rust. Notably documenting that `as_string` and handling
  of arguments is lossy when there are lone surrogates.

* A `JsString::is_valid_utf16` method was added to test whether
  `as_string` is lossless or not.

The intention is that most default behavior of `wasm-bindgen` will
remain, but where necessary bindings will use `JsString` instead of
`str`/`String` and will manually check for `is_valid_utf16` as
necessary. It's also hypothesized that this is relatively rare and not
too performance critical, so an optimized intrinsic for `is_valid_utf16`
is not yet provided.

Closes rustwasm#1348
@hsivonen
Copy link

hsivonen commented Apr 2, 2019

This has a lot of very deep implications, including for international languages (e.g. Chinese).

The original issue cites emoji input, not Chinese. Are you actually able to reproduce this with Chinese? (Perhaps by enabling HKSCS characters in Microsoft Bopomofo (in pre-Windows 10 "Advanced" settings); I have no idea how to type Hong Kong characters with a Mandarin pronunciation-based IME. Or more obviously, by enabling HKSCS characters in Microsoft Changjie (directly in the Windows 10-style settings).)

Which browsers does the original emoji issue reproduce in? Is there a minimal non-wasm JS-only test case that demonstrates the issue?

For this IME-triggered event issue, checking the whole string for lone surrogates, using WTF-8, or making UTF-16 to UTF-8 conversion report unpaired surrogates are all overkills: It's enough to check if the last code unit of the JS string is an unpaired surrogate and discard the event if so. Please don't make all DOM APIs use OsStrs unless it's truly, truly, backed up by use cases.

@Pauan
Copy link
Contributor Author

Pauan commented Apr 2, 2019

The original issue cites emoji input, not Chinese. Are you actually able to reproduce this with Chinese?

I haven't tested it, though all Unicode characters above a certain point use surrogate pairs, and the issue is about surrogate pairs, so I would expect it to behave the same. Maybe it doesn't, though.

Is there a minimal non-wasm JS-only test case that demonstrates the issue?

I'll work on creating a reduced test case!

It's enough to check if the last code unit of the JS string is an unpaired surrogate and discard the event if so.

I don't think that's true: the user can input text anywhere in the <input> (not just as the last character!)

Please don't make all DOM APIs use OsStrs unless it's truly, truly, backed up by use cases.

We have no plans to use OsStr (or WTF-8, or anything like that).

The plan is to make the value method on gloo_events::InputEvent return Option<String> (where it returns None if there are unpaired surrogates). That's it. It's basically the simplest and least-invasive change possible.

alexcrichton added a commit to alexcrichton/wasm-bindgen that referenced this issue Apr 2, 2019
This commit aims to address rustwasm#1348 via a number of strategies:

* Documentation is updated to warn about UTF-16 vs UTF-8 problems
  between JS and Rust. Notably documenting that `as_string` and handling
  of arguments is lossy when there are lone surrogates.

* A `JsString::is_valid_utf16` method was added to test whether
  `as_string` is lossless or not.

The intention is that most default behavior of `wasm-bindgen` will
remain, but where necessary bindings will use `JsString` instead of
`str`/`String` and will manually check for `is_valid_utf16` as
necessary. It's also hypothesized that this is relatively rare and not
too performance critical, so an optimized intrinsic for `is_valid_utf16`
is not yet provided.

Closes rustwasm#1348
@Pauan
Copy link
Contributor Author

Pauan commented Apr 2, 2019

Okay, I created a reduced test case: http://paaru.pbworks.com/w/file/fetch/132853233/input-bug.html

I tried to reproduce it with Chinese, but I wasn't able to.

It turns out that Chinese characters are not encoded with surrogate pairs, since they're in the BMP. So it's only very rare historical Chinese characters which might be affected.

And since (as far as I can tell) Windows doesn't have any way to input those characters, I wasn't able to reproduce that.

However, I was able to reproduce it with emoji characters with the following browsers:

  • Mozilla Firefox Developer Edition 67.0b7 (64-bit)
  • Google Chrome 73.0.3683.86 (Official Build) (64-bit)

Interestingly, I could not reproduce it with these browsers (they do not have this bug):

  • Microsoft Edge 42.17134.1.0
  • Microsoft Internet Explorer 11.648.17134.0

P.S. I also verified that this bug happens if the user enters the emoji anywhere in the <input>, so checking the final character is not sufficient.

@hsivonen
Copy link

hsivonen commented Apr 3, 2019

Okay, I created a reduced test case: http://paaru.pbworks.com/w/file/fetch/132853233/input-bug.html

Thank you. Confirmed with the Windows 10 touch keyboard and Firefox.

However, the emoji picker integrated into the Microsoft Pinyin IME does not reproduce this problem, which suggests that the touch keyboard is trying to generate the text using a non-IME iterface to the app.

It turns out that Chinese characters are not encoded with surrogate pairs, since they're in the BMP. So it's only very rare historical Chinese characters which might be affected.

I mentioned HKSCS in my previous comment, because some Hong Kong characters that may be rare but not necessarily historical are assigned to CJK Unified Ideographs Extension B. For example, Apple Cangjie, IBus (Ubuntu, etc.) Cangie, and Android GBoard Cangjie allow 𥄫 to be entered by typing bunhe (assuming QWERTY keycaps). Even after enabling HKSCS for Microsoft ChangJie (sic), I'm unable to enter the character using Microsoft's implementation.

This is on Windows 10 1803. Another thing to test on a newer Windows 10 would be enabling the Adlam keyboard (Adlam is assigned to Plane 1) and seeing how it behaves.

P.S. I also verified that this bug happens if the user enters the emoji anywhere in the , so checking the final character is not sufficient.

OK.

@hsivonen
Copy link

hsivonen commented Apr 3, 2019

For example, Apple Cangjie, IBus (Ubuntu, etc.) Cangie, and Android GBoard Cangjie allow 𥄫 to be entered by typing bunhe (assuming QWERTY keycaps).

I should mention that none of these appear to expose an unpaired surrogate in Firefox.

@hsivonen
Copy link

hsivonen commented Apr 3, 2019

Entering astral characters with Microsoft Bopomofo doesn't appear to expose unpaired surrogates. (To test: In IME properties under Output Settings in Character Set, set the radio button to Unicode and check all boxes. Then type e.g. 1i6, press down arrow, press right arrow and then use the arrow keys and enter to choose any of the red characters in the palette.)

@hsivonen
Copy link

hsivonen commented Apr 3, 2019

Gecko bug, Chrome bug

alexcrichton added a commit to alexcrichton/wasm-bindgen that referenced this issue Apr 3, 2019
This commit aims to address rustwasm#1348 via a number of strategies:

* Documentation is updated to warn about UTF-16 vs UTF-8 problems
  between JS and Rust. Notably documenting that `as_string` and handling
  of arguments is lossy when there are lone surrogates.

* A `JsString::is_valid_utf16` method was added to test whether
  `as_string` is lossless or not.

The intention is that most default behavior of `wasm-bindgen` will
remain, but where necessary bindings will use `JsString` instead of
`str`/`String` and will manually check for `is_valid_utf16` as
necessary. It's also hypothesized that this is relatively rare and not
too performance critical, so an optimized intrinsic for `is_valid_utf16`
is not yet provided.

Closes rustwasm#1348
alexcrichton added a commit to alexcrichton/wasm-bindgen that referenced this issue Apr 4, 2019
This commit aims to address rustwasm#1348 via a number of strategies:

* Documentation is updated to warn about UTF-16 vs UTF-8 problems
  between JS and Rust. Notably documenting that `as_string` and handling
  of arguments is lossy when there are lone surrogates.

* A `JsString::is_valid_utf16` method was added to test whether
  `as_string` is lossless or not.

The intention is that most default behavior of `wasm-bindgen` will
remain, but where necessary bindings will use `JsString` instead of
`str`/`String` and will manually check for `is_valid_utf16` as
necessary. It's also hypothesized that this is relatively rare and not
too performance critical, so an optimized intrinsic for `is_valid_utf16`
is not yet provided.

Closes rustwasm#1348
alexcrichton added a commit to alexcrichton/wasm-bindgen that referenced this issue Apr 5, 2019
This commit aims to address rustwasm#1348 via a number of strategies:

* Documentation is updated to warn about UTF-16 vs UTF-8 problems
  between JS and Rust. Notably documenting that `as_string` and handling
  of arguments is lossy when there are lone surrogates.

* A `JsString::is_valid_utf16` method was added to test whether
  `as_string` is lossless or not.

The intention is that most default behavior of `wasm-bindgen` will
remain, but where necessary bindings will use `JsString` instead of
`str`/`String` and will manually check for `is_valid_utf16` as
necessary. It's also hypothesized that this is relatively rare and not
too performance critical, so an optimized intrinsic for `is_valid_utf16`
is not yet provided.

Closes rustwasm#1348
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants