Skip to content

Commit

Permalink
Add warnings about UTF-16 vs UTF-8 strings
Browse files Browse the repository at this point in the history
This commit aims to address #1348 via a number of strategies:

* Documentation is updated to warn about UTF-16 vs UTF-8 problems
  between JS and Rust. Notably documenting that `as_string` and handling
  of arguments is lossy when there are lone surrogates.

* A `JsString::is_valid_utf16` method was added to test whether
  `as_string` is lossless or not.

The intention is that most default behavior of `wasm-bindgen` will
remain, but where necessary bindings will use `JsString` instead of
`str`/`String` and will manually check for `is_valid_utf16` as
necessary. It's also hypothesized that this is relatively rare and not
too performance critical, so an optimized intrinsic for `is_valid_utf16`
is not yet provided.

Closes #1348
  • Loading branch information
alexcrichton committed Apr 2, 2019
1 parent c5f18b6 commit c8fdfac
Show file tree
Hide file tree
Showing 5 changed files with 88 additions and 0 deletions.
29 changes: 29 additions & 0 deletions crates/js-sys/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3522,6 +3522,35 @@ impl JsString {
None
}
}

/// Returns whether this string is a valid UTF-16 string.
///
/// This is useful for learning whether `as_string()` will return a lossless
/// representation of the JS string. If this string contains unparied
/// surrogates then `as_string()` will succeed but it will be a lossy
/// representation of the JS string because lone surrogates will become
/// replacement characters.
///
/// If this function returns `false` then to get a lossless representation
/// of the string you'll need to manually use `char_code_at` accessor to
/// access the raw code points.
///
/// For more information, see the documentation on [JS strings vs Rust
/// strings][docs]
///
/// [docs]: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html
pub fn is_valid_utf16(&self) -> bool {
std::char::decode_utf16(self.iter()).all(|i| i.is_ok())
}

/// Returns an iterator over the u16 character codes that make up this JS
/// string.
///
/// This method will call `char_code_at` for each code in this JS string,
/// returning an iterator of the codes in sequence.
pub fn iter<'a>(&'a self) -> impl ExactSizeIterator<Item = u16> + 'a {
(0..self.length()).map(move |i| self.char_code_at(i) as u16)
}
}

impl PartialEq<str> for JsString {
Expand Down
12 changes: 12 additions & 0 deletions crates/js-sys/tests/wasm/JsString.rs
Original file line number Diff line number Diff line change
Expand Up @@ -541,3 +541,15 @@ fn raw() {
);
assert!(JsString::raw_0(&JsValue::null().unchecked_into()).is_err());
}

#[wasm_bindgen_test]
fn is_valid_utf16() {
assert!(JsString::from("a").is_valid_utf16());
assert!(JsString::from("").is_valid_utf16());
assert!(JsString::from("🥑").is_valid_utf16());
assert!(JsString::from("Why hello there this, 🥑, is 🥑 and is 🥑").is_valid_utf16());

assert!(JsString::from_char_code1(0x00).is_valid_utf16());
assert!(!JsString::from_char_code1(0xd800).is_valid_utf16());
assert!(!JsString::from_char_code1(0xdc00).is_valid_utf16());
}
22 changes: 22 additions & 0 deletions guide/src/reference/types/str.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,25 @@ with handles to JavaScript string values, use the `js_sys::JsString` type.
```js
{{#include ../../../../examples/guide-supported-types-examples/str.js}}
```

## UTF-16 vs UTF-8

Strings in JavaScript are encoded as UTF-16, but with one major exception: they
can contain unpaired surrogates. For some unicode characters UTF-16 usese two
16-byte values. These are called "surrogate pairs" because they always come in
pairs. In JavaScript, it is possible for these surrogate pairs to be missing the
other half, creating an "unpaired surrogate".

When passing a string from JavaScript to Rust, it uses the `TextEncoder` API to
convert from UTF-16 to UTF-8. This is normally perfectly fine... unless there
are unpaired surrogates. In that case it will replace the unpaired surrogates
with U+FFFD (�, the replacement character). That means the string in Rust is
now different from the string in JavaScript!

If you want to guarantee that the Rust string is the same as the JavaScript
string, you should instead use `js_sys::JsString` (which keeps the string in
JavaScript and doesn't copy it into Rust).

If you simply want to ignore strings which contain unpaired surrogates, you can
use `JsString::is_valid_utf16` to test whether the string contains unpaired
surrogates or not.
3 changes: 3 additions & 0 deletions guide/src/reference/types/string.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ Copies the string's contents back and forth between the JavaScript
garbage-collected heap and the Wasm linear memory with `TextDecoder` and
`TextEncoder`

> **Note**: Be sure to check out the [documentation for `str`](str.html) to
> learn about some caveats when working with strings between JS and Rust.
## Example Rust Usage

```rust
Expand Down
22 changes: 22 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,28 @@ impl JsValue {
///
/// If this JS value is not an instance of a string or if it's not valid
/// utf-8 then this returns `None`.
///
/// # UTF-16 vs UTF-8
///
/// Strings in JavaScript are encoded as UTF-16, but with one major
/// exception: they can contain unpaired surrogates. For some unicode
/// characters UTF-16 usese two 16-byte values. These are called "surrogate
/// pairs" because they always come in pairs. In JavaScript, it is possible
/// for these surrogate pairs to be missing the other half, creating an
/// "unpaired surrogate".
///
/// This method internally will use `TextEncoder` in JS to convert from
/// UTF-16 to UTF-8. This is normally perfectly fine... unless there are
/// unpaired surrogates. In that case it will replace the unpaired
/// surrogates with U+FFFD (�, the replacement character). That means the
/// string returned from this function is actually different from the string
/// in JS!
///
/// If you want to guarantee that the returned string is the same as the
/// source string, you'll need to use `js_sys::JsString::is_valid_utf16`. If
/// that function returns `true` then this function will be lossless. If the
/// string isn't valid utf-16, however, then you'll need to use
/// `JsString::iter` and retain the u16 values exactly.
#[cfg(feature = "std")]
pub fn as_string(&self) -> Option<String> {
unsafe {
Expand Down

0 comments on commit c8fdfac

Please sign in to comment.