Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warnings about UTF-16 vs UTF-8 strings #1416

Merged
merged 1 commit into from
Apr 5, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions crates/js-sys/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3522,6 +3522,37 @@ impl JsString {
None
}
}

/// Returns whether this string is a valid UTF-16 string.
alexcrichton marked this conversation as resolved.
Show resolved Hide resolved
///
/// This is useful for learning whether `String::from(..)` will return a
/// lossless representation of the JS string. If this string contains
/// unpaired surrogates then `String::from` will succeed but it will be a
/// lossy representation of the JS string because unpaired surrogates will
/// become replacement characters.
///
/// If this function returns `false` then to get a lossless representation
/// of the string you'll need to manually use the `iter` method (or the
/// `char_code_at` accessor) to view the raw character codes.
///
/// For more information, see the documentation on [JS strings vs Rust
/// strings][docs]
///
/// [docs]: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html
pub fn is_valid_utf16(&self) -> bool {
alexcrichton marked this conversation as resolved.
Show resolved Hide resolved
std::char::decode_utf16(self.iter()).all(|i| i.is_ok())
}

/// Returns an iterator over the `u16` character codes that make up this JS
/// string.
///
/// This method will call `char_code_at` for each code in this JS string,
/// returning an iterator of the codes in sequence.
pub fn iter<'a>(
&'a self,
) -> impl ExactSizeIterator<Item = u16> + DoubleEndedIterator<Item = u16> + 'a {
(0..self.length()).map(move |i| self.char_code_at(i) as u16)
}
}

impl PartialEq<str> for JsString {
Expand Down
12 changes: 12 additions & 0 deletions crates/js-sys/tests/wasm/JsString.rs
Original file line number Diff line number Diff line change
Expand Up @@ -541,3 +541,15 @@ fn raw() {
);
assert!(JsString::raw_0(&JsValue::null().unchecked_into()).is_err());
}

#[wasm_bindgen_test]
fn is_valid_utf16() {
assert!(JsString::from("a").is_valid_utf16());
assert!(JsString::from("").is_valid_utf16());
assert!(JsString::from("🥑").is_valid_utf16());
assert!(JsString::from("Why hello there this, 🥑, is 🥑 and is 🥑").is_valid_utf16());

assert!(JsString::from_char_code1(0x00).is_valid_utf16());
assert!(!JsString::from_char_code1(0xd800).is_valid_utf16());
assert!(!JsString::from_char_code1(0xdc00).is_valid_utf16());
}
7 changes: 6 additions & 1 deletion examples/without-a-bundler/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,12 @@
// Also note that the promise, when resolved, yields the wasm module's
// exports which is the same as importing the `*_bg` module in other
// modes
await init('./pkg/without_a_bundler_bg.wasm');
// await init('./pkg/without_a_bundler_bg.wasm');

const url = await fetch('http://localhost:8001/pkg/without_a_bundler_bg.wasm');
const body = await url.arrayBuffer();
const module = await WebAssembly.compile(body);
await init(module);

// And afterwards we can use all the functionality defined in wasm.
const result = add(1, 2);
Expand Down
27 changes: 27 additions & 0 deletions guide/src/reference/types/str.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,30 @@ with handles to JavaScript string values, use the `js_sys::JsString` type.
```js
{{#include ../../../../examples/guide-supported-types-examples/str.js}}
```

## UTF-16 vs UTF-8

Strings in JavaScript are encoded as UTF-16, but with one major exception: they
can contain unpaired surrogates. For some Unicode characters UTF-16 uses two
16-byte values. These are called "surrogate pairs" because they always come in
pairs. In JavaScript, it is possible for these surrogate pairs to be missing the
other half, creating an "unpaired surrogate".

When passing a string from JavaScript to Rust, it uses the `TextEncoder` API to
convert from UTF-16 to UTF-8. This is normally perfectly fine... unless there
are unpaired surrogates. In that case it will replace the unpaired surrogates
with U+FFFD (�, the replacement character). That means the string in Rust is
now different from the string in JavaScript!

If you want to guarantee that the Rust string is the same as the JavaScript
string, you should instead use `js_sys::JsString` (which keeps the string in
JavaScript and doesn't copy it into Rust).

If you want to access the raw value of a JS string, you can use `JsString::iter`,
which returns an `Iterator<Item = u16>`. This perfectly preserves everything
(including unpaired surrogates), but it does not do any encoding (so you
have to do that yourself!).

If you simply want to ignore strings which contain unpaired surrogates, you can
alexcrichton marked this conversation as resolved.
Show resolved Hide resolved
use `JsString::is_valid_utf16` to test whether the string contains unpaired
surrogates or not.
3 changes: 3 additions & 0 deletions guide/src/reference/types/string.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ Copies the string's contents back and forth between the JavaScript
garbage-collected heap and the Wasm linear memory with `TextDecoder` and
`TextEncoder`

> **Note**: Be sure to check out the [documentation for `str`](str.html) to
> learn about some caveats when working with strings between JS and Rust.

## Example Rust Usage

```rust
Expand Down
10 changes: 10 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,16 @@ impl JsValue {
///
/// If this JS value is not an instance of a string or if it's not valid
/// utf-8 then this returns `None`.
///
alexcrichton marked this conversation as resolved.
Show resolved Hide resolved
/// # UTF-16 vs UTF-8
///
/// JavaScript strings in general are encoded as UTF-16, but Rust strings
/// are encoded as UTF-8. This can cause the Rust string to look a bit
/// different than the JS string sometimes. For more details see the
/// [documentation about the `str` type][caveats] which contains a few
/// caveats about the encodings.
///
/// [caveats]: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html
#[cfg(feature = "std")]
pub fn as_string(&self) -> Option<String> {
unsafe {
Expand Down