Passing a String to JS removes BOM (FEFF) #1729

amesgen · 2019-08-21T16:11:11Z

Summary

Consider this function:

#[wasm_bindgen]
fn foo() -> String { "\u{feff}bar".to_owned() }

But in JS, I only get the string "bar" - the BOM is stripped. Is this intended? I can't seem to find info about this neither here nor here.

Additional Details

I use wasm-bindgen = "0.2.50" and rustc 1.39.0-nightly (bea0372a1 2019-08-20).

The text was updated successfully, but these errors were encountered:

Pauan · 2019-08-21T17:56:07Z

This is expected. Rust strings are UTF-8, but JavaScript strings are UTF-16, so it has to transcode the string.

This transcoding doesn't necessarily round-trip (e.g. it strips out BOM, and also replaces unpaired surrogates with the replacement character).

This isn't caused by Rust or wasm-bindgen, instead it's just how JavaScript itself behaves. Here's a JavaScript program to demonstrate:

var input = "\u{feff}bar";
var output = new TextDecoder("utf8").decode(new TextEncoder("utf8").encode(input))
console.log(input[0]);
console.log(output[0]);

As you can see, it stripped out the BOM.

Pauan · 2019-08-21T17:59:11Z

After looking into it some more, there is an ignoreBOM option, which causes BOM to round-trip:

var input = "\u{feff}bar";
var output = new TextDecoder("utf8", { ignoreBOM: true }).decode(new TextEncoder("utf8").encode(input))
console.log(input[0]);
console.log(output[0]);

So we could change wasm-bindgen so it does that. But I'm curious: what's your use case?

(We should probably also set fatal: true, so it will catch any weird encoding bugs)

amesgen · 2019-08-21T18:07:12Z

I think setting ignoreBom to true is a good idea, as there is no fundamental reason why UTF-8 (which has no BOM in Rust) -> UTF-16 should remove BOMs (in contrast to the unpaired surrogates, which can't be ignored when converting UTF-16 -> UTF-8).

IMO it would be nice to guarantee that Rust -> JS -> Rust is the identity, and JS -> Rust -> JS replaces unpaired surrogates with �, but nothing else.

Concerning my "use case": I just pass (user-supplied) strings from Rust to JS and back again, and I was very confused that this changes some strings - this is just a minimal repro.

amesgen added the question label Aug 21, 2019

Pauan mentioned this issue Aug 21, 2019

Adding ignoreBOM and fatal to TextDecoder #1730

Merged

alexcrichton closed this as completed in #1730 Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing a String to JS removes BOM (FEFF) #1729

Passing a String to JS removes BOM (FEFF) #1729

amesgen commented Aug 21, 2019

Pauan commented Aug 21, 2019

Pauan commented Aug 21, 2019 •

edited

Loading

amesgen commented Aug 21, 2019 •

edited

Loading

Passing a String to JS removes BOM (FEFF) #1729

Passing a String to JS removes BOM (FEFF) #1729

Comments

amesgen commented Aug 21, 2019

Summary

Additional Details

Pauan commented Aug 21, 2019

Pauan commented Aug 21, 2019 • edited Loading

amesgen commented Aug 21, 2019 • edited Loading

Pauan commented Aug 21, 2019 •

edited

Loading

amesgen commented Aug 21, 2019 •

edited

Loading