Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing a String to JS removes BOM (FEFF) #1729

Closed
amesgen opened this issue Aug 21, 2019 · 3 comments · Fixed by #1730
Closed

Passing a String to JS removes BOM (FEFF) #1729

amesgen opened this issue Aug 21, 2019 · 3 comments · Fixed by #1730
Labels

Comments

@amesgen
Copy link

amesgen commented Aug 21, 2019

Summary

Consider this function:

#[wasm_bindgen]
fn foo() -> String { "\u{feff}bar".to_owned() }

But in JS, I only get the string "bar" - the BOM is stripped. Is this intended? I can't seem to find info about this neither here nor here.

Additional Details

I use wasm-bindgen = "0.2.50" and rustc 1.39.0-nightly (bea0372a1 2019-08-20).

@Pauan
Copy link
Contributor

Pauan commented Aug 21, 2019

This is expected. Rust strings are UTF-8, but JavaScript strings are UTF-16, so it has to transcode the string.

This transcoding doesn't necessarily round-trip (e.g. it strips out BOM, and also replaces unpaired surrogates with the replacement character).

This isn't caused by Rust or wasm-bindgen, instead it's just how JavaScript itself behaves. Here's a JavaScript program to demonstrate:

var input = "\u{feff}bar";
var output = new TextDecoder("utf8").decode(new TextEncoder("utf8").encode(input))
console.log(input[0]);
console.log(output[0]);

As you can see, it stripped out the BOM.

@Pauan
Copy link
Contributor

Pauan commented Aug 21, 2019

After looking into it some more, there is an ignoreBOM option, which causes BOM to round-trip:

var input = "\u{feff}bar";
var output = new TextDecoder("utf8", { ignoreBOM: true }).decode(new TextEncoder("utf8").encode(input))
console.log(input[0]);
console.log(output[0]);

So we could change wasm-bindgen so it does that. But I'm curious: what's your use case?

(We should probably also set fatal: true, so it will catch any weird encoding bugs)

@amesgen
Copy link
Author

amesgen commented Aug 21, 2019

I think setting ignoreBom to true is a good idea, as there is no fundamental reason why UTF-8 (which has no BOM in Rust) -> UTF-16 should remove BOMs (in contrast to the unpaired surrogates, which can't be ignored when converting UTF-16 -> UTF-8).

IMO it would be nice to guarantee that Rust -> JS -> Rust is the identity, and JS -> Rust -> JS replaces unpaired surrogates with �, but nothing else.

Concerning my "use case": I just pass (user-supplied) strings from Rust to JS and back again, and I was very confused that this changes some strings - this is just a minimal repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants