Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and address WebAssembly constraints #73

Closed
sffc opened this issue May 2, 2020 · 10 comments
Closed

Document and address WebAssembly constraints #73

sffc opened this issue May 2, 2020 · 10 comments
Assignees
Labels
A-design Area: Architecture or design A-ffi Area: FFI, WebAssembly, Transpilation C-meta Component: Relating to ICU4X as a whole R-fixed-elsewhere Resolution: issue was fixed outside the core repo T-docs-tests Type: Code change outside core library
Milestone

Comments

@sffc
Copy link
Member

sffc commented May 2, 2020

@hagbard wrote a doc with some constraints that WebAssembly brings that make it not a viable option for ICU4X porting to the Web Platform. We should:

  1. Add this document to GitHub
  2. Raise these concerns to the WebAssembly standards group to make progress on addressing those problems in future WebAssembly releases
@sffc sffc added the T-docs-tests Type: Code change outside core library label May 2, 2020
@hsivonen
Copy link
Member

hsivonen commented May 5, 2020

Raise these concerns to the WebAssembly standards group to make progress on addressing those problems in future WebAssembly releases

These numbers are relevant for crossing the WebIDL boundary. In that case, it was assumed that the wasm side needs to see UTF-8 anyway.

Since ICU4X will be able to operate on UTF-16, it would be worthwhile to consider a non-WebIDL method on JS String that takes a Uint16Array view to a SharedArrayBuffer and writes the string contents there.

TC39 may want to consider:

  • A method on JS String that takes an enable-shared Uint8Array and returns in the most efficient way possible (two-item array?) the number of code units read and the number of code units written with the semantics copied from TextEncoder.encodeInto.
  • A method on JS String that takes an enable-shared Uint16Array and writes as many code units as fit as UTF-16. (The number of code units written is implicitly the minimum of the string length and the buffer length.)
  • A constructor that takes an enable-shared Uint16Array and constructs a string with the potentially-invalid UTF-16 contents of that buffer. (Probably preserving unpaired surrogates or having an option to replace them with the REPLACEMENT CHARACTER.)
  • A constructor that takes an enable-shared Uint8Array and constructs a string with the potentially-invalid UTF-8 contents of the buffer replacing ill-formed sequences with the REPLACEMENT CHARACTER according to the Encoding Standard.

CC @annevk

@hsivonen
Copy link
Member

hsivonen commented May 5, 2020

writes as many code units as fit as UTF-16

Correction: Writes the code units as 16-bit units without ensuring UTF-16 validity.

@sffc sffc added A-ffi Area: FFI, WebAssembly, Transpilation C-meta Component: Relating to ICU4X as a whole A-design Area: Architecture or design labels May 7, 2020
@annevk
Copy link
Contributor

annevk commented May 7, 2020

As part of this it would be ideal if the feature request in whatwg/encoding#174 could also be considered. The most natural place for that seems to be some extension to JavaScript strings rather than a new feature on the Encoding API objects.

cc @alexcrichton @Pauan

@Pauan
Copy link

Pauan commented May 7, 2020

@hsivonen That just seems like it's reinventing TextEncoder and TextDecoder.

Since TextEncoder already accepts an encoding argument, why can't we just add new TextEncoder("utf16")? The API doesn't need to change (Uint8Array is fine), it just needs a new encoding added.

@hsivonen
Copy link
Member

hsivonen commented May 7, 2020

That just seems like it's reinventing TextEncoder and TextDecoder.

For UTF-8, yes, except without WebIDL. How big of a performance win not having WebIDL in there would be should be benchmarked before committing.

For UTF-16, no. TextEncoder writes to Uint8Array. This is about writing to a Uint16Array, which is guaranteed to be aligned right. I.e. this is about Unicode Encoding Forms, not about Unicode Encoding Schemes.

@Pauan
Copy link

Pauan commented May 7, 2020

This is about writing to a Uint16Array, which is guaranteed to be aligned right. I.e. this is about Unicode Encoding Forms, not about Unicode Encoding Schemes.

WebAssembly guarantees little endian (since no mainstream CPUs use big endian), so there's no problem with TextEncoder just writing into the Uint8Array with little endian order.

Or are you concerned about something else?

@hsivonen
Copy link
Member

hsivonen commented May 7, 2020

Or are you concerned about something else?

Yes: Alignment. A UTF-16LE encoder writing by byte to a buffer that is actually aligned to even addresses is either more complex (if it checks for alignment and has multiple paths) or less efficient (if it writes by byte) than one that gets to assume it can do aligned 16-bit writes.

@Pauan
Copy link

Pauan commented May 7, 2020

@hsivonen Okay, but I don't see why the encoder can't internally use Uint16Array (even if it returns Uint8Array). Since converting from Uint16Array to Uint8Array doesn't copy (since they can share the same underlying ArrayBuffer).

But even if there was some reason that wouldn't work, TextEncoder could just return Uint16Array if the encoding is "utf16". That sort of polymorphism is very common (even in web APIs).

So why do we need a new API that just does the same thing as TextEncoder/TextDecoder?

@sffc sffc added this to the 2020 Q2 milestone Jun 17, 2020
@sffc
Copy link
Member Author

sffc commented Jun 26, 2020

CC @littledan

@sffc sffc assigned nciric and unassigned hagbard Jul 23, 2020
@sffc sffc modified the milestones: 2020 Q2, 2020 Q3 Jul 23, 2020
@nciric nciric assigned hagbard and unassigned nciric Jul 24, 2020
@sffc
Copy link
Member Author

sffc commented Oct 22, 2020

See #207 for additional discussion on this subject.

@sffc sffc closed this as completed Oct 22, 2020
@sffc sffc added the R-fixed-elsewhere Resolution: issue was fixed outside the core repo label Oct 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design A-ffi Area: FFI, WebAssembly, Transpilation C-meta Component: Relating to ICU4X as a whole R-fixed-elsewhere Resolution: issue was fixed outside the core repo T-docs-tests Type: Code change outside core library
Projects
None yet
Development

No branches or pull requests

6 participants