Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String encodings (UTF-8/UTF-16) #14

Closed
sffc opened this issue Mar 24, 2020 · 12 comments · Fixed by #49
Closed

String encodings (UTF-8/UTF-16) #14

sffc opened this issue Mar 24, 2020 · 12 comments · Fixed by #49
Assignees
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole T-docs-tests Type: Code change outside core library
Milestone

Comments

@sffc
Copy link
Member

sffc commented Mar 24, 2020

We want to make OmnICU able to be ported to other environments where strings are not necessarily UTF-8. For example, if called via FFI from Dart or WebAssembly, we may want to process strings that are UTF-16. How should we deal with strings in OmnICU that makes it easy to target UTF-16 environments while not hurting performance in UTF-8 environments?

@sffc sffc changed the title How to do string handling? String encodings (UTF-8/UTF-16) Mar 24, 2020
@sffc sffc added this to the 2020 Q1 milestone Mar 24, 2020
@macchiati
Copy link
Member

macchiati commented Mar 25, 2020 via email

@zbraniecki
Copy link
Member

@Manishearth @hsivonen

@sffc
Copy link
Member Author

sffc commented Mar 25, 2020

I was thinking one option here could be that we have a file like,

// typedefs.rs

// Rust UTF-8 environment
#[cfg(...)]
pub use std::string::String;

// UTF-16 environment
#[cfg(...)]
pub use omnicu::types::U16String as String;

And then, all public OmnICU APIs should import the String symbol from typedefs.rs, and the definition varies by configuration option. As long as omnicu::types::U16String is duck-type-compatible with std::string::String, then we should be OK, and we don't cause any performance regressions in Rust environments.

@Manishearth
Copy link
Member

IMO it would be better to use a trait than use cfgs

@hsivonen
Copy link
Member

I expect in Gecko whether we want to perform a given operation on UTF-8, on UTF-16, or, depending on caller, both, will vary over time. In the case of encoding_rs, Gecko is pretty firmly in the "both" situation, but for specific operations to be provided by OmnICU, it could easily be one or the other for a given operation.

I also expect there will be some operations where the Gecko caller wants UTF-16 for the time being but performing a conversion at the boundary and running the operation on UTF-8 inside OmnICU would be acceptable. When writing our conversions between UTF-16 and UTF-8, it's been a goal of mine to make it so fast that people wouldn't shy away from using UTF-8 for new code just because the new code needs to be called by old code that has UTF-16 at hand.

That is to say:

  • I think having having one big UTF-16 vs. UTF-8 build config flag isn't a good match for Gecko.
  • For performance-sensitive operations, I'd prefer to have both entry points monomorphized from source code written once such that C++/Rust cross-language LTO discards a monomorphization if unused. (For languages that are known to want one or the other and that don't participate in LTO with Rust, the FFI boundary should expose only the entry points for one so that the other gets discarded when building the largest artifact that does participate in LTO with Rust.)
  • If a given operation has to be written for one or the other only, I'd err on the side of writing it for UTF-8.

For encoding_rs having the UTF-16 entry points means that the naming of the UTF-8 entry points is slightly less elegant for UTF-8-only consumers than it could be if UTF-16 entry points didn't exist. I think this is acceptable, though. However, in the case of encoding_rs's problem space, there would be multiple different UTF-8 convenience entry points anyway, which causes a need to disambiguate by naming since there isn't argument type-based overloading. E.g. decode_to_utf16 outputs to caller-allocated UTF-16 buffer, decode_to_utf8 outputs to a caller-allocated UTF-8 buffer that's not a UTF-8 buffer for type system purposes, decode_to_str outputs to a caller-allocated UTF-8 buffer that is a UTF-8 buffer for type system purposes, and decode_to_string decodes by extending a caller-provided String. (The construction is supposed to be such that if you only use the UTF-16 entry point, the code for the UTF-8 entry points, which share the back end code, is discarded or vice versa. However, I don't actually have testing infra in place to keep verifying that this is so, since Gecko needs both anyway.)

@hsivonen
Copy link
Member

I think I should expand on my previous comment:

There aren't just two cases: UTF-16 and UTF-8. There are three cases: potentially-invalid UTF-16, potentially-invalid UTF-8 (invalidity isn't UB), and guaranteed-valid UTF-8 (invalidity is UB). (Guaranteed-valid UTF-16 doesn't exist in practice.) Additionally, there'd be Latin1, which both SpiderMonkey and V8 use internally when possible, but it's unclear if letting that abstraction leak into OmnICU is a good idea.

In the caller-to-OmnICU direction, it's always okay to pass guaranteed-valid UTF-8 to an entry points that takes potentially-invalid UTF-8 if one is willing to leave some performance on the table. In the OmnICU-to-caller direction it isn't quite that simple, because the simplest loop that restores UTF-8 validity after having written to a caller-allocated buffer of guaranteed-valid UTF-8 doesn't necessarily terminate if the buffer wasn't valid to begin with. (I.e. write zeros after your meaningful write until finding a UTF-8 lead byte.)

Supporting all three for output is results in just two copies of the business logic plus a wrapper for does the zeroing until the next UTF-8 lead for the third case. Supporting all three on the input side without leaving any performance on the table results in three monomorphizations. (encoding_rs doesn't support the case of encoding potentially-invalid UTF-8 into a legacy encoding.)

What travels best over the FFI is a pointer to caller-owned buffer and a length. That is, for input the three cases are &[u16], &[u8], and &str and for output &mut [u16], &mut [u8], and &mut str. &mut str isn't really idiomatic Rust, and encoding_rs provides the entry point that appends to a String. For the output case, this means that there has to be a way to obtain an estimate of the worst-case code unit length so that the caller can allocat buffer. If the caller is unhappy with the slop left at the end of the buffer, the caller has to afterwards reallocate an exact-length buffer and copy into it. This kind of write-and-shrink approach allocation isn't really worse than the usual way of write-expand-write-more.

So I would suggest that the FFI layer didn't try to expose either iterators or growable buffers but worked with pointer-and-length and caller-allocated buffers instead. When the internals use iterators, I'm fine with exposing the iterators to Rust callers. I think it's also nice to provide wrappers that append into String for Rust callers.

@sffc sffc added the T-docs-tests Type: Code change outside core library label Apr 16, 2020
@sffc sffc assigned hsivonen and unassigned zbraniecki and Manishearth Apr 16, 2020
@sffc
Copy link
Member Author

sffc commented Apr 16, 2020

Reassigning this issue to Henri as the primary owner. Can you write up the recommended approach to ICU4X string encodings in a PR that we can review?

@hsivonen
Copy link
Member

@anba, My recollection is that you've worked on optimizing SpiderMonkey cases where the ECMA Intl API is called with input that SpiderMonkey represents as Latin1. Are there operation with which it would be worth binary size increase to optimize performance by making the back end library work directly with the Latin1 representation?

@zbraniecki
Copy link
Member

There's also prior work on Gecko's nsstring crate - https://searchfox.org/mozilla-central/source/xpcom/rust/nsstring/src

I asked @emilio for early feedback since he's one of the authors of that FFI lib for strings. His response:

Henri's comment is on point
You want either different entry points or generics to deal with utf-8 / utf-16
(...)
Given this is intended to be a "base library", I'd try for the APIs to try to not have much saying on how the caller's string should look
So like output should probably allow to give it a buffer or something. ICU does some of it with their "byte sink" stuff

@hsivonen
Copy link
Member

Created a PR with a proposal. Lacks a sample C++ wrapper still.

@hsivonen
Copy link
Member

There's also prior work on Gecko's nsstring crate

I updated the proposal to use an allocation strategy similar to nsstring. The difference is that nsstring is aware of jemalloc bucket rounding, but the sample code I included is not. From API docs, it seems that reserve on Vec should query the allocator and round up to bucket size, but I don't see that actually happening.

@hsivonen
Copy link
Member

From API docs, it seems that reserve on Vec should query the allocator and round up to bucket size, but I don't see that actually happening.

Filed a Rust issue.

@sffc sffc added C-meta Component: Relating to ICU4X as a whole A-design Area: Architecture or design labels May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole T-docs-tests Type: Code change outside core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants