Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider UTF-32 support #545

Open
dpk opened this issue Mar 14, 2021 · 9 comments
Open

Reconsider UTF-32 support #545

dpk opened this issue Mar 14, 2021 · 9 comments
Assignees
Labels
A-ffi Area: FFI, WebAssembly, Transpilation C-meta Component: Relating to ICU4X as a whole question Unresolved questions; type unclear

Comments

@dpk
Copy link

dpk commented Mar 14, 2021

string_representation.md:

The use of UTF-32 is rare enough that it's not worth supporting.

There is one significant use of UTF-32 in the real world: Python’s so-called ‘flexible string representation’. See PEP 393. The short version: Python internally stores strings as Latin-1 if they only contain characters ≤ U+00FF; as UTF-16 (guaranteed valid, fwiw) if they contain only characters in the BMP; or otherwise as UTF-32. This is intended to provide the most efficient representation for a majority of strings while retaining O(1) string indexing — it’s much like what the document says about what SpiderMonkey and V8 do, but since Python string indexing, unlike JS string indexing, returns real codepoints and not UTF-16 code units, it adds an extra upgrade to UTF-32.

In the Scheme world, R7RS Large can reasonably be expected to require that codepoint indexing into strings (or some variant of strings — it’s possible we’ll end up with a string/text split like Haskell’s) be O(1), so I expect UTF-32 or Python-style flexible string representation to become common in that context, too.

(Also, before flexible string representation was introduced into Python, UTF-32 was used for all strings.)

@sffc
Copy link
Member

sffc commented Mar 14, 2021

CC @hsivonen

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting A-ffi Area: FFI, WebAssembly, Transpilation labels Mar 14, 2021
@hsivonen
Copy link
Member

If there is interest in interfacing with Python on that level instead of going via UTF-8, I guess that's a use case, then.

Note: Python doesn't guarantee UTF-32 validity: The 32-bit-code-unit strings can contain lone surrogates, so if the use case is interfacing with Python without UTF-8 conversion, ICU4X would need to check each code unit for being in the range for Rust char instead of assuming validity.

@dpk
Copy link
Author

dpk commented Mar 15, 2021

I would hypothetically be interested in writing a Python binding once icu4x has a C API, as an alternative to the very un-Pythonic and under-documented PyICU. (I’m the author of the PyICU cheat sheet, which is afaik the only API documentation specific to ICU in Python — otherwise, you’re just referred to the C++ API and left to work out how it maps on to Python for yourself.)

@ovalhub
Copy link

ovalhub commented Mar 15, 2021

Author of PyICU here: if you find PyICU very unpythonic, please provide concrete examples about how you're doing something with PyICU and how you'd suggest it be done instead in order be more pythonic. I'm happy to either fix actual un-pythonic examples of PyICU use-cases or show you how it's done. Please, be very specific, there already is a lot of built-in python iterator support, for example, that you may just not know about. It's ok to ask and suggest improvements (!)
I'm pretty sure that PyICU being very under-documented also contributes to your perceiving it as un-pythonic. I claimed PyICU documentation bankrupcy over a decade ago as the ICU API surface is huge and keeps growing. I cannot provide another set of docs, it's hard enough to keep up with ICU proper and the ICU docs themselves are pretty good. This is open source, the PyICU C++ wrappers around C/C++ICU are fairly regular, I encourage you to read the code to see what is possible and what is supported and how to use it.

@ovalhub
Copy link

ovalhub commented Mar 15, 2021

For example, from your cheat sheet, you seem to not know that BreakIterator is a python iterator:

   from icu import *
   de_words = BreakIterator.createWordInstance(Locale('de_DE'))
   de_words.setText('Bist du in der U-Bahn geboren?')
   list(de_words)
   [4, 5, 7, 8, 10, 11, 14, 15, 16, 17, 21, 22, 29, 30]

Yes, I understand, you'd prefer the actual words to be returned but that's not how ICU designed the BreakIterator, they chose to give you boundaries instead.
It's not that hard, in python, to then combine these boundaries into words, however.
That being said, but this might become a lot of work to do consistently, adding a higher level iterator giving you the words, would be nice too.

@sffc sffc added backlog C-meta Component: Relating to ICU4X as a whole question Unresolved questions; type unclear and removed discuss Discuss at a future ICU4X-SC meeting labels Mar 18, 2021
@sffc
Copy link
Member

sffc commented Mar 18, 2021

We will revisit string encodings as we approach ICU4X v1.

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting and removed v1 labels Apr 1, 2022
@sffc
Copy link
Member

sffc commented May 20, 2022

Discussion:

  • @Manishearth - ICU4X is powered by Diplomat, which allows the target to set the encoding for output strings (via Writeable). So the question is mainly scoped to input strings and text processing APIs like Segmenter.
  • @robertbastian - Should we consider returning types of iterators in Segmenter that abstracts over the string encoding?
  • @sffc - Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support. We could do it if it is well-motivated.
  • @robertbastian - Who else needs UTF-32? Are there numbers on the use?
  • @Manishearth - Ruby supports bring-your-own-encoding. Most clients use either UTF-8 or UTF-16. A "rope-based" character encoding was useful in XI, which allowed for incremental segmentation (re-segmenting text after being edited without re-computing the whole string).
  • @sffc - UTF-32 is useful as a code point storage mechanism, like &[char]. I haven't seen it used widely as an encoding for strings.
  • @robertbastian - Based on @hsivonen's comment above, Python does not guarantee validity of UTF-32, so maybe this should just be handled in the Python FFI layer.
  • @Manishearth - Since it is Python, maybe we just do an in-place conversion.
  • @robertbastian - We should punt the decision until we add Python FFI support.
  • @sffc - Agreed.

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label May 20, 2022
@hsivonen
Copy link
Member

This should not be taken as an endorsement of UTF-32, but as a matter of how hard things would be for the collator specifically:

Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support.

The collator and the decomposing normalizer consume an iterator over char internally (with errors mapped to U+FFFD), so adding UTF-32 support would be trivial. At compile time, there would be separate codegen instances for UTF-32, which would grow the binary size, but those instances should also be eligible to be thrown away by LTO when not used (except for FFI, there's currently a Rust symbol visibility issue standing in the way of cross-language LTO doing proper dead code analysis).

@hsivonen
Copy link
Member

Since most strings don't contain supplementary-plane characters, supporting UTF-32 wouldn't really help: If most Python strings were converted to UTF-32 upon ICU4X API boundary, they might as well be converted to UTF-8 unless there are indices returned.

Indices are relevant to the segmenter. In that case, it might actually help Python to convert to UTF-32 and then segment that.

Other than that, the case where avoiding conversion to UTF-8 might make sense is the collator, which performs a lot of string reading without modification. However, to have the collator operate without having to create (converted) copies of the Python string data, there'd need to be 6 methods:

  1. Compare potentially-ill-formed UTF-32 and potentially-ill-formed UTF-32.
  2. Compare UCS-2 and UCS-2.
  3. Compare Latin 1 and Latin 1.
  4. Compare potentially-ill-formed UTF-32 and UCS-2.
  5. Compare potentially-ill-formed UTF-32 and Latin 1.
  6. Compare UCS-2 and Latin 1.

The remaining of the nine cases are mirror cases of the last three, so no point in generating code for those separately.

Note that a surrogate pair in a Python string has the semantics of a surrogate and another surrogate. The result does not have supplementary-plane semantics. I haven't checked if surrogates promote to 32-bit code units or if the 16-bit-code-unit representation can have surrogates that don't have UTF-16 semantics. That is, it's unclear to me if item 2 can reuse the UTF-16 to UTF-16 comparison specialization.

Note that the raw Python data representation is available via PyO3 "on non-Py_LIMITED_API and little-endian only".

If someone really cares, it would make sense to benchmark the collator with these 6 variants (out-of-repo) vs. converting to UTF-8 and then using the &str-to-&str comparison.

@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ffi Area: FFI, WebAssembly, Transpilation C-meta Component: Relating to ICU4X as a whole question Unresolved questions; type unclear
Projects
None yet
Development

No branches or pull requests

4 participants