-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmenter #109
Comments
Yes. We might want to consider unicode_segmentation for non-line segmentation. Neither of these are locale aware, however. |
Intl.Segmenter is in scope for 402, and I think it makes sense to include here in ICU4X as well. I think we should build the segmenter based on the unicode properties API that ICU4X provides. |
Adding to backlog along with Normalization (#40). |
@methyl compiled ICU4C Segmenter to WebAssembly as a polyfill for Intl.Segmenter: https://github.com/surferseo/intl-segmenter-polyfill According to tc39/proposal-intl-segmenter#118 (comment), the .wasm file is ~350 KB gzipped, with the Thai dictionary but not the other dictionaries. Just posting this here as some benchmarks to consider for the Rust segmenter. |
@sffc thanks for mention, if you have any questions about the implementation, go ahead. I also explored compiling https://github.com/google/rust_icu to WASM but it was not straightforward and I gave up. |
Which segmentation functionality exactly? There are several types desirable:
cc @raphlinus |
Re line segmentation, I have no particular ego invested, but it is true as @dhardy points out that xi-unicode has a solution to this problem. It's a very optimized, hand-tuned solution. I haven't benchmarked it against uax14_rs, but did benchmark it against ICU a long time ago and it was considerably faster (somewhere in the 2x-3x range, I don't remember the details). I would be pleased and honored if it were adapted for use in icu4x, but relieved if not; I'm busy enough as it is. There's lots more to line breaks than a UAX 14 compliant break iterator. There's all the southeast Asian dictionary-based breaking stuff, plus other desirable "tailoring." For example, Android has a little parser for url and email addresses, and applies different breaking rules for those, because the default UAX 14 behavior is pretty bad in those use cases. I should also point out that ICU has some really complex additional behavior around numbers and ASCII symbols, not covered by UAX 14. I evaluated it when working on xi-unicode and found it not worthwhile, but others with different use cases may feel differently. So whoever adopts this will have their hands full, if the goal is a world-class solution. The Druid text stack currently does nothing sophisticated wrt word advance, but is likely a customer for WordCursor functionality. |
I'll also add that xi-unicode goes to Unicode 10, while uax14_rs seems to be up to date with Unicode 13. That's another consideration, and makes apples-to-apples comparison harder, as Unicode 13 is substantially more complex. Lastly, xi-unicode has support for break iteration in non-contiguous strings, to support the needs at the time of xi-editor. I think that's probably worth removing going forward; my current thinking is that even when dealing with very large texts, it makes sense to use a contiguous string representation at the paragraph granularity and below. The maintenance burden of keeping the noncontiguous representation up to date is nontrivial. |
One of the main value propositions of ICU4X is to provide Unicode and CLDR data (#217 is a thread specifically about Unicode character property data). In cases where functionality is out of scope for ICU4X, I hope that downstream crates can depend on the ICU4X data provider to get their character properties. For segmentation in particular, as others have mentioned, there is already a strong ecosystem in Rust. I don't think we should reinvent the wheel in ICU4X. There is one area where I'm not aware of strong ecosystem support, which is dictionary-based segmentation (important in many East and Southeast Asian languages). I'm hosting an intern this fall who I hope will be able to build a segmenter for these languages. |
Hi all, I have a document for implementing segmenters in icu4x. Feedback very welcome. cc @jfkthame |
I'm intending to review in close detail your document. I'm a Cantonese speaker and have implemented a bad trie version of Cantonese segmentation for JS (using the 402 API) which is still immensely better than ICU's Apropos of literally nothing in your document, since I've only skimmed it at this point, here is a thread where I've been discussing segmentation API limitations for Cantonese: tc39/proposal-intl-segmenter#133 I'm looking forward to learning from your research! |
Implementation feedback from @aheninger:
|
I did some quick informal performance measurements of ICU4C vs uax14_rs on English text, and came up with similar results to those reported above, in #707. ICU is slower, running at around 60% the speed of uax14_rs. My suspicion is that part of the difference comes from a ICU's caching layer. ICU's API supports forwards and backwards iteration, and testing arbitrary string positions, all from the same iterator. Caching helps some use cases, but not plain forward iteration. The work per character for forward iteration of the underlying engine looks like it should be comparable for ICU and uax14_rs packages. All this is just hunch - without some real profiling work, I don't really know and could easily be wrong. As @dhardy and @raphlinus noted earlier, there is also xi-unicode/rust/unicode. The approach it takes is similar to that of uax14_rs - I wonder if there is some common ancestry between the two In both xi-uniocde and uax14_rs, lists of states and character properties are hard coded; you can't make a word or grapheme cluster iterator by just plugging in different rules. Also, both xi-editor and uax14_rs include tests to check the Unicode test file LineBreakTest.txt. uax14_rs is skipping tests that include the line break character classes OP, NU, PO and PR. For uax14_rs I confirmed that the test fails if these cases are added back in. |
This will be closed after segmenter moves to components in #2259. |
Segmenter has been moved to components and is being released in 1.2. Closing this tracking issue. |
For Segmenter, seems like the first crate we could consider incorporating is https://github.com/makotokato/uax14_rs
@sffc @Manishearth @makotokato - would it make sense to consider it for ICU4X?
The text was updated successfully, but these errors were encountered: