Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenter #109

Closed
zbraniecki opened this issue May 29, 2020 · 17 comments
Closed

Segmenter #109

zbraniecki opened this issue May 29, 2020 · 17 comments
Assignees
Labels
C-segmentation Component: Segmentation C-unicode Component: Props, sets, tries S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality

Comments

@zbraniecki
Copy link
Member

For Segmenter, seems like the first crate we could consider incorporating is https://github.com/makotokato/uax14_rs

@sffc @Manishearth @makotokato - would it make sense to consider it for ICU4X?

@Manishearth
Copy link
Member

Yes. We might want to consider unicode_segmentation for non-line segmentation.

Neither of these are locale aware, however.

@sffc sffc added C-unicode Component: Props, sets, tries T-core Type: Required functionality labels Jun 2, 2020
@sffc
Copy link
Member

sffc commented Jun 2, 2020

Intl.Segmenter is in scope for 402, and I think it makes sense to include here in ICU4X as well. I think we should build the segmenter based on the unicode properties API that ICU4X provides.

@sffc sffc added backlog help wanted Issue needs an assignee labels Jun 4, 2020
@sffc
Copy link
Member

sffc commented Jun 4, 2020

Adding to backlog along with Normalization (#40).

@sffc
Copy link
Member

sffc commented Jun 17, 2020

@methyl compiled ICU4C Segmenter to WebAssembly as a polyfill for Intl.Segmenter:

https://github.com/surferseo/intl-segmenter-polyfill

According to tc39/proposal-intl-segmenter#118 (comment), the .wasm file is ~350 KB gzipped, with the Thai dictionary but not the other dictionaries.

Just posting this here as some benchmarks to consider for the Rust segmenter.

@methyl
Copy link

methyl commented Jun 18, 2020

@sffc thanks for mention, if you have any questions about the implementation, go ahead. I also explored compiling https://github.com/google/rust_icu to WASM but it was not straightforward and I gave up.

@sffc sffc reopened this Sep 4, 2020
@dhardy
Copy link

dhardy commented Sep 5, 2020

Which segmentation functionality exactly? There are several types desirable:

cc @raphlinus

@raphlinus
Copy link

raphlinus commented Sep 5, 2020

Re line segmentation, I have no particular ego invested, but it is true as @dhardy points out that xi-unicode has a solution to this problem. It's a very optimized, hand-tuned solution. I haven't benchmarked it against uax14_rs, but did benchmark it against ICU a long time ago and it was considerably faster (somewhere in the 2x-3x range, I don't remember the details). I would be pleased and honored if it were adapted for use in icu4x, but relieved if not; I'm busy enough as it is.

There's lots more to line breaks than a UAX 14 compliant break iterator. There's all the southeast Asian dictionary-based breaking stuff, plus other desirable "tailoring." For example, Android has a little parser for url and email addresses, and applies different breaking rules for those, because the default UAX 14 behavior is pretty bad in those use cases. I should also point out that ICU has some really complex additional behavior around numbers and ASCII symbols, not covered by UAX 14. I evaluated it when working on xi-unicode and found it not worthwhile, but others with different use cases may feel differently.

So whoever adopts this will have their hands full, if the goal is a world-class solution.

The Druid text stack currently does nothing sophisticated wrt word advance, but is likely a customer for WordCursor functionality.

@raphlinus
Copy link

raphlinus commented Sep 5, 2020

I'll also add that xi-unicode goes to Unicode 10, while uax14_rs seems to be up to date with Unicode 13. That's another consideration, and makes apples-to-apples comparison harder, as Unicode 13 is substantially more complex.

Lastly, xi-unicode has support for break iteration in non-contiguous strings, to support the needs at the time of xi-editor. I think that's probably worth removing going forward; my current thinking is that even when dealing with very large texts, it makes sense to use a contiguous string representation at the paragraph granularity and below. The maintenance burden of keeping the noncontiguous representation up to date is nontrivial.

@sffc
Copy link
Member

sffc commented Sep 6, 2020

One of the main value propositions of ICU4X is to provide Unicode and CLDR data (#217 is a thread specifically about Unicode character property data). In cases where functionality is out of scope for ICU4X, I hope that downstream crates can depend on the ICU4X data provider to get their character properties.

For segmentation in particular, as others have mentioned, there is already a strong ecosystem in Rust. I don't think we should reinvent the wheel in ICU4X.

There is one area where I'm not aware of strong ecosystem support, which is dictionary-based segmentation (important in many East and Southeast Asian languages). I'm hosting an intern this fall who I hope will be able to build a segmenter for these languages.

@aethanyc
Copy link
Contributor

Hi all, I have a document for implementing segmenters in icu4x. Feedback very welcome.
https://docs.google.com/document/d/1ojrOdIchyIHYbg2G9APX8j2p0XtmVLj0f9jPIbFYVUE/edit

cc @jfkthame

@sffc
Copy link
Member

sffc commented Feb 22, 2021

CC @nathanhammond

@nathanhammond
Copy link

nathanhammond commented Feb 28, 2021

@aethanyc 👋

I'm intending to review in close detail your document. I'm a Cantonese speaker and have implemented a bad trie version of Cantonese segmentation for JS (using the 402 API) which is still immensely better than ICU's BreakIterator with the default dictionary.

Apropos of literally nothing in your document, since I've only skimmed it at this point, here is a thread where I've been discussing segmentation API limitations for Cantonese:

tc39/proposal-intl-segmenter#133

I'm looking forward to learning from your research!

@sffc sffc added S-epic Size: Major project (create smaller child issues) and removed backlog help wanted Issue needs an assignee labels Apr 22, 2021
@sffc sffc added this to the ICU4X 0.4 milestone Apr 22, 2021
@sffc
Copy link
Member

sffc commented May 4, 2021

Implementation feedback from @aheninger:

Here are a few followup comments to the ICU4X deep dive discussion of April 21.

  1. Testing. For evaluating uax14_rs (or other new implementations), the Unicode segmentation test data at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.txt is very useful. Writing code to parse these text files and drive a test is pretty simple, and will give a quick read on whether an implementation is basically on track.
  2. Little was said about dictionary based word segmentation. Does ICU4X have a strategy separate from classic ICU? How does the LSTM based thing Frank is working on fit in?
    • In ICU, the transitions between text regions requiring rule based and dictionary based breaking is an area with many unaddressed bugs. Something to keep in mind while doing a new implementation
  3. Sequential vs. Parallel application of rules, and the speed vs. complexity tradeoff. While slower, if an engine based on sequential application of the rules turned out to be good enough, the rule maintenance problem would be greatly simplified. I think this would be true even if adding a rule customization involved writing code, or running a tool that produced some code.
    • A good performing sequential-rule engine would probably want to be based on rules that describe how to move from one boundary to the next, rather than on UAX style rules that describe how to test an arbitrary text position for being a boundary. But the transformation between the two styles is much easier than the sequential to parallel transformation involved in maintaining the ICU rules.
  4. Monkey Testing. In ICU, this has proved to be by far the most effective way to find obscure corner case bugs in the rules. It relies on having a reference implementation of the defining UAX rules that closely follows the UAX breaking algorithm, and is used to generate expected results for random test data.
    • The random test text sequences are not completely random. Start with a list of all of the character classes that appear in the rules. When building a test string, for each character to be added, first pick a random character class, then pick a random character from within the class.
    • I would be very hesitant to release a segmentation implementation without having monkey-tested it. Unfortunately, a good monkey test is quite a bit more work to write than a test driver for the Unicode test data, #1 above. Do #1 first.

@sffc
Copy link
Member

sffc commented May 7, 2021

See some benchmark results from @aethanyc in #707.

@devinreams devinreams added the C-segmentation Component: Segmentation label May 10, 2021
@aheninger
Copy link

I did some quick informal performance measurements of ICU4C vs uax14_rs on English text, and came up with similar results to those reported above, in #707. ICU is slower, running at around 60% the speed of uax14_rs.

My suspicion is that part of the difference comes from a ICU's caching layer. ICU's API supports forwards and backwards iteration, and testing arbitrary string positions, all from the same iterator. Caching helps some use cases, but not plain forward iteration. The work per character for forward iteration of the underlying engine looks like it should be comparable for ICU and uax14_rs packages. All this is just hunch - without some real profiling work, I don't really know and could easily be wrong.

As @dhardy and @raphlinus noted earlier, there is also xi-unicode/rust/unicode. The approach it takes is similar to that of uax14_rs - I wonder if there is some common ancestry between the two

In both xi-uniocde and uax14_rs, lists of states and character properties are hard coded; you can't make a word or grapheme cluster iterator by just plugging in different rules.

Also, both xi-editor and uax14_rs include tests to check the Unicode test file LineBreakTest.txt. uax14_rs is skipping tests that include the line break character classes OP, NU, PO and PR. For uax14_rs I confirmed that the test fails if these cases are added back in.

@aethanyc aethanyc removed this from the ICU4X 0.4 milestone Aug 12, 2021
@aethanyc aethanyc added this to the ICU4X 0.5 milestone Aug 12, 2021
@aethanyc aethanyc modified the milestones: ICU4X 0.5, ICU4X 1.0 Jan 20, 2022
@sffc
Copy link
Member

sffc commented Jul 28, 2022

This will be closed after segmenter moves to components in #2259.

@sffc sffc modified the milestones: ICU4X 1.0 (Features), ICU4X 1.1 Sep 26, 2022
@aethanyc aethanyc modified the milestones: ICU4X 1.1, ICU4X 1.2 Dec 15, 2022
@sffc
Copy link
Member

sffc commented Apr 13, 2023

Segmenter has been moved to components and is being released in 1.2. Closing this tracking issue.

@sffc sffc closed this as completed Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation C-unicode Component: Props, sets, tries S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

10 participants