Segmenter #109

zbraniecki · 2020-05-29T19:06:13Z

For Segmenter, seems like the first crate we could consider incorporating is https://github.com/makotokato/uax14_rs

@sffc @Manishearth @makotokato - would it make sense to consider it for ICU4X?

Manishearth · 2020-05-29T22:45:18Z

Yes. We might want to consider unicode_segmentation for non-line segmentation.

Neither of these are locale aware, however.

sffc · 2020-06-02T03:10:43Z

Intl.Segmenter is in scope for 402, and I think it makes sense to include here in ICU4X as well. I think we should build the segmenter based on the unicode properties API that ICU4X provides.

sffc · 2020-06-04T18:29:46Z

Adding to backlog along with Normalization (#40).

sffc · 2020-06-17T10:21:17Z

@methyl compiled ICU4C Segmenter to WebAssembly as a polyfill for Intl.Segmenter:

https://github.com/surferseo/intl-segmenter-polyfill

According to tc39/proposal-intl-segmenter#118 (comment), the .wasm file is ~350 KB gzipped, with the Thai dictionary but not the other dictionaries.

Just posting this here as some benchmarks to consider for the Rust segmenter.

methyl · 2020-06-18T09:50:24Z

@sffc thanks for mention, if you have any questions about the implementation, go ahead. I also explored compiling https://github.com/google/rust_icu to WASM but it was not straightforward and I gave up.

dhardy · 2020-09-05T19:02:17Z

Which segmentation functionality exactly? There are several types desirable:

line breaking is covered by the mentioned crate; also by https://crates.io/crates/xi-unicode (code)
word breaking, useful e.g. to select a whole word; this is covered by https://crates.io/crates/unicode-segmentation
word advance forward/back (result of Ctrl+Left/Right in a text editor); I believe this is not yet covered (Word navigation unicode-rs/unicode-segmentation#83)

cc @raphlinus

raphlinus · 2020-09-05T19:33:36Z

Re line segmentation, I have no particular ego invested, but it is true as @dhardy points out that xi-unicode has a solution to this problem. It's a very optimized, hand-tuned solution. I haven't benchmarked it against uax14_rs, but did benchmark it against ICU a long time ago and it was considerably faster (somewhere in the 2x-3x range, I don't remember the details). I would be pleased and honored if it were adapted for use in icu4x, but relieved if not; I'm busy enough as it is.

There's lots more to line breaks than a UAX 14 compliant break iterator. There's all the southeast Asian dictionary-based breaking stuff, plus other desirable "tailoring." For example, Android has a little parser for url and email addresses, and applies different breaking rules for those, because the default UAX 14 behavior is pretty bad in those use cases. I should also point out that ICU has some really complex additional behavior around numbers and ASCII symbols, not covered by UAX 14. I evaluated it when working on xi-unicode and found it not worthwhile, but others with different use cases may feel differently.

So whoever adopts this will have their hands full, if the goal is a world-class solution.

The Druid text stack currently does nothing sophisticated wrt word advance, but is likely a customer for WordCursor functionality.

raphlinus · 2020-09-05T20:04:04Z

I'll also add that xi-unicode goes to Unicode 10, while uax14_rs seems to be up to date with Unicode 13. That's another consideration, and makes apples-to-apples comparison harder, as Unicode 13 is substantially more complex.

Lastly, xi-unicode has support for break iteration in non-contiguous strings, to support the needs at the time of xi-editor. I think that's probably worth removing going forward; my current thinking is that even when dealing with very large texts, it makes sense to use a contiguous string representation at the paragraph granularity and below. The maintenance burden of keeping the noncontiguous representation up to date is nontrivial.

sffc · 2020-09-06T05:47:58Z

One of the main value propositions of ICU4X is to provide Unicode and CLDR data (#217 is a thread specifically about Unicode character property data). In cases where functionality is out of scope for ICU4X, I hope that downstream crates can depend on the ICU4X data provider to get their character properties.

For segmentation in particular, as others have mentioned, there is already a strong ecosystem in Rust. I don't think we should reinvent the wheel in ICU4X.

There is one area where I'm not aware of strong ecosystem support, which is dictionary-based segmentation (important in many East and Southeast Asian languages). I'm hosting an intern this fall who I hope will be able to build a segmenter for these languages.

aethanyc · 2021-02-22T19:23:23Z

Hi all, I have a document for implementing segmenters in icu4x. Feedback very welcome.
https://docs.google.com/document/d/1ojrOdIchyIHYbg2G9APX8j2p0XtmVLj0f9jPIbFYVUE/edit

cc @jfkthame

sffc · 2021-02-22T20:10:01Z

CC @nathanhammond

nathanhammond · 2021-02-28T19:13:23Z

@aethanyc 👋

I'm intending to review in close detail your document. I'm a Cantonese speaker and have implemented a bad trie version of Cantonese segmentation for JS (using the 402 API) which is still immensely better than ICU's BreakIterator with the default dictionary.

Apropos of literally nothing in your document, since I've only skimmed it at this point, here is a thread where I've been discussing segmentation API limitations for Cantonese:

tc39/proposal-intl-segmenter#133

I'm looking forward to learning from your research!

sffc · 2021-05-04T20:48:32Z

Implementation feedback from @aheninger:

Here are a few followup comments to the ICU4X deep dive discussion of April 21.

Testing. For evaluating uax14_rs (or other new implementations), the Unicode segmentation test data at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.txt is very useful. Writing code to parse these text files and drive a test is pretty simple, and will give a quick read on whether an implementation is basically on track.

Little was said about dictionary based word segmentation. Does ICU4X have a strategy separate from classic ICU? How does the LSTM based thing Frank is working on fit in?

In ICU, the transitions between text regions requiring rule based and dictionary based breaking is an area with many unaddressed bugs. Something to keep in mind while doing a new implementation

Sequential vs. Parallel application of rules, and the speed vs. complexity tradeoff. While slower, if an engine based on sequential application of the rules turned out to be good enough, the rule maintenance problem would be greatly simplified. I think this would be true even if adding a rule customization involved writing code, or running a tool that produced some code.

A good performing sequential-rule engine would probably want to be based on rules that describe how to move from one boundary to the next, rather than on UAX style rules that describe how to test an arbitrary text position for being a boundary. But the transformation between the two styles is much easier than the sequential to parallel transformation involved in maintaining the ICU rules.

Monkey Testing. In ICU, this has proved to be by far the most effective way to find obscure corner case bugs in the rules. It relies on having a reference implementation of the defining UAX rules that closely follows the UAX breaking algorithm, and is used to generate expected results for random test data.

The random test text sequences are not completely random. Start with a list of all of the character classes that appear in the rules. When building a test string, for each character to be added, first pick a random character class, then pick a random character from within the class.

I would be very hesitant to release a segmentation implementation without having monkey-tested it. Unfortunately, a good monkey test is quite a bit more work to write than a test driver for the Unicode test data, #1 above. Do #1 first.

sffc · 2021-05-07T17:37:41Z

See some benchmark results from @aethanyc in #707.

aheninger · 2021-05-18T00:00:42Z

I did some quick informal performance measurements of ICU4C vs uax14_rs on English text, and came up with similar results to those reported above, in #707. ICU is slower, running at around 60% the speed of uax14_rs.

My suspicion is that part of the difference comes from a ICU's caching layer. ICU's API supports forwards and backwards iteration, and testing arbitrary string positions, all from the same iterator. Caching helps some use cases, but not plain forward iteration. The work per character for forward iteration of the underlying engine looks like it should be comparable for ICU and uax14_rs packages. All this is just hunch - without some real profiling work, I don't really know and could easily be wrong.

As @dhardy and @raphlinus noted earlier, there is also xi-unicode/rust/unicode. The approach it takes is similar to that of uax14_rs - I wonder if there is some common ancestry between the two

In both xi-uniocde and uax14_rs, lists of states and character properties are hard coded; you can't make a word or grapheme cluster iterator by just plugging in different rules.

Also, both xi-editor and uax14_rs include tests to check the Unicode test file LineBreakTest.txt. uax14_rs is skipping tests that include the line break character classes OP, NU, PO and PR. For uax14_rs I confirmed that the test fails if these cases are added back in.

sffc · 2022-07-28T17:17:05Z

This will be closed after segmenter moves to components in #2259.

sffc · 2023-04-13T22:25:33Z

Segmenter has been moved to components and is being released in 1.2. Closing this tracking issue.

sffc added C-unicode Component: Props, sets, tries T-core Type: Required functionality labels Jun 2, 2020

sffc added backlog help wanted Issue needs an assignee labels Jun 4, 2020

sffc closed this as completed Jun 4, 2020

filmil mentioned this issue Jun 4, 2020

Import unicode-normalization or re-write from scratch? #40

Closed

sffc reopened this Sep 4, 2020

sffc assigned aethanyc Apr 22, 2021

sffc added S-epic Size: Major project (create smaller child issues) and removed backlog help wanted Issue needs an assignee labels Apr 22, 2021

sffc added this to the ICU4X 0.4 milestone Apr 22, 2021

aethanyc mentioned this issue May 7, 2021

Add uax14_rs as an experiment crate #707

Closed

devinreams added the C-segmentation Component: Segmentation label May 10, 2021

aethanyc removed this from the ICU4X 0.4 milestone Aug 12, 2021

aethanyc added this to the ICU4X 0.5 milestone Aug 12, 2021

aethanyc modified the milestones: ICU4X 0.5, ICU4X 1.0 Jan 20, 2022

sffc modified the milestones: ICU4X 1.0 Untriaged, ICU4X 1.0 (Features) May 25, 2022

sffc modified the milestones: ICU4X 1.0 (Features), ICU4X 1.1 Sep 26, 2022

aethanyc modified the milestones: ICU4X 1.1, ICU4X 1.2 Dec 15, 2022

sffc closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmenter #109

Segmenter #109

zbraniecki commented May 29, 2020

Manishearth commented May 29, 2020

sffc commented Jun 2, 2020

sffc commented Jun 4, 2020

sffc commented Jun 17, 2020

methyl commented Jun 18, 2020

dhardy commented Sep 5, 2020

raphlinus commented Sep 5, 2020 •

edited

Loading

raphlinus commented Sep 5, 2020 •

edited

Loading

sffc commented Sep 6, 2020

aethanyc commented Feb 22, 2021

sffc commented Feb 22, 2021

nathanhammond commented Feb 28, 2021 •

edited

Loading

sffc commented May 4, 2021

sffc commented May 7, 2021

aheninger commented May 18, 2021

sffc commented Jul 28, 2022 •

edited by aethanyc

Loading

sffc commented Apr 13, 2023

Segmenter #109

Segmenter #109

Comments

zbraniecki commented May 29, 2020

Manishearth commented May 29, 2020

sffc commented Jun 2, 2020

sffc commented Jun 4, 2020

sffc commented Jun 17, 2020

methyl commented Jun 18, 2020

dhardy commented Sep 5, 2020

raphlinus commented Sep 5, 2020 • edited Loading

raphlinus commented Sep 5, 2020 • edited Loading

sffc commented Sep 6, 2020

aethanyc commented Feb 22, 2021

sffc commented Feb 22, 2021

nathanhammond commented Feb 28, 2021 • edited Loading

sffc commented May 4, 2021

sffc commented May 7, 2021

aheninger commented May 18, 2021

sffc commented Jul 28, 2022 • edited by aethanyc Loading

sffc commented Apr 13, 2023

raphlinus commented Sep 5, 2020 •

edited

Loading

raphlinus commented Sep 5, 2020 •

edited

Loading

nathanhammond commented Feb 28, 2021 •

edited

Loading

sffc commented Jul 28, 2022 •

edited by aethanyc

Loading