Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and potentially fix LSTM feature in icu_segmenter #2829

Closed
sffc opened this issue Nov 16, 2022 · 9 comments · Fixed by #3010
Closed

Document and potentially fix LSTM feature in icu_segmenter #2829

sffc opened this issue Nov 16, 2022 · 9 comments · Fixed by #3010
Assignees
Labels
C-segmentation Component: Segmentation

Comments

@sffc
Copy link
Member

sffc commented Nov 16, 2022

We aim to make features additive. However, the LSTM feature in icu_segmenter is in an odd state. We should revisit as a group and arrive at how we want to handle this.

@sffc sffc added C-segmentation Component: Segmentation discuss-priority Discuss at the next ICU4X meeting labels Nov 16, 2022
@sffc
Copy link
Member Author

sffc commented Dec 5, 2022

CC @robertbastian @Manishearth @aethanyc @makotokato

How should we handle this?

I think we should make "dictionary" and "lstm" be separate features that are additive as follows:

  1. LineBreakSegmenter::try_new_dictionary_unstable is enabled with "dictionary"
  2. LineBreakSegmenter::try_new_lstm_unstable is enabled with "lstm"
  3. LineBreakSegmenter::try_new_unstable is hidden, but the _with_any_provider and _with_buffer_provider versions are public, and depending on the enabled features, they try the dictionary first, LSTM second (or vice-versa depending on what we decide), and fall back to UAX 14 if neither dictionary nor LSTM is available

@Manishearth
Copy link
Member

I think that's a good plan

@robertbastian
Copy link
Member

depending on the enabled features, they try the dictionary first, LSTM second

I don't think this is a good idea. This can lead to those functions changing behaviour if some dependency activates a feature on icu_segmenter.

@sffc
Copy link
Member Author

sffc commented Dec 6, 2022

depending on the enabled features, they try the dictionary first, LSTM second

I don't think this is a good idea. This can lead to those functions changing behaviour if some dependency activates a feature on icu_segmenter.

The contract of those functions is "I give no opinion on what data to use, but I will do the best thing given the available data"; if someone else in your program enables LSTM, then it's fine for this function to utilize the LSTM data.

It could be called try_new_auto or try_new_best if you think it needs a scarier name.

@aethanyc
Copy link
Contributor

Per discussion with @makotokato on Dec 12/13.

  1. After #[no_std] for LSTM segmenter #2845, we can enable LSTM by default.
  2. LSTM models should be preferable over dictionaries for East Asian Languages (Thai, Lao, etc.) due to small binary data size. But for CJK, dictionary should be used by default because there's no LSTM model.

Makoto, do I summarize above correctly?

@sffc
Copy link
Member Author

sffc commented Dec 15, 2022

Discussion:

  • @robertbastian - I'm okay with the 'best effort' constructor if it has a scarier name. It just shouldn't be the default one. Suggestion: try_new_best_effort or try_new_automatic
  • @sffc - try_new_auto
  • @aethanyc - I'm okay with the multiple constructors

@makotokato
Copy link
Member

Makoto, do I summarize above correctly?

Yes.

@sffc sffc mentioned this issue Dec 20, 2022
22 tasks
@sffc sffc added this to the ICU4X 1.2 milestone Dec 22, 2022
@sffc
Copy link
Member Author

sffc commented Jan 5, 2023

Consensus on try_new_auto for the combined function.

@aethanyc
Copy link
Contributor

For the record, #3010 closed this issue by adding "auto" cargo feature rather than "dictionary" cargo feature. The discussion was in #3010 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
None yet
5 participants