Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Segmenter to Components #2259

Closed
21 of 22 tasks
sffc opened this issue Jul 27, 2022 · 7 comments · Fixed by #3300
Closed
21 of 22 tasks

Move Segmenter to Components #2259

sffc opened this issue Jul 27, 2022 · 7 comments · Fixed by #3300
Assignees
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Jul 27, 2022

Checklist for Segmenter in Components:

@sffc sffc added T-core Type: Required functionality C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) labels Jul 27, 2022
@sffc sffc added this to the ICU4X 1.0 (Features) milestone Jul 27, 2022
@aethanyc aethanyc mentioned this issue Jul 28, 2022
@sffc
Copy link
Member Author

sffc commented Jul 29, 2022

I added one more small bullet point about the symbols module (CC @robertbastian)

@aethanyc
Copy link
Contributor

aethanyc commented Aug 2, 2022

The symbols module is used in datagen.

use icu_segmenter::symbols::*;

I wonder whether it is possible to hide it from icu_segmenter's online doc, but let it remain accessible from icu_datagen?

@sffc
Copy link
Member Author

sffc commented Aug 2, 2022

Yeah, okay, let's mark the module as #[doc(hidden)] (at the point it is pub use'd in lib.rs)

@sffc
Copy link
Member Author

sffc commented Nov 2, 2022

Follow-up from #2716: in LineBreakIterator, the grapheme segmenter data should be pre-borrowed so that we don't need to re-borrow it each time. This is a known performance bottleneck that we don't want to propagate in new code.

@robertbastian
Copy link
Member

I've noticed that the public APIs eagerly return Vec<usize>s for segmentation points. Ideally we'd return some kind of lazy iterator. I don't think this is something we can implement for 1.2, but I'd like to change returns types to impl Iterator<Item = usize> so that we can do this in the future in a semver compatible way. Thoughts?

@aethanyc
Copy link
Contributor

aethanyc commented Apr 6, 2023

I've noticed that the public APIs eagerly return Vecs for segmentation points.

Which user-facing public APIs return Vec<usize>? I found internal APIs such as

pub(crate) fn complex_language_segment_utf16(
dictionary: Option<&Dictionary>,
lstm: Option<&LstmPayloads>,
grapheme: Option<&RuleBreakDataV1>,
input: &[u16],
) -> Vec<usize> {
to be the possible candidates. Returning impl Iterator<Item = usize> seems nice if its caller can benefit from such change.

@robertbastian
Copy link
Member

Sorry I got confused by internal APIs that are marked pub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants