Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should "breakType" rename to "segmentType" #44

Closed
FrankYFTang opened this issue Oct 16, 2018 · 5 comments
Closed

Should "breakType" rename to "segmentType" #44

FrankYFTang opened this issue Oct 16, 2018 · 5 comments

Comments

@FrankYFTang
Copy link
Contributor

During one of our design review, one of our colleague question why we name this API as "Segmenter" instead of BreakIterator but in the same time use the term "breakType" but not "segmentType". He suggest if we name this API as "segmenter", then we should make all the name consistent and therefore rename "breakType" in the spec as "segmentType" instead.

@littledan
Copy link
Member

Good idea. (Or, should we call it, Intl.Breaker???) Added to the October 2018 Intl meeting agenda. https://github.com/tc39/ecma402/blob/master/meetings/agenda-2018-10-18.md

@gibson042
Copy link
Collaborator

gibson042 commented Nov 6, 2018

UAX #29 employs the following vocabulary:

  • [significant] text element: a sentence, word, or user-perceived character
  • segmentation: the process of boundary determination
  • boundary: a transition point between two segments
  • segment: synonym for [significant] text element
  • break: synonym for boundary
  • grapheme cluster: an algorithmically-defined approximation of a user-perceived character

UAX #14 adds:

  • line break: a position in text where one line ends
  • [line] break opportunity: a position in text where a line is allowed to end
  • mandatory break: a character property that requires an immediately following line break

Since this proposal is derived from those technical reports, it would be nice if the interface introduced by it hewed as closely to them as practical. ICU demonstrates that there is value in providing detail beyond the mere position of boundaries, but taking its interface (which targets low-level languages and has grown organically in specialized directions) as gospel seems like a mistake. And model accuracy is also important... boundaries don't have properties of their own, but their preceding segments do (and in combination rather than as partitioners, cf. getRuleStatusVec)—even mandatory vs. optional line break opportunities (or "hard" vs. "soft" in ICU vocabulary) are determined by whether or not the last character of the preceding segment is a terminator.

I'm not sure this proposal should include reflection of segment characteristics, but if it does then we should avoid the singular "type" altogether, in anticipation of future extensions describing segments by multiple dimensions (e.g., a word being foreign to the segmenter locale, having code points from multiple general categories, etc.). Do we want granularity-specific properties (e.g., mandatory: true or terminatingPunctuation: "!")? Or perhaps an array or set that is always present and contains granularity-specific values (e.g., segmentTags: ["word"])? But if you're worried about performance, it might be best to leave such determinations out of the implementation itself, or make them opt-in at iterator construction time.

@gibson042
Copy link
Collaborator

gibson042 commented Apr 19, 2019

The current text does a poor job of defining what breakType is. Possible values seem to describe segments rather than boundaries, and it is not specified to which boundary-adjacent segment they correspond with. This is especially confusing for backwards iteration—what is the proper value of breakType after (new Intl.Segmenter("fr", {granularity: "word"})).segment("Ceci n'est pas une pipe").preceding(8)? There's also the issue of a missing definition for "numbers, letters, kana characters, ideographic characters, etc" and "sentence terminator ('.', '?', '!', etc.)".

I am in favor of removing breakType because it is easy for consumers to check the break-preceding code unit at index - 1 on their own, but if breakType or a renamed equivalent remains then it needs a better and more complete specification.

@gibson042
Copy link
Collaborator

The initial question of this issue was resolved by 242ce14.

@littledan
Copy link
Member

Closing per #72

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants