Should "breakType" rename to "segmentType" #44

FrankYFTang · 2018-10-16T17:49:33Z

During one of our design review, one of our colleague question why we name this API as "Segmenter" instead of BreakIterator but in the same time use the term "breakType" but not "segmentType". He suggest if we name this API as "segmenter", then we should make all the name consistent and therefore rename "breakType" in the spec as "segmentType" instead.

littledan · 2018-10-16T18:14:55Z

Good idea. (Or, should we call it, Intl.Breaker???) Added to the October 2018 Intl meeting agenda. https://github.com/tc39/ecma402/blob/master/meetings/agenda-2018-10-18.md

gibson042 · 2018-11-06T23:05:07Z

UAX #29 employs the following vocabulary:

[significant] text element: a sentence, word, or user-perceived character
segmentation: the process of boundary determination
boundary: a transition point between two segments
segment: synonym for [significant] text element
break: synonym for boundary
grapheme cluster: an algorithmically-defined approximation of a user-perceived character

UAX #14 adds:

line break: a position in text where one line ends
[line] break opportunity: a position in text where a line is allowed to end
mandatory break: a character property that requires an immediately following line break

Since this proposal is derived from those technical reports, it would be nice if the interface introduced by it hewed as closely to them as practical. ICU demonstrates that there is value in providing detail beyond the mere position of boundaries, but taking its interface (which targets low-level languages and has grown organically in specialized directions) as gospel seems like a mistake. And model accuracy is also important... boundaries don't have properties of their own, but their preceding segments do (and in combination rather than as partitioners, cf. getRuleStatusVec)—even mandatory vs. optional line break opportunities (or "hard" vs. "soft" in ICU vocabulary) are determined by whether or not the last character of the preceding segment is a terminator.

I'm not sure this proposal should include reflection of segment characteristics, but if it does then we should avoid the singular "type" altogether, in anticipation of future extensions describing segments by multiple dimensions (e.g., a word being foreign to the segmenter locale, having code points from multiple general categories, etc.). Do we want granularity-specific properties (e.g., mandatory: true or terminatingPunctuation: "!")? Or perhaps an array or set that is always present and contains granularity-specific values (e.g., segmentTags: ["word"])? But if you're worried about performance, it might be best to leave such determinations out of the implementation itself, or make them opt-in at iterator construction time.

gibson042 · 2019-04-19T14:12:53Z

The current text does a poor job of defining what breakType is. Possible values seem to describe segments rather than boundaries, and it is not specified to which boundary-adjacent segment they correspond with. This is especially confusing for backwards iteration—what is the proper value of breakType after (new Intl.Segmenter("fr", {granularity: "word"})).segment("Ceci n'est pas une pipe").preceding(8)? There's also the issue of a missing definition for "numbers, letters, kana characters, ideographic characters, etc" and "sentence terminator ('.', '?', '!', etc.)".

I am in favor of removing breakType because it is easy for consumers to check the break-preceding code unit at index - 1 on their own, but if breakType or a renamed equivalent remains then it needs a better and more complete specification.

gibson042 · 2019-05-04T13:14:40Z

The initial question of this issue was resolved by 242ce14.

littledan · 2019-05-05T23:16:51Z

Closing per #72

gibson042 mentioned this issue Nov 8, 2018

Is this an API for iterating segments, or boundaries? #59

Closed

gibson042 mentioned this issue Apr 19, 2019

Normative: Iterate over break locations, rather than segments #67

Closed

littledan closed this as completed May 5, 2019

gibson042 mentioned this issue Mar 26, 2020

More specific definition of isWordLike in the spec? #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should "breakType" rename to "segmentType" #44

Should "breakType" rename to "segmentType" #44

FrankYFTang commented Oct 16, 2018

littledan commented Oct 16, 2018

gibson042 commented Nov 6, 2018 •

edited

gibson042 commented Apr 19, 2019 •

edited

gibson042 commented May 4, 2019

littledan commented May 5, 2019

Should "breakType" rename to "segmentType" #44

Should "breakType" rename to "segmentType" #44

Comments

FrankYFTang commented Oct 16, 2018

littledan commented Oct 16, 2018

gibson042 commented Nov 6, 2018 • edited

gibson042 commented Apr 19, 2019 • edited

gibson042 commented May 4, 2019

littledan commented May 5, 2019

gibson042 commented Nov 6, 2018 •

edited

gibson042 commented Apr 19, 2019 •

edited