Consistency with Number.range model #120

sffc · 2020-07-09T21:14:52Z

@Jack-Works has been working on https://github.com/tc39/proposal-Number.range, in which the method returns an Iterator directly, rather than an object implementing Symbol.iterator. This topic has not been decided (see tc39/proposal-iterator.range#22), but it is up for discussion again at the July TC39. The arguments in favor of this model are mostly around the ergonomics: you can use Iterator methods like .take() on the return value directly, rather than calling [Symbol.iterator] or wrapping the return value in Iterator.from.

We switched from this model to the model with the intermediate immutable type %Segments%.prototype due to concerns I raised in #93, which still stand.

If, and only if, Number.range settles on the "direct-to-iterator" model, then I wanted to think about an alternative API for Intl.Segmenter. What would you think about,

Change Intl.Segmenter.prototype.segment to return a %SegmentIterator% like it did before
- Optional: rename this method to .split(), like String.prototype.split
Add new method Intl.Segmenter.prototype.segmentContaining that returns the segment data object

The cookbook examples in the README would change to:

// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";

let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let current = undefined;

current = segmenter.segmentContaining(input, 0)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segmenter.segmentContaining(input, 5)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segmenter.segmentContaining(input, 6)
// → { index: 6, segment: "-", isWordLike: false }

current = segmenter.segmentContaining(input, current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }

current = segmenter.segmentContaining(input, current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }

current = segmenter.segmentContaining(input, current.index + current.segment.length)
// → undefined

Also, now that we are diminishing the functionality of %Segments%.prototype due to SES concerns, the case for a standalone type is weaker.

I'm not sure what this would mean for performance. With %Segments%.prototype, the engine has a place to cache info from call to call. Maybe with this new model, the browser can cache things based on the input string.

CC @littledan @rbuckton @hax @mpcsh @Andrew-Cottrell @gsathya @devsnek @tabatkins

The text was updated successfully, but these errors were encountered:

sffc · 2020-07-10T01:21:03Z

To clarify: I very much want Intl.Segmenter to go to Stage 3, and I am happy with the status quo of the proposal, regardless of how we go on Number.range. I opened this issue in order to highlight a potential inconsistency in how we deal with iterator factories in the standard library. I leave it up to others to determine whether this is something worth pursuing.

Andrew-Cottrell · 2020-07-10T09:17:14Z

I have read the README and the rendered spec, but have not yet read the existing issues. My apologies if I have misunderstood some important aspect. These are my initial thoughts.

What would you think about,

Change Intl.Segmenter.prototype.segment to return a %SegmentIterator% like it did before

Optional: rename this method to .split(), like String.prototype.split

Add new method Intl.Segmenter.prototype.segmentContaining that returns the segment data object

My thoughts are about consistency with the RegExp.prototype[@@matchAll] model rather than the Number.range model.

Intl.Segmenter is — on some level — conceptually similar to RegExp.prototype[@@matchAll]. An instance of Intl.Segmenter corresponds to an instance of RegExp, %SegmentIterator%.prototype corresponds to %RegExpStringIteratorPrototype%, and the index parameter of the %Segments%.prototype.containing method corresponds to the lastIndex property of RegExp instances. One immediate difference is lastIndex may usefully be set only before calling matchAll, whereas containing may be called multiple times and independently of any iteration.

I suspect Intl.Segmenter might more often be used without calling the containing method, so — from an ergonomic perspective — it may be preferable to return a %SegmentIterator% and move/rename the containing method, as indicated above.

Regarding renaming Intl.Segmenter.prototype.segment — in addition to the option split — the options match, matchAll, or matchSegments might also be considered. Another option is a combination of Intl.Segmenter.prototype[@@matchSegments] and String.prototype.matchSegments(segmenter), which might be useful (although I don't fully understand why this was done for RegExp.prototype[@@matchAll]).

%Segments%.prototype.containing could be moved to Intl.Segmenter.prototype.segmentAt — rather than Intl.Segmenter.prototype.segmentContaining — as it is shorter and follows the naming convention established by the charAt, charCodeAt, and codePointAt methods of String.prototype. One minor difference is the index argument may point within the matched segment, whereas codePointAt does not return a useful code point when the pos argument points to the second code unit of a surrogate pair. However, these At methods are called on the string and only the index (or pos) needs to be passed as an explicit argument. Perhaps it might be interesting to consider String.prototype.segmentAt(index, segmenter) or similar.

Alternatively, %Segments%.prototype.containing could be moved to %SegmentIterator%.prototype.segmentContaining as segmentContaining is unlikely to conflict with any method added to %IteratorPrototype%. One drawback with this idea is the need to call the iterator factory method when only random access is needed.

Regarding caching in a polyfill/shim implementation: without a %Segments% instance being available to close over both the Intl.Segmenter and the String (via [[SegmentsString]] and [[SegmentsSegmenter]] internal slots), there is no simple place to store a cache that would be automatically garbage collected when no longer needed. A shared bounded data structure with a least-recently-used (LRU) eviction policy might work well. I am not qualified to say much about optimisation of native implementations, but I guess a shared bounded cache might be useful to prevent excessive memory usage in scripts that match segments in a vast number of strings while keeping the %SegmentIterator% instances alive.

I opened this issue in order to highlight a potential inconsistency in how we deal with iterator factories in the standard library. I leave it up to others to determine whether this is something worth pursuing.

I see no particular reason to standardise on returning either an Iterable or Iterator from factories. I think this could be decided on a case-by-case basis. However, when all else is equal, Iterator seems to me to be the preferable option due to proposal-iterator-helpers.

hax · 2020-07-11T06:18:45Z

I truly agree with @Andrew-Cottrell . The only thing I need to add is, even I also feel segment is much like matchAll than range, generally I prefer iterable than iterator because iterable is reusable, and allow us pass the reference of it to many different place. This is especially convenient in segment case because if it's a iterator, u need to pass {segmenter, input} and recreate iterator in every places.

Andrew-Cottrell · 2020-07-13T09:43:16Z

..., generally I prefer iterable than iterator because iterable is reusable, and allow us pass the reference of it to many different place. This is especially convenient in segment case because if it's a iterator, u need to pass {segmenter, input} and recreate iterator in every places.

While it is often the case that we want to operate on the same operands in multiple scopes, I don't think proposals should eagerly specify new types that only serve to close over operands, unless there is a clear benefit such as interoperability or performance. One can create iterable objects in user code without much difficulty.

function matchAll(input, regexp) {
    return {input, regexp, [Symbol.iterator]: () => input.matchAll(regexp)};
}
var iterable = matchAll("simple ASCII words", new RegExp("(?:(\\w+)|(?:\\W+))", "g"));

function matchSegments(input, segmenter) {
    return {input, segmenter, [Symbol.iterator]: () => input.matchSegments(segmenter)};
}
var iterable = matchSegments("Moi?  N'est-ce pas.", new Intl.Segmenter("fr", {granularity: "word"}));

function range(start, end) {
    return {start, end, [Symbol.iterator]: () => Number.range(start, end)};
}
var iterable = range(0, 10);

hax · 2020-07-18T20:34:15Z

Yes, it's easy to create iterable, but only make sense when people understand all such things. On the other hand, I don't see any lost of returning iterable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency with Number.range model #120

Consistency with Number.range model #120

sffc commented Jul 9, 2020

sffc commented Jul 10, 2020

Andrew-Cottrell commented Jul 10, 2020 •

edited

Loading

hax commented Jul 11, 2020

Andrew-Cottrell commented Jul 13, 2020 •

edited

Loading

hax commented Jul 18, 2020

Consistency with Number.range model #120

Consistency with Number.range model #120

Comments

sffc commented Jul 9, 2020

sffc commented Jul 10, 2020

Andrew-Cottrell commented Jul 10, 2020 • edited Loading

hax commented Jul 11, 2020

Andrew-Cottrell commented Jul 13, 2020 • edited Loading

hax commented Jul 18, 2020

Andrew-Cottrell commented Jul 10, 2020 •

edited

Loading

Andrew-Cottrell commented Jul 13, 2020 •

edited

Loading