Intl.Segmenter with URLs, email addresses, and acronyms #656

sffc · 2022-03-01T01:01:36Z

Consider the following string of text:

Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or example@gmail.com. www.google.com and she's there and well-played.

What segments should we produce?

ICU currently produces

Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played

But V8 splits the acronym, URL, and email address into separate tokens.

Questions:

Should this be implementation-dependent behavior, or should it go into the spec?
Which behavior should be the default?
Should this be an API option?

See https://bugs.chromium.org/p/chromium/issues/detail?id=1301830

The text was updated successfully, but these errors were encountered:

zbraniecki · 2022-03-01T02:51:40Z

FYI @makotokato @aethanyc @jfkthame

srl295 · 2022-03-02T18:33:34Z

I think this should go upstream into CLDR, perhaps as API switchable behavior.

gibson042 · 2022-03-02T19:01:13Z

ECMA-402 text segmentation boundary determination is intentionally implementation-dependent, as documented at https://tc39.es/ecma402/#annex-implementation-dependent-behaviour and https://tc39.es/ecma402/#sec-findboundary .

Boundary determination is implementation-dependent, but general default algorithms are specified in Unicode Standard Annex 29 (available at https://www.unicode.org/reports/tr29/). It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at http://cldr.unicode.org/).

It may also be worth noting that the default word boundary rules of UAX 29 don't allow breaking apart "P.T.O" or "gmail.com" or "www.google.com", per WB6 and WB7—V8's behavior is precisely the sort of behavior intended to be protected by that clause (although I personally would recommend against this particular deviation).

I would also discourage getting stuck in the morass of possible API switches, unless it is something as general (and advanced) as explicit specification of boundary rules in a form similar to that of UAX 29.

sffc · 2022-03-02T21:19:31Z

The use case comes from a Google team that is implementing a spell checker via word segmentation.

We want to detect URLs as a whole so that we can refrain from spellchecking them and giving erratic spelling suggestions to users.

It seems like we could form some use cases for word segmentation:

When you want real words (e.g., spell checker)
When you want smaller tokens (e.g., cursor positioning)

I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.

aethanyc · 2022-03-02T23:06:59Z

For the record, ICU4X word segmenter produces the same output as ICU.

Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played

The code that generates the result. (The segment starts with a punctuation are removed.)

let s = "Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or example@gmail.com. www.google.com and she's there and well-played.";
let provider = RuleBreakDataProvider;
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str(s).collect();
for w in breakpoints.windows(2) {
    let begin = w[0];
    let end = w[1];
    if !s[begin..end].starts_with(&[' ', ':', '@', '.', '/', '-']) {
        println!("{}", &s[begin..end]);
    }
}

gibson042 · 2022-03-02T23:36:04Z

I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.

The problem is that there are at least a dozen similar edge cases—domain names, email addresses, IP addresses, ~~URIs~~ IRIs, hashtags, @-references, Markdown/bbcode/etc. formatting, Latin abbreviations, hyphenated compounds, dates/times/datetimes, emoticons/kaomoji, etc.

"Long" vs. "short" seems too coarse, and a collection of fine-tuning options seems even worse.

sffc · 2022-03-03T08:48:11Z

I agree that we don't want an explosion of options for each edge case. I think we could look at use cases, though, and center options around those use cases. We already have "word", "sentence", and "grapheme" (and perhaps "line" at some point); we could add a new one called "token", for example, and say that "word" should be full words only, and "token" can produce smaller tokens.

gibson042 · 2022-03-03T15:39:06Z

That seems like the kind of thing that should be pushed for in Unicode so that ECMA-402 can adopt it as a downstream consumer.

srl295 · 2022-03-03T19:47:32Z

Are some of these things that some kind of independent recognizer should find? I. E. URLs, email, hashtags etc. maybe you want some preprocess operation that detects such and replaces them for processing. Then you could add Btc hash, git hash, or anything new that comes along.

Markdown and bbcode would fit in this category also, and wouldn't make sense for a general purpose plain text segménter

gregtatum · 2022-03-04T14:35:38Z

From a web compatibility perspective, it would be bad if a product had a really good spell checker implementation when backed by V8 due to custom Intl.Segmenter changes, but then other engines had a bad spell checker experience due to differences in platform implementation.

zbraniecki · 2022-03-17T20:26:25Z

@aethanyc my understanding was that lwbrk (current Gecko's layout segmenter) does special things for web-compat segmentation (@ sign, :// etc.) - I'm not sure if it's specific to CJK or all around. In the original pitch, we talked about upstreaming those customizations to ICU4X Segmenter either as CLDR or as a "web-mode" overlay. Can you comment on that?

sffc · 2022-03-17T21:09:48Z

TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-03-17.md#intlsegmenter-with-urls-email-addresses-and-acronyms

Conclusion:

FYT to investigate the V8 change
Come back to this group if necessary

gibson042 · 2022-06-16T18:58:23Z

We would also like to push for an improvement in https://www.unicode.org/reports/tr29/#Word_Boundary_Rules that provides an example above WB6 similar to the one above WB8, but don't know how to pursue that.

-Do not break letters across certain punctuation.
+Do not break letters across certain punctuation, such as within “e.g” or “example.com”.

sffc · 2022-06-17T04:36:43Z

Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656

We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.

romulocintra · 2022-06-17T10:44:29Z

Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656

We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.

I wrapped up some basic tests in https://github.com/tc39/test262/pull/3577/files

And found several potential issues due to not having specific rules for word segmentation mostly with email addresses and some results running it in different browsers + ICU were different from what I expected as a user.

Example :

"my@mail.com sending message to my@mail.org"

[Log] my
[Log] @
[Log] mail.com
[Log]  
[Log] sending
[Log]  
[Log] message
[Log]  
[Log] to
[Log]  
[Log] my
[Log] @
[Log] mail.org

What are the next steps? IMHO I believe we should extend the recommendations on how to deal with this segmentation/ word Boundary that covers at least the popular use cases like emails.

sffc · 2022-06-18T00:45:44Z

@macchiati says to fill in the form at https://corp.unicode.org/reporting/error.html with suggestions.

He says that it may be useful to investigate the @ symbol. He says:

BTW, @ might be a fairly clear case. In the Olden Days, you'd only use @ in cases like "3 pieces @ $15 each", but nowadays by far the most prominent cases are in email or tags (@sffc). So we could consider proposing that something like that the following shouldn't break in word segmentation.

Letter Digit* @
and @ Letter

(off the top of my head; would have to look at the details)

sffc · 2022-07-07T23:39:07Z

Additional discussion on 2022-07-07: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms

macchiati · 2022-07-08T16:20:18Z

There a lot to unpack in that conversation. I'm OOO, but one point to add is that UAX#29 and UAX#14 both specify precise default algorithms, but do and need to allow for a great deal of customization for different environments, different languages, etc. We should discuss what could to done help with what tc39 is doing, perhaps with some explicit profiles or som other mechanism.

…

On Thu, Jul 7, 2022 at 5:39 PM Shane F. Carr ***@***.***> wrote: Additional discussion on 2022-07-07: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms — Reply to this email directly, view it on GitHub <#656 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMEEEYSYT7NKJL2VKM3VS5TCPANCNFSM5PSTOEHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zbraniecki · 2022-07-09T01:46:26Z

Additional point - icu4x segmenter architecture diverged from icu4c as a result of the design decisions driven by gecko layout segmentation needs to allow for runtime switches and overlays allowing for cheap context switching mid segmentation. This is something that icu4c requires build time dictionary/rules rebuild, and icu4c can do in fly.

ryzokuken · 2022-07-12T09:34:14Z

@zbraniecki did you miss icu4x in the last comment?

ryzokuken · 2022-07-13T12:31:24Z

@sffc after being so pushy during the monthly meeting, I was scrolling through tests today with @romulocintra and realized that we have tests for locale-specific behavior, for example the output of a NumberFormat#format operation for scientific notation in German.

This challenges my assumption that we should only test specified behavior in test262 and based on that precedent, I'll be quite alright with including the tests in test262. Apologies for being so assertive about this matter without finishing my research.

sffc · 2023-01-12T20:22:06Z

CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15767

Another: https://unicode-org.atlassian.net/browse/CLDR-15839

sffc added s: discuss Status: TG2 must discuss to move forward c: text Component: case mapping, collation, properties labels Mar 1, 2022

sffc added this to Priority Issues in ECMA-402 Meeting Topics Mar 1, 2022

romulocintra mentioned this issue Jun 17, 2022

Add tests for Intl.Segmenter option granularity - "word" tc39/test262#3577

Closed

sffc moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Jul 7, 2022

sffc moved this from Previously Discussed to Priority Issues in ECMA-402 Meeting Topics Jul 7, 2022

romulocintra moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Oct 5, 2022

sffc moved this from Previously Discussed to Other Issues in ECMA-402 Meeting Topics Oct 6, 2022

sffc added s: blocked Status: the issue is blocked on upstream and removed s: discuss Status: TG2 must discuss to move forward labels Jan 12, 2023

sffc added this to the ES 2024 milestone Jan 12, 2023

sffc moved this from Other Issues to Previously Discussed in ECMA-402 Meeting Topics Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intl.Segmenter with URLs, email addresses, and acronyms #656

Intl.Segmenter with URLs, email addresses, and acronyms #656

sffc commented Mar 1, 2022 •

edited by ryzokuken

zbraniecki commented Mar 1, 2022 •

edited

srl295 commented Mar 2, 2022

gibson042 commented Mar 2, 2022

sffc commented Mar 2, 2022

aethanyc commented Mar 2, 2022

gibson042 commented Mar 2, 2022

sffc commented Mar 3, 2022

gibson042 commented Mar 3, 2022

srl295 commented Mar 3, 2022 •

edited

gregtatum commented Mar 4, 2022

zbraniecki commented Mar 17, 2022

sffc commented Mar 17, 2022

gibson042 commented Jun 16, 2022

sffc commented Jun 17, 2022

romulocintra commented Jun 17, 2022 •

edited

sffc commented Jun 18, 2022

sffc commented Jul 7, 2022

macchiati commented Jul 8, 2022 via email

zbraniecki commented Jul 9, 2022

ryzokuken commented Jul 12, 2022

ryzokuken commented Jul 13, 2022

sffc commented Jan 12, 2023 •

edited

Intl.Segmenter with URLs, email addresses, and acronyms #656

Intl.Segmenter with URLs, email addresses, and acronyms #656

Comments

sffc commented Mar 1, 2022 • edited by ryzokuken

zbraniecki commented Mar 1, 2022 • edited

srl295 commented Mar 2, 2022

gibson042 commented Mar 2, 2022

sffc commented Mar 2, 2022

aethanyc commented Mar 2, 2022

gibson042 commented Mar 2, 2022

sffc commented Mar 3, 2022

gibson042 commented Mar 3, 2022

srl295 commented Mar 3, 2022 • edited

gregtatum commented Mar 4, 2022

zbraniecki commented Mar 17, 2022

sffc commented Mar 17, 2022

gibson042 commented Jun 16, 2022

sffc commented Jun 17, 2022

romulocintra commented Jun 17, 2022 • edited

sffc commented Jun 18, 2022

sffc commented Jul 7, 2022

macchiati commented Jul 8, 2022 via email

zbraniecki commented Jul 9, 2022

ryzokuken commented Jul 12, 2022

ryzokuken commented Jul 13, 2022

sffc commented Jan 12, 2023 • edited

sffc commented Mar 1, 2022 •

edited by ryzokuken

zbraniecki commented Mar 1, 2022 •

edited

srl295 commented Mar 3, 2022 •

edited

romulocintra commented Jun 17, 2022 •

edited

sffc commented Jan 12, 2023 •

edited