Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intl.Segmenter with URLs, email addresses, and acronyms #656

Open
sffc opened this issue Mar 1, 2022 · 22 comments
Open

Intl.Segmenter with URLs, email addresses, and acronyms #656

sffc opened this issue Mar 1, 2022 · 22 comments
Labels
c: text Component: case mapping, collation, properties s: blocked Status: the issue is blocked on upstream
Milestone

Comments

@sffc
Copy link
Contributor

sffc commented Mar 1, 2022

Consider the following string of text:

Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or example@gmail.com. www.google.com and she's there and well-played.

What segments should we produce?

ICU currently produces

Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played

But V8 splits the acronym, URL, and email address into separate tokens.

Questions:

  1. Should this be implementation-dependent behavior, or should it go into the spec?
  2. Which behavior should be the default?
  3. Should this be an API option?

See https://bugs.chromium.org/p/chromium/issues/detail?id=1301830

@sffc sffc added s: discuss Status: TG2 must discuss to move forward c: text Component: case mapping, collation, properties labels Mar 1, 2022
@sffc sffc added this to Priority Issues in ECMA-402 Meeting Topics Mar 1, 2022
@zbraniecki
Copy link
Member

zbraniecki commented Mar 1, 2022

FYI @makotokato @aethanyc @jfkthame

@srl295
Copy link
Member

srl295 commented Mar 2, 2022

I think this should go upstream into CLDR, perhaps as API switchable behavior.

@gibson042
Copy link
Contributor

ECMA-402 text segmentation boundary determination is intentionally implementation-dependent, as documented at https://tc39.es/ecma402/#annex-implementation-dependent-behaviour and https://tc39.es/ecma402/#sec-findboundary .

Boundary determination is implementation-dependent, but general default algorithms are specified in Unicode Standard Annex 29 (available at https://www.unicode.org/reports/tr29/). It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at http://cldr.unicode.org/).

It may also be worth noting that the default word boundary rules of UAX 29 don't allow breaking apart "P.T.O" or "gmail.com" or "www.google.com", per WB6 and WB7—V8's behavior is precisely the sort of behavior intended to be protected by that clause (although I personally would recommend against this particular deviation).

I would also discourage getting stuck in the morass of possible API switches, unless it is something as general (and advanced) as explicit specification of boundary rules in a form similar to that of UAX 29.

@sffc
Copy link
Contributor Author

sffc commented Mar 2, 2022

The use case comes from a Google team that is implementing a spell checker via word segmentation.

We want to detect URLs as a whole so that we can refrain from spellchecking them and giving erratic spelling suggestions to users.

It seems like we could form some use cases for word segmentation:

  1. When you want real words (e.g., spell checker)
  2. When you want smaller tokens (e.g., cursor positioning)

I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.

@aethanyc
Copy link

aethanyc commented Mar 2, 2022

For the record, ICU4X word segmenter produces the same output as ICU.

Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played
The code that generates the result. (The segment starts with a punctuation are removed.)
let s = "Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or example@gmail.com. www.google.com and she's there and well-played.";
let provider = RuleBreakDataProvider;
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str(s).collect();
for w in breakpoints.windows(2) {
    let begin = w[0];
    let end = w[1];
    if !s[begin..end].starts_with(&[' ', ':', '@', '.', '/', '-']) {
        println!("{}", &s[begin..end]);
    }
}

@gibson042
Copy link
Contributor

I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.

The problem is that there are at least a dozen similar edge cases—domain names, email addresses, IP addresses, URIs IRIs, hashtags, @-references, Markdown/bbcode/etc. formatting, Latin abbreviations, hyphenated compounds, dates/times/datetimes, emoticons/kaomoji, etc.

"Long" vs. "short" seems too coarse, and a collection of fine-tuning options seems even worse.

@sffc
Copy link
Contributor Author

sffc commented Mar 3, 2022

I agree that we don't want an explosion of options for each edge case. I think we could look at use cases, though, and center options around those use cases. We already have "word", "sentence", and "grapheme" (and perhaps "line" at some point); we could add a new one called "token", for example, and say that "word" should be full words only, and "token" can produce smaller tokens.

@gibson042
Copy link
Contributor

That seems like the kind of thing that should be pushed for in Unicode so that ECMA-402 can adopt it as a downstream consumer.

@srl295
Copy link
Member

srl295 commented Mar 3, 2022

Are some of these things that some kind of independent recognizer should find? I. E. URLs, email, hashtags etc. maybe you want some preprocess operation that detects such and replaces them for processing. Then you could add Btc hash, git hash, or anything new that comes along.

Markdown and bbcode would fit in this category also, and wouldn't make sense for a general purpose plain text segménter

@gregtatum
Copy link
Member

From a web compatibility perspective, it would be bad if a product had a really good spell checker implementation when backed by V8 due to custom Intl.Segmenter changes, but then other engines had a bad spell checker experience due to differences in platform implementation.

@zbraniecki
Copy link
Member

@aethanyc my understanding was that lwbrk (current Gecko's layout segmenter) does special things for web-compat segmentation (@ sign, :// etc.) - I'm not sure if it's specific to CJK or all around. In the original pitch, we talked about upstreaming those customizations to ICU4X Segmenter either as CLDR or as a "web-mode" overlay. Can you comment on that?

@sffc
Copy link
Contributor Author

sffc commented Mar 17, 2022

TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-03-17.md#intlsegmenter-with-urls-email-addresses-and-acronyms

Conclusion:

  • FYT to investigate the V8 change
  • Come back to this group if necessary

@gibson042
Copy link
Contributor

We would also like to push for an improvement in https://www.unicode.org/reports/tr29/#Word_Boundary_Rules that provides an example above WB6 similar to the one above WB8, but don't know how to pursue that.

-Do not break letters across certain punctuation.
+Do not break letters across certain punctuation, such as within “e.g” or “example.com”.

@sffc
Copy link
Contributor Author

sffc commented Jun 17, 2022

Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656

We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.

@romulocintra
Copy link
Member

romulocintra commented Jun 17, 2022

Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656

We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.

I wrapped up some basic tests in https://github.com/tc39/test262/pull/3577/files

And found several potential issues due to not having specific rules for word segmentation mostly with email addresses and some results running it in different browsers + ICU were different from what I expected as a user.

Example :

"my@mail.com sending message to my@mail.org"

[Log] my
[Log] @
[Log] mail.com
[Log]  
[Log] sending
[Log]  
[Log] message
[Log]  
[Log] to
[Log]  
[Log] my
[Log] @
[Log] mail.org

What are the next steps? IMHO I believe we should extend the recommendations on how to deal with this segmentation/ word Boundary that covers at least the popular use cases like emails.

@sffc
Copy link
Contributor Author

sffc commented Jun 18, 2022

@macchiati says to fill in the form at https://corp.unicode.org/reporting/error.html with suggestions.

He says that it may be useful to investigate the @ symbol. He says:

BTW, @ might be a fairly clear case. In the Olden Days, you'd only use @ in cases like "3 pieces @ $15 each", but nowadays by far the most prominent cases are in email or tags (@sffc). So we could consider proposing that something like that the following shouldn't break in word segmentation.

Letter Digit* @
and @ Letter

(off the top of my head; would have to look at the details)

@sffc sffc moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Jul 7, 2022
@sffc sffc moved this from Previously Discussed to Priority Issues in ECMA-402 Meeting Topics Jul 7, 2022
@sffc
Copy link
Contributor Author

sffc commented Jul 7, 2022

@macchiati
Copy link

macchiati commented Jul 8, 2022 via email

@zbraniecki
Copy link
Member

Additional point - icu4x segmenter architecture diverged from icu4c as a result of the design decisions driven by gecko layout segmentation needs to allow for runtime switches and overlays allowing for cheap context switching mid segmentation. This is something that icu4c requires build time dictionary/rules rebuild, and icu4c can do in fly.

@ryzokuken
Copy link
Member

@zbraniecki did you miss icu4x in the last comment?

@ryzokuken
Copy link
Member

@sffc after being so pushy during the monthly meeting, I was scrolling through tests today with @romulocintra and realized that we have tests for locale-specific behavior, for example the output of a NumberFormat#format operation for scientific notation in German.

This challenges my assumption that we should only test specified behavior in test262 and based on that precedent, I'll be quite alright with including the tests in test262. Apologies for being so assertive about this matter without finishing my research.

@romulocintra romulocintra moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Oct 5, 2022
@sffc sffc moved this from Previously Discussed to Other Issues in ECMA-402 Meeting Topics Oct 6, 2022
@sffc
Copy link
Contributor Author

sffc commented Jan 12, 2023

@sffc sffc added s: blocked Status: the issue is blocked on upstream and removed s: discuss Status: TG2 must discuss to move forward labels Jan 12, 2023
@sffc sffc added this to the ES 2024 milestone Jan 12, 2023
@sffc sffc moved this from Other Issues to Previously Discussed in ECMA-402 Meeting Topics Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: text Component: case mapping, collation, properties s: blocked Status: the issue is blocked on upstream
Projects
ECMA-402 Meeting Topics
Previously Discussed
Development

No branches or pull requests

9 participants