New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intl.Segmenter with URLs, email addresses, and acronyms #656
Comments
I think this should go upstream into CLDR, perhaps as API switchable behavior. |
ECMA-402 text segmentation boundary determination is intentionally implementation-dependent, as documented at https://tc39.es/ecma402/#annex-implementation-dependent-behaviour and https://tc39.es/ecma402/#sec-findboundary .
It may also be worth noting that the default word boundary rules of UAX 29 don't allow breaking apart "P.T.O" or "gmail.com" or "www.google.com", per WB6 and WB7—V8's behavior is precisely the sort of behavior intended to be protected by that clause (although I personally would recommend against this particular deviation). I would also discourage getting stuck in the morass of possible API switches, unless it is something as general (and advanced) as explicit specification of boundary rules in a form similar to that of UAX 29. |
The use case comes from a Google team that is implementing a spell checker via word segmentation.
It seems like we could form some use cases for word segmentation:
I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting. |
For the record, ICU4X word segmenter produces the same output as ICU.
The code that generates the result. (The segment starts with a punctuation are removed.)let s = "Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or example@gmail.com. www.google.com and she's there and well-played.";
let provider = RuleBreakDataProvider;
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str(s).collect();
for w in breakpoints.windows(2) {
let begin = w[0];
let end = w[1];
if !s[begin..end].starts_with(&[' ', ':', '@', '.', '/', '-']) {
println!("{}", &s[begin..end]);
}
} |
The problem is that there are at least a dozen similar edge cases—domain names, email addresses, IP addresses, "Long" vs. "short" seems too coarse, and a collection of fine-tuning options seems even worse. |
I agree that we don't want an explosion of options for each edge case. I think we could look at use cases, though, and center options around those use cases. We already have "word", "sentence", and "grapheme" (and perhaps "line" at some point); we could add a new one called "token", for example, and say that "word" should be full words only, and "token" can produce smaller tokens. |
That seems like the kind of thing that should be pushed for in Unicode so that ECMA-402 can adopt it as a downstream consumer. |
Are some of these things that some kind of independent recognizer should find? I. E. URLs, email, hashtags etc. maybe you want some preprocess operation that detects such and replaces them for processing. Then you could add Btc hash, git hash, or anything new that comes along. Markdown and bbcode would fit in this category also, and wouldn't make sense for a general purpose plain text segménter |
From a web compatibility perspective, it would be bad if a product had a really good spell checker implementation when backed by V8 due to custom |
@aethanyc my understanding was that lwbrk (current Gecko's layout segmenter) does special things for web-compat segmentation ( |
TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-03-17.md#intlsegmenter-with-urls-email-addresses-and-acronyms Conclusion:
|
We would also like to push for an improvement in https://www.unicode.org/reports/tr29/#Word_Boundary_Rules that provides an example above WB6 similar to the one above WB8, but don't know how to pursue that. -Do not break letters across certain punctuation.
+Do not break letters across certain punctuation, such as within “e.g” or “example.com”. |
Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656 We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262. |
I wrapped up some basic tests in https://github.com/tc39/test262/pull/3577/files And found several potential issues due to not having specific rules for word segmentation mostly with email addresses and some results running it in different browsers + ICU were different from what I expected as a user. Example :
What are the next steps? IMHO I believe we should extend the recommendations on how to deal with this segmentation/ word Boundary that covers at least the popular use cases like emails. |
@macchiati says to fill in the form at https://corp.unicode.org/reporting/error.html with suggestions. He says that it may be useful to investigate the
|
Additional discussion on 2022-07-07: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms |
There a lot to unpack in that conversation. I'm OOO, but one point to add
is that UAX#29 and UAX#14 both specify precise default algorithms, but do
and need to allow for a great deal of customization for different
environments, different languages, etc.
We should discuss what could to done help with what tc39 is doing, perhaps
with some explicit profiles or som other mechanism.
…On Thu, Jul 7, 2022 at 5:39 PM Shane F. Carr ***@***.***> wrote:
Additional discussion on 2022-07-07:
https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms
—
Reply to this email directly, view it on GitHub
<#656 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMEEEYSYT7NKJL2VKM3VS5TCPANCNFSM5PSTOEHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Additional point - icu4x segmenter architecture diverged from icu4c as a result of the design decisions driven by gecko layout segmentation needs to allow for runtime switches and overlays allowing for cheap context switching mid segmentation. This is something that icu4c requires build time dictionary/rules rebuild, and icu4c can do in fly. |
@zbraniecki did you miss icu4x in the last comment? |
@sffc after being so pushy during the monthly meeting, I was scrolling through tests today with @romulocintra and realized that we have tests for locale-specific behavior, for example the output of a This challenges my assumption that we should only test specified behavior in test262 and based on that precedent, I'll be quite alright with including the tests in test262. Apologies for being so assertive about this matter without finishing my research. |
Consider the following string of text:
What segments should we produce?
ICU currently produces
But V8 splits the acronym, URL, and email address into separate tokens.
Questions:
See https://bugs.chromium.org/p/chromium/issues/detail?id=1301830
The text was updated successfully, but these errors were encountered: