Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider making ResolveLocale return the normalized requested locale instead of the available locale #830

Open
sffc opened this issue Aug 31, 2023 · 6 comments
Labels
c: locale Component: locale identifiers s: comment Status: more info is needed to move forward

Comments

@sffc
Copy link
Contributor

sffc commented Aug 31, 2023

The concept of an "available locale" is not always well defined. For example, DateTimeFormat might have data in more locales for certain numbering systems and calendars than others, or more locales for the inner NumberFormat than for the datetime patterns. Also, engines like ICU4X do not directly expose the resolved locale information.

Currently the ECMA-402 specification requires that the resolved locale in resolvedOptions() be an entry from the availableLocales list, a list which is never directly exposed to the client. However, supportedLocalesOf() echoes back entries from the requestedLocales list.

The use cases for resolvedOptions().locale are not completely clear, but a few might be

  1. Make other UI elements on the page appear with the same locale as the formatter
  2. Detect when root fallback occurred so that polyfill data can be used (this is tricky to do since it's not possible to distinguish a fallback to DefaultLocale that was intended versus one that was a last resort)

I think both of these cases can be served by making ResolveLocale return an entry from requestedLocales instead of an entry from availableLocales.

CC @hsivonen

ICU4X issue: unicode-org/icu4x#3906

@sffc sffc added s: discuss Status: TG2 must discuss to move forward c: locale Component: locale identifiers labels Aug 31, 2023
@eemeli
Copy link
Member

eemeli commented Sep 1, 2023

However, supportedLocalesOf() echoes back entries from the requestedLocales list.

To be specific, from the canonicalized list of requested locales.

Intl.NumberFormat.supportedLocalesOf('EN-us')  ['en-US']

@anba
Copy link
Contributor

anba commented Sep 5, 2023

icu::datetime::DateTimeFormatter expects a single DataLocale to construct a new formatter object, but Intl.DateTimeFormat allows a list of locales as its input and then selects the supported locale.

For example:

new Intl.DateTimeFormat(["nds", "ksh", "de"], {localeMatcher: "lookup", month: "long"}).format(0)

Can have the following outputs: (Assuming de is always supported.)

Resolved locale Output
nds Januaar
ksh Jannewa
de Januar

Note: CLDR has data for all three locales, but nds is marked as unconfirmed and therefore isn't included in ICU by default. And ksh is removed in V8's copy of ICU, but is available in JSC/SpiderMonkey, so browsers will either return Jannewa or Januar here.

For example:

new Intl.DateTimeFormat("co", {localeMatcher: "best fit", month: "long"}).format(0)

Can have the following outputs: (Assuming it is the default locale)

Resolved locale Output
co ghjennaghju
fr janvier
it gennaio

Note: CLDR has data for co, but it's marked as unconfirmed and therefore isn't included in ICU by default. The best fit locale matcher can resolve co to fr when using CLDR's locale matching data. When best fit isn't supported, which is the case in all browsers, then co will fall back to the default locale, i.e. it in this example.

If there is more than one requested locale and the best fit locale matcher is supported, determining the resolved locale is more tricky.

For example:

new Intl.DateTimeFormat(["co", "es"], {localeMatcher: "best fit"}).resolvedOptions().locale

Assuming co isn't supported, but fr and es are both supported, the best fit matcher could return either locale. (I'd try to return es, because IMO that's a better fit than fr in this case, based on the language distance definitions from "languageInfo.xml".) But as mentioned above, best fit locale matcher isn't supported in browsers, so there's no immediate need to support this in ICU4X.

Using the root locale fallback for either nds or co in the above examples and then returning M01 isn't acceptable.


Browsers are using uloc_getAvailable (*) to compute the list of available locales and just use that list as-is.

This leads to returning non-optimal results in some cases, but nobody has ever complained about this:

The available locales list from uloc_getAvailable includes de and de-AT (when using the default ICU data configuration). de-Latn-AT isn't returned from uloc_getAvailable. And browsers also don't add the default script Latn to de-AT to manually add de-Latn-AT to the list of available locales. So when requesting de-Latn-AT with the lookup matcher, de will be returned from BestAvailableLocale and January will be formatted as Januar instead of Jänner.

new Intl.DateTimeFormat("de-Latn-AT", {localeMatcher: "lookup", month: "long"}).format(0)
Input Locale Resolved Locale Output
de de Januar
de-AT de-AT Jänner
de-Latn-AT de Januar

(*) And ucol_getAvailable for Intl.Collator. There's also udat_getAvailable and unum_getAvailable, but both return the same results as uloc_getAvailable.


The concept of an "available locale" is not always well defined.

Conceptually the available locale describes that useful/expected results can be generated. For example when requesting new Intl.DateTimeFormat("fr").resolvedOptions().locale returns "fr", the formatter should use French and not fallback to the root locale. When new Intl.Segmenter("th", {granularity: "word"}).resolvedOptions().locale returns "th", the underlying break iterator should have some sort of Thai dictionary support. Using the root fallback but still returning "th" as the resolved locale will only create user confusion. And new Intl.Collator("tlh").resolvedOptions().locale should always return the default locale, unless someone can validate that the root fallback is always appropriate for tlh.

I wonder if the main issue is just that there's confusion what is meant by resolved locale? The resolved locale doesn't describe the locale where the underlying date is defined. For example when requesting de-AT, most locale information is in de, except for some overrides, like for example "Jänner" instead of "Januar" for the month January. The resolved locale should still be de-AT, even when using some locale information which is actually defined in de. From the ICU4X perspective, the list of available locales is the list of locales given to icu_datagen when generating the locale data.

  1. Detect when root fallback occurred so that polyfill data can be used (this is tricky to do since it's not possible to distinguish a fallback to DefaultLocale that was intended versus one that was a last resort)

Whether or not root fallback is acceptable depends on the Intl service constructor, used options, and the kind of fallback. For example a root fallback for Intl.DateTimeFormat to an <alias> entry, where the <alias> entry resolves to a non-root locale entry in the actual locale is acceptable. Using the root fallback for Intl.Collator("tlh-Piqd") isn't acceptable, because there's no indication that the root locale collation definitions are useful for tlh words in the Piqd script.

I'm not sure what you mean when mentioning "fallback to DefaultLocale". Root locale fallback is different from default locale fallback. When this is about the default locale fallback in LookupMatcher, it's detectable when the default locale fallback will be used by first calling into LookupSupportedLocales.

I think both of these cases can be served by making ResolveLocale return an entry from requestedLocales instead of an entry from availableLocales.

When LookupMatcher doesn't find an available locale and steps 3-5 are executed:

  1. Let defLocale be ! DefaultLocale().
  2. Set result.[[locale]] to defLocale.
  3. Return result.

Which entry from requestedLocales do you propose to use here?


And because I saw this mentioned in the Matrix logs (here and here):

No browser actually implements the basic format matcher for Intl.DateTimeFormat. All browsers only support the best fit format matcher, which is implemented by returning whatever ICU4C computes. In some cases there's some additional fine-tuning, for example to adjust the hour-cycle or change numeric to 2-digit.

So it's basically it's the reverse situation when compared to localeMatcher:


V8 has some WIP code to support the best fit locale matcher, but it's currently disabled by default, but can be enabled through the --harmony-intl-best-fit-matcher command line flag. Also see https://bugs.chromium.org/p/v8/issues/detail?id=7051.

@sffc
Copy link
Contributor Author

sffc commented Sep 7, 2023

I'm not sure what you mean when mentioning "fallback to DefaultLocale". Root locale fallback is different from default locale fallback. When this is about the default locale fallback in LookupMatcher, it's detectable when the default locale fallback will be used by first calling into LookupSupportedLocales.

Good observation. This may also rise to the level of a use case for a developer caring about the difference between an availableLocale and a requestedLocale.

From the ICU4X perspective, the list of available locales is the list of locales given to icu_datagen when generating the locale data.

An issue here is that the default icu_datagen fallback mode is to save space by stripping locales that fall back to a parent containing the same data. ICU4C does the same thing, but it retains memory of a certain locale being present in the form of an empty resource file. icu_datagen has a mode, "hybrid", that disables this stripping (at the cost of slightly larger binary size: no data is duplicated but the lookup structure is larger). Assuming we retain the current behavior in ECMA-402, it would be a requirement for implementers using ICU4X to build their data in hybrid mode.

Browsers only support the lookup locale matcher, the best fit locale matcher isn't supported.

I find it interesting that all the browsers codify the availableLocales behavior. Do they do this because 402 makes the availableLocales list observable? One could pass in thousands of locales and eventually piece together the list.

ICU4X implements a fallback which is not the lookup matcher; the "de-Latn-AT" case is one that ICU4X specifically tries to solve. So, an engine wanting a lookup matcher would need to implement this on top of ICU4X, and the "best fit" matcher would be to simply use the ICU4X fallback machinery.

@anba
Copy link
Contributor

anba commented Sep 8, 2023

An issue here is that the default icu_datagen fallback mode is to save space by stripping locales that fall back to a parent containing the same data. ICU4C does the same thing, but it retains memory of a certain locale being present in the form of an empty resource file. icu_datagen has a mode, "hybrid", that disables this stripping (at the cost of slightly larger binary size: no data is duplicated but the lookup structure is larger). Assuming we retain the current behavior in ECMA-402, it would be a requirement for implementers using ICU4X to build their data in hybrid mode.

I think there's still some confusion about what the resolved locale should be.

// Uses the <dateFormatItem> data from "de.xml". Only the <month> data is used from "de_AT.xml"
console.log(new Intl.DateTimeFormat("de-AT", {localeMatcher: "lookup", month: "long"}).resolvedOptions().locale);

// Uses the <dateFormatItem> data from "de.xml". <day> is also used from "de.xml".
console.log(new Intl.DateTimeFormat("de-AT", {localeMatcher: "lookup", weekday: "long"}).resolvedOptions().locale);

This should print "de-AT" twice, even though the date-time patterns are stored in the de locale. It should not print "de" just because the formatting data is used from the parent locale de.


Is it possible to store the locales processed by icu_datagen into the ICU4X data file and then provide an ICU4X API to retrieve this list of locales? This should be sufficient for ECMA-402 implementations to compute the resolved locale.

@sffc
Copy link
Contributor Author

sffc commented Sep 8, 2023

Is it possible to store the locales processed by icu_datagen into the ICU4X data file and then provide an ICU4X API to retrieve this list of locales? This should be sufficient for ECMA-402 implementations to compute the resolved locale.

Yes, this is doable and should be easy to implement, if we think this is the best path forward.

@sffc
Copy link
Contributor Author

sffc commented Nov 16, 2023

TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2023-11-16.md#consider-making-resolvelocale-return-the-normalized-requested-locale-instead-of-the-available-locale-830

Conclusion: keep the status quo. Too potentially disruptive and motivation not strong enough right now to make a change. Reconsider if we hear stronger motivation.

@sffc sffc added s: comment Status: more info is needed to move forward and removed s: discuss Status: TG2 must discuss to move forward labels Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Component: locale identifiers s: comment Status: more info is needed to move forward
Projects
Status: Previously Discussed
Development

No branches or pull requests

3 participants