Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Make an API for locale negotiation #513

Open
longlho opened this issue Nov 12, 2020 · 28 comments
Open

[Proposal] Make an API for locale negotiation #513

longlho opened this issue Nov 12, 2020 · 28 comments
Labels
c: locale Component: locale identifiers s: comment Status: more info is needed to move forward User Preferences Related to user preferences

Comments

@longlho
Copy link
Collaborator

longlho commented Nov 12, 2020

Resolving from a list of requested locales (via Accept-Languages header for example) to a list of supported locales for a certain website. See vercel/next.js#18676 for example.

Existing locale matching libraries are fairly rudimentary (exact match only). We already have LookupMatcher/BestAvailableLocale/ResolveLocale abstract operations that take into account extensions as well so IMO we should make a top level API for this.

@zbraniecki
Copy link
Member

The proper API for that is very non trivial. The listed APIs are deeply imperfect and the exact strategy especially among multiple locales may deeply depend on the business needs.

That makes me reluctant to "bless" one approach with a standardized API that we'll have to maintain. We already have localeMatcher which is a cautionary tale about how we thought it'll work best.

I'm not opposed to expose improved APIs to help with matching algorithms (parent locales, addLikelySubtags etc.), but I'd like to be cautious about providing a single solution as I don't think we found a golden one yet.

@longlho
Copy link
Collaborator Author

longlho commented Nov 12, 2020

Would exposing top level API for just lookup localeMatcher be possible? IMO that would be a start.

@sffc sffc added c: locale Component: locale identifiers s: comment Status: more info is needed to move forward User Preferences Related to user preferences labels Nov 12, 2020
@sffc
Copy link
Contributor

sffc commented Nov 12, 2020

Adding LanguageMatcher was first brought up in #46. So it's not a new issue; it just needs someone to champion it and help shepherd it through the process. As Zibi said, though, we need to be careful in this space, since there are multiple competing implementations of LanguageMatcher with different tradeoffs.

@longlho
Copy link
Collaborator Author

longlho commented Nov 16, 2020

So is the next step to put together a formal proposal repo?

@sffc
Copy link
Contributor

sffc commented Nov 16, 2020

So is the next step to put together a formal proposal repo?

Yes; creating a proposal repo with a README explaining the problem space will enable us to continue exploring this space. The instructions are here:

https://github.com/tc39/ecma262/blob/master/CONTRIBUTING.md#new-feature-proposals

@longlho
Copy link
Collaborator Author

longlho commented Nov 17, 2020

@zbraniecki
Copy link
Member

Does this work? https://github.com/longlho/proposal-intl-bestavailablelocale

Your proposal seems to be going counter to my recommendation. It does introduce an API that makes it look like it is the "canonical" language negotiation. Since I do not believe we have designed such a golden standard in the industry yet, trying to expose one via ECMA-402 will lead to either us having to support inferior algorithm forever, all while likely having to introduce another one one day (Intl.EvenBetterBestAvailableLocale?), or having to change the algorithm internally which may depend on other options or even additional data and make this API hard to advance.

Your proposal also does a negotiation to a single locale, which I think is rarely the right call. I explained my rationale here: https://firefox-source-docs.mozilla.org/intl/locale.html#language-negotiation

To sum it up, I would much rather see us exposing lower level components such as parentlocales, maximize/minimize etc. You have the right to push for standardization of some negotiation that you consider good enough to be added to ECMA-402 and supported forever, but I'd like my votum separatum to be explicit.
I do not believe that at the moment we should have an API with a name such as Intl.bestAvailableLocale or anything similar.

@longlho
Copy link
Collaborator Author

longlho commented Dec 1, 2020

So right now this is already happening with all the Intl APIs that take in a set of locales so I'm not sure I understand the rationale against exposing locale matching as a top level API? It also intentionally expose localeMatcher as an option with best fit to still be an implementation-dependent option, which prevents locking to a specific algorithm.

Would you recommend a different name? I do believe at the moment we need some top level locale negotiation for i18n sites to serve the right translation bundle that is consistent with other formatters like date/time/number. We do have the use cases for it so can you clarify on further requirements?

@zbraniecki
Copy link
Member

zbraniecki commented Dec 1, 2020

So right now this is already happening with all the Intl APIs that take in a set of locales so I'm not sure I understand the rationale against exposing locale matching as a top level API?

The BestAvailableLocale is an abstract operation that is internal to our API. It handles CLDR data specifically for a "final" matcher, and doesn't allow any nesting. It also works only because generally we expose the same set of locales for all formaters exposed by ECMA-402. I believe that we made a mistake when we standardized it the way we did, and we can do nothing about it now.

It's going to be a very tricky experience to try to merge this API shape with MessageFormat 2.0, which cannot handle language negotiation the same way as it has to handle a case where we have resources in one locale, but it requires a formatter like PluralRules which may have a different set of locales.
Exposing this algorithm to users as a canonicalized approach means that we recommend them to use it, for their needs. And I don't believe this algorithm to be good for common purpose needs.
In the user-land, the list of available resources will differ and the needs will often nest (list format that also formats its content cells and different cells will have different available locales).

Would you recommend a different name?

No, I just don't think we have a single API that we should expose.

I do believe at the moment we need some top level locale negotiation for i18n sites to serve the right translation bundle that is consistent with other formatters like date/time/number.

What makes it impossible to implement it in user land? We have supportedLocalesOf which allows you to perform any nesting you want with merging filtering against locales available for a given formatter, we have maximize/minimize. What is missing?

We do have the use cases for it so can you clarify on further requirements?

I believe that your questions for me are misguided by the assumption that we should add APIs for everything that we can find "use case" for.

I don't think "have a use case" is a good justification. We have a use case for localization as well, but we don't have a good format to offer, so we don't standardize something just because we can.

I don't have requirements, I have an opinion. There is a large spectrum of possible algorithms for language negotiation, and I believe for now, ECMA-402 should restrain from "blessing" any. Instead, we should make sure that we lower the barrier for user land algorithms to be written.
If there are missing low-level APIs that are data heavy and in result unfit for user-land, then I believe we should consider them.

My recommendation would be to develop what you're aiming for as a user-land library and see if it gains universal adoption. (here's an example of one we use for Fluent)

What prevents you from doing that?

@zbraniecki
Copy link
Member

As a possible (long term) path forward:

@longlho
Copy link
Collaborator Author

longlho commented Dec 1, 2020

I guess I'm basing the requirements off of #512 which this seems to satisfy:

  1. There are prior arts for locale negotiation.
  2. It is difficult to implement in user land due to CLDR data (e.g alias, language matching, parent locales) and also the LanguageMatching algorithm is certainly tricky.
  3. This does have broad appeal, as demo'ed in the next.js GH issue (57K stars). The implementation we're going with is within @formatjs/ecma402-abstract (320K weekly downloads but that number doesn't matter as far as I'm concerned).

I'm not asking for blessing of 1 specific algorithm, but rather a top level API that allows you to pick one. We already have precedent with lookup algorithm which is https://tools.ietf.org/html/rfc4647#section-3.4

Going by the list of requirements I'm not sure what actionable things are missing so I can go figure them out?

@longlho
Copy link
Collaborator Author

longlho commented Dec 1, 2020

@zbraniecki @sffc ok sounds like #512 needs to be updated with something to capture this conversation, since IMO this checks all the reqs but still doesn't seem to qualify.

@zbraniecki
Copy link
Member

I think we see the evaluation against that checklist differently then. Here's mine:

  1. There are multiple "prior arts" in the industry, and the authors are unhappy with the current ones. That's a strong red flag against further standardizing it and exposing more of an API surface.
    You can read Mark's slides from IUC on that topic: https://docs.google.com/presentation/u/1/d/1kdV9ZtVulhg33Z20n9OGJLUzWnbGKYjqYEfN2ZzBNNo/htmlpresent

  2. It's not difficult to implement it in the user land, and the CLDR data is exposed as lower level API bits like parent locales, maximize etc.
    Your point on algorithm being tricky is very true. To the point where, as I said, we don't have a good one yet.

  3. I buy into the idea that it is a generally useful API to have, but I would like you to show me a single NPM module that implements it in the user land, is massively popular, and somehow limited by something that standardization can solve (like heavy payload that the browser can provide).

I have not performed such analysis but my guess is that we have a plethora of different locale negotiation strategies used by libraries and no single API/algorithm that dominates organically or because of its quality.

I'm not asking for blessing of 1 specific algorithm, but rather a top level API that allows you to pick one.

The issue is that the way we standardized BestAvailableLocale does provide one, that is not great and narrow in use case, and alternative in form of "implementation dependent" one. I'm not aware of any user of that approach.

Going by the list of requirements I'm not sure what actionable things are missing so I can go figure them out?

I make two claims:

  1. The algorithm does not match criteria (2) and it can be implemented in user land.
  2. We don't have a good candidate for standardization yet.

You can address (1) by documenting what makes user land language matcher(s) impossible/infeasible right now.
You can address (2) by showing that everyone who can uses and is happy with a single algorithm that is a good candidate for standardization.

@zbraniecki
Copy link
Member

I'd really like you to address your claim that you checked requirement (2). I don't think you proved that this algorithm is somehow hard to implement in user land and requires extending ECMA-402 API surface yet.

@longlho
Copy link
Collaborator Author

longlho commented Dec 1, 2020

So the question I have is why can't we standardize on the API without standardizing on the algorithm? (effectively making the algorithm part implementation-dependent?)

@zbraniecki
Copy link
Member

So the question I have is why can't we standardize on the API without standardizing on the algorithm? (effectively making the algorithm part implementation-dependent?)

What's the value of such exercise?

@longlho
Copy link
Collaborator Author

longlho commented Dec 1, 2020

I believe the value is to consolidate usage to a standardized API. It has the same value prop as supportedLocalesOf right?

@zbraniecki
Copy link
Member

I believe the value is to consolidate usage to a standardized API.

I don't think I understand. If I'm correct that you can create your npm library for locale negotiation with your own algorithm and we dont' want to standardize the algorithm, then what is the benefit of ECMA-402 API that doesn't standardize anything except of extending API surface?

It has the same value prop as supportedLocalesOf right?

supportedLocalesOf expose data information for ECMA-402 APIs. It tells you something you cannot retrieve from anywhere else and no user-land library can be written that would know what locales are supported by Intl.DateTimeFormat in a given environment.

supportedLocalesOf may be used by various language negotiation algorithms (and is necessary for some nesting algorithms like ones used by localization systems that want to rely on ECMA-402 APIs).

@longlho
Copy link
Collaborator Author

longlho commented Dec 1, 2020

Can't you achieve the same with resolvedOptions though?

@zbraniecki
Copy link
Member

Can't you achieve the same with resolvedOptions though?

I don't think I understand.

The use case for supportedLocalesOf are:

  1. I have a list of locales, before I show them to the user for selection, I want to check which of them they have DateTimeFormat support for, and show to the user a filtered down version that I know they'll have date and time formats for
  2. I have a localization message in 55 locales. The message has a placeholders with a date. To format the date, I'll use DateTimeFormat so I'm going to filter down the 55 locales against the locales the environment has date and time formats for. The result list (subset of locales I have l10n resources for and locales I have date and time formats for) will be negotiated against user requested locales to come up with the best locale fallback chain that will have both l10n resources and date/time so that my result message looks consistent.

How would you use resolvedOptions to achieve that?

@sffc
Copy link
Contributor

sffc commented Dec 1, 2020

Thanks for the discussion. Here's my take. Spoiler alert: I stand somewhere in the middle of the spectrum.

Requirement 1 (Prior Art)

There's no doubt that there is prior art for language negotiation, in particular UTS 35 Section 4.4 Language Matching (ca. 2019), as well as other less-standard language negotiation algorithms you can find around the ecosystem.

UTS 35 builds upon experience from previous CLDR language matching algorithms, based on multiple years of work by @macchiati and colleagues. Thanks to @markusicu, ICU4C and ICU4J are up to date with these latest recommendations.

I can also see Zibi's perspective on this point: "There are multiple 'prior arts' in the industry, and the authors are unhappy with the current ones." However, this point is not specified in CONTRIBUTING.md (#512). I will follow up with a revision: #517

Note that the slides Zibi references are from 2010. The latest version of UTS 35 is Mark's latest attempt at solving this problem, the one he was making a case for in 2010.

What I do think would be productive is a discussion about the specific pros and cons of the different language negotiation algorithms. If Zibi has qualms about Mark and Markus's ICU 65 language negotiation algorithm, we should hear them, so that Unicode can iterate on improving the algorithm and the data.

Requirement 2 (Difficult to Implement in Userland)

Older, more simplistic language matching algorithms required less data and were therefore simpler to implement in userland. However, UTS 35 Section 4.4 is quite complex to implement, and it requires a sizeable chunk of locale data. It took multiple engineer-years of work to get ICU4C and ICU4J up to speed with UTS 35 Section 4.4; @markusicu can attest to that.

Requirement 3 (Broad Appeal)

I am convinced that language negotiation has very broad appeal. I don't think this point is disputed.

Conclusion

In my mind, the main problem is point 1: we don't have agreement that the prior art is high-quality. Here's a possible path forward:

  1. Open a Stage 1 proposal for LocaleMatcher so that we can organize these discussions. Please remember that Stage 1 merely means that we think this is a problem worth investigating; it does not mean we have agreed on a final direction.
  2. Collect more data on the language negotiation algorithms in the wild. Summarize them, and propose one or more for inclusion into ECMA-402, using an option to toggle between them (like "basic" vs "best fit" for Intl.DateTimeFormat skeletons).
  3. Then, we can make a fully informed decision before moving this proposal to Stage 2.

@zbraniecki
Copy link
Member

Thanks @sffc ! I think we definitely agree on (3). I'm awaiting advancement in our conversation about locale matching in ICU4X before I make up my mind on (1) and my approach wrt. (2) is that we should identify building blocks we can expose to enable many different algorithms.

In other words, in context of (2), I believe that the problem of data should be separated from the problem of algorithm. If good locale negotiation requires data, we should consider exposing the data.
But if it requires an algorithm, we should agree on what algorithm fits all needs.

I'm not sure if I have qualms with current UTS 35 approach, but I'd say that the amount of different algorithm's I've seen and your claim that the one in UTS is very complex, makes me think that if we can expose building blocks and make it easy to write a user land library, then we can try to convince JS ecosystem to use it and if they can do so successfully, we can claim we have a ground to imprint it in the standard.

@longlho
Copy link
Collaborator Author

longlho commented Dec 1, 2020

Thanks @sffc. @zbraniecki I myself don't understand the shortcomings of UTS35 LanguageMatching. Can u clarify? Were there specific bugs that were filed?

In terms of algorithms that I know of, I've included them in https://github.com/longlho/proposal-intl-localematcher#prior-arts. @zbraniecki can u point me to the ones you've seen that are not in this list?

@zbraniecki
Copy link
Member

@zbraniecki I myself don't understand the shortcomings of UTS35 LanguageMatching. Can u clarify?

Sure,

  • The ICU API returns a single locale. I listed above my position that single locale is almost always implicit fallback locale chain of [thatLocale, lastFallbackLocale] and that implicitness is a bad API design that leads to bad UX. We can do better. Here's an example of what fluent-langneg does - https://github.com/projectfluent/fluent.js/blob/master/fluent-langneg/README.md#strategies
  • If there are multiple supported locales with the same (language, script, region) likely subtags, then the current implementation returns the first of those locales. That's also a choice, and a suboptimal one for localization use case.
  • Performance - I haven't run any recent performance evaluations, but IIRC the algorithms design had high impact on performance requiring significant amount of CPU time to be spent by default on edge case scenarios like variants and debatable features such as distances. The mentioned fluent-langneg algorithm performed significantly better in our tests against Firefox startup matchings without any quality degradation on our test corpus.

Were there specific bugs that were filed?

I don't think it's fixable by "bugs". It's a design decision. As I stated above, I don't believe we're yet at the place where we have a "golden standard" of locale matching that we can use and standardize in JavaScript forever. Which makes me concerned when you aim to do that.

@macchiati
Copy link

macchiati commented Jan 4, 2021 via email

@zbraniecki
Copy link
Member

zbraniecki commented Jan 4, 2021

Hi Mark! Thank you for taking time to respond here!

Are you saying that the single locale should be "almost always implicit fallback locale chain of [thatLocale,
lastFallbackLocale]", or that it is that in UTS #35 but shouldn't be?

I argue that very rarely in the modern localization ecosystem do we encounter a scenario where a single outcome locale is actually useful.

Almost all cases I've encountered use that outcome locale, which is an output of the negotiation, as an input to another negotiation down the chain.

I documented it here: https://firefox-source-docs.mozilla.org/intl/locale.html#language-negotiation

(Also, I don't know what you mean by "thatLocale" (I was guessing the
requested locale, but there can be multiple) and "lastFallbackLocale" (no
idea).)

thatLocale in my example is the outcome of the negotiation. If the outcome is a single locale, then any operation down the chain that fails to perfectly match will fallback on the last fallback locale which is the locale used when nothing matches.

In other words, I'm claiming that if you ever operate on a single locale, in most cases, you're actually operating on an implicit two element fallback chain, and that is a hidden factor that should not be.

Can you clarify what the intended usage scenarios are for filtering and
matching?

Let me try!

I'll start quoting UTS#35 for naming convention alignment:

Language matching is used to find the best supported locale ID given a requested list of languages.

On the input we have a list of requested locales (also called *desired* in the doc), and a list of supported locales in a given scenario. The output of the LanguageMatcher is a single supported locale ID called resulting language,

If the scenario is localization (and this has been quoted as the primary scenario for this API), then the resulting language is passed to the localization API as the input.
As the localization performs its operation, it may encounter a chained operation - say, date formatting, pluralization, number formatting, or even localized icon selection, color scheme or other chained, sub-internationalization operation.

In such a scenario, the library will have access to a list of locales that it has ability to perform that operation in (new supported locales), and will need to negotiate it against the list of locales that the user prefers.

But that's not trivial. There are actually three sets of locales that may be the right requested locales input for that negotiation:

a) The single resulting locale that the localization API was provided as an outcome of the initial negotiation
b) A wide list of locales that the localization API has resources for filtered by the user original requested locales
c) An original list of reqested locales from the user

Than there are also two ways to think about what should be the input:

  1. either it should be a list of requested locales
  2. or the list of supported locales.

Example: Localization -> Date

Drilling into a particular example. Let's say that we're dealing with localization which contains a date, and we have the following information:

user_requested = ["de-CH", "fr-CH", "es"]; // ordered
localization_supported = ["de", "de-AT", "de-IT", "de-AR", "fr", "fr-IT", "fr-FR", "es"]; // unordered
date_supported = ["es-ES", "fr-CA", "fr-CH", "fr-FR"]; // unordered
default_locale = "en";

If the outcome of the initial negotiation is ["de-CH", "fr-CH", "es"] x [ ["de", "de-AT", "de-IT", "de-AR", "fr", "fr-IT", "fr-FR", "es"] = ["de"] then how should the negotiation for the nested negotiation be performed?

We know we don't have dates in de, so what should happen?

What often happens is that the API will just take en because we could not match ["de"] against any of the date_supported locales. So we implicitly constructed a fallback chain of two elements - ["de", "en"]. I believe that implicit fallback chain to be the worst possible chain produced, and we should aim to do better.
But if someone wants to use it, that's what lookup strategy does.

We might also instead of passing de meaning "we localize in de resources", pass de-CH as "we localize in some locale based on the user requested de-CH locale". The benefit of the latter is that we preserved more information about the user preference and the subsequent negotiation has a chance to better fine tune to their preference (yes, we use german translations, but we can adapt to Swiss date/time formatting!).

We can also allow the initial negotiation to produce the widest possible list of localization supported locales, basically filtering the localization_supported list to get as many, ordered, locales as possible allowing the date&time negotiation to know much more about what the user wanted to maximize the chance it'll provide a better fallback.

It may be tempting to at least cut off on language and say that if our translation resources are in German, then there's no value in providing fr-IT as an optional input - after all, we don't want to end up with German localization with french dates embedded, do we?
But I think it's the consequence of the hidden implicit chain I described above - we actually will fallback across languages and what's worse we'll fallback on a worse language despite the user providing us all the information necessary to give them a better output.
We just threw it out on the first negotiation because we removed all bits but one.

  1. Is filtering basically like the union of calling lookup for each of
    the requested locales?

Correct. It offers a mid-point between performance and completeness.

  1. Is matching something like "return each available locale that is
    within some distance metric of one of the requested locales".

Yes. The distance in fluent algorithm is a binary term separating languages and scripts.

  1. The data doesn't produce desired results (but additions of data could address that)
  2. The data structure doesn't allow for desired results
  3. The data structure & implementation can't be optimized
  4. The data structure & implementation produces (or can) desired
    results, but many of the results are edge cases where the cost in
    performance isn't worth it.

I'm not sure if I know the answer to it. I feel that UTS#35 algorithm does its job well for the given considerations, and badly for other. I'm not sure if it's a matter of extending the algorithm, or accepting that there are different tradeoffs and different resulting optimal strategies and algorithms.

UTS#35 algorithm works well when all resources in a components network have all the same locales, which is increasingly rare and I doubt that should be assumed for the future.

I claim no superiority of Fluent's approach to UTS#35, but I do believe there are multiple ways to "chose" the best approach and UTS#35 current algorithm is just one of them, working well for a particular environment, and Fluent is another, working well for a different one. There are likely others working better for other scenarios.

For example, one could pass the original requested locales through Localization API down to a chained Screenshot Selection API to select the screenshot in the best matched locale, even if the translation text around it uses some second or third preferred locale. In such case we could say that it's beneficial to allow the nested negotiation to provide better output of the negotiation than the parent negotiation was able to.

I believe there should be a way to design an API that allows for the six models I know of (lookup/filtering/matching x requested-as-output/supported-as-output), but I haven't had time to work on that.

There are other consequences of the algorithm producing a single locale than chaining. Chaining is just one example of such complexity.
Another, just to document it, is direct fallbacking. If we provide Localization API a single locale as an output of negotiation, and during runtime some error occurs and a single message, or a group of messages, cannot be resolved in that locale, what should the fallback be?
If we have just one locale, then, again, most likely the API will perform a fallback on the implicit chain where the second element is the "default locale". If we carry more detailed fallback chain in the Localization API, then the fallback can be more nuanced and result in better user experience.

Does that help illustrate the problem scope I see?

@macchiati
Copy link

macchiati commented Jan 4, 2021 via email

@zbraniecki
Copy link
Member

Is that the intended reference?

I would be genuinely impressed if it was! Thank you! Fixing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Component: locale identifiers s: comment Status: more info is needed to move forward User Preferences Related to user preferences
Projects
None yet
Development

No branches or pull requests

4 participants