-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Make an API for locale negotiation #513
Comments
The proper API for that is very non trivial. The listed APIs are deeply imperfect and the exact strategy especially among multiple locales may deeply depend on the business needs. That makes me reluctant to "bless" one approach with a standardized API that we'll have to maintain. We already have I'm not opposed to expose improved APIs to help with matching algorithms (parent locales, addLikelySubtags etc.), but I'd like to be cautious about providing a single solution as I don't think we found a golden one yet. |
Would exposing top level API for just lookup localeMatcher be possible? IMO that would be a start. |
Adding LanguageMatcher was first brought up in #46. So it's not a new issue; it just needs someone to champion it and help shepherd it through the process. As Zibi said, though, we need to be careful in this space, since there are multiple competing implementations of LanguageMatcher with different tradeoffs. |
So is the next step to put together a formal proposal repo? |
Yes; creating a proposal repo with a README explaining the problem space will enable us to continue exploring this space. The instructions are here: https://github.com/tc39/ecma262/blob/master/CONTRIBUTING.md#new-feature-proposals |
Does this work? https://github.com/longlho/proposal-intl-bestavailablelocale |
Your proposal seems to be going counter to my recommendation. It does introduce an API that makes it look like it is the "canonical" language negotiation. Since I do not believe we have designed such a golden standard in the industry yet, trying to expose one via ECMA-402 will lead to either us having to support inferior algorithm forever, all while likely having to introduce another one one day ( Your proposal also does a negotiation to a single locale, which I think is rarely the right call. I explained my rationale here: https://firefox-source-docs.mozilla.org/intl/locale.html#language-negotiation To sum it up, I would much rather see us exposing lower level components such as parentlocales, maximize/minimize etc. You have the right to push for standardization of some negotiation that you consider good enough to be added to ECMA-402 and supported forever, but I'd like my votum separatum to be explicit. |
So right now this is already happening with all the Intl APIs that take in a set of locales so I'm not sure I understand the rationale against exposing locale matching as a top level API? It also intentionally expose Would you recommend a different name? I do believe at the moment we need some top level locale negotiation for i18n sites to serve the right translation bundle that is consistent with other formatters like date/time/number. We do have the use cases for it so can you clarify on further requirements? |
The It's going to be a very tricky experience to try to merge this API shape with MessageFormat 2.0, which cannot handle language negotiation the same way as it has to handle a case where we have resources in one locale, but it requires a formatter like PluralRules which may have a different set of locales.
No, I just don't think we have a single API that we should expose.
What makes it impossible to implement it in user land? We have
I believe that your questions for me are misguided by the assumption that we should add APIs for everything that we can find "use case" for. I don't think "have a use case" is a good justification. We have a use case for localization as well, but we don't have a good format to offer, so we don't standardize something just because we can. I don't have requirements, I have an opinion. There is a large spectrum of possible algorithms for language negotiation, and I believe for now, ECMA-402 should restrain from "blessing" any. Instead, we should make sure that we lower the barrier for user land algorithms to be written. My recommendation would be to develop what you're aiming for as a user-land library and see if it gains universal adoption. (here's an example of one we use for Fluent) What prevents you from doing that? |
As a possible (long term) path forward:
|
I guess I'm basing the requirements off of #512 which this seems to satisfy:
I'm not asking for blessing of 1 specific algorithm, but rather a top level API that allows you to pick one. We already have precedent with Going by the list of requirements I'm not sure what actionable things are missing so I can go figure them out? |
@zbraniecki @sffc ok sounds like #512 needs to be updated with something to capture this conversation, since IMO this checks all the reqs but still doesn't seem to qualify. |
I think we see the evaluation against that checklist differently then. Here's mine:
I have not performed such analysis but my guess is that we have a plethora of different locale negotiation strategies used by libraries and no single API/algorithm that dominates organically or because of its quality.
The issue is that the way we standardized
I make two claims:
You can address (1) by documenting what makes user land language matcher(s) impossible/infeasible right now. |
I'd really like you to address your claim that you checked requirement (2). I don't think you proved that this algorithm is somehow hard to implement in user land and requires extending ECMA-402 API surface yet. |
So the question I have is why can't we standardize on the API without standardizing on the algorithm? (effectively making the algorithm part implementation-dependent?) |
What's the value of such exercise? |
I believe the value is to consolidate usage to a standardized API. It has the same value prop as |
I don't think I understand. If I'm correct that you can create your npm library for locale negotiation with your own algorithm and we dont' want to standardize the algorithm, then what is the benefit of ECMA-402 API that doesn't standardize anything except of extending API surface?
|
Can't you achieve the same with resolvedOptions though? |
I don't think I understand. The use case for
How would you use |
Thanks for the discussion. Here's my take. Spoiler alert: I stand somewhere in the middle of the spectrum. Requirement 1 (Prior Art)There's no doubt that there is prior art for language negotiation, in particular UTS 35 Section 4.4 Language Matching (ca. 2019), as well as other less-standard language negotiation algorithms you can find around the ecosystem. UTS 35 builds upon experience from previous CLDR language matching algorithms, based on multiple years of work by @macchiati and colleagues. Thanks to @markusicu, ICU4C and ICU4J are up to date with these latest recommendations. I can also see Zibi's perspective on this point: "There are multiple 'prior arts' in the industry, and the authors are unhappy with the current ones." However, this point is not specified in CONTRIBUTING.md (#512). I will follow up with a revision: #517 Note that the slides Zibi references are from 2010. The latest version of UTS 35 is Mark's latest attempt at solving this problem, the one he was making a case for in 2010. What I do think would be productive is a discussion about the specific pros and cons of the different language negotiation algorithms. If Zibi has qualms about Mark and Markus's ICU 65 language negotiation algorithm, we should hear them, so that Unicode can iterate on improving the algorithm and the data. Requirement 2 (Difficult to Implement in Userland)Older, more simplistic language matching algorithms required less data and were therefore simpler to implement in userland. However, UTS 35 Section 4.4 is quite complex to implement, and it requires a sizeable chunk of locale data. It took multiple engineer-years of work to get ICU4C and ICU4J up to speed with UTS 35 Section 4.4; @markusicu can attest to that. Requirement 3 (Broad Appeal)I am convinced that language negotiation has very broad appeal. I don't think this point is disputed. ConclusionIn my mind, the main problem is point 1: we don't have agreement that the prior art is high-quality. Here's a possible path forward:
|
Thanks @sffc ! I think we definitely agree on (3). I'm awaiting advancement in our conversation about locale matching in ICU4X before I make up my mind on (1) and my approach wrt. (2) is that we should identify building blocks we can expose to enable many different algorithms. In other words, in context of (2), I believe that the problem of data should be separated from the problem of algorithm. If good locale negotiation requires data, we should consider exposing the data. I'm not sure if I have qualms with current UTS 35 approach, but I'd say that the amount of different algorithm's I've seen and your claim that the one in UTS is very complex, makes me think that if we can expose building blocks and make it easy to write a user land library, then we can try to convince JS ecosystem to use it and if they can do so successfully, we can claim we have a ground to imprint it in the standard. |
Thanks @sffc. @zbraniecki I myself don't understand the shortcomings of UTS35 In terms of algorithms that I know of, I've included them in https://github.com/longlho/proposal-intl-localematcher#prior-arts. @zbraniecki can u point me to the ones you've seen that are not in this list? |
Sure,
I don't think it's fixable by "bugs". It's a design decision. As I stated above, I don't believe we're yet at the place where we have a "golden standard" of locale matching that we can use and standardize in JavaScript forever. Which makes me concerned when you aim to do that. |
Zibi, as far as the API goes:
I listed above my position that single locale is almost always implicit
fallback locale chain of [thatLocale, lastFallbackLocale]
I don't understand this point. Are you saying that the single locale
*should be* "almost always implicit fallback locale chain of [thatLocale,
lastFallbackLocale]", or that it *is* that in UTS #35 but shouldn't be?
(Also, I don't know what you mean by "thatLocale" (I was guessing the
requested locale, but there can be multiple) and "lastFallbackLocale" (no
idea).)
Can you clarify what the intended usage scenarios are for filtering and
matching?
Also, your semantics of filtering and matching are unclear to me:
1. Is filtering basically like the union of calling lookup for each of
the requested locales?
2. Is matching something like "return each available locale that is
within some distance metric of one of the requested locales".
Can you explain more of the desired semantics?
As far as the implementation of UTS#35 goes, can you spell out a bit more.
Do you mean:
1. The data doesn't produce desired results (but additions of data could
address that)
2. The data structure doesn't allow for desired results
3. The data structure & implementation can't be optimized
4. The data structure & implementation produces (or can) desired
results, but many of the results are edge cases where the cost in
performance isn't worth it.
Mark
…On Mon, Jan 4, 2021 at 6:38 AM Zibi Braniecki ***@***.***> wrote:
@zbraniecki <https://github.com/zbraniecki> I myself don't understand the
shortcomings of UTS35 LanguageMatching. Can u clarify?
Sure,
- The ICU API returns a single locale. I listed above my position that
single locale is almost always implicit fallback locale chain of [thatLocale,
lastFallbackLocale] and that implicitness is a bad API design that
leads to bad UX. We can do better. Here's an example of what
fluent-langneg does -
https://github.com/projectfluent/fluent.js/blob/master/fluent-langneg/README.md#strategies
- If there are multiple supported locales with the same (language,
script, region) likely subtags, then the current implementation returns the
first of those locales. That's also a choice, and a suboptimal one for
localization use case.
- Performance - I haven't run any recent performance evaluations, but
IIRC the algorithms design had high impact on performance requiring
significant amount of CPU time to be spent by default on edge case
scenarios like variants and debatable features such as distances. The
mentioned fluent-langneg algorithm performed significantly better
<https://docs.google.com/spreadsheets/d/13x2S_xhGCD8ArUz9MiFPMSxCwrIVsbe1vb3Va35vGc8/edit?usp=sharing>
in our tests against Firefox startup matchings without any quality
degradation on our test corpus
<https://github.com/projectfluent/fluent-langneg-rs/tree/master/tests/fixtures/negotiate>
.
Were there specific bugs that were filed?
I don't think it's fixable by "bugs". It's a design decision. As I stated
above, I don't believe we're yet at the place where we have a "golden
standard" of locale matching that we can use and standardize in JavaScript
forever. Which makes me concerned when you aim to do that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#513 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMDH3ZY5PQJ5YYNEYULSYHHFNANCNFSM4TTY7B4Q>
.
|
Hi Mark! Thank you for taking time to respond here!
I argue that very rarely in the modern localization ecosystem do we encounter a scenario where a single outcome locale is actually useful. Almost all cases I've encountered use that outcome locale, which is an output of the negotiation, as an input to another negotiation down the chain. I documented it here: https://firefox-source-docs.mozilla.org/intl/locale.html#language-negotiation
In other words, I'm claiming that if you ever operate on a single locale, in most cases, you're actually operating on an implicit two element fallback chain, and that is a hidden factor that should not be.
Let me try! I'll start quoting UTS#35 for naming convention alignment:
On the input we have a list of requested locales (also called If the scenario is localization (and this has been quoted as the primary scenario for this API), then the resulting language is passed to the localization API as the input. In such a scenario, the library will have access to a list of locales that it has ability to perform that operation in (new supported locales), and will need to negotiate it against the list of locales that the user prefers. But that's not trivial. There are actually three sets of locales that may be the right requested locales input for that negotiation: a) The single resulting locale that the localization API was provided as an outcome of the initial negotiation Than there are also two ways to think about what should be the input:
Example: Localization -> DateDrilling into a particular example. Let's say that we're dealing with localization which contains a date, and we have the following information:
If the outcome of the initial negotiation is We know we don't have dates in What often happens is that the API will just take We might also instead of passing We can also allow the initial negotiation to produce the widest possible list of localization supported locales, basically filtering the It may be tempting to at least cut off on language and say that if our translation resources are in German, then there's no value in providing
Correct. It offers a mid-point between performance and completeness.
Yes. The distance in fluent algorithm is a binary term separating languages and scripts.
I'm not sure if I know the answer to it. I feel that UTS#35 algorithm does its job well for the given considerations, and badly for other. I'm not sure if it's a matter of extending the algorithm, or accepting that there are different tradeoffs and different resulting optimal strategies and algorithms. UTS#35 algorithm works well when all resources in a components network have all the same locales, which is increasingly rare and I doubt that should be assumed for the future. I claim no superiority of Fluent's approach to UTS#35, but I do believe there are multiple ways to "chose" the best approach and UTS#35 current algorithm is just one of them, working well for a particular environment, and Fluent is another, working well for a different one. There are likely others working better for other scenarios. For example, one could pass the original requested locales through Localization API down to a chained Screenshot Selection API to select the screenshot in the best matched locale, even if the translation text around it uses some second or third preferred locale. In such case we could say that it's beneficial to allow the nested negotiation to provide better output of the negotiation than the parent negotiation was able to. I believe there should be a way to design an API that allows for the six models I know of (lookup/filtering/matching x requested-as-output/supported-as-output), but I haven't had time to work on that. There are other consequences of the algorithm producing a single locale than chaining. Chaining is just one example of such complexity. Does that help illustrate the problem scope I see? |
Thanks for all the information, I'll look this over in the next week. (Just
had time to glance over it now.)
A couple of quick notes.
1. There is definitely a difference in models. Your example has:
localization_supported = ["de", "de-AT", "de-IT", "de-AR", "fr", "fr-IT",
"fr-FR", "es"]; // unordered
date_supported = ["es-ES", "fr-CA", "fr-CH", "fr-FR"]; // unordered
default_locale = "en";
In UTS #35, the presumption is that the available list presented to the
algorithm would be the list supported by the application as a whole. That
is, you wouldn't say that the application supports "de" if it didn't also
support "de" date formats. This is to avoid the ransom-note effect where
the menus are in French (say) but the dates are in Japanese. (Aside:
typically the services like date, time, number formatting etc will be
available in a much wider set of locales than the localizations.)
There may be times, as with your "screenshot selection", where it is fine
to go with any of the user's desired locales even if they are quite
different than the encompassing content. The localization and
screenshot_selection locales might be disjoint. If we know that the user
requested French and Japanese, but we only have German and Japanese screen
shots then we present the Japanese screen shots. If we filtered the user's
desired set before getting down to the screen shot server, we wouldn't know
to be able to do that.
2. I should have written "most acceptable results" instead of "desired
results".
3.
https://firefox-source-docs.mozilla.org/intl/locale.html#language-negotiation
Such algorithms may vary in sophistication and number of strategies.
Mozilla’s solution is based on modified logic from RFC 5656
<https://tools.ietf.org/html/rfc5656>.
https://tools.ietf.org/html/rfc5656 is "Elliptic Curve Algorithm
Integration in the Secure Shell Transport Layer". Is that the intended
reference?
Mark
…On Mon, Jan 4, 2021 at 12:34 PM Zibi Braniecki ***@***.***> wrote:
Hi Mark! Thank you for taking time to respond here!
Are you saying that the single locale *should be* "almost always implicit
fallback locale chain of [thatLocale,
lastFallbackLocale]", or that it *is* that in UTS #35
<#35> but shouldn't be?
I argue that very rarely in the modern localization ecosystem do we
encounter a scenario where a single outcome locale is actually useful.
Almost all cases I've encountered use that outcome locale, which is an
*output* of the negotiation, as an *input* to another negotiation down
the chain.
I documented it here:
https://firefox-source-docs.mozilla.org/intl/locale.html#language-negotiation
(Also, I don't know what you mean by "thatLocale" (I was guessing the
requested locale, but there can be multiple) and "lastFallbackLocale" (no
idea).)
thatLocale in my example is the outcome of the negotiation. If the
outcome is a single locale, then any operation down the chain that fails to
perfectly match will fallback on the last fallback locale which is the
locale used when nothing matches.
In other words, I'm claiming that if you ever operate on a single locale,
in most cases, you're actually operating on an implicit two element
fallback chain, and that is a hidden factor that should not be.
Can you clarify what the intended usage scenarios are for filtering and
matching?
Let me try!
I'll start quoting UTS#35 for naming convention alignment:
Language matching is used to find the best supported locale ID given a
requested list of languages.
On the input we have a list of *requested* locales (also called *desired*
in the doc), and a list of *supported* locales in a given scenario. The
output of the LanguageMatcher is a single supported locale ID called *resulting
language*,
If the scenario is localization (and this has been quoted as the primary
scenario for this API), then the *resulting language* is passed to the
localization API as the *input*.
As the localization performs its operation, it may encounter a chained
operation - say, date formatting, pluralization, number formatting, or even
localized icon selection, color scheme or other chained,
sub-internationalization operation.
In such a scenario, the library will have access to a list of locales that
it has ability to perform that operation in (new *supported* locales),
and will need to negotiate it against the list of locales that the user
prefers.
But that's not trivial. There are actually three sets of locales that may
be the right *requested* locales input for that negotiation:
a) The single *resulting* locale that the localization API was provided
as an outcome of the initial negotiation
b) A wide list of locales that the localization API has resources for
filtered by the user original requested locales
c) An original list of *reqested* locales from the user
That there are also two ways to think about what should be the input:
either it should be a list of *requested* locales, or the list of
*supported* locales.
Example: Localization -> Date
Drilling into a particular example. Let's say that we're dealing with
localization which contains a date, and we have the following information:
user_requested = ["de-CH", "fr-CH", "es"]; // ordered
localization_supported = ["de", "de-AT", "de-IT", "de-AR", "fr", "fr-IT", "fr-FR", "es"]; // unordered
date_supported = ["es-ES", "fr-CA", "fr-CH", "fr-FR"]; // unordered
default_locale = "en";
If the outcome of the initial negotiation is ["de-CH", "fr-CH", "es"] x [
["de", "de-AT", "de-IT", "de-AR", "fr", "fr-IT", "fr-FR", "es"] = ["de"]
then how should the negotiation for the nested negotiation be performed?
We know we don't have dates in de, so what should happen?
What happens is that we'll just take "en" because we could not match
["de"] against any of the date_supported locales. So we implicitly
constructed a fallback chain of two elements - ["de", "en"]. I believe
that implicit fallback chain to be the worst possible chain produced, and
we should aim to do better.
But if someone wants to use it, that's what lookup strategy does.
We might also instead of passing de meaning "we localize in de
resources", pass de-CH as we localize in some locale based on the user
requested de-CH` locale". The benefit of the latter is that we preserved
more information about the user preference and the subsequent negotiation
has a chance to better fine tune to their preference (yes, we use german
translations, but we can adapt to Swiss date/time formatting!).
We can also allow the initial negotiation to produce the widest possible
list of localization supported locales, basically *filtering* the
localization_supported list to get as many, ordered, locales as possible
allowing the date&time negotiation to know *much more* about what the
user wanted to maximize the chance it'll provide a better fallback.
It may be tempting to at least cut off on language and say that if our
translation resources are in german, then there's no value in providing
fr-IT as an optional input - after all, we don't want to end up with
german localization with french dates embedded, do we?
But I think it's the consequence of the hidden implicit chain I described
above - we actually will fallback across languages and what's worse we'll
fallback on a worse language despite the user providing us all the
information necessary to give them a better output.
We just threw it out on the first negotiation because we removed all bits
but one.
1. Is filtering basically like the union of calling lookup for each of
the requested locales?
Correct. It offers a mid-point between performance and completeness.
1. Is matching something like "return each available locale that is
within some distance metric of one of the requested locales".
Yes. The distance in fluent algorithm is a binary term separating
languages and scripts.
1. The data doesn't produce desired results (but additions of data
could address that)
2. The data structure doesn't allow for desired results
3. The data structure & implementation can't be optimized
4. The data structure & implementation produces (or can) desired
results, but many of the results are edge cases where the cost in
performance isn't worth it.
I'm not sure if I know the answer to it. I feel that UTS#35 algorithm does
its job well for the given considerations, and badly for other. I'm not
sure if it's a matter of extending the algorithm, or accepting that there
are different tradeoffs and different resulting optimal strategies and
algorithms.
UTS#35 algorithm works well when all resources in a components network
have all the same locales, which is increasingly rare and I doubt that
should be assumed for the future.
I claim no superiority of Fluent's approach to UTS#35, but I do believe
there are multiple ways to "chose" the best approach and UTS#35 current
algorithm is just one of them, working well for a particular environment,
and Fluent is another, working well for a different one. There are likely
others working better for other scenarios.
For example, one could pass the original requested locales through
Localization API down to a chained Screenshot Selection API to select the
screenshot in the best matched locale, even if the translation text around
it uses some second or third preferred locale. In such case we could say
that it's beneficial to allow the nested negotiation to provide *better*
output of the negotiation than the parent negotiation was able to.
I believe there should be a way to design an API that allows for the six
models I know of (lookup/filtering/matching x
requested-as-output/supported-as-output), but I haven't had time to work on
that.
There are other consequences of the algorithm producing a single locale
than chaining. Chaining is just one example of such complexity.
Another, just to document it, is direct fallbacking. If we provide
Localization API a single locale as an output of negotiation, and during
runtime some error occurs and a single message, or a group of messages,
cannot be resolved in that locale, what should the fallback be?
If we have just one locale, then, again, most likely the API will perform
a fallback on the implicit chain where the second element is the "default
locale". If we carry more detailed fallback chain in the Localization API,
then the fallback can be more nuanced and result in better user experience.
Does that help illustrate the problem scope I see?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#513 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGRJ7A4P6XFII2YE63SYIQ4RANCNFSM4TTY7B4Q>
.
|
I would be genuinely impressed if it was! Thank you! Fixing :) |
Resolving from a list of requested locales (via
Accept-Languages
header for example) to a list of supported locales for a certain website. See vercel/next.js#18676 for example.Existing locale matching libraries are fairly rudimentary (exact match only). We already have
LookupMatcher
/BestAvailableLocale
/ResolveLocale
abstract operations that take into account extensions as well so IMO we should make a top level API for this.The text was updated successfully, but these errors were encountered: