Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support handling underscore as a Locale separator on the input #777

Open
zbraniecki opened this issue Apr 20, 2023 · 9 comments
Open

Support handling underscore as a Locale separator on the input #777

zbraniecki opened this issue Apr 20, 2023 · 9 comments
Labels
c: locale Component: locale identifiers s: comment Status: more info is needed to move forward

Comments

@zbraniecki
Copy link
Member

We currently reject _ as a subtag separator when parsing a locale.

This has come up in unicode-org/icu4x#3336 .

I'm questioning the value of rejecting _ as a subtag separator on the input in favor of following the Robustness Principle.

We already do not follow Unicode BCP47 Locale strictly - we accept uncanonicalized Locale, for example this works:

new Intl.Locale("EN-fr");
new Intl.DateTimeFormat("aR-Pl");

both of those work despite not being canonical, we will canonicalize them on the output!

I suggest we extend support to:

new Intl.Locale("EN_fr");
new Intl.DateTimeFormat("aR_Pl");
@zbraniecki zbraniecki added s: discuss Status: TG2 must discuss to move forward c: locale Component: locale identifiers labels Apr 20, 2023
@aphillips
Copy link

Note that BCP47 is case insensitive, so the call out about that is incorrect: those are all valid tags.

I tend to agree with canonicalizing underscore to hyphen, since it can be confusing.

@gibson042
Copy link
Contributor

I am in general much more inclined towards making decisions based on Hyrum's Law than the Postel Principle, because the latter frequently leads to regret such as probably-permanent commitment to the bizarre behavior of Date parsing (ask me how I know). ECMA-402 has made it this far without BCP-47-incompatible Unicode locale identifiers, and I would see little value in pursuing a backwards incompatible syntax at this late stage.

@ptomato
Copy link
Contributor

ptomato commented Apr 20, 2023

In GNOME's JS environment, we have a use case for accepting _. Information about the current locale comes from a platform API as there is no navigator object. The platform API comes from C and uses _ as a separator (e.g., en_CA) since that's the format accepted by libc's LC_ALL, LC_TIME, etc. environment variables. If you want to use this locale in Intl APIs, you have to transform the _ to a - yourself, which is easy to do incorrectly if you're not familiar with locales. (e.g., assuming the underscore only occurs at position 2)

@sffc
Copy link
Contributor

sffc commented Apr 20, 2023

My personal position started closer to where @gibson042 is coming from, but I found compelling the argument that handling _ is a well defined operation that does not need to complicate a grammar. The only risk is that _ would start being a valid token in a future edition of BCP-47, which I see unlikely; for example, something like en-GB_ENG where GB_ENG would be a region with subdivision. I therefore do not see risk in accepting _ as part of the grammar upon input in these strings.

@sffc
Copy link
Contributor

sffc commented Apr 20, 2023

CC @macchiati who (along with @aphillips) is editor of the BCP-47 standard and may have thoughts here. Should Intl accept _ in place of - in locale identifiers that are otherwise interpreted in BCP-47?

@aphillips
Copy link

@sffc There is no chance that underscore will ever be a valid anything in BCP47. The grammar is purposefully fixed, with pathways available for future expansion via extensions and a few reserved bits in the "normal" grammar.

I do think that the canonical form should never include underscore. but accepting underscores would potentially reduce some tripping hazard for users who, for some reason, expect underscore to work or who have a non-BCP47 interface that produces underscore. (@zbraniecki check with Prithvi and Abhijeet for implementation experience and details)

@gibson042
Copy link
Contributor

handling _ is a well defined operation that does not need to complicate a grammar

I disagree that it would not complicate any grammar, because UTS 35 deviates from BCP 47 not just in allowing _ as a substitute for -, but also in allowing root as a special standalone language identifier (which would otherwise be syntactically invalid) and optionally allowing language identifiers to start with a script rather than a language (see BCP 47 Conformance). We'd also need to correct all of ECMA-402 to reference "Unicode locale identifiers" rather than "Unicode BCP 47 locale identifiers" and "language tags", because the latter two explicitly exclude identifiers using BCP 47-incompatible syntax such as underscores.

accepting underscores would potentially reduce some tripping hazard for users who, for some reason, expect underscore to work or who have a non-BCP47 interface that produces underscore

Isn't that a good thing to encounter, since such identifiers are not valid for general interchange? This is in fact the biggest issue with the robustness principle—it turns deviations into undocumented or poorly documented shadow requirements that spread infectiously but unpredictably throughout an ecosystem.

@sffc
Copy link
Contributor

sffc commented Nov 16, 2023

TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2023-11-16.md#support-handling-underscore-as-a-locale-separator-on-the-input-777

Conclusion: the committee didn't feel that there was strong enough motivation to make this change at this time. @FrankYFTang also pointed out that it can be done in userland. If there is more evidence to back up making this change we are happy to reconsider it.

@sffc sffc added s: comment Status: more info is needed to move forward and removed s: discuss Status: TG2 must discuss to move forward labels Nov 16, 2023
@srl295
Copy link
Member

srl295 commented Nov 17, 2023

handling _ is a well defined operation that does not need to complicate a grammar

I disagree that it would not complicate any grammar, because UTS 35 deviates from BCP 47 not just in allowing _ as a substitute for -, but also in allowing root as a special standalone language identifier (which would otherwise be syntactically invalid) and optionally allowing language identifiers to start with a script rather than a language (see BCP 47 Conformance). We'd also need to correct all of ECMA-402 to reference "Unicode locale identifiers" rather than "Unicode BCP 47 locale identifiers" and "language tags", because the latter two explicitly exclude identifiers using BCP 47-incompatible syntax such as underscores.

I agree that underscores should NOT be supported. But I would disagree with the unqualified statement that UTS 35 deviates from BCP 47. "Unicode Language and Locale Identifiers" deviate from BCP 47. But UTS 35 does not propose that other processors of BCP 47 (such as ecma402) should also deviate. The introduction to the major section you link to states that "Unicode LDML uses stable identifiers based on BCP47", and the end of the section you link to states that:

There are thus two subtypes of Unicode locale identifiers:

  • the term Unicode CLDR locale identifier applies where the backwards compatibility syntax is used.
  • the term Unicode BCP 47 locale identifier applies otherwise. A Unicode BCP 47 locale identifier is also a valid BCP 47 language tag.

Also note that CLDR has a ticket CLDR-15012 Move to BCP47 - CLDR considers the current identifiers to be based on BCP47.

accepting underscores would potentially reduce some tripping hazard for users who, for some reason, expect underscore to work or who have a non-BCP47 interface that produces underscore

Isn't that a good thing to encounter, since such identifiers are not valid for general interchange? This is in fact the biggest issue with the robustness principle—it turns deviations into undocumented or poorly documented shadow requirements that spread infectiously but unpredictably throughout an ecosystem.

100% this. No BCP47 deviations. that will just hurt users.

@ptomato wrote:

In GNOME's JS environment, we have a use case for accepting _. Information about the current locale comes from a platform API as there is no navigator object. The platform API comes from C and uses _ as a separator (e.g., en_CA) since that's the format accepted by libc's LC_ALL, LC_TIME, etc. environment variables. If you want to use this locale in Intl APIs, you have to transform the _ to a - yourself, which is easy to do incorrectly if you're not familiar with locales. (e.g., assuming the underscore only occurs at position 2)

Except it's actually much more complex than that. Saying you can transform _ to - yourself really hurts users here, because the POSIX locale IDs actually require a bit of processing to attempt to get right into ICU locales / Unicode locale identifiers, or into BCP 47. (Ask me how I know!) — you actually should be using something like the ICU code I linked to get an ICU locale, and then convert that into bcp47 using ICU in GNOME's JS environment. Recommending _ to - works in trivial cases, but does other users a disservice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Component: locale identifiers s: comment Status: more info is needed to move forward
Projects
Status: Previously Discussed
Development

No branches or pull requests

6 participants