-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support handling underscore as a Locale separator on the input #777
Comments
Note that BCP47 is case insensitive, so the call out about that is incorrect: those are all valid tags. I tend to agree with canonicalizing underscore to hyphen, since it can be confusing. |
I am in general much more inclined towards making decisions based on Hyrum's Law than the Postel Principle, because the latter frequently leads to regret such as probably-permanent commitment to the bizarre behavior of |
In GNOME's JS environment, we have a use case for accepting |
My personal position started closer to where @gibson042 is coming from, but I found compelling the argument that handling |
CC @macchiati who (along with @aphillips) is editor of the BCP-47 standard and may have thoughts here. Should Intl accept |
@sffc There is no chance that underscore will ever be a valid anything in BCP47. The grammar is purposefully fixed, with pathways available for future expansion via extensions and a few reserved bits in the "normal" grammar. I do think that the canonical form should never include underscore. but accepting underscores would potentially reduce some tripping hazard for users who, for some reason, expect underscore to work or who have a non-BCP47 interface that produces underscore. (@zbraniecki check with Prithvi and Abhijeet for implementation experience and details) |
I disagree that it would not complicate any grammar, because UTS 35 deviates from BCP 47 not just in allowing
Isn't that a good thing to encounter, since such identifiers are not valid for general interchange? This is in fact the biggest issue with the robustness principle—it turns deviations into undocumented or poorly documented shadow requirements that spread infectiously but unpredictably throughout an ecosystem. |
TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2023-11-16.md#support-handling-underscore-as-a-locale-separator-on-the-input-777 Conclusion: the committee didn't feel that there was strong enough motivation to make this change at this time. @FrankYFTang also pointed out that it can be done in userland. If there is more evidence to back up making this change we are happy to reconsider it. |
I agree that underscores should NOT be supported. But I would disagree with the unqualified statement that UTS 35 deviates from BCP 47. "Unicode Language and Locale Identifiers" deviate from BCP 47. But UTS 35 does not propose that other processors of BCP 47 (such as ecma402) should also deviate. The introduction to the major section you link to states that "Unicode LDML uses stable identifiers based on BCP47", and the end of the section you link to states that:
Also note that CLDR has a ticket CLDR-15012 Move to BCP47 - CLDR considers the current identifiers to be based on BCP47.
100% this. No BCP47 deviations. that will just hurt users. @ptomato wrote:
Except it's actually much more complex than that. Saying you can transform _ to - yourself really hurts users here, because the POSIX locale IDs actually require a bit of processing to attempt to get right into ICU locales / Unicode locale identifiers, or into BCP 47. (Ask me how I know!) — you actually should be using something like the ICU code I linked to get an ICU locale, and then convert that into bcp47 using ICU in GNOME's JS environment. Recommending _ to - works in trivial cases, but does other users a disservice. |
We currently reject
_
as a subtag separator when parsing a locale.This has come up in unicode-org/icu4x#3336 .
I'm questioning the value of rejecting
_
as a subtag separator on the input in favor of following the Robustness Principle.We already do not follow Unicode BCP47 Locale strictly - we accept uncanonicalized Locale, for example this works:
both of those work despite not being canonical, we will canonicalize them on the output!
I suggest we extend support to:
The text was updated successfully, but these errors were encountered: