CanonicalizeLanguageTag should remove duplicate attributes/keywords in a Unicode extension, consistent with Intl.Locale #82

jswalden · 2019-11-18T22:20:28Z

Canonicalization performed by CanonicalizeLanguageTag and that performed by Intl.Locale differ in two intended ways.

CanonicalizeLanguageTag doesn't remove duplicated attributes or keywords, e.g. "en-u-attr-attr" and "en-u-co-dict-co-phonebk" are both considered to be canonical. Intl.Locale does (and almost necessarily must, to integrate keywords in the input tag with keywords specified through the options bag).
CanonicalizeLanguageTag doesn't replace aliased subtags in Unicode locale extension sequences with their preferred forms, e.g. "en-u-ms-imperial" is canonical according to CanonicalizeLanguageTag, but Intl.Locale will transform it to "en-u-ms-uksystem". (This latter behavior doesn't exist in the current spec because of changes to TR35 upstream. See Update references to match current UTS 35 spec #77 for dealing with that change.)

On the call last week I had thought the latter TR35 upstream change was something we had accepted, and I didn't understand that the first problem still remained, so I was fine with this proposal moving forward. But the latter change was unintentional (#77 will deal with it), and the first problem is real. We need to fix both of these to move this proposal forward, IMO. :-(

I have a patch that augments this proposal with changes to the existing CanonicalizeLanguageTag algorithm such that duplicate attributes and keywords are removed. I am not sure that this is the most elegant way to implement deduplication. But it gets the job done, and of course implementations will choose whatever approach works best for them in reality. I'll create a PR once I've gotten this issue filed and have an issue number to refer to.

The text was updated successfully, but these errors were encountered:

…removes duplicate attributes/keywords in Unicode locale extension sequences just as Intl.Locale does. Fixes tc39#82.

jswalden · 2019-11-19T02:09:31Z

I can't find anything in TR35 or RFC 6067 that describes the removal of duplicate attributes/keywords. So it doesn't look like there's anything in TR35 we could invoke to perform this operation, and it has to be hand-rolled -- as this patch does for CanonicalizeLanguageTag, and as Intl.Locale in this specification already does. :-|

jswalden · 2019-11-19T03:37:13Z

@anba You should probably take a look at and comment on this, seeing as you understand this stuff better than everyone else here. :-)

aphillips · 2019-11-19T03:43:20Z

@jswalden 6067 says something about this:

Only the first occurrence of an attribute or key conveys meaning in a
language tag. When interpreting tags containing the Unicode locale
extension, duplicate attributes or keywords are ignored in the
following way: ignore any attribute that has already appeared in the
tag and ignore any keyword whose key has already occurred in the tag.

This was intended to allow duplicates to be removed without effect (although I don't think we went on to encourage/permit it).

jswalden · 2019-11-19T03:49:16Z

@aphillips Yeah, the semantics of a tag with duplicates in it are clear enough even if they aren't removed. Just seems to me if we remove them one place -- and we don't really have a choice about it in Intl.Locale unless we wanted to just insert out-of-band options stringwise at the start of any existing Unicode locale extension, but that seems batty -- we should be consistent and do it everywhere.

aphillips · 2019-11-19T13:26:53Z

The canonical form is in a specific order. I would just go ahead and remove duplicates - - - extension tags are already long and unwieldy even without useless cruft in them.

…removes duplicate attributes/keywords in Unicode locale extension sequences just as Intl.Locale does. Fixes tc39#82.

zbraniecki · 2020-01-24T09:00:20Z

#83 fixes this.

jswalden mentioned this issue Nov 18, 2019

Change the CanonicalizeLanguageTag operation so that it removes duplicate attributes/keywords in Unicode locale extension sequences just as Intl.Locale does #83

Merged

zbraniecki added this to the v1 milestone Dec 17, 2019

zbraniecki mentioned this issue Jan 24, 2020

Update references to match current UTS 35 spec #77

Closed

zbraniecki closed this as completed in #83 Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CanonicalizeLanguageTag should remove duplicate attributes/keywords in a Unicode extension, consistent with Intl.Locale #82

CanonicalizeLanguageTag should remove duplicate attributes/keywords in a Unicode extension, consistent with Intl.Locale #82

jswalden commented Nov 18, 2019 •

edited

Loading

jswalden commented Nov 19, 2019

jswalden commented Nov 19, 2019

aphillips commented Nov 19, 2019

jswalden commented Nov 19, 2019

aphillips commented Nov 19, 2019

zbraniecki commented Jan 24, 2020

CanonicalizeLanguageTag should remove duplicate attributes/keywords in a Unicode extension, consistent with Intl.Locale #82

CanonicalizeLanguageTag should remove duplicate attributes/keywords in a Unicode extension, consistent with Intl.Locale #82

Comments

jswalden commented Nov 18, 2019 • edited Loading

jswalden commented Nov 19, 2019

jswalden commented Nov 19, 2019

aphillips commented Nov 19, 2019

jswalden commented Nov 19, 2019

aphillips commented Nov 19, 2019

zbraniecki commented Jan 24, 2020

jswalden commented Nov 18, 2019 •

edited

Loading