Skip to content

fix: BL-14889 use indiv langtags for rep languages in tag creation#134

Merged
andrew-polk merged 1 commit intomainfrom
BL-14889_langtags_use_indiv
Feb 13, 2026
Merged

fix: BL-14889 use indiv langtags for rep languages in tag creation#134
andrew-polk merged 1 commit intomainfrom
BL-14889_langtags_use_indiv

Conversation

@nabalone
Copy link
Collaborator

@nabalone nabalone commented Feb 13, 2026

This change is Reviewable

@nabalone nabalone marked this pull request as draft February 13, 2026 15:47
Copy link
Contributor

@andrew-polk andrew-polk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrew-polk made 3 comments.
Reviewable status: 0 of 11 files reviewed, 3 unresolved discussions (waiting on @nabalone).


components/language-chooser/common/find-language/languageTagUtils.ts line 49 at r1 (raw file):

}

export function ensureLangSubtagIsCanonicalForReps(

This function wants a comment.


components/language-chooser/common/find-language/languageTagUtils.ts line 55 at r1 (raw file):

  const canonicalLanguageSubtag = language.isRepresentativeForMacrolanguage
    ? language.alternativeTags?.[0]?.split("-")[0] || language.languageSubtag
    : language.languageSubtag;

pasting devin comment below.
We should at least have a comment?


components/language-chooser/common/find-language/languageTagUtils.ts:R53-55

ensureLangSubtagIsCanonicalForReps relies on alternativeTags[0] ordering convention

The canonical macrolanguage code is derived from language.alternativeTags?.[0]?.split("-")[0] at languageTagUtils.ts:54. This assumes the first entry in alternativeTags always begins with the macrolanguage code (e.g., ar-Arab-EG for Standard Arabic, uz-Latn-UZ for Northern Uzbek). If the data ordering ever changes such that the first alternative tag starts with the individual code (e.g., arb-Brai), the canonical extraction would silently return the wrong code. The current data appears consistent, and the test at languageTagUtils.spec.ts:516-531 covers the empty alternativeTags fallback case, but there is no validation that alternativeTags[0] actually starts with a macrolanguage code.


components/language-chooser/common/find-language/searchForLanguage.ts line 234 at r1 (raw file):

    const canonicalLanguageSubtag = language.isRepresentativeForMacrolanguage
      ? language.alternativeTags?.[0]?.split("-")[0] || language.languageSubtag
      : language.languageSubtag;

devin says


components/language-chooser/common/find-language/searchForLanguage.ts:R232-248

Redundant canonical conversion in parseLangtagFromLangChooser

On line 238, the code manually replaces the language subtag via languageTag.replace(languageSubtag || "", canonicalLanguageSubtag), and then also passes the language object to getMaximalLangtag, which internally calls ensureLangSubtagIsCanonicalForReps again (languageTagUtils.ts:111). The second conversion is always a no-op since the subtag is already canonical after the manual replace, so this is harmless but redundant. Additionally, canonicalLanguageSubtag is computed inline here duplicating the logic from ensureLangSubtagIsCanonicalForReps. The manual replace could be removed entirely since getMaximalLangtag now handles the conversion internally via the language parameter.

@nabalone nabalone marked this pull request as ready for review February 13, 2026 17:39
Copy link
Contributor

@andrew-polk andrew-polk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrew-polk reviewed 11 files and all commit messages, and made 8 comments.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @nabalone).


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 672 at r1 (raw file):

and so are tested with the modifier instead of here

I think there is a mistake in the wording here. At least, I don't understand what it means


components/language-chooser/common/find-language/searchForLanguage.ts line 227 at r1 (raw file):

The script must be implied, look for the equivalent maximal tag, which will have a script subtag explicit.

I think this should be two sentences? Or at least a semicolon?


components/language-chooser/common/find-language/searchForLanguage.ts line 231 at r1 (raw file):

to to

to do?


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 102 at r1 (raw file):

        NORTHERN_UZBEK_LANGUAGE
      )
    ).toEqual("uz");

For any test cases where the output of the test is not the same as the output of the chooser overall, it seems like we want a clarifying comment?
I.e. I think this is giving uz because it is beyond the scope of createTag to make sure it is using the representative language? But at first glance, it seems like this is validating something different than our rep-first policy.


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 196 at r1 (raw file):

    expect(
      getShortestSufficientLangtag("uzn-Latn-UZ", NORTHERN_UZBEK_LANGUAGE)
    ).toContain("uz");

contain here seems odd since it will match uzn and uz


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 202 at r1 (raw file):

    expect(
      getShortestSufficientLangtag("uzn-Cyrl-x-foobar", NORTHERN_UZBEK_LANGUAGE)
    ).toContain("-x-foobar");

Should verify language subtag?


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 258 at r1 (raw file):

    expect(getMaximalLangtag("arb-Arab", STANDARD_ARABIC_LANGUAGE)).toContain(
      "-Arab-EG"
    );

Seems like being explicit about the full tag we expect would be better.
I.e. use toEqual instead of toContain


components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):

#### Language tag shortening

<!-- TODO this should be createTagFromOrthography? -->

Looks like this TODO comment can be removed.

Should the "createTag" in the last sentence also be changed to createTagFromOrthography?

@nabalone nabalone force-pushed the BL-14889_langtags_use_indiv branch from 8f72658 to 096a374 Compare February 13, 2026 19:52
Copy link
Collaborator Author

@nabalone nabalone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nabalone made 8 comments.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @andrew-polk).


components/language-chooser/common/find-language/languageTagUtils.ts line 49 at r1 (raw file):

Previously, andrew-polk wrote…

This function wants a comment.

Done.


components/language-chooser/common/find-language/languageTagUtils.ts line 55 at r1 (raw file):

Previously, andrew-polk wrote…

pasting devin comment below.
We should at least have a comment?


components/language-chooser/common/find-language/languageTagUtils.ts:R53-55

ensureLangSubtagIsCanonicalForReps relies on alternativeTags[0] ordering convention

The canonical macrolanguage code is derived from language.alternativeTags?.[0]?.split("-")[0] at languageTagUtils.ts:54. This assumes the first entry in alternativeTags always begins with the macrolanguage code (e.g., ar-Arab-EG for Standard Arabic, uz-Latn-UZ for Northern Uzbek). If the data ordering ever changes such that the first alternative tag starts with the individual code (e.g., arb-Brai), the canonical extraction would silently return the wrong code. The current data appears consistent, and the test at languageTagUtils.spec.ts:516-531 covers the empty alternativeTags fallback case, but there is no validation that alternativeTags[0] actually starts with a macrolanguage code.

Done.


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 102 at r1 (raw file):

Previously, andrew-polk wrote…

For any test cases where the output of the test is not the same as the output of the chooser overall, it seems like we want a clarifying comment?
I.e. I think this is giving uz because it is beyond the scope of createTag to make sure it is using the representative language? But at first glance, it seems like this is validating something different than our rep-first policy.

Done.


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 196 at r1 (raw file):

Previously, andrew-polk wrote…

contain here seems odd since it will match uzn and uz

The point of all these functions that I was really trying to test here is that they remove or add the other subtags as appropriate, but I can just use equals if it creates less confusion.


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 202 at r1 (raw file):

Previously, andrew-polk wrote…

Should verify language subtag?

Done.


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 258 at r1 (raw file):

Previously, andrew-polk wrote…

Seems like being explicit about the full tag we expect would be better.
I.e. use toEqual instead of toContain

Done.


components/language-chooser/common/find-language/languageTagUtils.spec.ts line 672 at r1 (raw file):

Previously, andrew-polk wrote…

and so are tested with the modifier instead of here

I think there is a mistake in the wording here. At least, I don't understand what it means

Hmm, does this make sense?


components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):

Previously, andrew-polk wrote…

Looks like this TODO comment can be removed.

Should the "createTag" in the last sentence also be changed to createTagFromOrthography?

Done. oops

Copy link
Collaborator Author

@nabalone nabalone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nabalone made 1 comment.
Reviewable status: 6 of 14 files reviewed, 12 unresolved discussions (waiting on @andrew-polk).


components/language-chooser/common/find-language/scripts/langtagProcessing.ts line 114 at r2 (raw file):

      isRepresentativeForMacrolanguage: entry.isRepresentativeForMacrolanguage,
      // entry.tag is the canonical tag (https://github.com/silnrsi/langtags/blob/release/doc/langtags.md)
      alternativeTags: new Set([entry.tag, entry.full, ...(entry.tags || [])]),

Note this change tends to bump representative languages up in the results order, e.g. if you type "uz", Northern Uzbek now shows up on top of the Uzbek Macrolanguage whereas before the latter would come up first. But I'm thinking this change is not bad since we want to discourage macrolanguages anyway

Copy link
Contributor

@andrew-polk andrew-polk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrew-polk reviewed 8 files and all commit messages, made 2 comments, and resolved 7 discussions.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @nabalone).


components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):

Previously, nabalone (Noel) wrote…

Done. oops

The text in this paragraph still references "createTag". I think that's a mistake. But maybe I'm not undestanding.


components/language-chooser/react/language-chooser-react-mui/e2e/languageSearch.e2e.ts line 105 at r2 (raw file):

    await expect(sanCard).not.toContainText("vsn");
  });

Sorry; I should have asked for this in the last round...
Seems worth having one e2e here for selecting a representative language and getting the correct subtag out?

I.e. something which would have failed before these changes?

Copy link
Contributor

@andrew-polk andrew-polk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrew-polk made 2 comments.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @nabalone).


components/language-chooser/common/find-language/testUtils.ts line 73 at r2 (raw file):

    "arb-Brai",
    "ar",
  ],

devin says the following.
Feel free to say this doesn't matter


components/language-chooser/common/find-language/testUtils.ts:R65-73

STANDARD_ARABIC_LANGUAGE test data has non-canonical first alternativeTag

The STANDARD_ARABIC_LANGUAGE test constant has alternativeTags[0] = "ar-Arab-EG" (a full tag), whereas the actual generated data in languageData.json has "ar" as the first element. The code comment at languageTagUtils.ts:57 states language.alternativeTags[0] is the canonical tag. The test languageSearch.spec.ts:528 verifies real data has the canonical tag first. The test data inconsistency doesn't cause failures because ensureLangSubtagIsCanonicalForReps does .split("-")[0] to extract just the language subtag ("ar" in both cases), but the test data doesn't match real data structure and could mislead future developers.


components/language-chooser/common/find-language/scripts/langtagProcessing.ts line 75 at r2 (raw file):

      entry.full,
      ...(entry.tags ?? []),
    ]);

devin says


Missing `entry.tag` in combine path of `addOrCombineLangtagsEntry` causes canonical short tags to be absent from `alternativeTags`

The PR added entry.tag to the new-entry path (line 114) to ensure the canonical tag appears first in alternativeTags, but the combine path (lines 71-75) was not updated to also include entry.tag. This means canonical short tags from subsequent langtags.json entries for the same language are missing from alternativeTags.

When a language has multiple entries in langtags.json (e.g., Northern Uzbek has entries for uzuz-AFuz-Braiuz-Cyrluz-Sogd), only the first entry goes through the new-entry path at line 114 where entry.tag is now added. Subsequent entries go through the combine path at lines 71-75, which adds entry.full and entry.tags but not entry.tag.

For example, for Northern Uzbek the canonical short tags uz-AFuz-Braiuz-CNuz-Cyrl, and uz-Sogd are confirmed missing from alternativeTags in the generated languageData.json. Since alternativeTags is used as a Fuse search key (searchForLanguage.ts:26), searching for these canonical short forms won't match the language.

The first entry's canonical tag (uz) IS correctly included and first, so ensureLangSubtagIsCanonicalForReps works correctly. The impact is limited to search completeness for secondary canonical tag forms.

Suggested fix

    langs[entry.indivIsoCode].alternativeTags = new Set([
      ...langs[entry.indivIsoCode].alternativeTags,
      entry.tag,
      entry.full,
      ...(entry.tags ?? []),
    ]);

@nabalone nabalone force-pushed the BL-14889_langtags_use_indiv branch from 096a374 to 91c4be9 Compare February 13, 2026 22:24
Copy link
Collaborator Author

@nabalone nabalone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nabalone made 6 comments.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @andrew-polk).


components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):

Previously, andrew-polk wrote…

The text in this paragraph still references "createTag". I think that's a mistake. But maybe I'm not undestanding.

Done.


components/language-chooser/common/find-language/searchForLanguage.ts line 227 at r1 (raw file):

Previously, andrew-polk wrote…

The script must be implied, look for the equivalent maximal tag, which will have a script subtag explicit.

I think this should be two sentences? Or at least a semicolon?

Done.


components/language-chooser/common/find-language/searchForLanguage.ts line 231 at r1 (raw file):

Previously, andrew-polk wrote…

to to

to do?

Done.


components/language-chooser/common/find-language/searchForLanguage.ts line 234 at r1 (raw file):

Previously, andrew-polk wrote…

devin says


components/language-chooser/common/find-language/searchForLanguage.ts:R232-248

Redundant canonical conversion in parseLangtagFromLangChooser

On line 238, the code manually replaces the language subtag via languageTag.replace(languageSubtag || "", canonicalLanguageSubtag), and then also passes the language object to getMaximalLangtag, which internally calls ensureLangSubtagIsCanonicalForReps again (languageTagUtils.ts:111). The second conversion is always a no-op since the subtag is already canonical after the manual replace, so this is harmless but redundant. Additionally, canonicalLanguageSubtag is computed inline here duplicating the logic from ensureLangSubtagIsCanonicalForReps. The manual replace could be removed entirely since getMaximalLangtag now handles the conversion internally via the language parameter.

Done.


components/language-chooser/common/find-language/testUtils.ts line 73 at r2 (raw file):

Previously, andrew-polk wrote…

devin says the following.
Feel free to say this doesn't matter


components/language-chooser/common/find-language/testUtils.ts:R65-73

STANDARD_ARABIC_LANGUAGE test data has non-canonical first alternativeTag

The STANDARD_ARABIC_LANGUAGE test constant has alternativeTags[0] = "ar-Arab-EG" (a full tag), whereas the actual generated data in languageData.json has "ar" as the first element. The code comment at languageTagUtils.ts:57 states language.alternativeTags[0] is the canonical tag. The test languageSearch.spec.ts:528 verifies real data has the canonical tag first. The test data inconsistency doesn't cause failures because ensureLangSubtagIsCanonicalForReps does .split("-")[0] to extract just the language subtag ("ar" in both cases), but the test data doesn't match real data structure and could mislead future developers.

Done. Wow, I didn't think of this, glad Devin caught it! It doesn't make a practical difference now but is good for consistency and potential future use.


components/language-chooser/react/language-chooser-react-mui/e2e/languageSearch.e2e.ts line 105 at r2 (raw file):

Previously, andrew-polk wrote…

Sorry; I should have asked for this in the last round...
Seems worth having one e2e here for selecting a representative language and getting the correct subtag out?

I.e. something which would have failed before these changes?

Done. There are a lot because copilot wrote a lot

@nabalone nabalone force-pushed the BL-14889_langtags_use_indiv branch from 91c4be9 to 537a292 Compare February 13, 2026 22:32
@nabalone nabalone force-pushed the BL-14889_langtags_use_indiv branch from 537a292 to 38baaa0 Compare February 13, 2026 23:09
Copy link
Collaborator Author

@nabalone nabalone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nabalone made 1 comment.
Reviewable status: 11 of 15 files reviewed, 8 unresolved discussions (waiting on @andrew-polk).


components/language-chooser/common/find-language/scripts/langtagProcessing.ts line 75 at r2 (raw file):

Previously, andrew-polk wrote…

devin says


Missing `entry.tag` in combine path of `addOrCombineLangtagsEntry` causes canonical short tags to be absent from `alternativeTags`

The PR added entry.tag to the new-entry path (line 114) to ensure the canonical tag appears first in alternativeTags, but the combine path (lines 71-75) was not updated to also include entry.tag. This means canonical short tags from subsequent langtags.json entries for the same language are missing from alternativeTags.

When a language has multiple entries in langtags.json (e.g., Northern Uzbek has entries for uzuz-AFuz-Braiuz-Cyrluz-Sogd), only the first entry goes through the new-entry path at line 114 where entry.tag is now added. Subsequent entries go through the combine path at lines 71-75, which adds entry.full and entry.tags but not entry.tag.

For example, for Northern Uzbek the canonical short tags uz-AFuz-Braiuz-CNuz-Cyrl, and uz-Sogd are confirmed missing from alternativeTags in the generated languageData.json. Since alternativeTags is used as a Fuse search key (searchForLanguage.ts:26), searching for these canonical short forms won't match the language.

The first entry's canonical tag (uz) IS correctly included and first, so ensureLangSubtagIsCanonicalForReps works correctly. The impact is limited to search completeness for secondary canonical tag forms.

Suggested fix

    langs[entry.indivIsoCode].alternativeTags = new Set([
      ...langs[entry.indivIsoCode].alternativeTags,
      entry.tag,
      entry.full,
      ...(entry.tags ?? []),
    ]);

Done.

Copy link
Contributor

@andrew-polk andrew-polk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrew-polk reviewed 7 files and all commit messages, and resolved 7 discussions.
Reviewable status: all files reviewed, 1 unresolved discussion.

Copy link
Contributor

@andrew-polk andrew-polk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrew-polk resolved 1 discussion and dismissed @nabalone from a discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @nabalone).

@andrew-polk andrew-polk merged commit 5db1c59 into main Feb 13, 2026
2 checks passed
@andrew-polk andrew-polk deleted the BL-14889_langtags_use_indiv branch February 13, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants