Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft for implementing Local Displayname Algorithm #3587

Merged
merged 14 commits into from
Aug 23, 2023

Conversation

snktd
Copy link
Member

@snktd snktd commented Jun 27, 2023

@snktd snktd requested a review from robertbastian June 27, 2023 19:09
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general feedback is that you should be working with Locale objects instead of strings. Locales are always canonicalized, and encapsulate invariants that you need (i.e. that the language tag is present). It also allows you to reduce intermediate String allocations. See https://docs.rs/icu_locid/latest/icu_locid/zerovec/index.html for how to store Locales in ZeroVecs.

You might have to make the method accept &mut Locale so that you can borrow from the locale and at the same time modify it.

Eventually this should probably return a impl Writeable instead of a Cow<str>, which owns the &mut Locale.

@snktd
Copy link
Member Author

snktd commented Jul 11, 2023

My general feedback is that you should be working with Locale objects instead of strings. Locales are always canonicalized, and encapsulate invariants that you need (i.e. that the language tag is present). It also allows you to reduce intermediate String allocations. See https://docs.rs/icu_locid/latest/icu_locid/zerovec/index.html for how to store Locales in ZeroVecs.You might have to make the method accept &mut Locale so that you can borrow from the locale and at the same time modify it. Eventually this should probably return a impl Writeable instead of a Cow, which owns the &mut Locale.

I understand that allocating is in general a bad idea in ICU4X. However, I am not sure how I can use Locale type as is in this algorithm. The main confusion stems from the fact that the longest_matching_prefix can be composed of multiple subtags (language-region, language-script) and based on that the other values such as LDN (Locale Display Name) and LQS (longest qualifying substring) are derived.

For example, locale!("de-CH-EMODENG") is equivalent to
{ "locale": { "type" : "LanguageIdentifier", "language": "German", "region": "Switzerland", "variants": ["Early Modern English"] } }
However, because "de-CH" exists in locale_data, the structure we ideally need is:
{ "locale": { "type" : "LanguageIdentifier", "language": "Swiss High German", "variants": ["Early Modern English"] } }

There can be multiple such combinations here. Like "zh-Hans" which is composition of the language and script subtags, but is part of locale data which translates to "Simplified Mandarin Chinese". So, I am not sure how I can create longest_matching_prefix without creating a new string and use that to derive LDN.

Note that the longest_matching_prefix is used to derive LQS as well and is still important here as we should avoid using any subtags which were already used to compute LDN. In the example above, the translation for the "region" subtag should not be part of LQS.

I am not sure how I can achieve this without creating new interim strings.

@sffc sffc self-requested a review July 12, 2023 11:28
@robertbastian
Copy link
Member

longest_matching_prefix can be composed of multiple subtags (language-region, language-script) and based on that the other values such as LDN (Locale Display Name) and LQS (longest qualifying substring) are derived.

You can probably model this on the stack, if you know what the possible cases are. If you have something like

enum LongestMatchingSubtag {
  LangRegion(Language, Region),
  LangScript(Language, Script),
  ...
}

you won't need to heap-allocate. Not sure if this is enough to solve your problem.

Another option could be to expose a subtags iterator on Locale.

@snktd
Copy link
Member Author

snktd commented Aug 12, 2023

@robertbastian @sffc updated the PR to follow Robert's suggestion. Basically
(1) I added the code to convert between L-S and L-R subtags to and from LanguageIdentifier.
(2) Used ZeroMap::get_by() to lookup into the map using the LanguageIdentifier.
(3) Followed Robert's suggestion to create an enum: LongestMatchingSubtag {LangRegion, LangScript, Lang} which helped me to use the LongestMatchingSubtag across different methods.

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is much clearer and avoids a lot of allocations

experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
components/locid/src/langid.rs Outdated Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
let lang_script_identifier: LanguageIdentifier = (locale.id.language, script).into();
if locale_data
.get()
.names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you not need to check short_names, long_names, and menu_names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alt variants are subsets of the actual names. I haven't found an example for which the alt variant exist and the version with the variant is missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does the spec allow it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found anything related in the spec.

Comment on lines 426 to 430
let longest_matching_subtag = find_longest_matching_subtag(&locale, &self);

// TODO: This binary search needs to return the longest matching found prefix
// instead of just perfect matches
if let Some(displayname) = match self.options.style {
Some(Style::Short) => self
.locale_data
.get()
.short_names
.get_by(|bytes| locale.strict_cmp(bytes).reverse()),
Some(Style::Long) => self
.locale_data
.get()
.long_names
.get_by(|bytes| locale.strict_cmp(bytes).reverse()),
Some(Style::Menu) => self
.locale_data
.get()
.menu_names
.get_by(|bytes| locale.strict_cmp(bytes).reverse()),
_ => None,
// Step - 1: Construct a locale display name string (LDN).
// Find the displayname for the longest_matching_subtag which was derived above.
let ldn = get_locale_display_name(&locale, &longest_matching_subtag, &self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth combining these two calls so you don't have to do the map lookup twice. You can add a &'a str field to the LongestMatchingSubtag enum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this earlier but it was making the code more cluttered. Moving the other method to the enum impl makes this code much cleaner. Map lookup should be O(1) in this case, if I am not missing any other overhead that comes with the map lookup, I would prefer keeping this as separate calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map lookup is O(nlogn), and the names map is probably the biggest, so ideally we wouldn't do that lookup twice.

experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
experimental/displaynames/Cargo.toml Outdated Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
Comment on lines +476 to +478
result.len() + " (".len() + lqs.iter().map(|s| ", ".len() + s.len()).sum::<usize>()
- ", ".len()
+ ")".len(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings should come from locale data. Could be a follow-up but we definitely need this before this goes stable.

https://github.com/unicode-org/cldr-json/blob/80a94b0f6c3a34d6e2dc0dca8639a54babc87f94/cldr-json/cldr-localenames-full/main/zh/localeDisplayNames.json#L12C1-L14C44

Copy link
Member Author

@snktd snktd Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimental/displaynames/tests/tests.rs Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
Comment on lines 596 to 598
if let Some(scriptdn) = scriptdisplay {
lqs.push(scriptdn);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use fallback, also for region below

Suggested change
if let Some(scriptdn) = scriptdisplay {
lqs.push(scriptdn);
}
lqs.push(scriptdn.unwrap_or(script.as_str()));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started initially with adding the fallback, but couldn't get pass the borrow checker without using to_string().
In this case,

{
   let mut lqs: Vec<&str> = vec![];
   if let Some(script) = locale.id.script {
       lqs.push(scriptdn.unwrap_or(script.as_str()));
   }
    return lqs;
}

Error:
"lqs.push(scriptdisplay.unwrap_or(script.as_str()));
--------------- script is borrowed here

return lqs;
| ^^^ returns a value referencing data owned by the current function"

Which makes sense. But I couldn't figure how to get around this without allocating a new string for script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a fix

experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved
utils/tzif/src/lib.rs Show resolved Hide resolved
experimental/displaynames/tests/tests.rs Show resolved Hide resolved
experimental/displaynames/tests/tests.rs Outdated Show resolved Hide resolved
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

}
// Throw an error if the LDN is none as it is not possible to have a locale string without the language.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

_ => None,
/// For a given locale and the data, find the longest prefix of the string that exists as a key in the CLDR locale data.
pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier {
let LocaleDisplayNamesFormatter { locale_data, .. } = self;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I personally would just use self.locale_data everywhere instead of doing this, but up to you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 596 to 598
if let Some(scriptdn) = scriptdisplay {
lqs.push(scriptdn);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a fix

@snktd snktd requested a review from sffc August 22, 2023 16:44
pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier {
// NOTE: The subtag ordering of the canonical locale is `language_script_region + variants + extensions`.
// The logic to find the longest matching subtag is based on this ordering.
if let Some(script) = locale.id.script {
Copy link
Member

@robertbastian robertbastian Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you technically have to try all these combinations according to the algorithm?

  • LSRV
  • LSR
  • LSV
  • LRV
  • LS (x)
  • LR (x)
  • LV
  • SR
  • SV
  • RV
  • L (x)
  • S
  • R
  • V

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no explicit mention in the algorithm for what is covered under LocaleDisplayName (LDN). But because we are looking for the longest matching string in the locale and language data, I couldn't find any example in the data which has any other combinations then what is already covered in this implementation. However, we do need to technically support the other combinations as the data may change in future. I think a better way to implement the support for this is to use the subtag iterator.
Sketching the algorithm:

  1. Try matching the entire locale first and return the languageIdentifier if found in the locale data.
  2. If not, remove the last subtag and lookup for the remaining locale string in the locale data, if found then construct the languageIdentifier and return.
  3. Continue step-2 until all the subtags are removed.
  4. Fallback if no match is found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what we have is more efficient than the general solution, I'm okay landing this and fixing the algorithm later if these types of cases come up. The way you've written this, I think old-code-new-data should just ignore the new entries, which is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chopping off subtags will only cover LSRV, LSR, LS, L.

sffc
sffc previously approved these changes Aug 22, 2023
Comment on lines +476 to +478
result.len() + " (".len() + lqs.iter().map(|s| ", ".len() + s.len()).sum::<usize>()
- ", ".len()
+ ")".len(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestCase {
input_1: &locale!("zh_Hans"),
expected: "Simplified Chinese",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: one other case could be "es-Latn-419"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +444 to +445
/// For a given locale and the data, find the longest prefix of the string that exists as a key in the CLDR locale data.
pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: this only ever returns one of the following:

  • language-script
  • language-region
  • language

But does the spec allow us to return language-script-region, etc? Consider this in #3913

@@ -406,23 +406,24 @@ impl LocaleDisplayNamesFormatter {

/// Returns the display name of a locale.
/// This implementation is based on the algorithm described in
/// https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm
/// `<https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't want tick marks around this; just the <> is sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -47,6 +47,10 @@ fn test_concatenate() {
input_1: &Locale::from_str("es-419-fonipa").unwrap(),
expected: "Latin American Spanish (IPA Phonetics)",
},
TestCase {
input_1: &Locale::from_str("es-Latn-419").unwrap(),
expected: "Spanish (Latin, Latin America)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Latin American Spanish (Latin) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But "es-Latn" doesn't exist in locale data? Don't we consider just the sequential subtags to get the longest matching string for locale display name? For example the test for "es-Cyrl-MX" returns "Spanish (Cyrillic, Mexico)" and not the "Mexican Spanish (Cyrillic)".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think you're right: the CLDR test data does the same thing

https://github.com/unicode-org/cldr/blob/main/common/testData/localeIdentifiers/localeDisplayName.txt#L15

this is different than my intuitive sense but if it's what's in the spec then that's what we should implement!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locale data contains

"es-419": "Latin American Spanish",

so the result should be "Latin American Spanish (Latin)". The condition that stops this from happening is in line 463: LR is only tried if the locales has no script. I don't think that's following the spec, because es-419, not es should be the longest matching subtag

Match the L subtags against the type values in the elements. Pick the element with the most subtags matching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think non-sequential subtags can be a valid match, because the spec also says

If there is more than one such element, pick the one that has subtypes matching earlier.

This kind of implies that there can be multiple matches that aren't prefixes of one another, so subtags can be skipped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think the inconsistency with the CLDR test data is that it is running in non-dialect mode. In dialect mode, it's pretty clear that "Latin American Spanish (Latin)" is the correct output. This can be seen by the fact that the CLDR test data for es-419 also says "Spanish (Latin America)" instead of "Latin American Spanish".

So, yes, please fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(happy doing this in a follow-up in #3913)

Comment on lines +425 to +426
#[allow(clippy::indexing_slicing)] // indexes in range
if !lqs.is_empty() {
Copy link
Member

@sffc sffc Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (optional): you can use split_first to avoid the need to unwrap. split_first returns an Option<&T> with the first and an Option<&[T]> with the remainder.

@robertbastian robertbastian marked this pull request as ready for review August 23, 2023 06:43
@robertbastian robertbastian requested a review from a team as a code owner August 23, 2023 06:43
@sffc sffc merged commit 4750225 into unicode-org:main Aug 23, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants