First draft for implementing Local Displayname Algorithm #3587

snktd · 2023-06-27T19:09:04Z

This is based on https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm.

experimental/displaynames/src/displaynames.rs

robertbastian

My general feedback is that you should be working with Locale objects instead of strings. Locales are always canonicalized, and encapsulate invariants that you need (i.e. that the language tag is present). It also allows you to reduce intermediate String allocations. See https://docs.rs/icu_locid/latest/icu_locid/zerovec/index.html for how to store Locales in ZeroVecs.

You might have to make the method accept &mut Locale so that you can borrow from the locale and at the same time modify it.

Eventually this should probably return a impl Writeable instead of a Cow<str>, which owns the &mut Locale.

snktd · 2023-07-11T19:26:12Z

My general feedback is that you should be working with Locale objects instead of strings. Locales are always canonicalized, and encapsulate invariants that you need (i.e. that the language tag is present). It also allows you to reduce intermediate String allocations. See https://docs.rs/icu_locid/latest/icu_locid/zerovec/index.html for how to store Locales in ZeroVecs.You might have to make the method accept &mut Locale so that you can borrow from the locale and at the same time modify it. Eventually this should probably return a impl Writeable instead of a Cow, which owns the &mut Locale.

I understand that allocating is in general a bad idea in ICU4X. However, I am not sure how I can use Locale type as is in this algorithm. The main confusion stems from the fact that the longest_matching_prefix can be composed of multiple subtags (language-region, language-script) and based on that the other values such as LDN (Locale Display Name) and LQS (longest qualifying substring) are derived.

For example, locale!("de-CH-EMODENG") is equivalent to
{ "locale": { "type" : "LanguageIdentifier", "language": "German", "region": "Switzerland", "variants": ["Early Modern English"] } }
However, because "de-CH" exists in locale_data, the structure we ideally need is:
{ "locale": { "type" : "LanguageIdentifier", "language": "Swiss High German", "variants": ["Early Modern English"] } }

There can be multiple such combinations here. Like "zh-Hans" which is composition of the language and script subtags, but is part of locale data which translates to "Simplified Mandarin Chinese". So, I am not sure how I can create longest_matching_prefix without creating a new string and use that to derive LDN.

Note that the longest_matching_prefix is used to derive LQS as well and is still important here as we should avoid using any subtags which were already used to compute LDN. In the example above, the translation for the "region" subtag should not be part of LQS.

I am not sure how I can achieve this without creating new interim strings.

robertbastian · 2023-07-12T12:41:45Z

longest_matching_prefix can be composed of multiple subtags (language-region, language-script) and based on that the other values such as LDN (Locale Display Name) and LQS (longest qualifying substring) are derived.

You can probably model this on the stack, if you know what the possible cases are. If you have something like

enum LongestMatchingSubtag {
  LangRegion(Language, Region),
  LangScript(Language, Script),
  ...
}

you won't need to heap-allocate. Not sure if this is enough to solve your problem.

Another option could be to expose a subtags iterator on Locale.

snktd · 2023-08-12T01:17:59Z

@robertbastian @sffc updated the PR to follow Robert's suggestion. Basically
(1) I added the code to convert between L-S and L-R subtags to and from LanguageIdentifier.
(2) Used ZeroMap::get_by() to lookup into the map using the LanguageIdentifier.
(3) Followed Robert's suggestion to create an enum: LongestMatchingSubtag {LangRegion, LangScript, Lang} which helped me to use the LongestMatchingSubtag across different methods.

robertbastian

Nice, this is much clearer and avoids a lot of allocations

experimental/displaynames/src/displaynames.rs

components/locid/src/langid.rs

experimental/displaynames/src/displaynames.rs

robertbastian · 2023-08-14T08:59:57Z

experimental/displaynames/src/displaynames.rs

+        let lang_script_identifier: LanguageIdentifier = (locale.id.language, script).into();
+        if locale_data
+            .get()
+            .names


do you not need to check short_names, long_names, and menu_names?

The alt variants are subsets of the actual names. I haven't found an example for which the alt variant exist and the version with the variant is missing.

But does the spec allow it?

I haven't found anything related in the spec.

robertbastian · 2023-08-14T09:01:41Z

experimental/displaynames/src/displaynames.rs

+        let longest_matching_subtag = find_longest_matching_subtag(&locale, &self);

-        // TODO: This binary search needs to return the longest matching found prefix
-        // instead of just perfect matches
-        if let Some(displayname) = match self.options.style {
-            Some(Style::Short) => self
-                .locale_data
-                .get()
-                .short_names
-                .get_by(|bytes| locale.strict_cmp(bytes).reverse()),
-            Some(Style::Long) => self
-                .locale_data
-                .get()
-                .long_names
-                .get_by(|bytes| locale.strict_cmp(bytes).reverse()),
-            Some(Style::Menu) => self
-                .locale_data
-                .get()
-                .menu_names
-                .get_by(|bytes| locale.strict_cmp(bytes).reverse()),
-            _ => None,
+        // Step - 1: Construct a locale display name string (LDN).
+        // Find the displayname for the longest_matching_subtag which was derived above.
+        let ldn = get_locale_display_name(&locale, &longest_matching_subtag, &self);


It might be worth combining these two calls so you don't have to do the map lookup twice. You can add a &'a str field to the LongestMatchingSubtag enum.

I tried this earlier but it was making the code more cluttered. Moving the other method to the enum impl makes this code much cleaner. Map lookup should be O(1) in this case, if I am not missing any other overhead that comes with the map lookup, I would prefer keeping this as separate calls.

Map lookup is O(nlogn), and the names map is probably the biggest, so ideally we wouldn't do that lookup twice.

experimental/displaynames/src/displaynames.rs

experimental/displaynames/Cargo.toml

experimental/displaynames/src/displaynames.rs

sffc · 2023-08-15T21:50:31Z

experimental/displaynames/src/displaynames.rs

+                result.len() + " (".len() + lqs.iter().map(|s| ", ".len() + s.len()).sum::<usize>()
+                    - ", ".len()
+                    + ")".len(),


These strings should come from locale data. Could be a follow-up but we definitely need this before this goes stable.

https://github.com/unicode-org/cldr-json/blob/80a94b0f6c3a34d6e2dc0dca8639a54babc87f94/cldr-json/cldr-localenames-full/main/zh/localeDisplayNames.json#L12C1-L14C44

Acknowledged

experimental/displaynames/tests/tests.rs

experimental/displaynames/src/displaynames.rs

robertbastian · 2023-08-16T09:27:16Z

experimental/displaynames/src/displaynames.rs

+            if let Some(scriptdn) = scriptdisplay {
+                lqs.push(scriptdn);
+            }


Use fallback, also for region below

Suggested change

if let Some(scriptdn) = scriptdisplay {

lqs.push(scriptdn);

}

lqs.push(scriptdn.unwrap_or(script.as_str()));

I started initially with adding the fallback, but couldn't get pass the borrow checker without using to_string().
In this case,

{ let mut lqs: Vec<&str> = vec![]; if let Some(script) = locale.id.script { lqs.push(scriptdn.unwrap_or(script.as_str())); } return lqs; }

Error:
"lqs.push(scriptdisplay.unwrap_or(script.as_str()));
--------------- script is borrowed here

return lqs;
| ^^^ returns a value referencing data owned by the current function"

Which makes sense. But I couldn't figure how to get around this without allocating a new string for script.

pushed a fix

experimental/displaynames/src/displaynames.rs

utils/tzif/src/lib.rs

experimental/displaynames/tests/tests.rs

robertbastian

Nice

robertbastian · 2023-08-22T08:40:56Z

experimental/displaynames/src/displaynames.rs

        }
+        // Throw an error if the LDN is none as it is not possible to have a locale string without the language.


nit: update comment

robertbastian · 2023-08-22T08:41:32Z

experimental/displaynames/src/displaynames.rs

-        _ => None,
+    /// For a given locale and the data, find the longest prefix of the string that exists as a key in the CLDR locale data.
+    pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier {
+        let LocaleDisplayNamesFormatter { locale_data, .. } = self;


nit: I personally would just use self.locale_data everywhere instead of doing this, but up to you

robertbastian · 2023-08-22T08:58:05Z

experimental/displaynames/src/displaynames.rs

+            if let Some(scriptdn) = scriptdisplay {
+                lqs.push(scriptdn);
+            }


pushed a fix

robertbastian · 2023-08-22T21:42:54Z

experimental/displaynames/src/displaynames.rs

+    pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier {
+        // NOTE: The subtag ordering of the canonical locale is `language_script_region + variants + extensions`.
+        // The logic to find the longest matching subtag is based on this ordering.
+        if let Some(script) = locale.id.script {


Don't you technically have to try all these combinations according to the algorithm?

LSRV

LSR

LSV

LRV

LS (x)

LR (x)

LV

SR

SV

RV

L (x)

S

R

V

There is no explicit mention in the algorithm for what is covered under LocaleDisplayName (LDN). But because we are looking for the longest matching string in the locale and language data, I couldn't find any example in the data which has any other combinations then what is already covered in this implementation. However, we do need to technically support the other combinations as the data may change in future. I think a better way to implement the support for this is to use the subtag iterator.
Sketching the algorithm:

Try matching the entire locale first and return the languageIdentifier if found in the locale data.

If not, remove the last subtag and lookup for the remaining locale string in the locale data, if found then construct the languageIdentifier and return.

Continue step-2 until all the subtags are removed.

Fallback if no match is found.

If what we have is more efficient than the general solution, I'm okay landing this and fixing the algorithm later if these types of cases come up. The way you've written this, I think old-code-new-data should just ignore the new entries, which is fine.

Chopping off subtags will only cover LSRV, LSR, LS, L.

sffc · 2023-08-22T21:40:40Z

experimental/displaynames/src/displaynames.rs

+                result.len() + " (".len() + lqs.iter().map(|s| ", ".len() + s.len()).sum::<usize>()
+                    - ", ".len()
+                    + ")".len(),


sffc · 2023-08-22T21:46:58Z

experimental/displaynames/tests/tests.rs

+        TestCase {
+            input_1: &locale!("zh_Hans"),
+            expected: "Simplified Chinese",
+        },


Thought: one other case could be "es-Latn-419"

sffc · 2023-08-22T21:48:32Z

experimental/displaynames/src/displaynames.rs

+    /// For a given locale and the data, find the longest prefix of the string that exists as a key in the CLDR locale data.
+    pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier {


Observation: this only ever returns one of the following:

language-script

language-region

language

But does the spec allow us to return language-script-region, etc? Consider this in #3913

sffc · 2023-08-23T00:33:34Z

experimental/displaynames/src/displaynames.rs

@@ -406,23 +406,24 @@ impl LocaleDisplayNamesFormatter {

    /// Returns the display name of a locale.
    /// This implementation is based on the algorithm described in
-    /// https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm
+    /// `<https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm>`


nit: you don't want tick marks around this; just the <> is sufficient.

sffc · 2023-08-23T00:34:21Z

experimental/displaynames/tests/tests.rs

@@ -47,6 +47,10 @@ fn test_concatenate() {
            input_1: &Locale::from_str("es-419-fonipa").unwrap(),
            expected: "Latin American Spanish (IPA Phonetics)",
        },
+        TestCase {
+            input_1: &Locale::from_str("es-Latn-419").unwrap(),
+            expected: "Spanish (Latin, Latin America)",


Shouldn't this be Latin American Spanish (Latin) ?

But "es-Latn" doesn't exist in locale data? Don't we consider just the sequential subtags to get the longest matching string for locale display name? For example the test for "es-Cyrl-MX" returns "Spanish (Cyrillic, Mexico)" and not the "Mexican Spanish (Cyrillic)".

hmm, I think you're right: the CLDR test data does the same thing

https://github.com/unicode-org/cldr/blob/main/common/testData/localeIdentifiers/localeDisplayName.txt#L15

this is different than my intuitive sense but if it's what's in the spec then that's what we should implement!

Locale data contains

"es-419": "Latin American Spanish",

so the result should be "Latin American Spanish (Latin)". The condition that stops this from happening is in line 463: LR is only tried if the locales has no script. I don't think that's following the spec, because es-419, not es should be the longest matching subtag

Match the L subtags against the type values in the elements. Pick the element with the most subtags matching.

I do think non-sequential subtags can be a valid match, because the spec also says

If there is more than one such element, pick the one that has subtypes matching earlier.

This kind of implies that there can be multiple matches that aren't prefixes of one another, so subtags can be skipped.

Okay, I think the inconsistency with the CLDR test data is that it is running in non-dialect mode. In dialect mode, it's pretty clear that "Latin American Spanish (Latin)" is the correct output. This can be seen by the fact that the CLDR test data for es-419 also says "Spanish (Latin America)" instead of "Latin American Spanish".

So, yes, please fix this.

(happy doing this in a follow-up in #3913)

sffc · 2023-08-23T00:35:12Z

experimental/displaynames/src/displaynames.rs

+        #[allow(clippy::indexing_slicing)] // indexes in range
+        if !lqs.is_empty() {


Suggestion (optional): you can use split_first to avoid the need to unwrap. split_first returns an Option<&T> with the first and an Option<&[T]> with the remainder.

snktd added 2 commits June 27, 2023 12:04

First draft of locale displayname algorithm

5c33872

Minor change in comment

17f19d9

snktd requested a review from robertbastian June 27, 2023 19:09

snktd commented Jun 27, 2023

View reviewed changes

experimental/displaynames/src/displaynames.rs Outdated Show resolved Hide resolved

robertbastian reviewed Jul 3, 2023

View reviewed changes

Avoid constructing a new cannonical locale String.

0141ac6

sffc self-requested a review July 12, 2023 11:28

addressing review comments

4a85149

snktd requested a review from robertbastian August 12, 2023 01:18

robertbastian reviewed Aug 14, 2023

View reviewed changes

Addressing comments - round 2

19ea386

snktd requested a review from robertbastian August 15, 2023 18:19

sffc reviewed Aug 15, 2023

View reviewed changes

robertbastian reviewed Aug 16, 2023

View reviewed changes

snktd added 4 commits August 21, 2023 16:12

Merge remote-tracking branch 'upstream/main' into dialect

d0d867e

Moving methods to LocaleDisplayNamesFormatter

4a00a05

Removing the unused enum

6989da0

Adding a few more test cases

6855ceb

snktd requested a review from robertbastian August 22, 2023 03:01

fallback

0f7c4ac

robertbastian reviewed Aug 22, 2023

View reviewed changes

Address comments

94b311a

snktd requested a review from sffc August 22, 2023 16:44

Running ci-job-tidyi

2bc861b

robertbastian reviewed Aug 22, 2023

View reviewed changes

sffc previously approved these changes Aug 22, 2023

View reviewed changes

Making clippy happy and adding one more test case.

fed1622

snktd dismissed sffc’s stale review via fed1622 August 22, 2023 23:05

sffc reviewed Aug 23, 2023

View reviewed changes

Addressing minor nit comment

a1d9fdc

sffc approved these changes Aug 23, 2023

View reviewed changes

robertbastian marked this pull request as ready for review August 23, 2023 06:43

robertbastian requested a review from a team as a code owner August 23, 2023 06:43

sffc merged commit 4750225 into unicode-org:main Aug 23, 2023
26 checks passed

		}
		// Throw an error if the LDN is none as it is not possible to have a locale string without the language.

		/// For a given locale and the data, find the longest prefix of the string that exists as a key in the CLDR locale data.
		pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier {

		#[allow(clippy::indexing_slicing)] // indexes in range
		if !lqs.is_empty() {

First draft for implementing Local Displayname Algorithm #3587

First draft for implementing Local Displayname Algorithm #3587

Conversation

snktd commented Jun 27, 2023

robertbastian left a comment

Choose a reason for hiding this comment

snktd commented Jul 11, 2023

robertbastian commented Jul 12, 2023

snktd commented Aug 12, 2023 • edited

robertbastian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snktd Aug 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertbastian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertbastian Aug 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc Aug 23, 2023 • edited

Choose a reason for hiding this comment

snktd commented Aug 12, 2023 •

edited

snktd Aug 22, 2023 •

edited

robertbastian Aug 22, 2023 •

edited

sffc Aug 23, 2023 •

edited