-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft for implementing Local Displayname Algorithm #3587
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general feedback is that you should be working with Locale
objects instead of strings. Locale
s are always canonicalized, and encapsulate invariants that you need (i.e. that the language tag is present). It also allows you to reduce intermediate String
allocations. See https://docs.rs/icu_locid/latest/icu_locid/zerovec/index.html for how to store Locales in ZeroVecs.
You might have to make the method accept &mut Locale
so that you can borrow from the locale and at the same time modify it.
Eventually this should probably return a impl Writeable
instead of a Cow<str>
, which owns the &mut Locale
.
I understand that allocating is in general a bad idea in ICU4X. However, I am not sure how I can use Locale type as is in this algorithm. The main confusion stems from the fact that the longest_matching_prefix can be composed of multiple subtags (language-region, language-script) and based on that the other values such as LDN (Locale Display Name) and LQS (longest qualifying substring) are derived. For example, locale!("de-CH-EMODENG") is equivalent to There can be multiple such combinations here. Like "zh-Hans" which is composition of the language and script subtags, but is part of locale data which translates to "Simplified Mandarin Chinese". So, I am not sure how I can create longest_matching_prefix without creating a new string and use that to derive LDN. Note that the longest_matching_prefix is used to derive LQS as well and is still important here as we should avoid using any subtags which were already used to compute LDN. In the example above, the translation for the "region" subtag should not be part of LQS. I am not sure how I can achieve this without creating new interim strings. |
You can probably model this on the stack, if you know what the possible cases are. If you have something like enum LongestMatchingSubtag {
LangRegion(Language, Region),
LangScript(Language, Script),
...
} you won't need to heap-allocate. Not sure if this is enough to solve your problem. Another option could be to expose a subtags iterator on |
@robertbastian @sffc updated the PR to follow Robert's suggestion. Basically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is much clearer and avoids a lot of allocations
let lang_script_identifier: LanguageIdentifier = (locale.id.language, script).into(); | ||
if locale_data | ||
.get() | ||
.names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you not need to check short_names
, long_names
, and menu_names
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alt variants are subsets of the actual names. I haven't found an example for which the alt variant exist and the version with the variant is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But does the spec allow it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found anything related in the spec.
let longest_matching_subtag = find_longest_matching_subtag(&locale, &self); | ||
|
||
// TODO: This binary search needs to return the longest matching found prefix | ||
// instead of just perfect matches | ||
if let Some(displayname) = match self.options.style { | ||
Some(Style::Short) => self | ||
.locale_data | ||
.get() | ||
.short_names | ||
.get_by(|bytes| locale.strict_cmp(bytes).reverse()), | ||
Some(Style::Long) => self | ||
.locale_data | ||
.get() | ||
.long_names | ||
.get_by(|bytes| locale.strict_cmp(bytes).reverse()), | ||
Some(Style::Menu) => self | ||
.locale_data | ||
.get() | ||
.menu_names | ||
.get_by(|bytes| locale.strict_cmp(bytes).reverse()), | ||
_ => None, | ||
// Step - 1: Construct a locale display name string (LDN). | ||
// Find the displayname for the longest_matching_subtag which was derived above. | ||
let ldn = get_locale_display_name(&locale, &longest_matching_subtag, &self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth combining these two calls so you don't have to do the map lookup twice. You can add a &'a str
field to the LongestMatchingSubtag
enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this earlier but it was making the code more cluttered. Moving the other method to the enum impl makes this code much cleaner. Map lookup should be O(1) in this case, if I am not missing any other overhead that comes with the map lookup, I would prefer keeping this as separate calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map lookup is O(nlogn), and the names
map is probably the biggest, so ideally we wouldn't do that lookup twice.
result.len() + " (".len() + lqs.iter().map(|s| ", ".len() + s.len()).sum::<usize>() | ||
- ", ".len() | ||
+ ")".len(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These strings should come from locale data. Could be a follow-up but we definitely need this before this goes stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if let Some(scriptdn) = scriptdisplay { | ||
lqs.push(scriptdn); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use fallback, also for region below
if let Some(scriptdn) = scriptdisplay { | |
lqs.push(scriptdn); | |
} | |
lqs.push(scriptdn.unwrap_or(script.as_str())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started initially with adding the fallback, but couldn't get pass the borrow checker without using to_string().
In this case,
{
let mut lqs: Vec<&str> = vec![];
if let Some(script) = locale.id.script {
lqs.push(scriptdn.unwrap_or(script.as_str()));
}
return lqs;
}
Error:
"lqs.push(scriptdisplay.unwrap_or(script.as_str()));
--------------- script
is borrowed here
return lqs;
| ^^^ returns a value referencing data owned by the current function"
Which makes sense. But I couldn't figure how to get around this without allocating a new string for script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
} | ||
// Throw an error if the LDN is none as it is not possible to have a locale string without the language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
_ => None, | ||
/// For a given locale and the data, find the longest prefix of the string that exists as a key in the CLDR locale data. | ||
pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier { | ||
let LocaleDisplayNamesFormatter { locale_data, .. } = self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I personally would just use self.locale_data
everywhere instead of doing this, but up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if let Some(scriptdn) = scriptdisplay { | ||
lqs.push(scriptdn); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed a fix
pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier { | ||
// NOTE: The subtag ordering of the canonical locale is `language_script_region + variants + extensions`. | ||
// The logic to find the longest matching subtag is based on this ordering. | ||
if let Some(script) = locale.id.script { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you technically have to try all these combinations according to the algorithm?
- LSRV
- LSR
- LSV
- LRV
- LS (x)
- LR (x)
- LV
- SR
- SV
- RV
- L (x)
- S
- R
- V
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no explicit mention in the algorithm for what is covered under LocaleDisplayName (LDN). But because we are looking for the longest matching string in the locale and language data, I couldn't find any example in the data which has any other combinations then what is already covered in this implementation. However, we do need to technically support the other combinations as the data may change in future. I think a better way to implement the support for this is to use the subtag iterator.
Sketching the algorithm:
- Try matching the entire locale first and return the languageIdentifier if found in the locale data.
- If not, remove the last subtag and lookup for the remaining locale string in the locale data, if found then construct the languageIdentifier and return.
- Continue step-2 until all the subtags are removed.
- Fallback if no match is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If what we have is more efficient than the general solution, I'm okay landing this and fixing the algorithm later if these types of cases come up. The way you've written this, I think old-code-new-data should just ignore the new entries, which is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chopping off subtags will only cover LSRV, LSR, LS, L.
result.len() + " (".len() + lqs.iter().map(|s| ", ".len() + s.len()).sum::<usize>() | ||
- ", ".len() | ||
+ ")".len(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestCase { | ||
input_1: &locale!("zh_Hans"), | ||
expected: "Simplified Chinese", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: one other case could be "es-Latn-419"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/// For a given locale and the data, find the longest prefix of the string that exists as a key in the CLDR locale data. | ||
pub fn find_longest_matching_subtag(&self, locale: &Locale) -> LanguageIdentifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: this only ever returns one of the following:
- language-script
- language-region
- language
But does the spec allow us to return language-script-region, etc? Consider this in #3913
@@ -406,23 +406,24 @@ impl LocaleDisplayNamesFormatter { | |||
|
|||
/// Returns the display name of a locale. | |||
/// This implementation is based on the algorithm described in | |||
/// https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm | |||
/// `<https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't want tick marks around this; just the <>
is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -47,6 +47,10 @@ fn test_concatenate() { | |||
input_1: &Locale::from_str("es-419-fonipa").unwrap(), | |||
expected: "Latin American Spanish (IPA Phonetics)", | |||
}, | |||
TestCase { | |||
input_1: &Locale::from_str("es-Latn-419").unwrap(), | |||
expected: "Spanish (Latin, Latin America)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be Latin American Spanish (Latin)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But "es-Latn" doesn't exist in locale data? Don't we consider just the sequential subtags to get the longest matching string for locale display name? For example the test for "es-Cyrl-MX" returns "Spanish (Cyrillic, Mexico)" and not the "Mexican Spanish (Cyrillic)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think you're right: the CLDR test data does the same thing
this is different than my intuitive sense but if it's what's in the spec then that's what we should implement!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locale data contains
"es-419": "Latin American Spanish",
so the result should be "Latin American Spanish (Latin)". The condition that stops this from happening is in line 463: LR is only tried if the locales has no script. I don't think that's following the spec, because es-419
, not es
should be the longest matching subtag
Match the L subtags against the type values in the elements. Pick the element with the most subtags matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think non-sequential subtags can be a valid match, because the spec also says
If there is more than one such element, pick the one that has subtypes matching earlier.
This kind of implies that there can be multiple matches that aren't prefixes of one another, so subtags can be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think the inconsistency with the CLDR test data is that it is running in non-dialect mode. In dialect mode, it's pretty clear that "Latin American Spanish (Latin)" is the correct output. This can be seen by the fact that the CLDR test data for es-419 also says "Spanish (Latin America)" instead of "Latin American Spanish".
So, yes, please fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(happy doing this in a follow-up in #3913)
#[allow(clippy::indexing_slicing)] // indexes in range | ||
if !lqs.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion (optional): you can use split_first
to avoid the need to unwrap. split_first
returns an Option<&T>
with the first and an Option<&[T]>
with the remainder.
This is based on https://www.unicode.org/reports/tr35/tr35-general.html#locale_display_name_algorithm.