Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NaiveFallback #2686

Open
zbraniecki opened this issue Sep 29, 2022 · 18 comments
Open

Add NaiveFallback #2686

zbraniecki opened this issue Sep 29, 2022 · 18 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters needs-approval One or more stakeholders need to approve proposal

Comments

@zbraniecki
Copy link
Member

In relation to #2683.

The current fallback mechanism is quite costly in binary size and data payload to support. In order to enable customers to use ICU4X without it, we by default do not deduplicate data that would rely on runtime fallbacking.

I believe we can (in a true Rust fashion!) resolve this dychotomy by introducing The Third Way between no fallback and full fallback. I dubbed it naive fallback.

Naive fallback works only one way - minimizing tags, and contains a very short list of exceptions.

The algorithm works like this:

  1. Take the requested locale and cut out everything except of language-script-region pair.
  2. Check if there is a match.
  3. If not, cut out region.
  4. Check if there is a match.
  5. If not, check if a given language-region pair is in exception list.
    5.1 If it is, use language-script from that exception
    5.2 If not, remove script.
  6. Check if there is a match.
  7. If not, use und

This will cater to exceptions in sr and zh, but not much more. For everything else it will just cut off from right to left and eventually fallback on und.

The algorithm is super small, the data is super small (maybe even baked in by default?) and if used in datagen+runtime allows us to cut out huge portion of locales which in turn reduces the number of keys in the key table in data payload.
This reduction has two benefits:

  1. It makes the data payload smaller, which is especially noticable in smaller payloads (decimal/symbols is predominantly keys)
  2. It makes the runtime locale selection faster which reduces the cost of constructor.
@zbraniecki
Copy link
Member Author

I created a small script that takes all locale data for a given key, hashes the content and deduplicated to the smallest number of data files to cover all keys.

In case of decimal/symbols@1 the ideal deduplication would reduce the number of keys from 683 to 49.

@zbraniecki
Copy link
Member Author

  • datetime/gregory/datesymbols@1 - 574 -> 257
  • datetime/gregory/datelengths@1 - 574 -> 187
  • datetime/week_data@1 - 155 -> 6
  • datetime/timesymbols@1 - 574 -> 205
  • plurals/cardinal@1 - 215 -> 39
  • plurals/ordinal@1 - 100 -> 24
  • collator/data@1 - 132 -> 122
  • collator/meta@1 - 142 -> 9
  • time_zone/formats@1 - 574 -> 141
  • time_zone/generic_long@1 - 574 -> 162
  • list/and@1 - 574 -> 116

I assume that this deduplication is already happening for values.
I also want to say that my script provides ideal optimization, which the naive algorithm is likely not going to achieve.
I do believe tho, that we can develop it so that it provides full conformance (falls back as well as a full fallback would) while getting very close to ideal key set.

@sffc
Copy link
Member

sffc commented Sep 29, 2022

#834

@sffc
Copy link
Member

sffc commented Sep 29, 2022

tl;dr, locale fallback is very, very complicated. I have a plan for how to solve the size and speed issues. It's something we can improve incrementally.

@zbraniecki
Copy link
Member Author

I don't think this is the same as #834. That one is about including locales based on allowlist, this is about deduplicating.

@sffc
Copy link
Member

sffc commented Sep 29, 2022

It's the same in the sense that these problems can be solved by making DataExporter smarter. We can either pre-populate locales or strip them if they have duplicates.

Note that "naive fallback" is already supported; see the no_data constructor on FallbackProvider.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

Back on my computer so I can write a more complete response…

The current fallback mechanism is quite costly in binary size and data payload to support.

It has some cost in binary size, but not too much. The data size cost is about 10 kB.

In order to enable customers to use ICU4X without it, we by default do not deduplicate data that would rely on runtime fallbacking.

We do deduplicate data; we store essentially just a pointer for these extra locales. It does mean that lookup is slower if we use binary search. This can be mitigated by shipping locales as separate language packs or by adopting ZeroHashMap for locale lookup (#2579).

I believe we can (in a true Rust fashion!) resolve this dychotomy by introducing The Third Way between no fallback and full fallback. I dubbed it naive fallback.

Naive fallback works only one way - minimizing tags, and contains a very short list of exceptions. …

This is exactly what our currently algorithm is doing when you run it in no-data mode. You add data for it to handle scripts and parent locales correctly.

5.2 If not, remove script.

We don't remove script without language because that's a well documented footgun. The only exception is for collation data.

This will cater to exceptions in sr and zh, but not much more. For everything else it will just cut off from right to left and eventually fallback on und. …

Again, this is what we already do. sr and zh are handled in a data-driven pre-processing step. It uses a subset of likely subtags data, and I want to make it such that if your app is already carrying likely subtags, we don't even ship that fallback data at all.

I guess what I'm missing is, what are you suggesting that your "naive fallback" is improving on? Your wish list is basically what I've already implemented. You can see full details in flexible vertical fallback (which I'm pretty sure you've reviewed before). Note that the algorithm itself is deceptively simple. It boils down to naive subtag removal, but with additional support for extension keywords.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

Another point on this topic. I think there have traditionally been two ways of looking at locales in software, and ICU4X adds a third:

  1. Ship all available locales all the time
  2. Ship a minimal set of locales based on product needs
  3. (New with ICU4X) Ship a core set of locales with more loaded on demand

For case (1), these clients are already eating a large cost by shipping all locales, so shipping a little extra code and data to help them with fallback is a small price percentage-wise. I hear you about the performance hit, but there are ways to solve this without changing the fallback algorithm.

For case (2), fallbacking should be pre-computed in datagen based on the desired set of locales such that we don't need to ship any fallbacking code or data, not even a "naive" version.

For case (3), we would likely employ a hybrid approach where desired locales are pre-computed in their respective language packs, but clients can opt in to the runtime locale fallback in order to get the full behavior of case (1).

@zbraniecki
Copy link
Member Author

just a pointer for these extra locales.

My claim is that the storing 500 locale strings when only 100 are needed is a substantial portion of the postcard payload, and it accrues cost for component constructor to select from.

My hope is that we can do better. If you think that the fallback is not costly, I'd love to see the size of the postcard files when we use the full fallback to deduplicate and reduce the number of keys to minimum. Can we do that now? If so, how?

@zbraniecki
Copy link
Member Author

I guess what I'm missing is, what are you suggesting that your "naive fallback" is improving on?

It's improving on no key deduplication which is the default mode of datagen.

In case of decimal/symbols@1 we are currently carrying around 684 keys, while ideally we would only carry 49. We can't do that because we need to annotate fallback, but among those 684 keys, we have only 87 language keys. I assume in almost all cases more specific locales can fallback onto the language itself, and many can fallback onto und, so I assume we should be able with a fallback to carry only 60-70 keys rather than 684.

That's a substantial payload decrease, memory cost reduction and I suspect constructor perf win.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

Yeah, the empty pointers are a significant source of data size and especially lookup speed issues in certain keys including number format and date format. This is a known issue in datagen. The empty pointers should be calculated and removed at datagen time, and a subset may be added back if compile-time fallback resolution is requested.

It's improving on no key deduplication which is the default mode of datagen. … In case of decimal/symbols@1 we are currently carrying around 684 keys, while ideally we would only carry 49. We can't do that because we need to annotate fallback, but among those 684 keys, we have only 87 language keys.

Nit: Let's be careful on language here. decimal/symbols@1 is a key, there are currently 684 locales for that key, but we should be able to prune that to 87 locales if we prune locales based on fallback at datagen time.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

I guess what I'm trying to say is, I'm thinking from the angle that the bug in datagen, the one where empty locale pointers are unnecessarily generated, which I've known has existed since basically the inception of datagen, will be fixed. Fixing it requires some care and design, but it is totally fixable in a dot release. My above claims about fallback being a fairly low cost relative to the cost of adding all locales are based on the world in which the bug in datagen is fixed.

@zbraniecki
Copy link
Member Author

Gotcha. I think we agree.

I have a slight concern about what you're proposing as naive fallback since it seems to ignore the edge cases of multi-script languages. There are very few of them, and I'd be comfortable baking this data in since I expect it to be ~9 keys. But it seems like you're saying that your naive fallback doesn't recognize those 9 keys and produces wrong results for cases like sr-SR or zh-TW. I think that in my mind the naive fallback is still between full and none but is a bit more informed to accommodate for most common exceptions.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

I have a slight concern about what you're proposing as naive fallback since it seems to ignore the edge cases of multi-script languages. There are very few of them, and I'd be comfortable baking this data in since I expect it to be ~9 keys. But it seems like you're saying that your naive fallback doesn't recognize those 9 keys and produces wrong results for cases like sr-SR or zh-TW. I think that in my mind the naive fallback is still between full and none but is a bit more informed to accommodate for most common exceptions.

OK, sure, I could see a third LocaleFallbacker constructor that narrows the likely subtags data to only the multi-script languages. You can get rid of a few K's of data this way. It would cause locales like "de-Latn-LI" to fail (it should be able to find "de-LI"), so it should be used only in cases where you know that the script is only ever specified in multi-script languages like zh and sr. I guess an issue here is that languages may change from single-script to multi-script in future CLDR releases, as has happened in the past.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

The nice thing is that this is all totally configurable since locale fallback is its own separate component in the data provider. If someone comes up with a better fallback mechanism, we can plug it right in.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

By the way, I see 18 multi-script languages in CLDR 41. The majority are Cyrl/Latn/Arab hybrids (it's not just Serbian); the rest are scattered in Africa, the Middle East, India, and China.

  • Azerbaijani: Latn (default), Cyrl (Russia), Arab (Iran)
  • Hausa: Latn (default), Arab (Sudan)
  • Kazakh: Cyrl (default), Arab (China)
  • Kurdish: Latn (default), Arab (Lebanon), Yezi (Georgia)
  • Kyrgyz: Cyrl (default), Arab (China), Latn (Turkey)
  • Manding: Latn (default), Nkoo (Guinea)
  • … more African, Central Asian, and Eastern European languages
  • Mundari: Beng (default), Deva (Nepal)
  • Chinese and Cantonese

@sffc sffc added needs-approval One or more stakeholders need to approve proposal discuss Discuss at a future ICU4X-SC meeting and removed needs-approval One or more stakeholders need to approve proposal labels Oct 17, 2022
@sffc sffc added the C-data-infra Component: provider, datagen, fallback, adapters label Dec 22, 2022
@sffc
Copy link
Member

sffc commented Jun 15, 2023

Discuss with:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 15, 2023
@robertbastian
Copy link
Member

We have an approximation of this now. With #5114 you can pass a LocaleFallbacker to the exporter, and it will deduplicate according to that. You can pass LocaleFallbacker::new_without_data() to basically get naive fallback (although not quite as outlined by Zibi in the OP).

@robertbastian robertbastian removed the discuss Discuss at a future ICU4X-SC meeting label Jun 27, 2024
@robertbastian robertbastian added needs-approval One or more stakeholders need to approve proposal and removed discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters needs-approval One or more stakeholders need to approve proposal
Projects
None yet
Development

No branches or pull requests

3 participants