Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up resource lookup and fallback at runtime #2699

Open
sffc opened this issue Oct 1, 2022 · 10 comments
Open

Speed up resource lookup and fallback at runtime #2699

sffc opened this issue Oct 1, 2022 · 10 comments
Assignees
Labels
A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring) T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Oct 1, 2022

#2683 and #834, as well as #2686, are largely centered around optimizations that are performed at datagen time.

We may want to consider opportunities for runtime improvements as well. Ideas include:

Use hash maps: The lowest-hanging fruit here would be to leverage ZeroHashMap to make resource lookups more constant-time, especially if you're in a situation where you ship hundreds of locales in a single data file. We should at least do this in BakedDataProvider since we can more easily change the data model post 1.0 and if people are building hundreds of locales, they probably should be using BakedDataProvider anyway.

Micro-optimizations in the fallback algorithm: For example, we can probably try the input DataLocale directly in the resource file and return fast if there is an exact match before performing the normalization and fallback steps.

Micro-optimizations in cmp_bytes (aka strict_cmp): Another bottleneck in resource lookup is the cmp_bytes function in DataLocale. There may be opportunities to improve the performance of this operation.

I'll also point out that one may naively believe that hitting the data provider multiple times for each constructor (e.g., loading date symbols, time symbols, and number symbols independently) has a negative performance cost. However, I would point to previous evidence that by leveraging smaller keys, we save on deserialization/validation cost, and also that smaller keys typically result in smaller data files due to more deduplication opportunities (although whenever a key split is proposed, the size impact should be measured).

@sffc sffc added T-enhancement Type: Nice-to-have but not required A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring) labels Oct 1, 2022
@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Oct 17, 2022
@sffc sffc added needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Dec 16, 2022
@sffc
Copy link
Member Author

sffc commented Dec 16, 2022

@Manishearth @robertbastian Can you weigh in and suggest alternatives?

@robertbastian
Copy link
Member

Use hash maps: The lowest-hanging fruit here would be to leverage ZeroHashMap to make resource lookups more constant-time, especially if you're in a situation where you ship hundreds of locales in a single data file. We should at least do this in BakedDataProvider since we can more easily change the data model post 1.0 and if people are building hundreds of locales, they probably should be using BakedDataProvider anyway.

Let's do it.

Micro-optimizations in the fallback algorithm: For example, we can probably try the input DataLocale directly in the resource file and return fast if there is an exact match before performing the normalization and fallback steps.

This makes sense and I'd have expected this to happen anyway.

Micro-optimizations in cmp_bytes (aka strict_cmp): Another bottleneck in resource lookup is the cmp_bytes function in DataLocale. There may be opportunities to improve the performance of this operation.

Do we actually have measurements confirming this as a bottleneck?


One other improvement I can think of: many keys will have identical locale sets. Our providers could cache the last requested locale and index, and if the next key uses the same locales (as determined at datagen time), we already know its index. This would only help in non-fallback lookup though.

For fallback we could cache the last key+locale+index, and if the next request is for the same key, we can start the search where we expected the last locale to be. As it's fallback, we usually look for a prefix of the previous key, so that should be in the same location.

@sffc
Copy link
Member Author

sffc commented Dec 16, 2022

Discussion:

  • We can carry info forward between requests using DataRequestMetadata
  • We could borrow a locale, etc., in that metadata.

@sffc sffc self-assigned this Dec 16, 2022
@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Dec 16, 2022
@sffc sffc added this to the ICU4X 1.2 milestone Dec 16, 2022
@sffc
Copy link
Member Author

sffc commented Oct 24, 2023

In #4207 I'm implementing a ZeroTrie variant, which, if the previous benchmarks hold, is faster and smaller than the current ZeroMap2d variant. However, ZeroHashMap is likely the fastest, but it will incur some data size cost. So I'm keeping this issue open but lowering the priority.

sffc added a commit that referenced this issue Oct 25, 2023
Related: #4216,
#4207,
#2699

See the numbers in locale_aux_test.rs
@sffc
Copy link
Member Author

sffc commented Dec 23, 2023

Raising the priority again because the ZeroTrie optimization was only applied to blob provider. We should still explore a solution for baked provider.

This performance impact manifests itself in NeoDateTimeFormatter, which hits the provider more than the previous iteration of DateTimeFormatter.

@sffc
Copy link
Member Author

sffc commented Apr 19, 2024

An issue I'm encountering now is that in datetime format, the ZeroTrie lookup tables for skeleta are fairly large. If we were able to share strings between keys (calendars), they could be made smaller. For example, with one level of indirection, we could store a table of locale strings, and then the lookup table would have a u32 index into the locale table. This would probably amortize to be smaller than ZeroTrie. Not sure about impact on lookup speed.

@Manishearth
Copy link
Member

That would be cool. I've wanted to amortize space between aux keys by storing them at a different layer in blob

@sffc
Copy link
Member Author

sffc commented May 4, 2024

Just an idea for how to do this in Blob Provider:

  1. Have a blob-local global ZeroTrie for language identifiers and another blob-local global ZeroTrie for aux keys
  2. The key-specific tries index with a two-character UTF-8 string: the index of the langid and the index of the aux key

I think this should amortize the cost of the aux-key-heavy data keys quite nicely.

@robertbastian
Copy link
Member

(2.) is using ZeroTrie as ZeroMap2d<char, char, usize>?

@sffc
Copy link
Member Author

sffc commented May 10, 2024

I did some experimentation. For an export of all date skeleton keys over all locales and maximal deduplication, here's what I get:

Blob Blob2 #2699 (comment)
Postcard Size 567405 B 218376 B 196215 B
Load for sr-Latn-x-ym0d (no fallback) 247.58 ns 175.20 ns 203.23 ns
Load for sr-ME-x-ym0d (with fallback) 719.68 ns 421.64 ns 451.99 ns

These figures aren't compelling enough for me to continue work on this blob layout. I archived my code here:

https://github.com/sffc/omnicu/tree/archive/blob3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring) T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

No branches or pull requests

3 participants