Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimize BlobDataProvider constructor #2625

Open
sffc opened this issue Sep 24, 2022 · 2 comments
Open

Further optimize BlobDataProvider constructor #2625

sffc opened this issue Sep 24, 2022 · 2 comments
Assignees
Labels
A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Sep 24, 2022

In #2603, I improved the performance of the BlobDataProvider constructor by 60%. However, there is still a lot more that can be done. In principle, this should be a constant-time operation, except that we need to do minimal validation of the bytes.

There are two parts of the constructor that take the most time:

  1. Checking that the indices list in the VarZeroVec is sorted (a bigger chunk)
  2. Checking that the joiner list in the ZeroMap2d is sorted (relatively smaller)

Both of these get slower as the data file gets bigger, especially (1).

If we could skip this during construction and delay the check to data load time, we should be able to get the BlobDataProvider construction time down to 100 ns or less.

I did one attempt at this in #2609 by changing the data model. However, this can be done without touching the data model, meaning we can do this later and still read the same data files.

CC @zbraniecki

@sffc sffc added T-enhancement Type: Nice-to-have but not required A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) labels Sep 24, 2022
@sffc sffc self-assigned this Sep 24, 2022
@sffc sffc added this to the ICU4X 1.1 milestone Oct 17, 2022
@sffc sffc assigned robertbastian and unassigned sffc Dec 22, 2022
@sffc sffc modified the milestones: ICU4X 1.1, ICU4X 1.x Dec 22, 2022
@robertbastian
Copy link
Member

Uhm I'm not sure why this is assigned to me. Is there an AI?

We could have a GIGO version of ZeroMap that doesn't check the sorting on deserialization.

@sffc
Copy link
Member Author

sffc commented Feb 9, 2023

I think when I assigned this issue I thought it was something different. I'll take it again.

@sffc sffc assigned sffc and unassigned robertbastian Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

No branches or pull requests

2 participants