Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow2's estimated_size_bytes performance issues #1738

Closed
teh-cmc opened this issue Mar 30, 2023 · 3 comments · Fixed by #1739
Closed

arrow2's estimated_size_bytes performance issues #1738

teh-cmc opened this issue Mar 30, 2023 · 3 comments · Fixed by #1739
Assignees
Labels
😤 annoying Something in the UI / SDK is annoying to use 🏹 arrow concerning arrow 📉 performance Optimization, memory use, etc

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Mar 30, 2023

The problem is so severe that it is pretty much impossible to use it anywhere in the store without everything slowing down to a crawl, which makes implementing statistics really, really painful.

Even for incremental measurements this is way too slow, orders of magnitudes slower than everything else on the write path.

See #1743 for detailed benchmarks.


If we cannot optimize it any further, my proposal is to move the problem upstream: compute byte sizes within the batching system, therefore:

  1. distributing the load to the clients
  2. making its cost irrelevant, since batching happens on a separate thread
@teh-cmc teh-cmc added 🏹 arrow concerning arrow 😤 annoying Something in the UI / SDK is annoying to use 📉 performance Optimization, memory use, etc labels Mar 30, 2023
@emilk
Copy link
Member

emilk commented Mar 31, 2023

Can't we fix this in arrow2? It sounds like a performance bug.

@teh-cmc
Copy link
Member Author

teh-cmc commented Mar 31, 2023

Can't we fix this in arrow2? It sounds like a performance bug.

Yep, I want to investigate this tomorrow after a good night's sleep.
Sounds like a good opportunity to become really intimate with arrow2's internals too, which I've been meaning to do for a while anyhow.

@teh-cmc teh-cmc self-assigned this Mar 31, 2023
@teh-cmc teh-cmc changed the title arrow2's estimated_size_bytes is _extremely_ slow arrow2's estimated_size_bytes performance issues Mar 31, 2023
@teh-cmc
Copy link
Member Author

teh-cmc commented Apr 1, 2023

Long-winded explanation here, but the TL;DR is that, as far as I can tell, A) there is nothing wrong with how estimated_bytes_size is implemented and B) performance is getting absolutely wrecked by nested virtual calls.

Due to the way arrow2 is structured, the number of virtual calls necessary to execute Array::clone or estimated_bytes_size scales linearly with the number of fields in a type, while at the same time the performance of these methods decreases pretty much linearly with the number of virtual calls involved.
I.e. cloning or estimating the size of a Point3D will be much slower than doing the same thing on an InstanceKey.

At least in the short-term I think the solution stays the same: get these operations out of the fast path, cache the results in DataCells, use DataCells everywhere, avoid working with raw arrow arrays as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
😤 annoying Something in the UI / SDK is annoying to use 🏹 arrow concerning arrow 📉 performance Optimization, memory use, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants