Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatype deduplication 2: switch to re_arrow2 #4883

Merged
merged 5 commits into from
Jan 24, 2024
Merged

Datatype deduplication 2: switch to re_arrow2 #4883

merged 5 commits into from
Jan 24, 2024

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Jan 22, 2024

Grunt work to switch to re_arrow2 and all the breaking changes that come with it.


Part of the tiny datatype deduplication PR series:

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG

@teh-cmc teh-cmc added dependencies concerning crates, pip packages etc 🚜 refactor Change the code, not the functionality exclude from changelog PRs with this won't show up in CHANGELOG.md labels Jan 22, 2024
@teh-cmc teh-cmc changed the base branch from main to cmc/polarless January 22, 2024 14:38
@teh-cmc teh-cmc changed the title Datatype deduplication 1: switch to re_arrow2 Datatype deduplication e: switch to re_arrow2 Jan 22, 2024
@teh-cmc teh-cmc changed the title Datatype deduplication e: switch to re_arrow2 Datatype deduplication 2: switch to re_arrow2 Jan 22, 2024
@teh-cmc teh-cmc added the do-not-merge Do not merge this PR label Jan 22, 2024
Copy link

github-actions bot commented Jan 22, 2024

Size changes

Name main 4883/merge Change
plots.rrd 204.80 kiB 194.56 kiB -5.00%

Copy link
Member

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay!

Cargo.lock Outdated Show resolved Hide resolved
@teh-cmc
Copy link
Member Author

teh-cmc commented Jan 23, 2024

This gives us a 20% memory usage improvement in the best case as is (instead of 50% back in the day, because we've optimized things since then, most notably we don't have any arrow extensions anymore).

We've basically reached the point where DataTypes have no heap overhead anymore, now the issue is that they have massive stack overhead (std::mem::size_of::<DataType>() = 48!) which turns into heap overhead everytime we slice data and get a dyn Array out.

Trying to remove the stack overhead from arrow2 is extremely painful, it breaks basically every line of code; and will be nullified when we switch to arrow-rs anyhow.
The only long-term viable solution I can think of is to implement our own slicing mechanism that isn't Box<dyn Array>, which has been the source of most of our compute and space problems since the beginning.

Base automatically changed from cmc/polarless to main January 23, 2024 17:24
teh-cmc added a commit that referenced this pull request Jan 23, 2024
All the grunt work left to get rid of polars.

- Remove all helpers and APIs built specifically for polars'
`DataFrame`.
- Refactor tests that rely on dataframe joins to not require join
semantics in the first place (`re_data_store` has no knowledge of those
anyway).
- The one test that does require join semantics has moved over to
`re_query`, where join semantics belong.
- All `polars-*` dep have been removed.

Don't look at the commit log as it makes no sense: i changed strategies
a bunch of times on the way.

---

- Part of #4789
- DNR: requires #4856

---

Part of the tiny datatype deduplication PR series:
- #4880
- #4883
@emilk
Copy link
Member

emilk commented Jan 24, 2024

Trying to remove the stack overhead from arrow2 is extremely painful, it breaks basically every line of code; and will be nullified when we switch to arrow-rs anyhow. The only long-term viable solution I can think of is to implement our own slicing mechanism that isn't Box<dyn Array>, which has been the source of most of our compute and space problems since the beginning.

Isn't the long-term solution to switch to arrow-rs?

@teh-cmc
Copy link
Member Author

teh-cmc commented Jan 24, 2024

Trying to remove the stack overhead from arrow2 is extremely painful, it breaks basically every line of code; and will be nullified when we switch to arrow-rs anyhow. The only long-term viable solution I can think of is to implement our own slicing mechanism that isn't Box<dyn Array>, which has been the source of most of our compute and space problems since the beginning.

Isn't the long-term solution to switch to arrow-rs?

I'd say that switching to arrow-rs is the medium-term solution even, and should carry us a long-ish way.

Specifically, arrow-rs..:

  • ..works with Arc<dyn Array> instead of Box<dyn Array>, so no costly virtual clones.
  • ..makes sure to arc-ify DataType's internal fields to amortize heap costs (so same as what we did just now).
  • ..reduced the stack size of DataType from 48 to 24 bytes.

But, there is still a lot of overhead when it comes to slicing data, e.g. a u32 slice is still std::mem::size_of::<PrimitiveArray<UInt32Type>>() = 96 bytes per cell...

@teh-cmc teh-cmc removed the do-not-merge Do not merge this PR label Jan 24, 2024
@teh-cmc teh-cmc merged commit c4dc7ad into main Jan 24, 2024
40 checks passed
@teh-cmc teh-cmc deleted the cmc/re_arrow2 branch January 24, 2024 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies concerning crates, pip packages etc exclude from changelog PRs with this won't show up in CHANGELOG.md 🚜 refactor Change the code, not the functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fork arrow2 and get rid of polars
2 participants