Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat many hdf5 files fast #1910

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ben-Epstein
Copy link
Contributor

Helper function to concatenate many hdf5 files. Tested against hundreds of thousands of files.

I could imagine using this when a user globs with a .open where vaex can call this to concat the files (maybe make the os.remove optional), and create the final file for vaex.

TODO: You'll see in the code that I handle string columns less than idea. I know that vaex creates a data and indices group for string columns. I was able to recreate and append to that successfully, but was unable to get vaex to properly read it. I believe that is because vaex cannot mmap string columns from chunked hdf5 files, but that may be incorrect (just my best guess reading the source code).

So currently the columns would come back as byte arrays, and would need to be casted like so

df[col] = df[col].to_arrow().cast(pa.large_string())

i'm sure we can figure out a better solution here.

CC @maartenbreddels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant