-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenating large files #2158
Comments
I am not aware of any such technique or tool. I understand your concerns about data duplication/redundancy, but keep in mind that this sort of conversion that is implemented now (that leads to data duplication) is much safer, i.e. is something goes wrong the original data is unaffected. Besides, the concatenation needs to figure out whether the schema is consistent across all of the files and what to do if it is not. Having said that, if anyone has a good idea on how to improve this, PRs are very welcome. |
Thanks for the quick response. BTW - entirely off topic - but I saw you're in Amsterdam. I'm over in Amsterdam for Sigcomm in a couple of weeks, would be nice to say hi (I'm usually in Australia). |
Ah i see your point. Indeed appending in coming data to an already existing blob is something we've been thinking about. I has come up in a few discussions in the past as well. I am afraid at this time I do not have a full solution.
I've never really worked with hdf5 files outside of vaex (or very little at least when doing some testing). and it is used here: Oh cool, very nice. Coming all the way. Yeah definetely send a message when you are around. Feel free to join our slack also (link on the front page of the repo). |
Hi, thanks, I'm trying the basic way, and both:
and
are giving me strange behaviour. If I compare to each file:
I get a total length of
Where as the open many is that:
Is there a potential limit in the total amount that can be concatenated? The unusual thing is that if I export the combined hdf5 file, it is about the size of each of the smaller .hdf5 combined - but doesn't seem to have all the data. Thanks for the feedback on the adding to an existing blob, I'll take a look at the link. |
So looking at the code, I would expect this Also looking at the code you are relying on I would rely on If there are any problems via exporting (like unsupported types etc..) there should be a clear warning / log in your console/notebook |
Hi, yeah I also tried with open(“*.hdf5”) and the max function didn’t return the max I expected. I expect the df = vaex.open(“*.hdf5”) to open all hdf5 in that directory, and the shape of the df indicates this (it is the sub of each single .hdf5 file). However the max value within df for a column is not the same max value if I check each .hdf5 file individually. Should I open this as a new issue of type bug? Thanks |
If you can make a reproducible example that would be great! |
I’ll see if I can put together a reproducible example - given the large file sizes it luging ve tricky with synthetic data. |
@sk2 i ran into a similar issue and indeed the regular concat did not scale. I am using this internally and it has not caused any issues for me. I opened a PR but I'm not sure if Jovan or Maarten wanted it in the official code (it was a draft PR just to share it with people that may need something similar) |
Hi, I noticed in the tutorials it mentioned performance may be better if multiple smaller files are combined into one larger file.
I could open many files and then save them, but this involves a lot of disk duplication, which doesn't scale.
Is there a technique that could combine multiple smaller files by manipulating the individual hdf5 files? I have looked at some Linux hdf tools but I'm wary of clobbering the data format (especially when it comes to rows).
Ideally there's a way to join them by moving chunks of disk - I do not need to retain the original hdf5 files so the raw bytes could be moved.
I couldn't see anything in the tutorials/faq on this.
Thanks
Simon
The text was updated successfully, but these errors were encountered: