Concatenating large files #2158

sk2 · 2022-08-07T10:33:35Z

Hi, I noticed in the tutorials it mentioned performance may be better if multiple smaller files are combined into one larger file.
I could open many files and then save them, but this involves a lot of disk duplication, which doesn't scale.

Is there a technique that could combine multiple smaller files by manipulating the individual hdf5 files? I have looked at some Linux hdf tools but I'm wary of clobbering the data format (especially when it comes to rows).

Ideally there's a way to join them by moving chunks of disk - I do not need to retain the original hdf5 files so the raw bytes could be moved.

I couldn't see anything in the tutorials/faq on this.

Thanks
Simon

JovanVeljanoski · 2022-08-07T13:02:31Z

I am not aware of any such technique or tool. I understand your concerns about data duplication/redundancy, but keep in mind that this sort of conversion that is implemented now (that leads to data duplication) is much safer, i.e. is something goes wrong the original data is unaffected.

Besides, the concatenation needs to figure out whether the schema is consistent across all of the files and what to do if it is not.

Having said that, if anyone has a good idea on how to improve this, PRs are very welcome.

sk2 · 2022-08-08T07:46:57Z

Thanks for the quick response.
That makes sense for the reasons you outline.
I'm looking to append incoming data to a file, where I know the structure is the same, so copying the whole thing wouldn't work as well.
In this case I'd probably be best going down the route outlined at https://forum.hdfgroup.org/t/merge-multiple-h5-files-containing-groups-into-one-h5/6911
Is the internal structure used when exporting a hdf5 from Vaex fairly easy to understand?

BTW - entirely off topic - but I saw you're in Amsterdam. I'm over in Amsterdam for Sigcomm in a couple of weeks, would be nice to say hi (I'm usually in Australia).

JovanVeljanoski · 2022-08-08T15:55:54Z

Ah i see your point. Indeed appending in coming data to an already existing blob is something we've been thinking about. I has come up in a few discussions in the past as well. I am afraid at this time I do not have a full solution.

Is the internal structure used when exporting a hdf5 from Vaex fairly easy to understand?

I've never really worked with hdf5 files outside of vaex (or very little at least when doing some testing).
The structure should be quite simple: it is column oriented container. For more info of how this looks like, you can peak here: https://github.com/vaexio/vaex/blob/master/packages/vaex-hdf5/vaex/hdf5/writer.py#L20

and it is used here:
https://github.com/vaexio/vaex/blob/master/packages/vaex-core/vaex/dataframe.py#L6904

Oh cool, very nice. Coming all the way. Yeah definetely send a message when you are around. Feel free to join our slack also (link on the front page of the repo).

sk2 · 2022-08-09T11:20:07Z

Hi, thanks, I'm trying the basic way, and both:

all_files = sorted(glob.glob("combined/*hdf5"))
df1 = vaex.open_many(all_files)

and

df1 = vaex.open_many("combined/*hdf5")

are giving me strange behaviour.
These are time series data, and I am trying to join them together to export to a single hdf5 file for efficiency.
They were all created from the same base sql query which created csvs which I saved using Vaex.

If I compare to each file:

 all_files = sorted(glob.glob("combined/*hdf5"))
total_length = 0
>>> for file in all_files:
...     df2 = vaex.open(file)
...     print("max", df2.end_time.max())
...     total_length += df2.shape[0]

I get a total length of

498009636

Where as the open many is that:

>>> df1.shape[0]
249004818

Is there a potential limit in the total amount that can be concatenated?
Is there a better way to do the "basic" concatenation?

The unusual thing is that if I export the combined hdf5 file, it is about the size of each of the smaller .hdf5 combined - but doesn't seem to have all the data.

Thanks for the feedback on the adding to an existing blob, I'll take a look at the link.
I'll also join the slack - I didn't notice the link on the website!

JovanVeljanoski · 2022-08-09T17:16:45Z

So looking at the code, vaex.open_many expects a list of file paths. It is vaex.open() that accepts a globe expression within as a first argument.

I would expect this df1 = vaex.open_many("combined/*hdf5") to crash or not work reliably.

Also looking at the code you are relying on df2.end_time.max(), maybe doing something like len(df) would be better. I don't know the if the sorting /ordering of the files matter, glob can be funny sometimes.

I would rely on df.shape (the sum of the individuals vs the the concatenated dataframe via any method) to draw my conclusions.

If there are any problems via exporting (like unsupported types etc..) there should be a clear warning / log in your console/notebook

sk2 · 2022-08-10T12:19:15Z

Hi, yeah I also tried with open(“*.hdf5”) and the max function didn’t return the max I expected.

I expect the df = vaex.open(“*.hdf5”) to open all hdf5 in that directory, and the shape of the df indicates this (it is the sub of each single .hdf5 file).

However the max value within df for a column is not the same max value if I check each .hdf5 file individually.
it looks like its only checking in the first three files from the glob and then returns the max from that.

Should I open this as a new issue of type bug?

Thanks

JovanVeljanoski · 2022-08-11T06:37:46Z

If you can make a reproducible example that would be great!

sk2 · 2022-08-22T13:44:33Z

I’ll see if I can put together a reproducible example - given the large file sizes it luging ve tricky with synthetic data.

Ben-Epstein · 2022-09-06T20:32:17Z

@sk2 i ran into a similar issue and indeed the regular concat did not scale. I am using this internally and it has not caused any issues for me. I opened a PR but I'm not sure if Jovan or Maarten wanted it in the official code (it was a draft PR just to share it with people that may need something similar)

#1910

joseberlines mentioned this issue Oct 20, 2022

Complete documentation for vaex.open_many() #2241

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenating large files #2158

Concatenating large files #2158

sk2 commented Aug 7, 2022

JovanVeljanoski commented Aug 7, 2022

sk2 commented Aug 8, 2022

JovanVeljanoski commented Aug 8, 2022

sk2 commented Aug 9, 2022 •

edited

Loading

JovanVeljanoski commented Aug 9, 2022

sk2 commented Aug 10, 2022

JovanVeljanoski commented Aug 11, 2022

sk2 commented Aug 22, 2022

Ben-Epstein commented Sep 6, 2022 •

edited

Loading

Concatenating large files #2158

Concatenating large files #2158

Comments

sk2 commented Aug 7, 2022

JovanVeljanoski commented Aug 7, 2022

sk2 commented Aug 8, 2022

JovanVeljanoski commented Aug 8, 2022

sk2 commented Aug 9, 2022 • edited Loading

JovanVeljanoski commented Aug 9, 2022

sk2 commented Aug 10, 2022

JovanVeljanoski commented Aug 11, 2022

sk2 commented Aug 22, 2022

Ben-Epstein commented Sep 6, 2022 • edited Loading

sk2 commented Aug 9, 2022 •

edited

Loading

Ben-Epstein commented Sep 6, 2022 •

edited

Loading