Add missing documentation: Incrementally build dataframe #377

dreamflasher · 2019-08-15T21:26:55Z

Vaex claims "Vaex is a library for dealing with larger than memory DataFrames (out of core).", but never actually shows how to do it. Yes, you can create it from other files, but there seems to be no way to create a new file from scratch?

There is df1 = vaex.open("somedata.hdf5"), but dataframes seems to lack an append method.

The text was updated successfully, but these errors were encountered:

JovanVeljanoski · 2019-08-15T22:23:52Z

Hi,

Actually, the documentation describes how to create a DataFrame from scratch:
https://docs.vaex.io/en/latest/tutorial.html#Getting-your-data-in

Also, there is a concat method which allows you to add rows from one DataFrame to the end of another, which is perhaps what you are looking for.
API docs

We are working on improving the documentation, it is a high priority for the final quarter of the year.

Cheers,
J.

dreamflasher · 2019-08-15T22:37:53Z

Thank you. I have the impression I didn't express well enough. Basically for all projects I have worked with so far one has iterative readers/writers:

result = []
for line in open("file"):
    result.append(return_complex_preprocessing_of_line(line))

Now I have the problem that result does not fit into memory. So I hoped vaex would be suitable for solving that problem. concat does not seem to be right for that. I want to append to a dataframe that is backed by a file – writing every now and then (intelligently, eg. whenever there's a bunch of stuff to write).

JovanVeljanoski · 2019-08-16T09:26:25Z

Hi,

I am not sure what is your use-case, but perhaps this would be of help to you?
#369 (comment)

maartenbreddels · 2019-08-16T10:09:31Z

I think this comes up in many issues, and it would be great if we elaborate more on this in the documentation. (from mobile phone)

…

On Fri, 16 Aug 2019, 11:26 Jovan Veljanoski, ***@***.***> wrote: Hi, I am not sure what is your use-case, but perhaps this would be of help to you? #369 (comment) <#369 (comment)> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#377?email_source=notifications&email_token=AANPEPLOZYGNNS3DEAJQXF3QEZXEFA5CNFSM4IMCAVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4OEWNQ#issuecomment-521947958>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANPEPOKPI5KWSPS3MF4KKLQEZXEFANCNFSM4IMCAVNA> .

JovanVeljanoski · 2020-04-29T13:29:59Z

This will be handled by #695 .

mujina93 · 2020-09-02T08:27:27Z

Upping this.

I also would need to know whether appending to a file-backed dataframe is possible or not, in order to allow row-wise read_from_disk-process-write_to_disk workflows, like the one outlined in issue #952 .

Was this addressed by #695 ? Looking at the changes I couldn't find anything pointing to adding functionality or documenting existing functionality for transparent appending/incremental writing.

It seems that there exist solutions to incrementally write to hdf5 files, like this, which back vaex's DataFrames. Or another workaround could be to write the processed data to a csv with a classic csv writer, follwed by instantiating a new vaex's DataFrame out of the processed hdf5 or csv.

But this would defeat the purpose of having a library which should handle those things, since in this case I would have to write my dataset-like classes backed by files which wrap this "low level" functionality.

mujina93 · 2020-09-02T08:41:19Z

Or perhaps is concat the idiomatic and performant solution that vaex gives? (seeing that that is given as the closing answer to issue #211 that asks for appending functionality).

Using the example from @dreamflasher, are you suggesting to go for something like the following?

result = []
df = vaex.DataFrame() # creating empty dataframe, how?
for line in open("file"):
    single_row_df = vaex.DataFrame(
        return_complex_preprocessing_of_line(line)) # most straightforward api to do this?
    df = df.concat(single_row_df)

Is this performant? (As in: how many times does it write to disk, or performs expensive things like big allocations or resizing/reshaping underlying files? And wouldn't it perform unneeded wasteful steps, like creating a brand new disk-backed DataFrame for every single row?)

Plus, this idiom doesn't seem possible as creating an empty dataframe doesn't seem supported yet, judging by #936. (Correct me if I'm wrong)

JovanVeljanoski · 2021-11-14T13:12:35Z

I think was is described here is out of scope for vaex, at least for the time being.

The general approach is:

you start with an memory mappable file (hdf5, arrow, parquet).
From there, you can start doing a large number of transformations/preprocessing.
If you need to work more with the outputs of the transformations, those can be saved to disk as well, in an appropriate format.

If you are starting from a non-memory mappable file (like csv, json etc..), the idea is:

either convert that data to a memory mappable file format that is supported; then the follow the process above
If that is not suitable, perhaps it is best to "stream" row by row in python and do the transformations using any library you find suitable (numpy, pandas etc..) (i guess this can be done by row or in batches), then output the data to a csv which can be converted to a m-mappable file format to work with vaex. If you work with reasonably sized batches, a whole batch can be converted to hdf5.

While vaex.concat can be used to create larger dataframes out of smaller ones, the use-case I image is the following: Say you have some process that creates few tens of millions rows per day. So each day you create a (arrow, hdf5, parquet) file with the data. You then want to analyse what happened this week, month or year, and then you would use vaex.concat to make a big dataframe out of smaller individual ones.

vaex.concat is not really meant to be used to incrementally create a dataframe. In fact, i don't see vaex supporting this in the short term, since vaex is there to work with data that is already present, in some format.

If you need to accumulate the data, with some preprocessing along the way, i would say look at arrow, since they supports some streaming stuff. Then you can export that right to arrow file or parquet, and vaex is ready to consume it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing documentation: Incrementally build dataframe #377

Add missing documentation: Incrementally build dataframe #377

dreamflasher commented Aug 15, 2019

JovanVeljanoski commented Aug 15, 2019

dreamflasher commented Aug 15, 2019

JovanVeljanoski commented Aug 16, 2019

maartenbreddels commented Aug 16, 2019 via email

JovanVeljanoski commented Apr 29, 2020

mujina93 commented Sep 2, 2020

mujina93 commented Sep 2, 2020

JovanVeljanoski commented Nov 14, 2021

Add missing documentation: Incrementally build dataframe #377

Add missing documentation: Incrementally build dataframe #377

Comments

dreamflasher commented Aug 15, 2019

JovanVeljanoski commented Aug 15, 2019

dreamflasher commented Aug 15, 2019

JovanVeljanoski commented Aug 16, 2019

maartenbreddels commented Aug 16, 2019 via email

JovanVeljanoski commented Apr 29, 2020

mujina93 commented Sep 2, 2020

mujina93 commented Sep 2, 2020

JovanVeljanoski commented Nov 14, 2021