Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing documentation: Incrementally build dataframe #377

Open
dreamflasher opened this issue Aug 15, 2019 · 8 comments
Open

Add missing documentation: Incrementally build dataframe #377

dreamflasher opened this issue Aug 15, 2019 · 8 comments

Comments

@dreamflasher
Copy link

Vaex claims "Vaex is a library for dealing with larger than memory DataFrames (out of core).", but never actually shows how to do it. Yes, you can create it from other files, but there seems to be no way to create a new file from scratch?

There is df1 = vaex.open("somedata.hdf5"), but dataframes seems to lack an append method.

@JovanVeljanoski
Copy link
Member

Hi,

Actually, the documentation describes how to create a DataFrame from scratch:
https://docs.vaex.io/en/latest/tutorial.html#Getting-your-data-in

Also, there is a concat method which allows you to add rows from one DataFrame to the end of another, which is perhaps what you are looking for.
API docs

We are working on improving the documentation, it is a high priority for the final quarter of the year.

Cheers,
J.

@dreamflasher
Copy link
Author

Thank you. I have the impression I didn't express well enough. Basically for all projects I have worked with so far one has iterative readers/writers:

result = []
for line in open("file"):
    result.append(return_complex_preprocessing_of_line(line))

Now I have the problem that result does not fit into memory. So I hoped vaex would be suitable for solving that problem. concat does not seem to be right for that. I want to append to a dataframe that is backed by a file – writing every now and then (intelligently, eg. whenever there's a bunch of stuff to write).

@JovanVeljanoski
Copy link
Member

Hi,

I am not sure what is your use-case, but perhaps this would be of help to you?
#369 (comment)

@maartenbreddels
Copy link
Member

maartenbreddels commented Aug 16, 2019 via email

@JovanVeljanoski
Copy link
Member

This will be handled by #695 .

@mujina93
Copy link

mujina93 commented Sep 2, 2020

Upping this.

I also would need to know whether appending to a file-backed dataframe is possible or not, in order to allow row-wise read_from_disk-process-write_to_disk workflows, like the one outlined in issue #952 .

Was this addressed by #695 ? Looking at the changes I couldn't find anything pointing to adding functionality or documenting existing functionality for transparent appending/incremental writing.

It seems that there exist solutions to incrementally write to hdf5 files, like this, which back vaex's DataFrames. Or another workaround could be to write the processed data to a csv with a classic csv writer, follwed by instantiating a new vaex's DataFrame out of the processed hdf5 or csv.

But this would defeat the purpose of having a library which should handle those things, since in this case I would have to write my dataset-like classes backed by files which wrap this "low level" functionality.

@mujina93
Copy link

mujina93 commented Sep 2, 2020

Or perhaps is concat the idiomatic and performant solution that vaex gives? (seeing that that is given as the closing answer to issue #211 that asks for appending functionality).

Using the example from @dreamflasher, are you suggesting to go for something like the following?

result = []
df = vaex.DataFrame() # creating empty dataframe, how?
for line in open("file"):
    single_row_df = vaex.DataFrame(
        return_complex_preprocessing_of_line(line)) # most straightforward api to do this?
    df = df.concat(single_row_df)

Is this performant? (As in: how many times does it write to disk, or performs expensive things like big allocations or resizing/reshaping underlying files? And wouldn't it perform unneeded wasteful steps, like creating a brand new disk-backed DataFrame for every single row?)

Plus, this idiom doesn't seem possible as creating an empty dataframe doesn't seem supported yet, judging by #936. (Correct me if I'm wrong)

@JovanVeljanoski
Copy link
Member

I think was is described here is out of scope for vaex, at least for the time being.

The general approach is:

  • you start with an memory mappable file (hdf5, arrow, parquet).
  • From there, you can start doing a large number of transformations/preprocessing.
  • If you need to work more with the outputs of the transformations, those can be saved to disk as well, in an appropriate format.

If you are starting from a non-memory mappable file (like csv, json etc..), the idea is:

  • either convert that data to a memory mappable file format that is supported; then the follow the process above
  • If that is not suitable, perhaps it is best to "stream" row by row in python and do the transformations using any library you find suitable (numpy, pandas etc..) (i guess this can be done by row or in batches), then output the data to a csv which can be converted to a m-mappable file format to work with vaex. If you work with reasonably sized batches, a whole batch can be converted to hdf5.

While vaex.concat can be used to create larger dataframes out of smaller ones, the use-case I image is the following: Say you have some process that creates few tens of millions rows per day. So each day you create a (arrow, hdf5, parquet) file with the data. You then want to analyse what happened this week, month or year, and then you would use vaex.concat to make a big dataframe out of smaller individual ones.

vaex.concat is not really meant to be used to incrementally create a dataframe. In fact, i don't see vaex supporting this in the short term, since vaex is there to work with data that is already present, in some format.

If you need to accumulate the data, with some preprocessing along the way, i would say look at arrow, since they supports some streaming stuff. Then you can export that right to arrow file or parquet, and vaex is ready to consume it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants