New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing documentation: Incrementally build dataframe #377
Comments
Hi, Actually, the documentation describes how to create a DataFrame from scratch: Also, there is a We are working on improving the documentation, it is a high priority for the final quarter of the year. Cheers, |
Thank you. I have the impression I didn't express well enough. Basically for all projects I have worked with so far one has iterative readers/writers:
Now I have the problem that result does not fit into memory. So I hoped vaex would be suitable for solving that problem. |
Hi, I am not sure what is your use-case, but perhaps this would be of help to you? |
I think this comes up in many issues, and it would be great if we elaborate
more on this in the documentation.
(from mobile phone)
…On Fri, 16 Aug 2019, 11:26 Jovan Veljanoski, ***@***.***> wrote:
Hi,
I am not sure what is your use-case, but perhaps this would be of help to
you?
#369 (comment)
<#369 (comment)>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#377?email_source=notifications&email_token=AANPEPLOZYGNNS3DEAJQXF3QEZXEFA5CNFSM4IMCAVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4OEWNQ#issuecomment-521947958>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANPEPOKPI5KWSPS3MF4KKLQEZXEFANCNFSM4IMCAVNA>
.
|
This will be handled by #695 . |
Upping this. I also would need to know whether appending to a file-backed dataframe is possible or not, in order to allow row-wise read_from_disk-process-write_to_disk workflows, like the one outlined in issue #952 . Was this addressed by #695 ? Looking at the changes I couldn't find anything pointing to adding functionality or documenting existing functionality for transparent appending/incremental writing. It seems that there exist solutions to incrementally write to hdf5 files, like this, which back vaex's DataFrames. Or another workaround could be to write the processed data to a csv with a classic csv writer, follwed by instantiating a new vaex's DataFrame out of the processed hdf5 or csv. But this would defeat the purpose of having a library which should handle those things, since in this case I would have to write my dataset-like classes backed by files which wrap this "low level" functionality. |
Or perhaps is Using the example from @dreamflasher, are you suggesting to go for something like the following? result = []
df = vaex.DataFrame() # creating empty dataframe, how?
for line in open("file"):
single_row_df = vaex.DataFrame(
return_complex_preprocessing_of_line(line)) # most straightforward api to do this?
df = df.concat(single_row_df) Is this performant? (As in: how many times does it write to disk, or performs expensive things like big allocations or resizing/reshaping underlying files? And wouldn't it perform unneeded wasteful steps, like creating a brand new disk-backed DataFrame for every single row?) Plus, this idiom doesn't seem possible as creating an empty dataframe doesn't seem supported yet, judging by #936. (Correct me if I'm wrong) |
I think was is described here is out of scope for vaex, at least for the time being. The general approach is:
If you are starting from a non-memory mappable file (like csv, json etc..), the idea is:
While
If you need to accumulate the data, with some preprocessing along the way, i would say look at arrow, since they supports some streaming stuff. Then you can export that right to arrow file or parquet, and vaex is ready to consume it. |
Vaex claims "Vaex is a library for dealing with larger than memory DataFrames (out of core).", but never actually shows how to do it. Yes, you can create it from other files, but there seems to be no way to create a new file from scratch?
There is
df1 = vaex.open("somedata.hdf5")
, but dataframes seems to lack anappend
method.The text was updated successfully, but these errors were encountered: