New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read-stream-dataset-inplace fails on large arrow file #126
Comments
Awesome, at least it failed as opposed to read bad data. Will dive into this, thanks. |
Do you have an R script for creating this file? Maybe I can from the taxi cab example earlier. |
I created it from a large csv file quite easily, by using 2 R libraries. And Starting from a 15G Csv (which I just "cat"ed together) I read CSV in batches and wrote it to arrow fiomat in batches as well. |
It looks like this, but without batched reading of writing.
but it works similar for batched reading / writing by using readr::read_csv_chunked and calling writer$write(x) in the call back function passed into readr::read_csv_chunked |
Ah, got it. I guess you could also just keep calling |
I fixed the exception. The code code in in_place.clj calls the wrong sub-buffer funciton.
But now it returns only the first "batch". I wrote the large file indeed in several batches, How am I supposed to get the full dataset (combination of all batches) ? |
I attempted to fix the base itself and checked it in. I have not developed placed a pathway to load all record-batches. This involves tracking dictionary-batches and applying differential changes to the dictionaries and then producing a dataset per-batch. So a really large arrow file will produce a sequence of datasets -
That code, read in 0 or more dictionaries and the next record batch would just be in a loop or sequence construct in order to return a sequence of datasets instead of the first one. The arrow specification is conflicted about dictionaries: https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout I am currently finishing up a blog post outlining memory mapping and some basic performance comparisons to the Java Arrow API so I have not got into working with large multirecord arrow files. You are brave :-) but I am really excited you are going for it. |
The work here will be I think past getting things to load thinking about how to build a sane interface for a sequence of datasets as opposed to a single one. I think the dataset protocols will work but of course without doing it no one knows. Things like filter need to be implemented in a multi-dataset-aware fashion. Currently the way concat-copying works is at the reader level and this imposes a per-index cost to each element access which is I believe unacceptable. With a multi-dataset concept, assuming the datasets share a schema, you could do select, filter, and group-by in smarter ways. |
Another thing about the article you linked to from R is that R's filter takes an AST so it can analyse the AST to strip out entire datasets when, for instance, a column's min and max lie outside the filter inclusion set. This is quite an optimization and you can't make it generically just with functions that return true or false; you would have to know the function is a greater/less than type filter. |
I slightly modified read-stream-dataset-inplace so that it returns a lazy sequence of datasets, In my 15 GB case, I had arround 1000 of them. Then I did a frequency count of the values of one column and aggregated over all datasets. It was all very straight forward and is was very fast It took roughly 90 seconds going over the 15 GB So we are definitely on the right track to work with large data from clojure |
I played a bit more with it and the overall experience is very, very good. To have the "very large data" as a lazy sequence of datasets makes exploratory analysis very convenient and fast. It seems to be that the combination of "lazy clojure" + "mmaped files" gives very good results. |
For me it is therefore perfectly fine to change the behavior of "read-stream-dataset-inplace" to return a lazy sequence of datasets. The only "issue" with this, that by using an "external arrow file", we might not be able to control the batch size, and might get the worst case (= all data in one huge batch) In my case I did not see the issue you described above about "Multiple dictionaries". |
I will now test you new code |
This is actually amazing. You may be literally the first person on the JVM to load and work with datasets of this size in this way. Normally they would have setup a hadoop cluster with spark and a ton more drama and you can now do this work on a laptop. What a solid validation of the design, I am just really happy with this outcome so far. |
The new code works, we can close this issue. |
We were just discussing this precise claim. Very cool. |
|
I created a large arrow file on disk (14 GB) and tried to read it with
read-stream-dataset-inplace
It fails with:
The text was updated successfully, but these errors were encountered: