read-stream-dataset-inplace fails on large arrow file #126

behrica · 2020-07-31T23:04:07Z

I created a large arrow file on disk (14 GB) and tried to read it with
read-stream-dataset-inplace

It fails with:

. Unhandled java.lang.IllegalArgumentException
  Value out of range for int: 14779929936

                  RT.java: 1248  clojure.lang.RT/intCast
                base.cljc:  107  tech.v2.datatype.base$sub_buffer/invokeStatic
                base.cljc:  105  tech.v2.datatype.base$sub_buffer/invoke
             datatype.clj:  418  tech.v2.datatype/sub-buffer
             datatype.clj:  413  tech.v2.datatype/sub-buffer
              RestFn.java:  425  clojure.lang.RestFn/invoke
             in_place.clj:   59  tech.libs.arrow.in-place/read-message
             in_place.clj:   46  tech.libs.arrow.in-place/read-message
             in_place.clj:   73  tech.libs.arrow.in-place/message-seq
             in_place.clj:   70  tech.libs.arrow.in-place/message-seq
             in_place.clj:  308  tech.libs.arrow.in-place/read-stream-dataset-inplace

cnuernber · 2020-08-01T12:33:27Z

Awesome, at least it failed as opposed to read bad data. Will dive into this, thanks.

cnuernber · 2020-08-01T12:38:06Z

Do you have an R script for creating this file? Maybe I can from the taxi cab example earlier.

behrica · 2020-08-01T13:05:28Z

I created it from a large csv file quite easily, by using 2 R libraries.
readr::read_csv_chunked

And
arrow::RecordBatchStreamWriter

Starting from a 15G Csv (which I just "cat"ed together) I read CSV in batches and wrote it to arrow fiomat in batches as well.
I will send you some example code

behrica · 2020-08-01T13:29:59Z

It looks like this, but without batched reading of writing.
So it works only for CSV which fit in R memory:

library(readr)
library(arrow)
df <- readr::read_csv("in.csv")
tf<-"out.arrow"
file_obj <- FileOutputStream$create(tf)
batch <- record_batch(df)
writer <- RecordBatchStreamWriter$create(file_obj,batch$schema)
writer$write(batch)
writer$close()
file_obj$close()

but it works similar for batched reading / writing by using readr::read_csv_chunked and calling writer$write(x) in the call back function passed into readr::read_csv_chunked

cnuernber · 2020-08-01T13:45:53Z

Ah, got it. I guess you could also just keep calling writer$write(batch) in a loop and get a similar result. I am download the taxi
data and that could also be written to a (quite large) batched arrow file.

behrica · 2020-08-01T16:45:22Z

I fixed the exception.

The code code in in_place.clj calls the wrong sub-buffer funciton.
It does not call the one of the NativeBuffer record, but the function in base.cljs.
In this line for example:

tech.ml.dataset/src/tech/libs/arrow/in_place.clj

Line 57 in 3ea1446

    
           (let [new-msg (Message/getRootAsMessage (-> (dtype/sub-buffer data offset msg-size)

But now it returns only the first "batch". I wrote the large file indeed in several batches,
and I get exactly a dataset where the number of rows is the batch size.

How am I supposed to get the full dataset (combination of all batches) ?

cnuernber · 2020-08-01T16:58:15Z

I attempted to fix the base itself and checked it in. I have not developed placed a pathway to load all record-batches. This involves tracking dictionary-batches and applying differential changes to the dictionaries and then producing a dataset per-batch. So a really large arrow file will produce a sequence of datasets -

tech.ml.dataset/src/tech/libs/arrow/in_place.clj

Line 302 in 3ea1446

(defn read-stream-dataset-inplace

That code, read in 0 or more dictionaries and the next record batch would just be in a loop or sequence construct in order to return a sequence of datasets instead of the first one. The arrow specification is conflicted about dictionaries:

https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout

I am currently finishing up a blog post outlining memory mapping and some basic performance comparisons to the Java Arrow API so I have not got into working with large multirecord arrow files. You are brave :-) but I am really excited you are going for it.

cnuernber · 2020-08-01T17:05:01Z

The work here will be I think past getting things to load thinking about how to build a sane interface for a sequence of datasets as opposed to a single one. I think the dataset protocols will work but of course without doing it no one knows. Things like filter need to be implemented in a multi-dataset-aware fashion. Currently the way concat-copying works is at the reader level and this imposes a per-index cost to each element access which is I believe unacceptable. With a multi-dataset concept, assuming the datasets share a schema, you could do select, filter, and group-by in smarter ways.

cnuernber · 2020-08-01T17:07:36Z

Another thing about the article you linked to from R is that R's filter takes an AST so it can analyse the AST to strip out entire datasets when, for instance, a column's min and max lie outside the filter inclusion set. This is quite an optimization and you can't make it generically just with functions that return true or false; you would have to know the function is a greater/less than type filter.

behrica · 2020-08-01T18:15:27Z

I slightly modified read-stream-dataset-inplace so that it returns a lazy sequence of datasets,
one per record batch.

In my 15 GB case, I had arround 1000 of them.

Then I did a frequency count of the values of one column and aggregated over all datasets.
(So I did not try to concat the datasets)

It was all very straight forward and is was very fast

It took roughly 90 seconds going over the 15 GB
This is already an amazing result, in my view.

So we are definitely on the right track to work with large data from clojure

behrica · 2020-08-01T18:52:33Z

I played a bit more with it and the overall experience is very, very good.

To have the "very large data" as a lazy sequence of datasets makes exploratory analysis very convenient and fast.

It seems to be that the combination of "lazy clojure" + "mmaped files" gives very good results.
Working in the repl feels very fast, and things are only "slow" when expected (if access to all 15 GB is indeed needed). It it is a "good" slowness, in the sense of repl stays stable after operation finishes. There is of course no java GC pressure.

behrica · 2020-08-01T19:01:07Z

For me it is therefore perfectly fine to change the behavior of "read-stream-dataset-inplace" to return a lazy sequence of datasets.
(one data set per record batch)

The only "issue" with this, that by using an "external arrow file", we might not be able to control the batch size, and might get the worst case (= all data in one huge batch)

In my case I did not see the issue you described above about "Multiple dictionaries".

behrica · 2020-08-01T19:03:05Z

I will now test you new code

cnuernber · 2020-08-01T19:17:02Z

This is actually amazing. You may be literally the first person on the JVM to load and work with datasets of this size in this way. Normally they would have setup a hadoop cluster with spark and a ton more drama and you can now do this work on a laptop. What a solid validation of the design, I am just really happy with this outcome so far.

behrica · 2020-08-01T19:38:04Z

The new code works, we can close this issue.

harold · 2020-08-01T21:08:37Z

It seems to be that the combination of "lazy clojure" + "mmaped files" gives very good results.

We were just discussing this precise claim. Very cool.

behrica · 2020-08-01T22:55:11Z

This is actually amazing. You may be literally the first person on the JVM to load and work with datasets of this size in this way. Normally they would have setup a hadoop cluster with spark and a ton more drama and you can now do this work on a laptop.
-> A VM with 2 cores and 8 GB ram in total

behrica closed this as completed Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read-stream-dataset-inplace fails on large arrow file #126

read-stream-dataset-inplace fails on large arrow file #126

behrica commented Jul 31, 2020

cnuernber commented Aug 1, 2020

cnuernber commented Aug 1, 2020

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020

cnuernber commented Aug 1, 2020

behrica commented Aug 1, 2020

cnuernber commented Aug 1, 2020 •

edited

cnuernber commented Aug 1, 2020

cnuernber commented Aug 1, 2020 •

edited

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020 •

edited

cnuernber commented Aug 1, 2020

behrica commented Aug 1, 2020

harold commented Aug 1, 2020

behrica commented Aug 1, 2020 •

edited

read-stream-dataset-inplace fails on large arrow file #126

read-stream-dataset-inplace fails on large arrow file #126

Comments

behrica commented Jul 31, 2020

cnuernber commented Aug 1, 2020

cnuernber commented Aug 1, 2020

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020

cnuernber commented Aug 1, 2020

behrica commented Aug 1, 2020

cnuernber commented Aug 1, 2020 • edited

cnuernber commented Aug 1, 2020

cnuernber commented Aug 1, 2020 • edited

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020

behrica commented Aug 1, 2020 • edited

cnuernber commented Aug 1, 2020

behrica commented Aug 1, 2020

harold commented Aug 1, 2020

behrica commented Aug 1, 2020 • edited

cnuernber commented Aug 1, 2020 •

edited

cnuernber commented Aug 1, 2020 •

edited

behrica commented Aug 1, 2020 •

edited

behrica commented Aug 1, 2020 •

edited