read-stream-dataset-inplace has memory leak #149

behrica · 2020-10-23T23:58:10Z

Running this code on a directory with around 10G of arrow files (2000 small ones),
fails with OutOfMemory

   (def arrows
    (->> (file-seq (io/file  "my-dir"))
         (filter #(.isFile %))
         (mapv #(ds/info (arrow/read-stream-dataset-inplace (.getPath %))))   ;; leaks memory
         ;; (mapv #(arrow/visualize-arrow-stream %))                                        ;; does not leak
         ))

The created datasets are not full garbage collected.
It works using the visualize-arrow-stream, no OOM.

The text was updated successfully, but these errors were encountered:

behrica · 2020-10-24T11:22:34Z

It happens with both resource types, :gc and stacked context.
But maybe I do something wrong.

The above code should run with little need for heap space, should it ?
I read one small dataset after each other and discard it then

behrica · 2020-10-24T12:15:59Z

I am not at all an expert in memory leak anaysis, but I used visualvm, and saw a large number of tech.resource.GCreference objects not garbage collected:

cnuernber · 2020-10-24T12:41:49Z

Hmm, definitely should not leak regardless.

behrica · 2020-10-24T12:55:57Z

And they seem to be refenced (or reference) the NativeBuffers:

behrica · 2020-10-24T12:59:24Z

It is related to line:

tech.ml.dataset/src/tech/v3/libs/arrow/in_place.clj

Line 346 in deff0e8

(resource/track (constantly fdata)))))

Removing that one, makes the mapv finish
So it leaks less, but at tend end I am still at 3.5 G heap use, while it would expect near the same as before. (500 M, on fresh JVM)
The "ds/info" data is very small.

cnuernber · 2020-10-24T13:01:12Z

This is because in my haste I used resource/track which defaults to stack resources and you do not have a stack based resource context open. I am going to rework the resource system to log warnings if stack based resource management is used and no resource context is open.

In general, gc-based resource management works fine and stack is only necessary at some times. There are other places where resource/track may be used and it is a bit of a hidden snake in the grass. Only the top level object (like the mmap file) needs to be tracked this way really, the rest of the resources may be chained via the gc mechanism to the top one.

cnuernber · 2020-10-24T13:02:17Z

This is similar to the other issue yesterday where I am really glad you are finding these now in reproducible ways. These are the types of issues that sometimes I get an email about and it is someone way deep into a problem and then it takes forever to tease out the reason.

cnuernber · 2020-10-24T14:56:45Z

Fixed in 5.00-alpha-11

behrica · 2020-10-24T16:28:06Z

I think, this is not fully solved.

On my directory of example files, to still ends with OOM.

I replicated yout test :

tech.ml.dataset/test/tech/v3/libs/arrow_test.clj

Line 73 in 8f1aa3d

    
           (vec (repeatedly 200 #(mapv meta (vals (arrow/read-stream-dataset-inplace "10m.arrow"))))))

and it works indeed.

But it fails with my files.

I think it got "better", compared to last version:

The first peeks where form simple example files, the last form "my" files.

The last peek correspons to this piece of code:

  (def  copipes-of-files 
    (->> (file-seq (io/file
                    "/home/carsten/Dropbox/sources/tablecloth/allscreenings/"
                    ))
         (filter #(.isFile %))
         (mapv #(ds/row-count (arrow/read-stream-dataset-inplace (.getPath %))))))

I will share my files with you, so you can reproduce it.

behrica · 2020-10-24T16:30:54Z

Maybe I was too quick. I see that the heap usage goes slowly down by itself.

I will try with 2G of heap only.

cnuernber · 2020-10-24T20:23:44Z

You should not get an out of memory error. The GC may start running a lot, however. If things can be GC'd at all then I would say there isn't a leak because I switched everything to simply use the GC for cleaning up the resources; there is no stack resource context in play any more.

behrica · 2020-10-25T09:16:03Z

@cnuernber
With this data: https://www.dropbox.com/s/6f6mz250t97ealx/allscreenings.zip?dl=0

The following code gives OOM for me with 2 GB heap.
Any of the arrow files is small, the largest is 44M.

(ns tech.v3.hanging
  (:require [clojure.java.io :as io]
            [tech.v3.dataset :as ds]
            [tech.v3.libs.arrow :as arrow]))

(defn count-rows-arrow []
  (println
  (->> (file-seq (io/file
                  "/home/carsten/Dropbox/sources/tablecloth/allscreenings/"
                  ))
       (filter #(.isFile %))
       (mapv #(ds/row-count (arrow/read-stream-dataset-inplace (.getPath %)))))))

So there is a memory leak somehwhere, in my view.
At least with some type of arrow file.

I did experiments with multiple copies of other arrow files, the same amount of data in total,
and it worked well.

So I seems to be related to the content of the arrow files.

I would suggest to reopen the issue.

behrica · 2020-10-25T09:30:05Z

I found a single arrow file, which results in OOM, by reading it repeatedly:

./screenings/TEST_SUGARS_METABOLIC_DISEASE/screenings.arrow

https://www.dropbox.com/s/8g86n593jenrnvt/screenings.arrow?dl=0

(ns tech.v3.hanging
  (:require [clojure.java.io :as io]
            [tech.v3.dataset :as ds]
            [tech.v3.libs.arrow :as arrow]))

(defn count-rows-arrow [args]
  (println
   (->>
    (repeat 2000 (io/file "/home/carsten/Dropbox/sources/tablecloth/allscreenings/screenings/TEST_SUGARS_METABOLIC_DISEASE/screenings.arrow"))
    (filter #(.isFile %))
    (mapv #(ds/row-count (arrow/read-stream-dataset-inplace (.getPath %)))))))

behrica · 2020-10-25T09:34:00Z

Heap dump at the moment of OOM

cnuernber · 2020-10-25T13:18:57Z

OK, looking into this again. Perhaps using the resource system to track native buffer derivatives is a mistake; I could just have a member variable on the child that points to the parent. Thanks for your patience to keep on this!

cnuernber · 2020-10-25T15:17:45Z

5.00-alpha-12 made your example both much, much faster and not thrash the gc.

cnuernber · 2020-10-25T15:21:10Z

This issue is due to loading the dictionary's out of the file. We cannot currently load those on demand so we load them as fast as possible while the rest of the file is loaded on demand.

For larger files, avoiding using dictionaries for string columns would avoid problems like this and allow a potentially faster streaming pathway iff the column has a low repetition count of the string members.

behrica · 2020-10-26T00:26:17Z

I can confirm, problems is solved.
I see very nice behaviour sweeping over 10 G of arrow files on disk.

cnuernber closed this as completed Oct 24, 2020

cnuernber reopened this Oct 25, 2020

cnuernber closed this as completed in f8e1c43 Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read-stream-dataset-inplace has memory leak #149

read-stream-dataset-inplace has memory leak #149

behrica commented Oct 23, 2020

behrica commented Oct 24, 2020

behrica commented Oct 24, 2020

cnuernber commented Oct 24, 2020

behrica commented Oct 24, 2020

behrica commented Oct 24, 2020

cnuernber commented Oct 24, 2020 •

edited

Loading

cnuernber commented Oct 24, 2020

cnuernber commented Oct 24, 2020

behrica commented Oct 24, 2020

behrica commented Oct 24, 2020

cnuernber commented Oct 24, 2020 •

edited

Loading

behrica commented Oct 25, 2020

behrica commented Oct 25, 2020

behrica commented Oct 25, 2020

cnuernber commented Oct 25, 2020 •

edited

Loading

cnuernber commented Oct 25, 2020

cnuernber commented Oct 25, 2020

behrica commented Oct 26, 2020

read-stream-dataset-inplace has memory leak #149

read-stream-dataset-inplace has memory leak #149

Comments

behrica commented Oct 23, 2020

behrica commented Oct 24, 2020

behrica commented Oct 24, 2020

cnuernber commented Oct 24, 2020

behrica commented Oct 24, 2020

behrica commented Oct 24, 2020

cnuernber commented Oct 24, 2020 • edited Loading

cnuernber commented Oct 24, 2020

cnuernber commented Oct 24, 2020

behrica commented Oct 24, 2020

behrica commented Oct 24, 2020

cnuernber commented Oct 24, 2020 • edited Loading

behrica commented Oct 25, 2020

behrica commented Oct 25, 2020

behrica commented Oct 25, 2020

cnuernber commented Oct 25, 2020 • edited Loading

cnuernber commented Oct 25, 2020

cnuernber commented Oct 25, 2020

behrica commented Oct 26, 2020

cnuernber commented Oct 24, 2020 •

edited

Loading

cnuernber commented Oct 24, 2020 •

edited

Loading

cnuernber commented Oct 25, 2020 •

edited

Loading