Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read-stream-dataset-inplace has memory leak #149

Closed
behrica opened this issue Oct 23, 2020 · 18 comments
Closed

read-stream-dataset-inplace has memory leak #149

behrica opened this issue Oct 23, 2020 · 18 comments

Comments

@behrica
Copy link
Contributor

behrica commented Oct 23, 2020

Running this code on a directory with around 10G of arrow files (2000 small ones),
fails with OutOfMemory

   (def arrows
    (->> (file-seq (io/file  "my-dir"))
         (filter #(.isFile %))
         (mapv #(ds/info (arrow/read-stream-dataset-inplace (.getPath %))))   ;; leaks memory
         ;; (mapv #(arrow/visualize-arrow-stream %))                                        ;; does not leak
         ))

The created datasets are not full garbage collected.
It works using the visualize-arrow-stream, no OOM.

@behrica
Copy link
Contributor Author

behrica commented Oct 24, 2020

It happens with both resource types, :gc and stacked context.
But maybe I do something wrong.

The above code should run with little need for heap space, should it ?
I read one small dataset after each other and discard it then

@behrica
Copy link
Contributor Author

behrica commented Oct 24, 2020

I am not at all an expert in memory leak anaysis, but I used visualvm, and saw a large number of tech.resource.GCreference objects not garbage collected:
image

@cnuernber
Copy link
Collaborator

Hmm, definitely should not leak regardless.

@behrica
Copy link
Contributor Author

behrica commented Oct 24, 2020

And they seem to be refenced (or reference) the NativeBuffers:

image

@behrica
Copy link
Contributor Author

behrica commented Oct 24, 2020

It is related to line:

(resource/track (constantly fdata)))))

Removing that one, makes the mapv finish
So it leaks less, but at tend end I am still at 3.5 G heap use, while it would expect near the same as before. (500 M, on fresh JVM)
The "ds/info" data is very small.

@cnuernber
Copy link
Collaborator

cnuernber commented Oct 24, 2020

This is because in my haste I used resource/track which defaults to stack resources and you do not have a stack based resource context open. I am going to rework the resource system to log warnings if stack based resource management is used and no resource context is open.

In general, gc-based resource management works fine and stack is only necessary at some times. There are other places where resource/track may be used and it is a bit of a hidden snake in the grass. Only the top level object (like the mmap file) needs to be tracked this way really, the rest of the resources may be chained via the gc mechanism to the top one.

@cnuernber
Copy link
Collaborator

This is similar to the other issue yesterday where I am really glad you are finding these now in reproducible ways. These are the types of issues that sometimes I get an email about and it is someone way deep into a problem and then it takes forever to tease out the reason.

@cnuernber
Copy link
Collaborator

Fixed in 5.00-alpha-11

@behrica
Copy link
Contributor Author

behrica commented Oct 24, 2020

I think, this is not fully solved.

On my directory of example files, to still ends with OOM.

I replicated yout test :

(vec (repeatedly 200 #(mapv meta (vals (arrow/read-stream-dataset-inplace "10m.arrow"))))))

and it works indeed.

But it fails with my files.

I think it got "better", compared to last version:

image

The first peeks where form simple example files, the last form "my" files.

The last peek correspons to this piece of code:

  (def  copipes-of-files 
    (->> (file-seq (io/file
                    "/home/carsten/Dropbox/sources/tablecloth/allscreenings/"
                    ))
         (filter #(.isFile %))
         (mapv #(ds/row-count (arrow/read-stream-dataset-inplace (.getPath %))))))

I will share my files with you, so you can reproduce it.

@behrica
Copy link
Contributor Author

behrica commented Oct 24, 2020

Maybe I was too quick. I see that the heap usage goes slowly down by itself.
image

I will try with 2G of heap only.

@cnuernber
Copy link
Collaborator

cnuernber commented Oct 24, 2020

You should not get an out of memory error. The GC may start running a lot, however. If things can be GC'd at all then I would say there isn't a leak because I switched everything to simply use the GC for cleaning up the resources; there is no stack resource context in play any more.

@behrica
Copy link
Contributor Author

behrica commented Oct 25, 2020

@cnuernber
With this data: https://www.dropbox.com/s/6f6mz250t97ealx/allscreenings.zip?dl=0

The following code gives OOM for me with 2 GB heap.
Any of the arrow files is small, the largest is 44M.

(ns tech.v3.hanging
  (:require [clojure.java.io :as io]
            [tech.v3.dataset :as ds]
            [tech.v3.libs.arrow :as arrow]))

(defn count-rows-arrow []
  (println
  (->> (file-seq (io/file
                  "/home/carsten/Dropbox/sources/tablecloth/allscreenings/"
                  ))
       (filter #(.isFile %))
       (mapv #(ds/row-count (arrow/read-stream-dataset-inplace (.getPath %)))))))

So there is a memory leak somehwhere, in my view.
At least with some type of arrow file.

I did experiments with multiple copies of other arrow files, the same amount of data in total,
and it worked well.

So I seems to be related to the content of the arrow files.

I would suggest to reopen the issue.

@behrica
Copy link
Contributor Author

behrica commented Oct 25, 2020

I found a single arrow file, which results in OOM, by reading it repeatedly:

./screenings/TEST_SUGARS_METABOLIC_DISEASE/screenings.arrow

https://www.dropbox.com/s/8g86n593jenrnvt/screenings.arrow?dl=0

(ns tech.v3.hanging
  (:require [clojure.java.io :as io]
            [tech.v3.dataset :as ds]
            [tech.v3.libs.arrow :as arrow]))

(defn count-rows-arrow [args]
  (println
   (->>
    (repeat 2000 (io/file "/home/carsten/Dropbox/sources/tablecloth/allscreenings/screenings/TEST_SUGARS_METABOLIC_DISEASE/screenings.arrow"))
    (filter #(.isFile %))
    (mapv #(ds/row-count (arrow/read-stream-dataset-inplace (.getPath %)))))))

@behrica
Copy link
Contributor Author

behrica commented Oct 25, 2020

image

Heap dump at the moment of OOM

@cnuernber cnuernber reopened this Oct 25, 2020
@cnuernber
Copy link
Collaborator

cnuernber commented Oct 25, 2020

OK, looking into this again. Perhaps using the resource system to track native buffer derivatives is a mistake; I could just have a member variable on the child that points to the parent. Thanks for your patience to keep on this!

@cnuernber
Copy link
Collaborator

5.00-alpha-12 made your example both much, much faster and not thrash the gc.

@cnuernber
Copy link
Collaborator

This issue is due to loading the dictionary's out of the file. We cannot currently load those on demand so we load them as fast as possible while the rest of the file is loaded on demand.

For larger files, avoiding using dictionaries for string columns would avoid problems like this and allow a potentially faster streaming pathway iff the column has a low repetition count of the string members.

@behrica
Copy link
Contributor Author

behrica commented Oct 26, 2020

I can confirm, problems is solved.
I see very nice behaviour sweeping over 10 G of arrow files on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants