-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read-stream-dataset-inplace has memory leak #149
Comments
It happens with both resource types, :gc and stacked context. The above code should run with little need for heap space, should it ? |
Hmm, definitely should not leak regardless. |
It is related to line:
Removing that one, makes the mapv finish |
This is because in my haste I used resource/track which defaults to stack resources and you do not have a stack based resource context open. I am going to rework the resource system to log warnings if stack based resource management is used and no resource context is open. In general, gc-based resource management works fine and stack is only necessary at some times. There are other places where resource/track may be used and it is a bit of a hidden snake in the grass. Only the top level object (like the mmap file) needs to be tracked this way really, the rest of the resources may be chained via the gc mechanism to the top one. |
This is similar to the other issue yesterday where I am really glad you are finding these now in reproducible ways. These are the types of issues that sometimes I get an email about and it is someone way deep into a problem and then it takes forever to tease out the reason. |
Fixed in |
I think, this is not fully solved. On my directory of example files, to still ends with OOM. I replicated yout test :
and it works indeed. But it fails with my files. I think it got "better", compared to last version: The first peeks where form simple example files, the last form "my" files. The last peek correspons to this piece of code:
I will share my files with you, so you can reproduce it. |
You should not get an out of memory error. The GC may start running a lot, however. If things can be GC'd at all then I would say there isn't a leak because I switched everything to simply use the GC for cleaning up the resources; there is no stack resource context in play any more. |
@cnuernber The following code gives OOM for me with 2 GB heap.
So there is a memory leak somehwhere, in my view. I did experiments with multiple copies of other arrow files, the same amount of data in total, So I seems to be related to the content of the arrow files. I would suggest to reopen the issue. |
I found a single arrow file, which results in OOM, by reading it repeatedly: ./screenings/TEST_SUGARS_METABOLIC_DISEASE/screenings.arrow https://www.dropbox.com/s/8g86n593jenrnvt/screenings.arrow?dl=0 (ns tech.v3.hanging
(:require [clojure.java.io :as io]
[tech.v3.dataset :as ds]
[tech.v3.libs.arrow :as arrow]))
(defn count-rows-arrow [args]
(println
(->>
(repeat 2000 (io/file "/home/carsten/Dropbox/sources/tablecloth/allscreenings/screenings/TEST_SUGARS_METABOLIC_DISEASE/screenings.arrow"))
(filter #(.isFile %))
(mapv #(ds/row-count (arrow/read-stream-dataset-inplace (.getPath %))))))) |
OK, looking into this again. Perhaps using the resource system to track native buffer derivatives is a mistake; I could just have a member variable on the child that points to the parent. Thanks for your patience to keep on this! |
|
This issue is due to loading the dictionary's out of the file. We cannot currently load those on demand so we load them as fast as possible while the rest of the file is loaded on demand. For larger files, avoiding using dictionaries for string columns would avoid problems like this and allow a potentially faster streaming pathway iff the column has a low repetition count of the string members. |
I can confirm, problems is solved. |
Running this code on a directory with around 10G of arrow files (2000 small ones),
fails with OutOfMemory
The created datasets are not full garbage collected.
It works using the visualize-arrow-stream, no OOM.
The text was updated successfully, but these errors were encountered: