Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] Memory Leak in decision trees #2787

Closed
ogrisel opened this issue Jan 23, 2014 · 18 comments · Fixed by #2790
Closed

[Regression] Memory Leak in decision trees #2787

ogrisel opened this issue Jan 23, 2014 · 18 comments · Fixed by #2790
Labels

Comments

@ogrisel
Copy link
Member

ogrisel commented Jan 23, 2014

Here is a reproduction script. I used a random tree to make it run much faster but I think this impacts all tree implementations. I used a very large output dimension to make the leak more visible:

import gc
import os
import psutil
import numpy as np
from sklearn.tree import ExtraTreeRegressor

X = np.random.normal(size=(100, 50))
Y = np.random.normal(size=(100, int(5e4)))

p = psutil.Process(os.getpid())

def print_mem():
    print("{:.0f}MB".format(p.get_memory_info().rss / 1e6))

print_mem()

for i in range(3):
    et = ExtraTreeRegressor(max_features=1).fit(X, Y)
    del et
    gc.collect()
    print_mem()

Here is the output on the current master:

86MB
171MB
251MB
331MB

The output on sklearn 0.14.1:

74MB
74MB
74MB
74MB
@jnothman
Copy link
Member

Are you able to run the same test before my data structure PR was merged?

On 24 January 2014 00:01, Olivier Grisel notifications@github.com wrote:

Here is a reproduction script. I used a random tree to make it run much
faster but I think this impacts all tree implementations. I used a very
large output dimension to make the leak more visible:

import gcimport osimport psutilimport numpy as npfrom sklearn.tree import ExtraTreeRegressor
X = np.random.normal(size=(100, 50))Y = np.random.normal(size=(100, int(5e4)))
p = psutil.Process(os.getpid())
def print_mem():
print("{:.0f}MB".format(p.get_memory_info().rss / 1e6))

print_mem()

for i in range(3):
et = ExtraTreeRegressor(max_features=1).fit(X, Y)
del et
gc.collect()
print_mem()

Here is the output on the current master:

86MB
171MB
251MB
331MB

The output on sklearn 0.14.1:

74MB
74MB
74MB
74MB


Reply to this email directly or view it on GitHubhttps://github.com//issues/2787
.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 23, 2014

I will adapt the script to make it bisect able.

@jnothman
Copy link
Member

Great.

On 24 January 2014 00:14, Olivier Grisel notifications@github.com wrote:

I will adapt the script to make it bisect able.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2787#issuecomment-33122372
.

@larsmans
Copy link
Member

I just added a print statement to each __cinit__ and each __dealloc__:

71MB
alloc RegressionCriterion
alloc Splitter
alloc Tree
dealloc Splitter
dealloc RegressionCriterion
156MB
alloc RegressionCriterion
alloc Splitter
alloc Tree
dealloc Splitter
dealloc RegressionCriterion
237MB
alloc RegressionCriterion
alloc Splitter
alloc Tree
dealloc Splitter
dealloc RegressionCriterion
317MB

Looks like Tree.__dealloc__ isn't being called.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 23, 2014

I confirm this is the culprit:

e7e7133912332f5ee7933963081ac20c0eacd450 is the first bad commit
commit e7e7133912332f5ee7933963081ac20c0eacd450
Author: Joel Nothman <joel.nothman@gmail.com>
Date:   Thu Jan 16 09:32:16 2014 +1100

    ENH/FIX Change Tree underlying data structure

    * avoid segfault (#2726) by setting base on numpy arrays
    * A struct array for tree nodes and values array (size indeterminate at
    compile time) are the only underlying structure, so each node is locally
    grouped memory, and joblib dumps will save only two numpy files per
    tree.
    * predict() uses the value array's take method, reducing code repetition

:040000 040000 88c8b1171cb2c35865d57c6856e0be65346bc73b 826be8ab9fab8c6f8c29b0d4b1e3e75e66cb4595 M  sklearn
bisect run success

@larsmans
Copy link
Member

I looks like arr.base = <PyObject*> self in Tree._finalize is creating a reference cycle, but sys.getrefcount doesn't report a higher value then when I don't call fit...

@larsmans
Copy link
Member

Oh wait, I should be looking at the refcount of et.tree_. That one's 4 after fit.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 23, 2014

@larsmans have you tried valgrind with http://svn.python.org/projects/stackless/trunk/Misc/valgrind-python.supp ? I have not tried yet myself but I have to go offline now. Maybe later.

@glouppe
Copy link
Contributor

glouppe commented Jan 23, 2014

Well indeed, since value_ndarray and node_ndarray both hold a reference to the tree object, its reference counter never goes to zero, and __dealloc__ never gets called. Any way to solve this cycle? @jnothman

@larsmans
Copy link
Member

Never mind my previous comment, there was a flaw in my reasoning.

Still, the trick to break the cycle is, I think, to introduce an intermediate class, say Owner, that owns the ndarrays, as well as the malloc'd buffers they refer to. The Tree instances keep an Owner instance and accesses the buffers through that, while the ndarrays have it as their base. The Owner should not itself refer to any Python object.

@jnothman
Copy link
Member

Sorry about this. I've just skimmed your comments and will try to look into
it a little later. Right, of course there's a cycle... A solution worth
trying is only to create the wrapper ndarrays (and set base) where we use
them.

On 24 January 2014 04:44, Lars Buitinck notifications@github.com wrote:

Never mind my previous comment, there was a flaw in my reasoning.

Still, the trick to break the cycle is, I think, to introduce an
intermediate class, say Owner, that owns the ndarrays, as well as the
malloc'd buffers they refer to. The Tree instances keep an Owner instance
and accesses the buffers through that, while the ndarrays have it as their
base. The Owner should not itself refer to any Python object.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2787#issuecomment-33148590
.

@glouppe
Copy link
Contributor

glouppe commented Jan 23, 2014

@larsmans I am not sure your solution is correct. Since there is still a cycle between Owner and the arrays (since they reference each other), both will always have positive reference counter and will never be deallocated.

@glouppe
Copy link
Contributor

glouppe commented Jan 23, 2014

A solution worth trying is only to create the wrapper ndarrays (and set base) where we use them.

Indeed, this will work. Never use (and store) arrays within the tree, but wrap the memory segment when exposing it to the external world. Remind me of something... ;)

@glouppe
Copy link
Contributor

glouppe commented Jan 23, 2014

@larsmans I am not sure your solution is correct. Since there is still a cycle between Owner and the arrays (since they reference each other), both will always have positive reference counter and will never be deallocated.

After some thought, I think we can make it work. Since with this solution the tree gets deallocated, the trick would be to break the cycle in Tree.__dealloc__, i.e. remove the reference in Owner to the ndarrays. This way, ndarrays get to be deallocated if there is no external reference to them, which in turn make Owner to be deallocatable as well.

@jnothman
Copy link
Member

I still don't see the need for an intermediary.

However, when having a first hack (
https://github.com/jnothman/scikit-learn/commits/tree_without_cycles) at
only generating ndarray wrapping on demand (never mind that predict is a
little less efficient; we'll fix that soon if it shows) is segfaulting in
tests where node attributes are retrieved, and I've not worked out why yet.

On 24 January 2014 07:22, Gilles Louppe notifications@github.com wrote:

@larsmans https://github.com/larsmans I am not sure your solution is
correct. Since there is still a cycle between Owner and the arrays (since
they reference each other), both will always have positive reference
counter and will never be deallocated.

After some thought, I think we can make it work. Since with this solution
the tree gets deallocated, the trick would be to break the cycle in
Tree.dealloc, i.e. remove the reference in Owner to the ndarrays.
This way, ndarrays get to be deallocated if there is no external reference
to them, which in turn make Owner to be deallocatable as well.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2787#issuecomment-33164439
.

@glouppe
Copy link
Contributor

glouppe commented Jan 23, 2014

I still don't see the need for an intermediary.

You are right this is not strictly necessary. I would be happier without adding such an intermediary.

@larsmans
Copy link
Member

@glouppe In the setup I had in mind, the Owner would not actually keep a reference to the arrays, just a pointer to their memory. But if we can do without such an ugly hack, then by all means we should :)

@jnothman
Copy link
Member

See #2790 where I've got a version without reference cycles.

On 24 January 2014 08:47, Lars Buitinck notifications@github.com wrote:

@glouppe https://github.com/glouppe In the setup I had in mind, the
Owner would not actually keep a reference to the arrays, just a pointer
to their memory. But if we can do without such an ugly hack, then by all
means we should :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/2787#issuecomment-33172617
.

@amueller amueller modified the milestones: 0.16, 0.15 Jul 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants