Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) #8002

Merged
merged 19 commits into from Jan 18, 2017

Conversation

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Dec 7, 2016

Fixes #7811

  • Fixes memory leak in MAE (Using safe_realloc instead of master's alloc/calloc-and-forget-about-previously-held-memory approach took care of the memory leak.)
  • Uniformly use safe_realloc everywhere (Added StackRecord* and PriorityHeapRecord* to fused type realloc_ptr`)
  • Acquire GIL only when an error needs to be raised
  • Propagate all errors to python interpreter level. (Use except * when appropriate)

cc: @glouppe @jmschrei @nelson-liu @agramfort @lesteve

Is it okay to release gil in safe_realloc? Any reason why it was not done before?

Also ref my mails to cython-devel:

@raghavrv raghavrv added the Bug label Dec 7, 2016
@raghavrv raghavrv added this to the 0.19 milestone Dec 7, 2016
@nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Dec 7, 2016

Thanks for tackling this, @raghavrv . Did you run memory benchmarks to verify that safe_realloc fixes the issue? I thought it was more deep seated than that...(though I'd be happy if it was not 😄 )

@lesteve
Copy link
Member

@lesteve lesteve commented Dec 7, 2016

I ran the snippet from #7811 (comment) on this PR and the memory leak seems to be taken care of.

Not a cython expert so I won't comment about the actual changes introduced by this PR.

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 7, 2016

Do you think such a snippet could be added as a test? I don't think there is precedence for memory leak tests. But I think we should have... Maybe as a more general test as a part of #4841?

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 7, 2016

@nelson-liu Yeah! Loic's snippet seems to be stable in this PR...

@lesteve
Copy link
Member

@lesteve lesteve commented Dec 7, 2016

Do you think such a snippet could be added as a test?

It takes plus than one minute to run so the short answer is probably not in its current form.

We have some memory usage based tests in joblib. Look at https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L368 and https://github.com/joblib/joblib/blob/master/joblib/test/common.py#L40 for example. In our experience they tend to be a bit brittle (you need the memory peak to be not too short-lived for memory_profiler to have a chance to see it ...).

@jmschrei
Copy link
Member

@jmschrei jmschrei commented Dec 7, 2016

From a cython POV this looks good, but I haven't tested it to make sure it fixes the issue.

if array == NULL:
# no free; __dealloc__ handles that
return -1
self.array_ = array

This comment has been minimized.

@nelson-liu

nelson-liu Dec 7, 2016
Contributor

not quite relevant to the issue at hand, but do you think these same fixes should be applied for the other data structures in utils.pyx?

This comment has been minimized.

@raghavrv

raghavrv Dec 8, 2016
Author Member

Indeed it can be extended to all the data structures but I first need to confirm if releasing gil in safe_realloc is safe... I can't figure out why it was not done so far... @glouppe or @jmschrei can help us figure that out maybe? :) If the current changes look okay, then I can replace all malloc calls with safe_realloc... There is also yet another line in tree.pyx where safe_realloc was explicitly avoided because it used to not release gil...

The thing is tree tests are not very extensive as they need to be and I am paranoid that these kind of clean ups extraneous to issue at hand might break user code in subtle ways...

Thx for raising the question!

This comment has been minimized.

@jmschrei

jmschrei Dec 8, 2016
Member

It should be fine. I'm guessing it wasn't put in originally because there was no need to. If it's not using and GIL operations anyway then there shouldn't be a difference.

@raghavrv raghavrv requested a review from glouppe Dec 8, 2016
return -1
self.array_ = array
with gil:
safe_realloc(&self.array_, self.capacity)

This comment has been minimized.

@tguillemot

tguillemot Dec 8, 2016
Contributor

That's a stupid question (I'm not an expert in Cython :[ ): why do you need with gil here and not in l346 ?

This comment has been minimized.

@lesteve

lesteve Dec 8, 2016
Member

As a non expert in cython, I had exactly the same thought :)

This comment has been minimized.

@nelson-liu

nelson-liu Dec 8, 2016
Contributor

I think that some of the changes proposed include editing safe_realloc to be a nogil function, so as a nonexpert in cython I feel like l346 is correct and the GIL block isn't necessary here...

This comment has been minimized.

@raghavrv

raghavrv Dec 8, 2016
Author Member

I started out with a gil blocking safe_realloc and tried to make it release gil to see how it performs. I forgot to update here... Will push a commit. Thanks for the catch!

This comment has been minimized.

@tguillemot

tguillemot Dec 8, 2016
Contributor

No problem it's also a way to me to learn cython :)

raise MemoryError("could not allocate (%d * %d) bytes"
% (nelems, sizeof(p[0][0])))
with gil:
raise MemoryError("could not allocate (%d * %d) bytes"

This comment has been minimized.

@jnothman

jnothman Dec 8, 2016
Member

These exceptions aren't propagated i.e. I don't think it will actually end the function's execution.

This comment has been minimized.

@jnothman

jnothman Dec 8, 2016
Member

Try

%%cython
cdef a() nogil:
  with gil:
    raise ValueError('Something broke')
def b():
    a()

This comment has been minimized.

@raghavrv

raghavrv Dec 9, 2016
Author Member

True but having except * ensures that it propagates back to the calling function (and then to interpreter)

%%cython
cdef void a() nogil except *:
  with gil:
    raise ValueError('Something broke')
def b():
    a()
b()

This comment has been minimized.

@jnothman

jnothman Dec 9, 2016
Member

Okay, I forgot those details. Haven't checked now. Just check that all nogil paths continue to propagate by that means.

This comment has been minimized.

@raghavrv

raghavrv Dec 12, 2016
Author Member

Okay. Thanks!

And I think as long as gil is acquired when we raise the error, it will propagate when except * construct is used... Infact raising cannot be done without gil for the same reason... I think this is why the whole safe_realloc was not made to release gil? I now changed it to acquire gil only when the memory is not sufficient and hence error has to be raised...

This comment has been minimized.

@raghavrv

raghavrv Dec 12, 2016
Author Member

With that confirmation could I have your +1 for merge? This would be nice to have as long as there aren't any side effects...

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 12, 2016

From a cython POV this looks good, but I haven't tested it to make sure it fixes the issue.

@jmschrei With Loic's confirmation that it fixes the issue, are you +1 for merge?

@@ -341,8 +343,7 @@ cdef class WeightedPQueue:
cdef void reset(self) nogil:

This comment has been minimized.

@jnothman

jnothman Dec 12, 2016
Member

but doesn't this require an except clause??

This comment has been minimized.

@jnothman

jnothman Dec 12, 2016
Member

And same for push?

This comment has been minimized.

@jnothman

jnothman Dec 12, 2016
Member

and every nogil function that calls them?

This comment has been minimized.

@jmschrei

jmschrei Dec 12, 2016
Member

This actually brings up a bigger issue. Will you need to change every function/method in the tree directory which calls safe_realloc to make sure it propogates the error back?

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 13, 2016

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 20, 2016

Sorry for the delay in the response...

I think we should use except * by default for all compiled methods / functions written in cython... But releasing the gil for the whole function should not affect propagation of error. It's the except *... Releasing gil raises a compile time exception that errors cannot be raised without gil... Hence apart from the current changes (which acquire gil only when an error needs to be raised), I think adding except * for all cdef functions should solve it... I'll push a commit in a moment...

Eitherways I think we need to fix this memory leak soon...

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 20, 2016

Or for this PR let's have the except * only for functions that would end up calling safe_realloc and revisit this issue of adding "except * in every cdef" at a later time...

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 20, 2016

In nested calls, it seems like the calling function need not have except Arghh I was wrong the try.. except makes it print. The calling function and every higher level function needs to have this except *

%%cython

cdef void gil_noexcept():
    raise ValueError('At gil_noexcept: This will not be propagated')
    
cdef void nogil_noexcept() nogil:
    # Acquire gil only when you want to raise
    with gil:
        raise ValueError('At nogil_noexcept: This will also not be propagated')
    
cdef void gil_except() except *:
    raise ValueError('At gil_except: This will be propagated')

cdef void nogil_except() nogil except *:
    # Acquire gil only when you want to raise
    with gil:
        raise ValueError('At nogil_except: This will also be propagated')
            
cdef void super_function() except *:
    for fn in (gil_noexcept, nogil_noexcept,
               gil_except, nogil_except):
        try:
            fn()
        except Exception as e:
            print(e)

def python_fn():
    super_function()

python_fn()
At gil_except: This will be propagated
At nogil_except: This will also be propagated
Exception ignored in: '_cython_magic_c6c634898ce44ef1e9180f02f86d5f76.gil_noexcept'
ValueError: At gil_noexcept: This will not be propagated
Exception ignored in: '_cython_magic_c6c634898ce44ef1e9180f02f86d5f76.nogil_noexcept'
ValueError: At nogil_noexcept: This will also not be propagated
@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 21, 2016

So from cython docs -

If you don’t do anything special, a function declared with cdef that does not return a Python object has no way of reporting Python exceptions to its caller. If an exception is detected in such a function, a warning message is printed and the exception is ignored. If you want a C function that does not return a Python object to be able to propagate exceptions to its caller, you need to declare an exception value for it.

There is also a third form of exception value declaration:

cdef int spam() except *:

This form causes Cython to generate a call to PyErr_Occurred() after every call to spam, regardless of what value it returns. If you have a function returning void that needs to propagate errors, you will have to use this form, since there isn’t any return value to test. Otherwise there is little use for this form.

I think it's safe to assume that all cdef declarations need an except * (not except -1) since some functions return -1 and some have void as the return type...

@jnothman @glouppe @jmschrei WDYT? Can I also request a comment from other cython experts (@larsmans or @jakevdp) here please :)

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Dec 21, 2016

(Manually checking which ones call functions (that call others maybe and so on) that raise an error is error-prone / could change in the future and we do not have tests for these kind of memory errors / leaks.

Which is why I'm suggesting adding "except *" for all cdef declarations as a best practice in tree code...)

@raghavrv raghavrv force-pushed the raghavrv:mae_mem_leak branch from be31ba0 to 086506c Dec 21, 2016
@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 18, 2017

I was thinking of merging this and fixing it in another PR... But appveyor is not green anyhow. I'll fix it here... And thanks for pointing out, that is correct safe_realloc and except would suffice to raise MemoryErrors...

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 18, 2017

And done... The appveyor seems to be stuck in the queue for a long time...

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 18, 2017

(Or it is taking it's revenge on me for pushing so many times :@)

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 18, 2017

A bit of both. I've just cancelled a whole lot of AppVeyor runs to see if it'll come unstuck. I've not yet found one in progress (not queued, not finished, not cancelled), so I don't know why it's stuck.

@NelleV
Copy link
Member

@NelleV NelleV commented Jan 18, 2017

I don' think the appveyor problem is related to scikit-learn. Matplotlib currently has the same problem.

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 18, 2017

Indeed even the cancelling operation has very slow throughput.

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 18, 2017

According to them, this build seems to be the issue... Anyway now I can see one build running...

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 18, 2017

But things are starting again, now.

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 18, 2017

Thanks!! Meanwhile can you take a look at the final commit and approve. The appveyor should pass in about 30 minutes. I'll merge in an hour...

Copy link
Member

@jnothman jnothman left a comment

Yes, I think those duplicate errors are correctly removed.

@jnothman jnothman changed the title [MRG+2] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) [MRG+3] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) Jan 18, 2017
@raghavrv raghavrv merged commit 4907029 into scikit-learn:master Jan 18, 2017
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 18, 2017

Thanks a lot for the reviews everyone!

@raghavrv raghavrv deleted the raghavrv:mae_mem_leak branch Jan 18, 2017
@glouppe
Copy link
Member

@glouppe glouppe commented Jan 19, 2017

Thanks for the fix!

(and sorry about not having time for a review, my bandwidth has been quite limited these days for scikit-learn, unfortunately...)

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 19, 2017

@jnothman Do we need a bugfix whatsnew for this?

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jan 19, 2017

(and sorry about not having time for a review, my bandwidth has been quite limited these days for scikit-learn, unfortunately...)

@glouppe No problem :) I'll have a new PR very soon for you ;P

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
@Tommalla
Copy link

@Tommalla Tommalla commented Aug 14, 2017

Not sure whether this or #8623 is the more appropriate place to write this, but I believe I have an issue with RandomForestRegressor leaking memory when used with the MAE criterion.

My setup is the following: I'm drawing some learning curves, so I'm training and evaluating the model on increasingly bigger fragments of the dataset (I divide it into 30 "chunks").

My dataset is 6640 x 7, numpy.float32.

With every new iteration of train -> predict -> replace the model variable with a new instance, I get a significant increase in memory consumption, up to the point when I run out of memory (10GB, around 9GB free for the experiment).

If I skip the learning curves and just train the model once on the whole dataset, it fits nicely and consumes around 3GB of memory. Likewise, if I change the criterion to MSE and leave the other parameters unchanged, I can run the whole program, including incremental training, and the memory usage doesn't go above 200MB.

I believe the leak happens somewhere in the predict method, as reducing the number of predict calls after the model is trained significantly reduces the memory consumption (around 800MB per predict for that dataset).

My current configuration:

>>> import platform; print(platform.platform())
Linux-4.12.4-1-ARCH-x86_64-with-arch-Arch-Linux
>>> import sys; print("Python", sys.version)
Python 3.6.2 (default, Jul 20 2017, 03:52:27) 
[GCC 7.1.1 20170630]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.19.1
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.18.2

Parameters passed to the RandomForestRegessor are: n_estimators=5, max_features='auto', criterion='mae'

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 15, 2017

@Tommalla
Copy link

@Tommalla Tommalla commented Aug 15, 2017

@jnothman My bad, missed the new release and thought this was included in 0.18.2. Works fine now.

Thank you!

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

10 participants