Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix #6420 Cloning decision tree estimators breaks criterion objects #7680

Merged
merged 1 commit into from Oct 19, 2016

Conversation

Projects
None yet
5 participants
@olologin
Copy link
Contributor

olologin commented Oct 16, 2016

Reference Issue

Fixes #6420

What does this implement/fix? Explain your changes.

This fix changes __reduce__ methods of criterion objects which are used for pickling/cloning.

There was memory management problem, function sizet_ptr_to_ndarray from tree/_utils.pyx returned np.array object, but this np.array object didn't own underlying memory. In some cases we call __repr__ on criterion object, this __repr__ returns needed info for pickling/cloning, this info includes n_classes array (for ClassificationCriterion), and Criterion object gets removed. If we read info returned by __repr__ after this - n_classes will be filled with garbage.

from sklearn.tree import tree
import copy
import numpy as np

gini = tree._criterion.ClassificationCriterion(n_outputs=1, n_classes=np.array([2]))
gini.__reduce__()
copy.deepcopy(gini).__reduce__()
    Out[1] (sklearn.tree._criterion.ClassificationCriterion, (1, array([4314681712])), {})

Also I added __repr__ functions for classes which didn't have them, because otherwise class of object after unpickling/cloning changes to more common. For example after cloning Gini criterion we will get ClassificationCriterion.

from sklearn.tree import tree
import copy
import numpy as np

gini = tree._criterion.Gini(n_outputs=1, n_classes=np.array([2]))
gini_copy = copy.deepcopy(gini)
gini_copy.__class__
    Out[1]: sklearn.tree._criterion.ClassificationCriterion

Any other comments?

Make suggestions

@olologin

This comment has been minimized.

Copy link
Contributor Author

olologin commented Oct 16, 2016

@amueller, @jnothman

Hmm, travis-ci fails with:

ValueError: sklearn.tree._criterion.Criterion has the wrong size, try recompiling. Expected 160, got 152
ImportError: cannot import name _tree

I saw this error on local machine (Python 2.7 Ubuntu 16.04), but after make clean it disappeared. Can we do something similar on server? Seems like cache related problem for me.

@nelson-liu

This comment has been minimized.

Copy link
Contributor

nelson-liu commented Oct 17, 2016

@olologin , yup that's an issue with the travis cache. maybe time to look into #7094 again?

@olologin olologin changed the title [WIP] Fix #6420 Cloning decision tree estimators breaks criterion objects [MRG] Fix #6420 Cloning decision tree estimators breaks criterion objects Oct 17, 2016

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Oct 17, 2016

Wait I can reset the cache for you. Thanks for working on this. I'll review in a while

@amueller amueller added this to the 0.18.1 milestone Oct 17, 2016

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 17, 2016

Maybe @jmschrei also wants to check it out, or @glouppe ?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 17, 2016

please add a test!

free(self.n_classes)

def __reduce__(self):
return (ClassificationCriterion,
(self.n_outputs,
sizet_ptr_to_ndarray(self.n_classes, self.n_outputs)),
sizet_ptr_to_ndarray(self.n_classes, self.n_outputs).copy()),

This comment has been minimized.

Copy link
@raghavrv

raghavrv Oct 17, 2016

Member

Why do you make a copy?

This comment has been minimized.

Copy link
@olologin

olologin Oct 18, 2016

Author Contributor

because numpy array which is returned by

sizet_ptr_to_ndarray(self.n_classes, self.n_outputs))

Doesn't own underlying memory (I've added a comment in sizet_ptr_to_ndarray), it just uses pointer on values which are stored in C array self.n_classes. When you get this tuple from __repr__, and object which you got it from gets removed, you can't read this tuple anymore (You can actually, but self.n_classes gets freed, and returned numpy array will show you some garbage, because it still uses pointer to provided C array for him). copy method does deep copy, and resulting array will own underlying memory.

I could alternatively just make arr = np.zeros(self.n_outputs), and copy all values to it in cycle. But this approach with sizet_ptr_to_ndarray().copy() existed in_tree.pyx before me, so I thought it will be more convenient to use it here too.

@@ -596,6 +602,12 @@ cdef class Gini(ClassificationCriterion):
= 1 - \sum_{k=0}^{K-1} count_k ** 2
"""

def __reduce__(self):

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 18, 2016

Member

Why did we not need these before? Is it not enough to use type(self) as the first member of the tuple in {Classification,Regression}Criterion?

This comment has been minimized.

Copy link
@olologin

olologin Oct 18, 2016

Author Contributor

Hmm, you're right, it's enough to use type(self) in {Classification,Regression}Criterion.

@@ -62,7 +62,10 @@ cdef inline UINT32_t our_rand_r(UINT32_t* seed) nogil:


cdef inline np.ndarray sizet_ptr_to_ndarray(SIZE_t* data, SIZE_t size):
"""Encapsulate data into a 1D numpy array of intp's."""
"""Encapsulate data into a 1D numpy array of intp's.
You should ensure that the provided data pointer is not freed while the returned array

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 18, 2016

Member

It seems like everywhere this is used it is now used with a copy. I propose we just do the copying here. WDYT? @raghavrv?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 18, 2016

Thanks for both your great investigation and patch!

@jnothman
Copy link
Member

jnothman left a comment

I'm not sure if we want a separate non-regression test for clone... And apart from these nitpicks and making sizet_ptr_to_ndarray do the copying, this LGTM

@@ -1609,3 +1612,22 @@ def test_mae():
dt_mae.fit([[3],[5],[3],[8],[5]],[6,7,3,4,3], [0.6,0.3,0.1,1.0,0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_equal(dt_mae.tree_.value.flat, [4.0, 6.0, 4.0])

def test_criterion_copy():

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 18, 2016

Member

PEP8: extra blank line, please

n_classes = np.arange(3)
for _, typename in CRITERIA_CLF.items():
criteria = typename(n_outputs, n_classes)
typename_, (n_outputs_, n_classes_), _ = copy(criteria).__reduce__()

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 18, 2016

Member

Please check deepcopy too, just to be sure

typename_, (n_outputs_, n_samples_), _ = copy(criteria).__reduce__()
assert_equal(typename, typename_)
assert_equal(n_outputs, n_outputs_)
assert_equal(n_samples, n_samples_)

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 18, 2016

Member

pedantic: should have newline at end of file

@olologin olologin force-pushed the olologin:6420_dt_criterion branch 4 times, most recently from 0b7992c to 04a5437 Oct 18, 2016

@@ -65,7 +65,7 @@ cdef inline np.ndarray sizet_ptr_to_ndarray(SIZE_t* data, SIZE_t size):
"""Encapsulate data into a 1D numpy array of intp's."""
cdef np.npy_intp shape[1]
shape[0] = <np.npy_intp> size
return np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data)
return np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data).copy()

This comment has been minimized.

Copy link
@raghavrv

raghavrv Oct 19, 2016

Member

@jnothman Thanks for suggesting this change. Indeed this is better.

@olologin Could you also remove the copy from this line and update the comment there to note that the function returns a copy and does not refer to the passed pointers...

Additionally it would be nice to change the docstring of this helper to note that we return a copy of numpy array and not not simply 'Encapsulate data' as it is mentioned currently...

This comment has been minimized.

Copy link
@olologin

olologin Oct 19, 2016

Author Contributor

Done.

for copy_func in [copy.copy, copy.deepcopy]:
for _, typename in CRITERIA_CLF.items():
criteria = typename(n_outputs, n_classes)
result = copy_func(criteria).__reduce__()

This comment has been minimized.

Copy link
@raghavrv

raghavrv Oct 19, 2016

Member

Rather could you pickle, load all the Criteria?

Currently this fails in master...

import pickle

from sklearn.tree._criterion import FriedmanMSE

fmse = FriedmanMSE(1, 10)
pickle.loads(pickle.dumps(fmse))

This comment has been minimized.

Copy link
@olologin

olologin Oct 19, 2016

Author Contributor

I've added pickling to this list.
Pickle and copy could work differently, but particularly with criteria they use the same reduce method.

from sklearn import datasets

from sklearn.utils import compute_sample_weight

CLF_CRITERIONS = ("gini", "entropy")
REG_CRITERIONS = ("mse", "mae")
REG_CRITERIONS = ("mse", "mae", "friedman_mse")

This comment has been minimized.

Copy link
@raghavrv

raghavrv Oct 19, 2016

Member

Oh boy wasn't this tested at all ;(

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 19, 2016

LGTM, thanks!

@jnothman jnothman changed the title [MRG] Fix #6420 Cloning decision tree estimators breaks criterion objects [MRG+1] Fix #6420 Cloning decision tree estimators breaks criterion objects Oct 19, 2016

@olologin olologin force-pushed the olologin:6420_dt_criterion branch 2 times, most recently from ac24db4 to bc3c530 Oct 19, 2016

@olologin

This comment has been minimized.

Copy link
Contributor Author

olologin commented Oct 19, 2016

Build on one of travis-ci configurations fails with:

build_tools/travis/test_script.sh: line 23: 4113 Segmentation fault (core dumped) nosetests -s --with-timer --timer-top-n 20 sklearn

I wonder if it's related to my changes. I didn't see this error on travis-ci after first commits, It appeared only when I changed new array creation to .copy(). Is it coincidence? If not - I think I should rewrite sizet_ptr_to_ndarray so it will just copy all values from C array to previously created ndarray, avoiding all this fancy PyArray_SimpleNewFromData methods.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 19, 2016

Yes, I was a little uncertain about the .copy approach. Perhaps it's not so reliable, despite my LGTM. Use np.frombuffer?

@raghavrv
Copy link
Member

raghavrv left a comment

I've cleared the cache for this PR and restarted the travis build. I don't think the current changes cause the failure.

@@ -542,13 +542,12 @@ cdef class Tree:
reaching node i.
"""
# Wrap for outside world.
# WARNING: these reference the current `nodes` and `value` buffers, which
# WARNING: these copy the current `nodes` and `value` buffers, which

This comment has been minimized.

Copy link
@raghavrv

raghavrv Oct 19, 2016

Member

No they still reference. Only the n_classes is copied. This comment should be left intact :)

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 19, 2016

I don't think the current changes cause the failure.

I suspect they do :)

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Oct 19, 2016

I pulled this PR, and pushed it to my fork. Travis passes smoothly there. The restart build option in travis doesn't clear the cache and retry, which is why this PR is still failing after clearing it's cache :)

I think @olologin needs to force push to rebuild this PR without any cache side effects...

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 19, 2016

Oh, okay. You're probably right.

On 20 October 2016 at 01:22, Raghav RV notifications@github.com wrote:

I pulled this PR, and pushed it to my fork. Travis passes smoothly there
https://travis-ci.org/raghavrv/scikit-learn/builds/168931248. The
restart build option in travis doesn't clear the cache and retry, which is
why this PR is still failing after clearing it's cache :)

I think @olologin https://github.com/olologin needs to force push to
rebuild this PR without any cache side effects...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7680 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6xsET_Fo73oSCouH65wbLItP-Bq0ks5q1ie4gaJpZM4KYC0m
.

@olologin

This comment has been minimized.

Copy link
Contributor Author

olologin commented Oct 19, 2016

@raghavrv mystical behaviour of Travis :), it fails with the same problems here again.

@olologin olologin force-pushed the olologin:6420_dt_criterion branch from cdb6133 to a895cd4 Oct 19, 2016

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Oct 19, 2016

Argh it now reuses the old travis cache. Apologies for putting you through this... :/ As a final attempt, could you squash all commits into one and force push? (I've cleared your cache again)

@nelson-liu

This comment has been minimized.

Copy link
Contributor

nelson-liu commented Oct 19, 2016

what usually worked for me is to push trivial commits travis on the files that weren't being recythonized (a comment will do), have them pass and cached, then revert back those changes.

@olologin olologin force-pushed the olologin:6420_dt_criterion branch from a895cd4 to da5a85b Oct 19, 2016

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Oct 19, 2016

I was looking for an automated way for all such future failures and came up with this

curl -Ls https://goo.gl/jhGwkV | git apply -v --index; git commit -m "NOMERGE recythonize all";
@nelson-liu

This comment has been minimized.

Copy link
Contributor

nelson-liu commented Oct 19, 2016

i think you need to also change recythonize _splitter, no?

@olologin olologin force-pushed the olologin:6420_dt_criterion branch from b41f9e3 to a5fe72e Oct 19, 2016

@olologin

This comment has been minimized.

Copy link
Contributor Author

olologin commented Oct 19, 2016

@nelson-liu , I didn't change that file, I don't think it needs recompilation. Will try now approach proposed by @raghavrv.

@nelson-liu

This comment has been minimized.

Copy link
Contributor

nelson-liu commented Oct 19, 2016

hmm, the errors look like you should have to, but i just always recythonized all the tree files whenever i got the error for good measure.

@@ -56,6 +56,10 @@ Bug fixes
<https://github.com/scikit-learn/scikit-learn/pull/6178>`_) by `Bertrand
Thirion`_

- Tree splitting criterion classes cloning/pickling are now memory safe

This comment has been minimized.

Copy link
@nelson-liu

nelson-liu Oct 19, 2016

Contributor

classes -> classes'?

This comment has been minimized.

Copy link
@olologin

olologin Oct 19, 2016

Author Contributor

You're right, thanks!

@olologin olologin force-pushed the olologin:6420_dt_criterion branch from a5fe72e to 98eaad3 Oct 19, 2016

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Oct 19, 2016

Finally travis passes :D After #7708, one should be able to add [ci recythonize] and force a recythonizing of all the cython sources. That should save all of us some trouble in the future...

@raghavrv raghavrv changed the title [MRG+1] Fix #6420 Cloning decision tree estimators breaks criterion objects [MRG+2] Fix #6420 Cloning decision tree estimators breaks criterion objects Oct 19, 2016

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Oct 19, 2016

@jnothman please merge if happy with the changes...

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 19, 2016

Thanks again, @olologin

@jnothman jnothman merged commit 74e4c42 into scikit-learn:master Oct 19, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 25, 2016

amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017

Merge tag '0.18.1' into releases
* tag '0.18.1': (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017

Merge branch 'releases' into dfsg
* releases: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

 Conflicts:  removed
	sklearn/externals/joblib/__init__.py
	sklearn/externals/joblib/_parallel_backends.py
	sklearn/externals/joblib/testing.py

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017

Merge branch 'dfsg' into debian
* dfsg: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

MLopez-Ibanez pushed a commit to MLopez-Ibanez/scikit-learn that referenced this pull request Feb 9, 2019

MLopez-Ibanez pushed a commit to MLopez-Ibanez/scikit-learn that referenced this pull request Feb 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.