New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] remove complicated equality checks in clone as __init__ shouldn't touch anything. #5540

Merged
merged 7 commits into from Sep 8, 2016

Conversation

Projects
None yet
8 participants
@amueller
Member

amueller commented Oct 22, 2015

Simplifies the check in clone that checks if __init__ and get_params play nicely together.

Previously, we could clone objects that copy numpy arrays in __init__. Afterwards we can not.

Doing anything in init is not a good idea.
I have an additional check for == because Pipeline is evil.
Working on it.

Note: this is a fix for #5522.

@amueller

This comment has been minimized.

Member

amueller commented Oct 22, 2015

Pipeline is only evil if steps is a numpy array, in which case it copies it. Or if something is passed that is not a sequence, in which case it calls list() on it.

The fact that tests pass means that tosequence here always ends up in the second branch.

We could remove the call to tosequence, which will likely not change anything, at least nothing that is tested.

Or, if we want to continue to support steps being non-sequences [I have no idea what that would mean] or make sure that if steps is an array, that it is copied in __init__, then we need a more complicated check in clone.

My intuition is that we should remove the call to tosequenceand we'll be fine.

Feedback very welcome, in particular @GaelVaroquaux and @jnothman.

@amueller amueller changed the title from [WIP] remove complicated equality checks in clone as __init__ shouldn't touch anything. to [MRG] remove complicated equality checks in clone as __init__ shouldn't touch anything. Oct 23, 2015

@amueller

This comment has been minimized.

Member

amueller commented Oct 23, 2015

Opened #5547 for VBGMM and #5549 for GP.

@amueller

This comment has been minimized.

Member

amueller commented Oct 23, 2015

I'm wondering if we should keep the old stuff and raise a deprecation warning if param1 is not param2. Otherwise we could break a lot of users estimators.

@amueller

This comment has been minimized.

Member

amueller commented Oct 23, 2015

Ok after fighting with myself, I left all the old stuff in and adapted the test from #5525.
We should raise a deprecation warning if param1 is not param2 once we fixed our own stuff.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 23, 2015

I opened a PR to your PR here: amueller#29

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 23, 2015

Then +1.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 23, 2015

I opened a new issue to discuss the need for deepcopy in clone: #5563.

@amueller

This comment has been minimized.

Member

amueller commented Aug 17, 2016

This should be ready now. @jnothman ?

@amueller amueller changed the title from [MRG] remove complicated equality checks in clone as __init__ shouldn't touch anything. to [MRG + 1] remove complicated equality checks in clone as __init__ shouldn't touch anything. Aug 17, 2016

equality_test = (new_obj_val == params_set_val or
new_obj_val is params_set_val)
# fall back on standard equality
equality_test = param1 == param2

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Aug 17, 2016

Member

is this not going to fail for np.nan? These are not numpy arrays, hence this part of the if/else clause is explored, and:

In [3]: np.nan.__class__.__mro__
Out[3]: (float, object)

In [4]: np.nan == np.nan
Out[4]: False

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Aug 17, 2016

Member

I get it: the "is" test has already been perform above, right?

I am slow.

Maybe a test case with a nan would be a good thing.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Aug 17, 2016

Minor bug that breaks CI, but is probably trivial to fix.

I have a request: could a test be added with a nan as an attribute, so that we are certain that this iteration of the code, or the others, does not break clone with nans (which is a bug that's easy to generate).

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 20, 2016

This is a good idea. I gather the deprecation warning is not happening during testing, except in this case where the test failure is. Can you add an assert_warns there?

@amueller

This comment has been minimized.

Member

amueller commented Aug 24, 2016

Sorry for the slow reply, I'll add the test and fix CI.

@amueller

This comment has been minimized.

Member

amueller commented Aug 24, 2016

There is already a test for estimators with nan here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_base.py#L140

And the deprecation warning is actually raised in a test where clone is supposed to fail. So if your clone fails you will now also always get a deprecation warning in addition to the error. I just pushed a fix for that, and a test.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Aug 24, 2016

@agramfort

This comment has been minimized.

Member

agramfort commented Aug 27, 2016

@amueller you need to rebase.

then +1 for merge

@amueller

This comment has been minimized.

Member

amueller commented Aug 31, 2016

should be good to merge once lights are green

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 31, 2016

./sklearn/base.py:119:80: E501 line too long (81 > 79 characters)

@amueller

This comment has been minimized.

Member

amueller commented Sep 6, 2016

done

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 6, 2016

Test failures!

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 6, 2016

Failure due to whitespace inconsistency in warning message vs test.

@amueller

This comment has been minimized.

Member

amueller commented Sep 7, 2016

should work now.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 8, 2016

LGTM

@jnothman jnothman merged commit 680ab51 into scikit-learn:master Sep 8, 2016

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@amueller

This comment has been minimized.

Member

amueller commented Sep 8, 2016

thanks for the reviews. This is soooo much more robust now :)

@amueller amueller referenced this pull request Sep 13, 2016

Closed

Refactor clone testing #1219

rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016

[MRG + 2] remove complicated equality checks in clone as __init__ sho…
…uldn't touch anything. (scikit-learn#5540)

* BF: issue 5522 (cloning objects with pandas.Dataframe attributes)

* super conservative fix, pending GP and VBGMM fixes.

* TST improved test for df param

* add deprecation warning, add whatsnew entry

* fixed place of deprecation warning, added test for deprecation warning.

* pep8

* fix whitespace error

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG + 2] remove complicated equality checks in clone as __init__ sho…
…uldn't touch anything. (scikit-learn#5540)

* BF: issue 5522 (cloning objects with pandas.Dataframe attributes)

* super conservative fix, pending GP and VBGMM fixes.

* TST improved test for df param

* add deprecation warning, add whatsnew entry

* fixed place of deprecation warning, added test for deprecation warning.

* pep8

* fix whitespace error
@stdkoehler

This comment has been minimized.

stdkoehler commented Jan 17, 2017

I'm not sure if this is the right place to ask this question. I always receive the

DeprecationWarning: Estimator KerasRegressor modifies parameters in init. This behavior is deprecated as of 0.18 and support for this behavior will be removed in 0.20. % type(estimator).name, DeprecationWarning)

when a parameter of my estimator is a list or an np.array, because the "param1 is param2" gives a false in such case. Is this intended behavior? My example is with KerasRegressor:

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
bostondata = load_boston()

X = bostondata.data
y = bostondata.target

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

def dynamic_model(nInputLayer=10, hiddenLayers=[10]):
    # create model
    model = Sequential()
    model.add(Dense(nInputLayer, input_dim=13, init='normal', activation='relu'))
    for hLayer in hiddenLayers:
        model.add(Dense(hLayer, init='normal', activation='relu'))
    model.add(Dense(1, init='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

nInputLayerList = [8, 32]
hiddenLayers = [np.array([5]), np.array([5,5])]
for nlayer in nInputLayerList:
    for hlayer in hiddenLayers:
        krmlp = KerasRegressor(build_fn=dynamic_model, nb_epoch=100, batch_size=5, verbose=0)
        krmlp.set_params(**{'nInputLayer': nlayer, 'hiddenLayers': hlayer})
        estimators = []
        estimators.append(('standardize', StandardScaler()))
        estimators.append(('mlp', krmlp))
        pipeline = Pipeline(estimators)
        kfold = KFold(n_splits=10, random_state=seed)
        results = cross_val_score(pipeline, X, y, cv=kfold)
        print("Dynamic %d input, %s hidden: %.2f (%.2f) MSE" % (nlayer, str(hlayer).strip('[]'), results.mean(), results.std()))

Sorry if this is the wrong place to ask the question. Please delete if so.

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 18, 2017

I presume the issue is not a list or array being passed, but your Sequence instance. Yes, I think that after doing a deepcopy, it's wrong for the code to claim that param1 is param2 "should always be true" :\ @amueller, is the motivation of this PR correct?

@dukebody

This comment has been minimized.

dukebody commented May 13, 2017

Hi! Not sure if this is the right place to ask about this - if not, please point me to the right location.

I'm the maintainer of the sklearn-pandas library. Due to the changes in this PR the DataFrameMapper now emits warnings whenever it's used in some kind of cross-validation: scikit-learn-contrib/sklearn-pandas#76

It is true that the DataFrameMapper constructor modifies parameters in the init method (see https://github.com/paulgb/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40), but those parameters are not numpy ndarrays or sparse matrices, therefore I don't really understand how the current implementation of the clone function (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/base.py#L29) can not raise the mentioned warning, even if the parameters of the old and the new (cloned) dataframe mapper are equal.

Where can I get some guidance about how to make sklearn-pandas compatible with sklearn>=0.20? Thanks.

@jnothman

This comment has been minimized.

Member

jnothman commented May 14, 2017

I'm a little confused. Do you object to the special handling of numpy arrays and sparse matrices because these are not checked for equality, and you have another special case for that? Or do you object to not modifying parameters in __init__.

The principle is to avoid modifying parameters in __init__, and delaying this until fit. A key reason to do so is that those same parameters might be specified through set_params (or even explicit setting of attributes), not through __init__ and hence should only undergo transformation when they need to be interpreted during fit. This could be achieved through properties, but validating/transforming only in fit is the simpler solution which scikit-learn has chosen for a convention.

@dukebody

This comment has been minimized.

dukebody commented Jun 10, 2017

@jnothman sorry, I was not objecting to special handling of numpy arrays and sparse matrices or not modifying parameters in __init__, I was just confused about how the check was implemented and didn't know how to modify the library I'm maintaining to comply with the new requisites.

Now I see that delaying the modification of parameters until the fit method is the way to go and why it was decided this way. Thanks for the explanation! :)

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 11, 2017

dukebody added a commit to scikit-learn-contrib/sklearn-pandas that referenced this pull request Jun 24, 2017

Do not mutate features in __init__
This is to comply with sklearn>=0.20 which requires that
no parameters are mutated in __init__ to be able to make
cross-validation wrappers work easily.

See scikit-learn/scikit-learn#5540

Resolves #76.

Code based on PR #105 by @jimmywan
@dukebody

This comment has been minimized.

dukebody commented Jun 24, 2017

@jnothman Where can I find the documentation about what has to be done to fit these new requirements? We finally figured out how to fix it in scikit-learn-contrib/sklearn-pandas#76 but having some docs explaining what you explained here would surely help other developers.

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 24, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG + 2] remove complicated equality checks in clone as __init__ sho…
…uldn't touch anything. (scikit-learn#5540)

* BF: issue 5522 (cloning objects with pandas.Dataframe attributes)

* super conservative fix, pending GP and VBGMM fixes.

* TST improved test for df param

* add deprecation warning, add whatsnew entry

* fixed place of deprecation warning, added test for deprecation warning.

* pep8

* fix whitespace error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment