New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SelectFromModel - hyperparameters of an estimator are not changed in a Pipeline #7756

Closed
jirispilka opened this Issue Oct 26, 2016 · 10 comments

Comments

Projects
None yet
3 participants
@jirispilka

jirispilka commented Oct 26, 2016

Description

feature_selection.SelectFromModel within Pipeline.
The hyperparameters of estimator (used in SelectFromModel) are not changed in the Pipeline.

Steps/Code to Reproduce

import numpy as np
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ParameterGrid

N = 100
X = np.random.randn(N, 50)
y = np.random.randint(0, 2, N)

print X.shape, y.shape

pipe = Pipeline([('fs', SelectFromModel(estimator=LogisticRegression(penalty='l1'))),
                 ('lr', LogisticRegression(penalty='l1'))])

parameters = {'fs__estimator__C': [1, 10], 'lr__C': [100]}
param_grid = ParameterGrid(parameters)

for param in param_grid:

    pipe.set_params(**param)
    d = pipe.get_params()
    print('SELECTED:')
    print('SelectFromModel - LogisticRegression: C', d['fs__estimator__C'])

    pipe.fit(X, y)
    print('ACTUAL:')
    print(pipe.named_steps['fs'].estimator_)

Expected Results

The hyperparameters of SelectFromModel.estimator_ should change with set_params method.

Actual Results

Function set_params actually modifies SelectFromModel.estimator but SelectFromModel.estimator_ is used instead with preset params.

The repeated use of SelectFromModel does not change SelectFromModel.estimator_
cf. fit method:

    if not hasattr(self, "estimator_"):
        self.estimator_ = clone(self.estimator)
    self.estimator_.fit(X, y, **fit_params)

Versions

Linux-4.4.0-45-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Jul 1 2016, 15:12:24) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.19.dev0')

@amueller

This comment has been minimized.

Member

amueller commented Oct 26, 2016

that looks like a pretty bad bug....

@amueller amueller added this to the 0.18.1 milestone Oct 26, 2016

@amueller

This comment has been minimized.

Member

amueller commented Oct 26, 2016

sfm = SelectFromModel(estimator=LogisticRegression(penalty='l1'))
sfm.fit(X, y)
sfm.set_params(estimator__C=100)
sfm.fit(X, y)
sfm.estimator_.C

1

That means that in the first call to fit, SelectFromModel creates estimator_ and reuses it later, I think.

@jirispilka

This comment has been minimized.

jirispilka commented Oct 26, 2016

That means that in the first call to fit, SelectFromModel creates estimator_ and reuses it later, I think.

Yes and sorry for mixing this with a pipeline, your code is simpler.
Thanks.

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 26, 2016

Indeed, yuck. I can't remember what motivated this logic, but it's the problem. Calling fit a second time should not reuse an existing estimator_.

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 26, 2016

Yes, a misguided feature added here: https://github.com/scikit-learn/scikit-learn/pull/4242/files#diff-f85e12a6fa5fdaf04b1f164fcc3a636dR119

Should remove it, or permit only with a warm_start=True option.

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 26, 2016

Such an option had existed :( 459cb9b

@amueller

This comment has been minimized.

Member

amueller commented Oct 26, 2016

:( @MechCoder do you remember what happened to 459cb9b? Revert that commit and add a regression test?

@spilkjir no worries, it took me a while to figure out what's going on. Thanks for the report.

@amueller

This comment has been minimized.

Member

amueller commented Oct 26, 2016

Or maybe let's just remove the warm-starting. If someone want's to use the warm-start, they can either use the prefit option, or they can call fit on estimator_.

@amueller

This comment has been minimized.

Member

amueller commented Oct 26, 2016

(or they could abuse partial_fit). Is calling partial_fit after calling fit defined behavior?

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 26, 2016

But this is as released in 0.17.0 so I don't really think it's a 0.18.1 blocker.

Recall that, thanks to clone, this does not break CV, grid search.

It does emphasise the need for tests that fit() is usually from cold.

@amueller amueller removed the Blocker label Oct 27, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment