Avoid calculating feature importance for multiple times in SelectFromModel #15169

qinhanmin2014 · 2019-10-10T01:31:39Z

Currently, in SelectFromModel, we calculate feature importance during transform. If users do feature selection based on the training set and transform both the training set and the testing set, they'll need to calculate feature importance for multiple times. Actually calculating feature importance is sometimes time-consuming (e.g., a large xgboost model), so I think we should figure out a way to avoid this.
I don't have a solution. We can't calculate and store feature importance during fit because when prefit=True, we allow users to call transform directly. We can't store feature importance during transform because we can't add/modify attributes during transform.

The text was updated successfully, but these errors were encountered:

jnothman · 2019-10-10T07:55:58Z

The convention in python is that a property/attributes/descriptor should not be expensive to `__get__`, so in theory it is the estimator's responsibility to memoise.

qinhanmin2014 · 2019-10-10T10:50:08Z

so in theory it is the estimator's responsibility to memoise.

Any examples in scikit-learn? I agree that it's a good idea but we can only add/modify attributes in fit?

jnothman · 2019-10-10T11:20:33Z

I thought you were talking about the expense of retrieving feature_importances_ ... That's the attribute I mean which can be implemented as an expensive property... What expense are you referring to?

qinhanmin2014 · 2019-10-10T13:13:48Z

I thought you were talking about the expense of retrieving feature_importances_

Yes, so we should avoid retrieving feature_importances for multiple times. For those classes who use @property, we need to calculate feature importance every time we access feature importance.

That's the attribute I mean which can be implemented as an expensive property... What expense are you referring to?

Sorry I can't understand this part.

jnothman · 2019-10-10T13:30:42Z

Oh right. I forgot things. I raised a related discussion in #7491

qinhanmin2014 · 2019-10-12T12:29:04Z

in theory it is the estimator's responsibility to memoise.

@jnothman do you think it's possible to store feature importance after we calculate it? This seems like a good idea.

qinhanmin2014 changed the title ~~Avoid Calculating feature importance for multiple times in SelectFromModel~~ Avoid calculating feature importance for multiple times in SelectFromModel Oct 10, 2019

thomasjpfan added the API label Oct 26, 2019

cmarmo added Needs Decision Requires decision module:model_selection labels Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid calculating feature importance for multiple times in SelectFromModel #15169

Avoid calculating feature importance for multiple times in SelectFromModel #15169

qinhanmin2014 commented Oct 10, 2019

jnothman commented Oct 10, 2019 via email

qinhanmin2014 commented Oct 10, 2019

jnothman commented Oct 10, 2019 via email

qinhanmin2014 commented Oct 10, 2019

jnothman commented Oct 10, 2019 via email

qinhanmin2014 commented Oct 12, 2019

Avoid calculating feature importance for multiple times in SelectFromModel #15169

Avoid calculating feature importance for multiple times in SelectFromModel #15169

Comments

qinhanmin2014 commented Oct 10, 2019

jnothman commented Oct 10, 2019 via email

qinhanmin2014 commented Oct 10, 2019

jnothman commented Oct 10, 2019 via email

qinhanmin2014 commented Oct 10, 2019

jnothman commented Oct 10, 2019 via email

qinhanmin2014 commented Oct 12, 2019