Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

[WIP] Implemented SelectFromModel meta-transformer #3011

Closed
wants to merge 17 commits into from

5 participants

@maheshakya

Fix for #2160
Implemented a meta-transformer to with _LearntSelectorMixin. Test cases are included.
Documentation(with examples) needs to be completed.

@jnothman
Owner

Thanks for tackling this. A number of points:

  • please put this in the same file as _LearntSelectorMixin
  • although I suggested inheriting from _LearntSelectorMixin, it isn't sufficient for correct operation. _LearntSelectorMixin.transform inspects the estimator directly, e.g. to inspect the penalty parameter. You will need to turn _LearntSelectorMixin.transform into one or more functions.
  • You shouldn't set any underscore-suffixed attributes in your estimator's __init__. This should be done in fit. It's fine to postpone validation to fit.
  • You shouldn't need to have estimator_params. clone will keep any settings, and BaseEstimator will handle the delegation of attribute setting so that this can work in grid search.
  • SelectFromModel should accept two parameters: estimator and threshold.
  • I am not sure about duplicating the coefficients here to store them as an attribute. It would be more relevant to store the aggregate feature importances: because _LearntSelectorMixin currently has to sum over the coefficients for multiclass linear models, it might be useful to see the summed features.) But it's not essential to store these locally.
  • It would also be nice to have an attribute threshold_ since _LearntSelectorMixin's calculation of the actual threshold can be non-trivial.
  • SelectFromModel should inherit from sklearn.feature_selection.base.SelectorMixin and implement _get_support_mask.
  • This PR should include the deprecation of _LearntSelectorMixin. Use the @deprecated decorator on transform.

I'm changing this PR to a [WIP]. I hope that's alright with you.

@jnothman
Owner

Also, the failing tests happen because this can't be constructed as it is. I think it should be included here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py#L441.

And although it does nothing at the moment, please inherit from sklearn.base.MetaEstimatorMixin.

@jnothman jnothman changed the title from [MRG] Implemented SelectFromModel meta-transformer to [WIP] Implemented SelectFromModel meta-transformer
@maheshakya

I have separated out the section where threshold, importances and penalty is retrieved in _LearntSelectorMixin into two functions. One of them uses @depricated decorator.

Other issues are fixed.

sklearn/feature_selection/from_model.py
((109 lines not shown))
+ if self.estimator is not None:
+ self.estimator_ = self.estimator
+ else:
+ self.estimator_ = default
+
+ if self.estimator_ is None:
+ raise ValueError("estimator cannot be None")
+
+ def _make_estimator(self):
+ """Make and configure a copy of the `estimator_` attribute."""
+
+ estimator = clone(self.estimator_)
+
+ return estimator
+
+ def __init__(self, estimator=None, threshold=None):
@jnothman Owner

Please move this method to right below the class docstring.

estimator=None is not useful. Leave it without a default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/testing.py
@@ -439,7 +439,7 @@ def uninstall_mldata_mock():
# Meta estimators need another estimator to be instantiated.
meta_estimators = ["OneVsOneClassifier",
"OutputCodeClassifier", "OneVsRestClassifier", "RFE",
- "RFECV", "BaseEnsemble"]
+ "RFECV", "BaseEnsemble", "SelectFromModel"]
@jnothman Owner

The test failure suggests this is not the right solution. Put it in other instead, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@maheshakya

Since threshold in now a parameter to SelectFromModel, transform function should use that value. So in that case the value taken from the transform function will be useless.
Is this a correct way?

When transform function is called from another estimator, it will act in the usual way.

@coveralls

Coverage Status

Coverage remained the same when pulling 9506120 on maheshakya:select_from_model into eb10c4c on scikit-learn:master.

@jnothman
Owner

So in that case the value taken from the transform function will be useless.

The correct way is to redefine transform in the metaestimator not to take a threshold parameter.

sklearn/feature_selection/from_model.py
((106 lines not shown))
+
+ def __init__(self, estimator, threshold=None):
+ self.estimator = estimator
+ self.threshold = threshold
+
+ def _validate_estimator(self, default=None):
+ """Check the estimator and set the `estimator_` attribute."""
+ if self.estimator is not None:
+ self.estimator_ = self.estimator
+ else:
+ self.estimator_ = default
+
+ if self.estimator_ is None:
+ raise ValueError("estimator cannot be None")
+
+ def _make_estimator(self):
@jnothman Owner

Please inline this method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/feature_selection/from_model.py
((19 lines not shown))
"""Transformer mixin selecting features based on importance weights.
This implementation can be mixin on any estimator that exposes a
``feature_importances_`` or ``coef_`` attribute to evaluate the relative
importance of individual features for feature selection.
+
+ Attributes
+ ----------
+ `threshold_`: float
+ The threshold value used for feature selection.
+
+ `support_mask_`: an estimator
@jnothman Owner

You have documented these attributes in the wrong class.

@jnothman Owner

And support_mask_ is not an estimator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

I don't think it's a good idea to store support_mask_. It is inconsistent with other feature selectors, duplicates the get_support method provided by SelectorMixin, and means you can't change the threshold and just call get_support to reevaluate it.

sklearn/feature_selection/from_model.py
((142 lines not shown))
+ classification, real numbers in regression).
+
+ **fit_params : Other estimator specific parameters
+
+ Returns
+ -------
+ self : object
+ Returns self.
+ """
+
+ # Validate and make estimator
+ self._validate_estimator()
+ self.estimator = self._make_estimator()
+
+ # Convert data
+ X, y = check_arrays(X, y)
@jnothman Owner

I don't think this belongs here. It's the base estimator's business.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/feature_selection/from_model.py
((96 lines not shown))
+ the median (resp. the mean) of the feature importances. A scaling
+ factor (e.g., "1.25*mean") may also be used. If None and if
+ available, the object attribute ``threshold`` is used. Otherwise,
+ "mean" is used by default.
+
+ Attributes
+ ----------
+ `estimator_`: an estimator
+ The base estimator from which the transformer is built.
+ """
+
+ def __init__(self, estimator, threshold=None):
+ self.estimator = estimator
+ self.threshold = threshold
+
+ def _validate_estimator(self, default=None):
@jnothman Owner

I don't think this method really needs to exist. There is no public way to set a default; estimator will have to be set explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/feature_selection/from_model.py
((29 lines not shown))
+ "estimator's parameter to compute it?")
+ if len(importances) != X.shape[1]:
+ raise ValueError("X has different number of features than"
+ " during model fitting.")
+
+ # Retrieve threshold
+ if threshold is None:
+ if hasattr(self, "penalty") and self.penalty == "l1":
+ # the natural default threshold is 0 when l1 penalty was used
+ threshold = getattr(self, "threshold", 1e-5)
+ else:
+ threshold = getattr(self, "threshold", "mean")
+
+ return importances, threshold
+
+ def _set_parameters_meta_transfomer(self, X, threshold):
@jnothman Owner

You have invented a lot of code duplication, when you could just allow the estimator to be passed as an argument to a shared staticmethod or module-level function. Please also choose a more descriptive name than _set_parameters. (If you use a module-level function, you don't have a problem of potentially overriding a method in a class using the mixin.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

Btw, it may be cleaner to implement this with the metaestimator not inheriting from the mixin. Please feel free to do it that way if you think it improves the code.

@maheshakya

I couldn't find a way to implement _get_support_mask function without storing the mask in the transform function without duplicating the code. So I made it a private member.

@coveralls

Coverage Status

Coverage remained the same when pulling be4f070 on maheshakya:select_from_model into fbe974b on scikit-learn:master.

@jnothman
Owner

This should probably support partial_fit. It should possibly also support warm_start, which would involve not using clone.

@maheshakya

@jnothman, I apologize for the inconvenience, Can you explain what exactly needs to be done in partial_fit. I checked out several estimators that defines it. But those seem to be performing different tasks.

sklearn/feature_selection/from_model.py
((9 lines not shown))
from ..externals import six
-from ..utils import safe_mask, atleast2d_or_csc
+from ..utils import safe_mask, atleast2d_or_csc, deprecated
+from .base import SelectorMixin
+
+
+def _get_feature_importances(estimator, X):
+ """Retrieve or aggregate feature importances from estimator"""
+ if hasattr(estimator, "feature_importances_"):
+ importances = estimator.feature_importances_
+ if importances is None:
+ raise ValueError("Importance weights not computed. Please set "
+ "the compute_importances parameter before fit.")
@glouppe Owner
glouppe added a note

This is deprecated in forests. You can remove this if-statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

partial_fit allows a model to be trained without keeping all the training data in memory (or providing it) at once. Here, partial_fit should create estimator_ only on the first call, and call estimator_.partial_fit for each call.

The only reason I say it's necessary is because it's a way the current mixin can be used.

@maheshakya

I have added partial_fit and warm_start.

BTW is there something wrong with Travis CI? It doesn't seem to be building my last two commits.

@jnothman
Owner

BTW is there something wrong with Travis CI? It doesn't seem to be building my last two commits.

It's highly non user-friendly, but this means that it can't automatically merge your work with master, which it does before testing. Could you please rebase on the current master and force-push the rebased branch?

@coveralls

Coverage Status

Coverage remained the same when pulling 9a41f2f on maheshakya:select_from_model into fec2867 on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 1469c48 on maheshakya:select_from_model into fec2867 on scikit-learn:master.

sklearn/feature_selection/from_model.py
((82 lines not shown))
+ X : array-like of shape = [n_samples, n_features]
+ The training input samples.
+
+ y : array-like, shape = [n_samples]
+ The target values (integers that correspond to classes in
+ classification, real numbers in regression).
+
+ **fit_params : Other estimator specific parameters
+
+ Returns
+ -------
+ self : object
+ Returns self.
+ """
+ if not hasattr(self.estimator, "partial_fit"):
+ raise(AttributeError, "estimator does not have"
@jnothman Owner
jnothman added a note

That ( can't be there for this to be correct syntax in Python 3.

But I think it's fine if you don't explicitly check this case. The AttributeError produced by self.estimator_.partial_fit (e.g. ''LinearSVC' object has no attribute 'partial_fit'') is clear enough.

Yes, I agree.
I removed the test case for that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage remained the same when pulling ccbab10 on maheshakya:select_from_model into fec2867 on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 77b30e4 on maheshakya:select_from_model into fec2867 on scikit-learn:master.

@jnothman
Owner

I'm not sure the tests are quite satisfactory yet (and I might not be around for the next few days to take another look), but I think we need to seek opinions as to whether a meta-estimator improves on the mixin. @glouppe, WDYT, and who else is likely to be opinionated on this?

@maheshakya

What are the tests that need to be improved. I can work on those.

@jnothman
Owner
@maheshakya

Thanks. I will give it a shot. (for improved test cases, examples and documentation)

@coveralls

Coverage Status

Coverage remained the same when pulling 7f12af3 on maheshakya:select_from_model into fec2867 on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 11d1e81 on maheshakya:select_from_model into fec2867 on scikit-learn:master.

@maheshakya

I suppose we need to change every example that uses feature selection based on LearntSelectorMixin, right?

@jnothman
Owner

Yes, unfortunately.

@MechCoder
Owner

It seems that there are unrelated changes. I'm not sure how adding a meta-transformer should affect travis.yml

@MechCoder
Owner

Closed in favor of #4242 . I rescued this PR from git hell!

@MechCoder MechCoder closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.