Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Feature Selection ease-of-use Improvements #2167

Closed
wants to merge 12 commits into
from

Conversation

Projects
None yet
5 participants
Contributor

oddskool commented Jul 18, 2013

As mentioned on the mailing list, this is a PR to improve the ease of use of feature selection in sklearn :

  1. Documentation
  • put a bit more detailed explanations in the narrative docs
  1. GradientBoostingClassifier
  • should implement transform() to perform feature selection based on computed features importance

For 1) I would basically touch a few areas of the docs pointed in the ML.

For 2) I'm considering a simple strategy : make GradientBoostingClassifier derive from _LearntSelectorMixin and see if the tests pass. If anyone has more clever ideas, let me know :)

Contributor

oddskool commented Jul 18, 2013

On my machine tests are OK after my last commits.

I'm going to write a new test for FS of Gradient* classes now to verify everything works.

Owner

jnothman commented Jul 20, 2013

IMO, as long as there are tests for _LearntSelectorMixin (which there are), and there are tests for GradientBoosting....feature_importances_, you shouldn't need separate tests to ensure the mixin behaviour is correct.

Are there other estimators which store feature_importances_ but do not inherit from _LearntSelectorMixin?

Owner

GaelVaroquaux commented Jul 20, 2013

IMO, as long as there are tests for _LearntSelectorMixin (which there
are), and there are tests for
GradientBoosting....feature_importances_, you shouldn't need separate
tests to ensure the mixin behaviour is correct.

The correct behavior stems from the fact that there is the right interface implemented by GradientBoosting... This can go wrong independently of _LearntSelectorMixin. Thus I advocate for tests :)

@larsmans larsmans commented on an outdated diff Jul 22, 2013

doc/modules/feature_selection.rst
@@ -213,3 +213,24 @@ features::
* :ref:`example_ensemble_plot_forest_importances_faces.py`: example
on face recognition data.
+
+Feature selection as part of a pipeline
+=======================================
+
+Feature selection is usually used as a pre-processing step before doing
+the actual learning. The recommended way to do this in scikit-learn is
+to use a :class:`sklearn.pipeline.Pipeline`::
+
+ clf = Pipeline([
+ ('feature_selection', RandomForestClassifier()),
@larsmans

larsmans Jul 22, 2013

Owner

Doesn't this create a rather huge model? Maybe a LinearSVC(penalty="l1") is more appropriate as the first step, since it's much more compact.

(This is an artifact of our implementation, but let's keep quiet about that...)

Contributor

oddskool commented Jul 23, 2013

@GaelVaroquaux, @jnothman : regarding tests I'm quite open to both. Would you be both happy if I find a way to factorize tests for all children of _LearntSelectorMixin ?

This would just check that all estimators that derive from it are indeed able to do their work.

Owner

jnothman commented Jul 23, 2013

Just test that it works locally, and everyone is happy, especially Gaël. The _LearntSelectorMixin is already tested.

Owner

GaelVaroquaux commented Jul 23, 2013

Just test that it works locally, and everyone is happy, especially Gaël.

:)

Contributor

oddskool commented Jul 23, 2013

You mean I should revert the test added in 99ac326 ?

Contributor

oddskool commented Jul 24, 2013

Ok, @GaelVaroquaux, @jnothman I've reverted the unit test previously added as the tests pass locally and the mixin is already tested elsewhere.

Then, @larsmans I've changed to a LinearSVC for the Pipeline example as it now looks a bit less tied to forest learners and hence perhaps less strange to new readers (having the same classifier for 2 different things).

Now, is there anything else that prevents from merging the PR ?

Contributor

oddskool commented Jul 25, 2013

Do I have to open an issue in the tracker to make this ship with 0.14 ?

Owner

amueller commented Jul 25, 2013

I just marked it. I think it is a nice addition to the docs and looks quite good.

Owner

amueller commented Jul 25, 2013

@GaelVaroquaux what do you think about including it?

Owner

larsmans commented Jul 25, 2013

I'm +1 on this. Shall I rebase it?

Owner

amueller commented Jul 25, 2013

@larsmans feel free I haven't really looked into the comments

@amueller amueller commented on an outdated diff Jul 25, 2013

doc/modules/feature_selection.rst
@@ -213,3 +213,25 @@ features::
* :ref:`example_ensemble_plot_forest_importances_faces.py`: example
on face recognition data.
+
+Feature selection as part of a pipeline
+=======================================
+
+Feature selection is usually used as a pre-processing step before doing
+the actual learning. The recommended way to do this in scikit-learn is
+to use a :class:`sklearn.pipeline.Pipeline`::
+
+ clf = Pipeline([
+ ('feature_selection', LinearSVC(penalty="l1")),
+ ('classification', RandomForestClassifier())
+ ])
+ clf.fit(X, y)
+
+In this snippet we make use of a first :class:`sklearn.svm.LinearSVC`
@amueller

amueller Jul 25, 2013

Owner

I would remove the word first

@amueller

amueller Jul 25, 2013

Owner

Or "we first make use of"

@amueller amueller commented on an outdated diff Jul 25, 2013

doc/modules/feature_selection.rst
+=======================================
+
+Feature selection is usually used as a pre-processing step before doing
+the actual learning. The recommended way to do this in scikit-learn is
+to use a :class:`sklearn.pipeline.Pipeline`::
+
+ clf = Pipeline([
+ ('feature_selection', LinearSVC(penalty="l1")),
+ ('classification', RandomForestClassifier())
+ ])
+ clf.fit(X, y)
+
+In this snippet we make use of a first :class:`sklearn.svm.LinearSVC`
+to evaluate feature importances and select the most relevant features.
+Then, a second classifier
+(:class:`sklearn.ensemble.GradientBoostingClassifier`) is trained on the
@amueller

amueller Jul 25, 2013

Owner

I wouldn't use parentheses here.

@amueller amueller commented on an outdated diff Jul 25, 2013

sklearn/ensemble/gradient_boosting.py
@@ -693,7 +695,8 @@ def feature_importances_(self):
return importances
-class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
+class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin,
@amueller

amueller Jul 25, 2013

Owner

I think there should be tests for that.

Owner

amueller commented Jul 25, 2013

Apart from minor copyediting and missing tests I'm +1.
I would really like the tests, though.

Owner

amueller commented Jul 25, 2013

we could also have common tests for feature selection ;)

Owner

larsmans commented Jul 25, 2013

I'm cherry-picking/rebasing the doc improvements as I'm not in the mood to review the mixin stuff right now. Maybe I'll check that stuff tomorrow.

Contributor

oddskool commented Jul 26, 2013

Ooops... It seems I've just pushed a mess to my branch when I tried to pull changes made by others to the current master :(

Contributor

oddskool commented Jul 26, 2013

Unfortunately I must go to cope with other chores and I'll be AFK most of the weekend :(

I think @larsmans already cherry-picked (as it seems) additions to the master so we are ok docs-wise. @amueller 's edits are worthwhile though, but can get through later imho.

On the mixin/tests side I guess commiters have to agree on the need/form for tests !

I provided a commit that ports a FS test from RandomForest* to GrandientBoosting*.

If commiters agree that a good way to go is to provide unified FS tests for all FS-capable estimators, I can look into it next week.

Owner

larsmans commented Aug 12, 2014

Closing because ffb89eb already introduced feature selection using GB and the doc improvements were cherry-picked into master separately.

@larsmans larsmans closed this Aug 12, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment