[MRG + 2] ENH enable setting pipeline components as parameters #1769

Merged
merged 29 commits into from Aug 29, 2016

Conversation

Projects
None yet
9 participants
@jnothman
Member

jnothman commented Mar 13, 2013

Until now, get_params() would return the steps of a pipeline by name, but setting them would fail silently (by setting an unused attribute); fixes bug #1800.

This allows users to grid search over alternative estimators for some step, as illustrated in the included example, or even to delete a step in a search by setting it to None.
But it may also be more directly practical: while a user may currently use get_params to extract an estimator from a nested pipeline using the double-underscore naming convention, e.g. for selective serialisation, they cannot use the reciprocal set_params which this PR enables.

This changeset also prohibits step names that equal initialisation parameters of the pipeline; otherwise FeatureUnion.set_params(transformer_weights=foo) would be ambiguous.

@larsmans

View changes

sklearn/pipeline.py
@@ -140,47 +196,46 @@ def fit_transform(self, X, y=None, **fit_params):
else:
return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt)
+ def run_pipeline(self, X, est_method, est_args=(), est_kwargs={}):

This comment has been minimized.

@larsmans

larsmans Mar 13, 2013

Member

This causes a run_pipeline to appear on FeatureUnion, so run would be a better name. I don't really see the use of this at present, though. Also, the docstring is incomplete (the types should be documented).

@larsmans

larsmans Mar 13, 2013

Member

This causes a run_pipeline to appear on FeatureUnion, so run would be a better name. I don't really see the use of this at present, though. Also, the docstring is incomplete (the types should be documented).

This comment has been minimized.

@jnothman

jnothman Mar 13, 2013

Member

Its only purpose was to refactor. I made it public as a "why not?" decision, which is probably why I forgot the docstring. Or perhaps because parameter descriptions are lacking throughout the class, and are incomplete in FeatureUnion.

@jnothman

jnothman Mar 13, 2013

Member

Its only purpose was to refactor. I made it public as a "why not?" decision, which is probably why I forgot the docstring. Or perhaps because parameter descriptions are lacking throughout the class, and are incomplete in FeatureUnion.

This comment has been minimized.

@jnothman

jnothman Mar 13, 2013

Member

Btw, this only applies to Pipeline, not _BasePipeline, so run_pipeline may indeed be appropriate (unless you think run is sufficiently descriptive). An equivalent does not easily apply to FeatureUnion.

@jnothman

jnothman Mar 13, 2013

Member

Btw, this only applies to Pipeline, not _BasePipeline, so run_pipeline may indeed be appropriate (unless you think run is sufficiently descriptive). An equivalent does not easily apply to FeatureUnion.

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Mar 13, 2013

Member

I don't like the way this overloads the meaning of set_params. I'm thinking instead of an either list-like or dict-like interface with a __setitem__ to set a step to a different estimator. WDYT?

Member

larsmans commented Mar 13, 2013

I don't like the way this overloads the meaning of set_params. I'm thinking instead of an either list-like or dict-like interface with a __setitem__ to set a step to a different estimator. WDYT?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 13, 2013

Member

@larsmans, the problem is that in order for the underscore notation to work, let's say param "X__Y", apparently "X" needs to be a valid param (though perhaps this is a bug in BaseEstimator.set_params, but I have no idea what motivation there is for the current code and so wouldn't dare remove it). Therefore "X" needs to be returned by get_params(deep=True), as it had been previously. But calling pipeline.set_params(X=blah) would just perform pipeline.X = blah to no meaningful effect.

Now, one option is to make pipeline.X = blah meaningful, one of:

  • add BasePipeline.__{g,s}etattr__ which identifies step names and handles them differently
  • actually just store the steps by name in pipeline.__dict__, and compile pipeline.steps (and the unused pipeline.named_steps) on the fly: steps = [getattr(self, name) for name in self._step_names].

I personally think actually having the steps as attributes on the object is ugly. Already having to have the step names and the initialiser arguments in the same namespace is ugly. Having to also make sure step names don't conflict with actual attributes of the classifier (including things like 'fit', 'transform' which are certainly possible step names in existing code) seems unreasonable.

As much as modifying some of the standard semantics of set_params is undesirable, breaking existing code is less desirable. But I'm open to suggestions.

So I pass the ball back to you...

Member

jnothman commented Mar 13, 2013

@larsmans, the problem is that in order for the underscore notation to work, let's say param "X__Y", apparently "X" needs to be a valid param (though perhaps this is a bug in BaseEstimator.set_params, but I have no idea what motivation there is for the current code and so wouldn't dare remove it). Therefore "X" needs to be returned by get_params(deep=True), as it had been previously. But calling pipeline.set_params(X=blah) would just perform pipeline.X = blah to no meaningful effect.

Now, one option is to make pipeline.X = blah meaningful, one of:

  • add BasePipeline.__{g,s}etattr__ which identifies step names and handles them differently
  • actually just store the steps by name in pipeline.__dict__, and compile pipeline.steps (and the unused pipeline.named_steps) on the fly: steps = [getattr(self, name) for name in self._step_names].

I personally think actually having the steps as attributes on the object is ugly. Already having to have the step names and the initialiser arguments in the same namespace is ugly. Having to also make sure step names don't conflict with actual attributes of the classifier (including things like 'fit', 'transform' which are certainly possible step names in existing code) seems unreasonable.

As much as modifying some of the standard semantics of set_params is undesirable, breaking existing code is less desirable. But I'm open to suggestions.

So I pass the ball back to you...

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 13, 2013

Member

I like the feature a lot, didn't have time to look into the implementation, though.,.. and won't have until next week :-/

Member

amueller commented Mar 13, 2013

I like the feature a lot, didn't have time to look into the implementation, though.,.. and won't have until next week :-/

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 13, 2013

Member

It would be nice if we could also set an estimator in the pipline to None, meaning a step should be skipped.

Member

amueller commented Mar 13, 2013

It would be nice if we could also set an estimator in the pipline to None, meaning a step should be skipped.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 13, 2013

Member

Please help with test fail: it's failing because _BasePipeline inherits from BaseEstimator, and hence is presumed to be testable among all estimators in sklearn/tests/test_common.py. Is the correct fix to make _BasePipeline some form of mixin (and hence not use super to call BaseEstimator.set_params), or to add _BasePipeline to sklearn.test_common.dont_test.

Member

jnothman commented Mar 13, 2013

Please help with test fail: it's failing because _BasePipeline inherits from BaseEstimator, and hence is presumed to be testable among all estimators in sklearn/tests/test_common.py. Is the correct fix to make _BasePipeline some form of mixin (and hence not use super to call BaseEstimator.set_params), or to add _BasePipeline to sklearn.test_common.dont_test.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 13, 2013

Member

The correct method is to make _BasePipeline an abstract base class by making the constructor an abstract function. look at other base classes for examples.

Member

amueller commented Mar 13, 2013

The correct method is to make _BasePipeline an abstract base class by making the constructor an abstract function. look at other base classes for examples.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 13, 2013

Member

@amueller I like the idea of setting a step to None, but I think there's too much risk of some function -- be it a method on a _BasePipeline descendant, or some external library -- forgetting to handle that case. Why not just let the user supply a dummy? If sklearn lacks a library of dummies, this is a use-case worth considering.

Member

jnothman commented Mar 13, 2013

@amueller I like the idea of setting a step to None, but I think there's too much risk of some function -- be it a method on a _BasePipeline descendant, or some external library -- forgetting to handle that case. Why not just let the user supply a dummy? If sklearn lacks a library of dummies, this is a use-case worth considering.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 13, 2013

Member

Yeah I also thought about that. I am not sure I buy your argument, though ;)
It would mean some extra code in the Pipeline, but it would be much more user friendly.

Member

amueller commented Mar 13, 2013

Yeah I also thought about that. I am not sure I buy your argument, though ;)
It would mean some extra code in the Pipeline, but it would be much more user friendly.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 13, 2013

Member

Presumably you would also need raise an error if the final step in the pipeline is None... How about, when we're happy with this patch, we consider the ramifications of that enhancement.

Member

jnothman commented Mar 13, 2013

Presumably you would also need raise an error if the final step in the pipeline is None... How about, when we're happy with this patch, we consider the ramifications of that enhancement.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 14, 2013

Member

Related to @larsmans' comment, set_params order of setting is now significant:

grid_clf = GridSearchCV(
    Pipeline([('sel', SelectKBest(chi2)), ('est', LogisticRegression())]),
    param_grid={'est': [LogisticRegression(), LinearSVC()], 'est__C': [0.1, 1.0]}
)

If 'est__C' is set before 'est' -- dependent on the implementation of dict.iterkeys were it left to BaseEstimator.set_params -- it won't have any effect (and worse, GridSearchCV will act as if it did the right thing).

It is also possible with this patch to set 'steps' as well as one of the steps, and the current implementation doesn't address this in set_params.

We could ignore this last somewhat-pathological case and generally ensure that BaseEstimator.set_params iterates in order of string length, or do those without underscores before those with. Or we can implement orderings on a per-estimator basis (which overriding set_params here does, but there needs to be a comment to that effect; an alternative would be for estimators to explicitly define a parameter ordering when appropriate).

And it's worth considering whether there's a use-case where the user would require an explicit parameter ordering, and whether we care to facilitate that...

Member

jnothman commented Mar 14, 2013

Related to @larsmans' comment, set_params order of setting is now significant:

grid_clf = GridSearchCV(
    Pipeline([('sel', SelectKBest(chi2)), ('est', LogisticRegression())]),
    param_grid={'est': [LogisticRegression(), LinearSVC()], 'est__C': [0.1, 1.0]}
)

If 'est__C' is set before 'est' -- dependent on the implementation of dict.iterkeys were it left to BaseEstimator.set_params -- it won't have any effect (and worse, GridSearchCV will act as if it did the right thing).

It is also possible with this patch to set 'steps' as well as one of the steps, and the current implementation doesn't address this in set_params.

We could ignore this last somewhat-pathological case and generally ensure that BaseEstimator.set_params iterates in order of string length, or do those without underscores before those with. Or we can implement orderings on a per-estimator basis (which overriding set_params here does, but there needs to be a comment to that effect; an alternative would be for estimators to explicitly define a parameter ordering when appropriate).

And it's worth considering whether there's a use-case where the user would require an explicit parameter ordering, and whether we care to facilitate that...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 7, 2013

Member

These latest changes attempt to:

  • implement @amueller's suggestion of allowing steps to be set to None;
  • conform to #1805's comment on methods in meta-estimators (which obviates @larsmans' comment about naming run_pipeline by removing it).
Member

jnothman commented Apr 7, 2013

These latest changes attempt to:

  • implement @amueller's suggestion of allowing steps to be set to None;
  • conform to #1805's comment on methods in meta-estimators (which obviates @larsmans' comment about naming run_pipeline by removing it).
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 8, 2013

Member

And this should possibly be augmented by an example of using advanced grid_search and pipeline features such as:

pipeline = Pipeline([('sel', SelectKBest()), ('clf', LinearSVC())]
param_grid = [{'sel__k': [5, 10, 20], 'sel__score_func': [chi2, f_classif], 'clf__C': [.1, 1, 10]}, {'sel': None, 'clf__C': [.1, 1, 10]}]
search = GridSearchCV(pipeline), param_grid=param_grid)

not that I think this is a particularly strong motivating example.

Member

jnothman commented Apr 8, 2013

And this should possibly be augmented by an example of using advanced grid_search and pipeline features such as:

pipeline = Pipeline([('sel', SelectKBest()), ('clf', LinearSVC())]
param_grid = [{'sel__k': [5, 10, 20], 'sel__score_func': [chi2, f_classif], 'clf__C': [.1, 1, 10]}, {'sel': None, 'clf__C': [.1, 1, 10]}]
search = GridSearchCV(pipeline), param_grid=param_grid)

not that I think this is a particularly strong motivating example.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Apr 8, 2013

Member

This is great :)
Unfortunately there is a deadline for ICCV next week, so I won't have much time this week either :-/ Sorry!

Member

amueller commented Apr 8, 2013

This is great :)
Unfortunately there is a deadline for ICCV next week, so I won't have much time this week either :-/ Sorry!

@GaelVaroquaux

View changes

sklearn/pipeline.py
- def score(self, X, y=None):
+ @property

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Apr 8, 2013

Member

Why use a property here? What's the benefit compared to a method?

@GaelVaroquaux

GaelVaroquaux Apr 8, 2013

Member

Why use a property here? What's the benefit compared to a method?

This comment has been minimized.

@jnothman

jnothman Apr 8, 2013

Member

As elsewhere, it ensures that hasattr(pipeline, 'inverse_transform') iff hasattr(step, 'inverse_transform') for all step in pipeline. Makes ducktyping meaningful.

@jnothman

jnothman Apr 8, 2013

Member

As elsewhere, it ensures that hasattr(pipeline, 'inverse_transform') iff hasattr(step, 'inverse_transform') for all step in pipeline. Makes ducktyping meaningful.

@GaelVaroquaux

View changes

sklearn/pipeline.py
+ return last_step.fit(Xt, y, **fit_params).transform(Xt)
+ return fn
+
+ def _pipeline_property(est_method, doc):

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Apr 8, 2013

Member

This meta-programming relying a custom vocabulary to populate objects makes them really hard to understand and to introspect. In addition, it is quite tricky code and to debug.

May I suggest something less sophisticated: for each of these methods, say 'transform', have a corresponding '_transform' method. Then a property:

@property
def _get_transform(self):
   if hasattr(self.steps[-1], 'transform'):
       return self._transform
  else:
       raise AttributeError

I believe that this should do the trick.

@GaelVaroquaux

GaelVaroquaux Apr 8, 2013

Member

This meta-programming relying a custom vocabulary to populate objects makes them really hard to understand and to introspect. In addition, it is quite tricky code and to debug.

May I suggest something less sophisticated: for each of these methods, say 'transform', have a corresponding '_transform' method. Then a property:

@property
def _get_transform(self):
   if hasattr(self.steps[-1], 'transform'):
       return self._transform
  else:
       raise AttributeError

I believe that this should do the trick.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Apr 8, 2013

Member

Fixed my code above, which wasn't correct :)

@GaelVaroquaux

GaelVaroquaux Apr 8, 2013

Member

Fixed my code above, which wasn't correct :)

This comment has been minimized.

@jnothman

jnothman Apr 8, 2013

Member

I assume you mean:

@property
def transform(self): ## Note, transform, not _get_transform
   """documentation relating to transform"""
   if hasattr(self.steps[-1], 'transform'):
       return self._transform
  else:
       raise AttributeError

Sure that'll work, assuming a sensible definition for _transform. But seeing as both _transform and transform would then be templated code repeated for predict, predict_proba, predict_log_proba, decision_function, score, and potentially other methods not yet invented, why is repeated code better or less vulnerable to bugs? Most of these methods in the current version of Pipeline are not separately tested, and no one would notice if they had small aberrations. This way, _pipeline_property makes it clear that all these methods are nearly identical; though it should have a comment to that effect.

Of course, you could do similar with something like:

@property
def transform(self):
   """documentation relating to transform"""
   return partial(self._run_pipeline, self.steps[-1][1].transform)

def _run_pipeline(self, est_fn, X, *args, **kwargs):
    Xt = X
    for name, transform in self.steps[:-1]:
        if transform is not None:
            Xt = transform.transform(Xt)
    return est_fn(Xt, *args, **kwargs)

Is that preferred?

@jnothman

jnothman Apr 8, 2013

Member

I assume you mean:

@property
def transform(self): ## Note, transform, not _get_transform
   """documentation relating to transform"""
   if hasattr(self.steps[-1], 'transform'):
       return self._transform
  else:
       raise AttributeError

Sure that'll work, assuming a sensible definition for _transform. But seeing as both _transform and transform would then be templated code repeated for predict, predict_proba, predict_log_proba, decision_function, score, and potentially other methods not yet invented, why is repeated code better or less vulnerable to bugs? Most of these methods in the current version of Pipeline are not separately tested, and no one would notice if they had small aberrations. This way, _pipeline_property makes it clear that all these methods are nearly identical; though it should have a comment to that effect.

Of course, you could do similar with something like:

@property
def transform(self):
   """documentation relating to transform"""
   return partial(self._run_pipeline, self.steps[-1][1].transform)

def _run_pipeline(self, est_fn, X, *args, **kwargs):
    Xt = X
    for name, transform in self.steps[:-1]:
        if transform is not None:
            Xt = transform.transform(Xt)
    return est_fn(Xt, *args, **kwargs)

Is that preferred?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2013

Member

Currently this PR does 3 thinks if I'm correct:

  • add meaningful docstrings
  • fix the duck typing
  • make it possible to replace estimators with set_params.

Could you maybe cherry-pick the first into master? The diff is very hard to read :-/
Also, it might be worth splitting up the other two parts into two independent PRs if that is not too much work.
Did you have any comments for @GaelVaroquaux appraoch?
For the third part, I am really for the feature, but I think we need to discuss the api for it.

Member

amueller commented May 7, 2013

Currently this PR does 3 thinks if I'm correct:

  • add meaningful docstrings
  • fix the duck typing
  • make it possible to replace estimators with set_params.

Could you maybe cherry-pick the first into master? The diff is very hard to read :-/
Also, it might be worth splitting up the other two parts into two independent PRs if that is not too much work.
Did you have any comments for @GaelVaroquaux appraoch?
For the third part, I am really for the feature, but I think we need to discuss the api for it.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 7, 2013

Member

Yes, the PR is bloated. If I find the time, I'll try separate out the stuff that's not in the title of this PR.

Anyone that doesn't think steps should be set as parameters needs to contend with some other fix for #1800. So as far as I'm concerned, it's a must, or a might-as-well. I am not sure what API issues you mean. Do you mean:

  • should it actually set an attribute on the estimator for consistency with BaseEstimator.set_params? The current implementation can be changed to do that easily, but trying to do the reverse creates backwards-compatibility issues.
  • should there be some other way of specifying parameter-setting priority, rather than overwriting set_params? I consider this an implementation detail.
  • should we support setting to None? I can't see why not.

Gaël's approach is about how to fix the duck-typing and keep the code readable (which is one reason I can't just cherry-pick it into master). He tests hasattr explicitly and raises an AttributeError in the else clause, which seems redundant and verbose. But if this promises substantially more readable, explicit code, I have no problem with it (except that it consumes three more repeated lines per method). He also delegates operation to an underscore-prefixed version of each method, which I think should be avoided: all these methods do the same thing, modulo the method called on the final estimator, so the code is more explicit, and less bug-prone, when refactored into something like _run_pipeline. (But again, this is not precisely the topic of this PR.)

Member

jnothman commented May 7, 2013

Yes, the PR is bloated. If I find the time, I'll try separate out the stuff that's not in the title of this PR.

Anyone that doesn't think steps should be set as parameters needs to contend with some other fix for #1800. So as far as I'm concerned, it's a must, or a might-as-well. I am not sure what API issues you mean. Do you mean:

  • should it actually set an attribute on the estimator for consistency with BaseEstimator.set_params? The current implementation can be changed to do that easily, but trying to do the reverse creates backwards-compatibility issues.
  • should there be some other way of specifying parameter-setting priority, rather than overwriting set_params? I consider this an implementation detail.
  • should we support setting to None? I can't see why not.

Gaël's approach is about how to fix the duck-typing and keep the code readable (which is one reason I can't just cherry-pick it into master). He tests hasattr explicitly and raises an AttributeError in the else clause, which seems redundant and verbose. But if this promises substantially more readable, explicit code, I have no problem with it (except that it consumes three more repeated lines per method). He also delegates operation to an underscore-prefixed version of each method, which I think should be avoided: all these methods do the same thing, modulo the method called on the final estimator, so the code is more explicit, and less bug-prone, when refactored into something like _run_pipeline. (But again, this is not precisely the topic of this PR.)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 8, 2013

Member

Thanks for your comments. I think we should fix the duck-typing first, and then discuss your improvement.
I must admit I did not look too closely into the implementation of the set_params, I was just a bit uneasy with it ;) there is probably no better way to resolve #1800. I think I just have to read the code in detail to convince myself that its a good idea ;)

Member

amueller commented May 8, 2013

Thanks for your comments. I think we should fix the duck-typing first, and then discuss your improvement.
I must admit I did not look too closely into the implementation of the set_params, I was just a bit uneasy with it ;) there is probably no better way to resolve #1800. I think I just have to read the code in detail to convince myself that its a good idea ;)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 8, 2013

Member

Okay, I've squashed the #1805 and comments and the parameter setting separately. When you (or other) give that first commit the okay, I'll pull it into master. Then we can discuss the second commit separately.

Member

jnothman commented May 8, 2013

Okay, I've squashed the #1805 and comments and the parameter setting separately. When you (or other) give that first commit the okay, I'll pull it into master. Then we can discuss the second commit separately.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 8, 2013

Member

Thanks for the quick update. I think it looks good but I'll have a closer look later. @GaelVaroquaux, any opinion? Still the same?

Member

amueller commented May 8, 2013

Thanks for the quick update. I think it looks good but I'll have a closer look later. @GaelVaroquaux, any opinion? Still the same?

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 17, 2014

Member

I have completely forgot about this issue. I can try to find time to have a new close look at it (it will be new because I forgot about it). Before I do that, I would like to have other people's opinion: are people still excited about it and see it as an important addition?

Member

GaelVaroquaux commented May 17, 2014

I have completely forgot about this issue. I can try to find time to have a new close look at it (it will be new because I forgot about it). Before I do that, I would like to have other people's opinion: are people still excited about it and see it as an important addition?

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans May 17, 2014

Member

Yes, I think this is a good addition.

Member

larsmans commented May 17, 2014

Yes, I think this is a good addition.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 17, 2014

Member

I might have a look at it and see if I am aware of new issues since I wrote
this!

On 17 May 2014 20:22, Lars Buitinck notifications@github.com wrote:

Yes, I think this is a good addition.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1769#issuecomment-43403314
.

Member

jnothman commented May 17, 2014

I might have a look at it and see if I am aware of new issues since I wrote
this!

On 17 May 2014 20:22, Lars Buitinck notifications@github.com wrote:

Yes, I think this is a good addition.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1769#issuecomment-43403314
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 18, 2014

Member

One thing this lacks is an example (perhaps grid search over dimensionality reductions, or comparing linear and nonlinear models). Perhaps there are existing examples that could be modified to employ this feature.

Member

jnothman commented May 18, 2014

One thing this lacks is an example (perhaps grid search over dimensionality reductions, or comparing linear and nonlinear models). Perhaps there are existing examples that could be modified to employ this feature.

@rmcgibbo

This comment has been minimized.

Show comment
Hide comment
@rmcgibbo

rmcgibbo May 19, 2014

Contributor

This is a pretty cool feature. Here's an example showing it off for grid search.

rmcgibbo@Computer-2 ~/projects/scikit-learn/examples
$ pep8 reduction_pipeline.py

rmcgibbo@Computer-2 ~/projects/scikit-learn/examples
$ cat reduction_pipeline.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
========================================================================
Optimizing over dimensionality reductions with Pipeline and GridSearchCV
========================================================================

This example constructs a pipeline that does unsupervised dimensionality
reduction followed by prediction with a support vector classifier. It
demonstrates the use of GridSearchCV to optimize over different classes of
estimators in a single CV run -- both PCA and NMF dimensionality reductions
are explored during the grid search.
"""
from __future__ import print_function
print(__doc__)

from sklearn.datasets import load_digits
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA, NMF

pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('svc', SVC())
])

digits = load_digits()
X_digits = digits.data
y_digits = digits.target


grid = GridSearchCV(pipe, cv=3, n_jobs=-1, param_grid={
    'reduce_dim': [PCA(), NMF()],
    'reduce_dim__n_components': [2, 4, 8],
    'svc__C': [1, 10, 100, 1000]
})
grid.fit(X_digits, y_digits)

print("Grid scores:")
print()
for params, mean_score, scores in grid.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() / 2, params))
print()
Contributor

rmcgibbo commented May 19, 2014

This is a pretty cool feature. Here's an example showing it off for grid search.

rmcgibbo@Computer-2 ~/projects/scikit-learn/examples
$ pep8 reduction_pipeline.py

rmcgibbo@Computer-2 ~/projects/scikit-learn/examples
$ cat reduction_pipeline.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
========================================================================
Optimizing over dimensionality reductions with Pipeline and GridSearchCV
========================================================================

This example constructs a pipeline that does unsupervised dimensionality
reduction followed by prediction with a support vector classifier. It
demonstrates the use of GridSearchCV to optimize over different classes of
estimators in a single CV run -- both PCA and NMF dimensionality reductions
are explored during the grid search.
"""
from __future__ import print_function
print(__doc__)

from sklearn.datasets import load_digits
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA, NMF

pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('svc', SVC())
])

digits = load_digits()
X_digits = digits.data
y_digits = digits.target


grid = GridSearchCV(pipe, cv=3, n_jobs=-1, param_grid={
    'reduce_dim': [PCA(), NMF()],
    'reduce_dim__n_components': [2, 4, 8],
    'svc__C': [1, 10, 100, 1000]
})
grid.fit(X_digits, y_digits)

print("Grid scores:")
print()
for params, mean_score, scores in grid.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() / 2, params))
print()
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 19, 2014

Member

Thanks for that. I'm not sure if it's a good idea to show off the case that's rarely applicable: where between the alternative estimators there is a common parameter name. For example, if we had a SelectKBest in there as well, we'd unfortunately need a more verbose parameter grid:

param_grid=[
    {
        'reduce_dim': [PCA(), NMF()],
        'reduce_dim__n_components': [2, 4, 8],
        'svc__C': [1, 10, 100, 1000]
    },
    {
        'reduce_dim': [SelectKBest()],
        'reduce_dim__k': [2, 4, 8],
        'svc__C': [1, 10, 100, 1000]
    },
]

which is by no means as pretty, but is the more common case :s

Member

jnothman commented May 19, 2014

Thanks for that. I'm not sure if it's a good idea to show off the case that's rarely applicable: where between the alternative estimators there is a common parameter name. For example, if we had a SelectKBest in there as well, we'd unfortunately need a more verbose parameter grid:

param_grid=[
    {
        'reduce_dim': [PCA(), NMF()],
        'reduce_dim__n_components': [2, 4, 8],
        'svc__C': [1, 10, 100, 1000]
    },
    {
        'reduce_dim': [SelectKBest()],
        'reduce_dim__k': [2, 4, 8],
        'svc__C': [1, 10, 100, 1000]
    },
]

which is by no means as pretty, but is the more common case :s

@rmcgibbo

This comment has been minimized.

Show comment
Hide comment
@rmcgibbo

rmcgibbo May 19, 2014

Contributor

Yeah, that's probably more useful.

Contributor

rmcgibbo commented May 19, 2014

Yeah, that's probably more useful.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2016

Member

TODO:

  • allow transform and fit_transform with last estimator None
  • test 'Unknown step name' error not traversible
  • check if we test get_params where an estimator is None
  • test step names conflict with constructor arguments
  • test step names not containing __
  • test Pipeline all intermediate steps should be transformers error
  • test FeatureUnion all steps should be transformers error
  • check test coverage for "if not Xs" case in FeatureUnion
  • separate issue for _BasePipeline in VotingClassifier
  • separate issue for hasattr(FeatureUnion, 'get_feature_names')
Member

jnothman commented Aug 24, 2016

TODO:

  • allow transform and fit_transform with last estimator None
  • test 'Unknown step name' error not traversible
  • check if we test get_params where an estimator is None
  • test step names conflict with constructor arguments
  • test step names not containing __
  • test Pipeline all intermediate steps should be transformers error
  • test FeatureUnion all steps should be transformers error
  • check test coverage for "if not Xs" case in FeatureUnion
  • separate issue for _BasePipeline in VotingClassifier
  • separate issue for hasattr(FeatureUnion, 'get_feature_names')
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 25, 2016

Member

Thanks @amueller, your demands for better tests revealed a few errors and inconsistencies.

Should ideally still change error messages to be more consistent about showing step name vs object repr vs type name...

Member

jnothman commented Aug 25, 2016

Thanks @amueller, your demands for better tests revealed a few errors and inconsistencies.

Should ideally still change error messages to be more consistent about showing step name vs object repr vs type name...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 25, 2016

Member

@MechCoder, if you will allow me, I'd like to leave off allowing transform and fit_transform with last estimator None. It's a logical extension, I admit, and I could put it in there, but it needs to be done carefully.

Please, @amueller and @MechCoder look over the recent test improvements and fixes.

Member

jnothman commented Aug 25, 2016

@MechCoder, if you will allow me, I'd like to leave off allowing transform and fit_transform with last estimator None. It's a logical extension, I admit, and I could put it in there, but it needs to be done carefully.

Please, @amueller and @MechCoder look over the recent test improvements and fixes.

- for name, trans in self.transformer_list)
+ delayed(_transform_one)(trans, name, weight, X)
+ for name, trans, weight in self._iter())
+ if not Xs:

This comment has been minimized.

@amueller

amueller Aug 25, 2016

Member

Is this tested now? Now this happens when all transformers are None, right? I think I understand now what happens ;) (still maybe add a one-line comment "all transformers are None"? Is having an empty list of transformers allowed?)

@amueller

amueller Aug 25, 2016

Member

Is this tested now? Now this happens when all transformers are None, right? I think I understand now what happens ;) (still maybe add a one-line comment "all transformers are None"? Is having an empty list of transformers allowed?)

This comment has been minimized.

@jnothman

jnothman Aug 26, 2016

Member

Commented. I suppose an empty list of transformers is allowed but untested.

@jnothman

jnothman Aug 26, 2016

Member

Commented. I suppose an empty list of transformers is allowed but untested.

sklearn/pipeline.py
+ Parameters of the transformers may be set using its name and the parameter
+ name separated by a '__'. A transformer may be replaced entirely by
+ setting the parameter with its name to another transformer,
+ or removed by setting to None.

This comment has been minimized.

@amueller

amueller Aug 25, 2016

Member

nitpick "setting it to None" ?

@amueller

amueller Aug 25, 2016

Member

nitpick "setting it to None" ?

sklearn/pipeline.py
transforms = estimators[:-1]
estimator = estimators[-1]
for t in transforms:
+ if t is None:
+ continue

This comment has been minimized.

@amueller

amueller Aug 25, 2016

Member

This line is not tested, which is slightly confusing to me.

@amueller

amueller Aug 25, 2016

Member

This line is not tested, which is slightly confusing to me.

This comment has been minimized.

@jnothman

jnothman Aug 26, 2016

Member

My coverage check says this is tested...

@jnothman

jnothman Aug 26, 2016

Member

My coverage check says this is tested...

sklearn/pipeline.py
+ # validate names
+ self._validate_names(names)
+
+ # validate estimators
transforms = estimators[:-1]

This comment has been minimized.

@amueller

amueller Aug 25, 2016

Member

transformers?

@amueller

amueller Aug 25, 2016

Member

transformers?

+ mult5 = Mult(mult=5)
+
+ def make():
+ return Pipeline([('m2', mult2), ('m3', mult3), ('last', mult5)])

This comment has been minimized.

@amueller

amueller Aug 25, 2016

Member

You should fit the pipeline, so that the steps are checked.

@amueller

amueller Aug 25, 2016

Member

You should fit the pipeline, so that the steps are checked.

self.steps[-1][-1].fit(Xt, y, **fit_params)
return self
def fit_transform(self, X, y=None, **fit_params):
- """Fit all the transforms one after the other and transform the
- data, then use fit_transform on transformed data using the final
+ """Fit the model and transform with the final estimator

This comment has been minimized.

@amueller

amueller Aug 25, 2016

Member

This should also validate_steps, right? Is there a test that fails in validate_steps?

@amueller

amueller Aug 25, 2016

Member

This should also validate_steps, right? Is there a test that fails in validate_steps?

This comment has been minimized.

@jnothman

jnothman Aug 26, 2016

Member

Good catch. I've added a fit_transform check to test_step_name_validation

@jnothman

jnothman Aug 26, 2016

Member

Good catch. I've added a fit_transform check to test_step_name_validation

This comment has been minimized.

@jnothman

jnothman Aug 26, 2016

Member

The tricky thing is that because these have legacy parameter validation in __init__, testing all validation cases is some effort!

@jnothman

jnothman Aug 26, 2016

Member

The tricky thing is that because these have legacy parameter validation in __init__, testing all validation cases is some effort!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 25, 2016

Member

I think tests that are missing are:

  • init pipeline, set steps to something weird, call fit_transform (this is currently not guarded against, I think), because no call to validate_steps in fit_transform.
  • init pipeline, set some steps to None, fit and predict/transform. (validate_steps is never called with a step being None).
  • maybe set a step to None at creation time?
Member

amueller commented Aug 25, 2016

I think tests that are missing are:

  • init pipeline, set steps to something weird, call fit_transform (this is currently not guarded against, I think), because no call to validate_steps in fit_transform.
  • init pipeline, set some steps to None, fit and predict/transform. (validate_steps is never called with a step being None).
  • maybe set a step to None at creation time?
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 26, 2016

Member

I've fixed those things. Also, despite my comment to @MechCoder, I've now made support for None as last estimator when [inverse_]transform is called...

Member

jnothman commented Aug 26, 2016

I've fixed those things. Also, despite my comment to @MechCoder, I've now made support for None as last estimator when [inverse_]transform is called...

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 26, 2016

Member

LGTM though python 2.6 complains ImportError: cannot import name assert_dict_equal

Member

amueller commented Aug 26, 2016

LGTM though python 2.6 complains ImportError: cannot import name assert_dict_equal

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 27, 2016

Member

I've submitted a patch that just uses assert_equal when assert_dict_equal is not importable; it's only cosmetic anyway.

Member

jnothman commented Aug 27, 2016

I've submitted a patch that just uses assert_equal when assert_dict_equal is not importable; it's only cosmetic anyway.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 28, 2016

Member

I'm astounded that this might actually, finally, be merged. If @MechCoder or @amueller would like to give its recent changes a final pass, that'd be great.

Member

jnothman commented Aug 28, 2016

I'm astounded that this might actually, finally, be merged. If @MechCoder or @amueller would like to give its recent changes a final pass, that'd be great.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 28, 2016

Member

LGTM.

Sorry, apparently I've been putting out fires in other places for the last 3 years (I've wanted this for a while).

Member

amueller commented Aug 28, 2016

LGTM.

Sorry, apparently I've been putting out fires in other places for the last 3 years (I've wanted this for a while).

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 28, 2016

Member

hahaha ;)

On 29 August 2016 at 02:47, Andreas Mueller notifications@github.com
wrote:

LGTM.

Sorry, apparently I've been putting out fires in other places for the last
3 years (I've wanted this for a while).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1769 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64GkZmq3E_ciJl9dVpvU6kY2x5WZks5qkbuHgaJpZM4Afu26
.

Member

jnothman commented Aug 28, 2016

hahaha ;)

On 29 August 2016 at 02:47, Andreas Mueller notifications@github.com
wrote:

LGTM.

Sorry, apparently I've been putting out fires in other places for the last
3 years (I've wanted this for a while).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1769 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64GkZmq3E_ciJl9dVpvU6kY2x5WZks5qkbuHgaJpZM4Afu26
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 29, 2016

Member

Well, I suppose I'd better merge before what's new merge conflicts appear!

Member

jnothman commented Aug 29, 2016

Well, I suppose I'd better merge before what's new merge conflicts appear!

@jnothman jnothman merged commit 5b20d48 into scikit-learn:master Aug 29, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Aug 29, 2016

Member

Sorry for the delay. Does it still need a review? :p

Member

MechCoder commented Aug 29, 2016

Sorry for the delay. Does it still need a review? :p

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Aug 29, 2016

Member

Thanks for your effort and long wait!

Member

MechCoder commented Aug 29, 2016

Thanks for your effort and long wait!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 29, 2016

Member

Wohoo!

Member

amueller commented Aug 29, 2016

Wohoo!

@betatim

This comment has been minimized.

Show comment
Hide comment
@betatim

betatim Aug 29, 2016

Contributor

This will be great!

Contributor

betatim commented Aug 29, 2016

This will be great!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 30, 2016

Member

Is it worth highlighting this feature alongside model_selection changes in what's new? It happens to fit very nicely with being able to access values of each parameter searched by grid search.

Member

jnothman commented Aug 30, 2016

Is it worth highlighting this feature alongside model_selection changes in what's new? It happens to fit very nicely with being able to access values of each parameter searched by grid search.

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG] ENH enable setting pipeline components as parameters (#1769)
Pipeline and FeatureUnion steps may now be set with set_params, and transformers may be replaced with None to effectively remove them.

Also test and improve ducktyping of Pipeline methods

@jnothman jnothman referenced this pull request in TeamHG-Memex/eli5 Jan 31, 2017

Closed

Support explain_weights(pipeline) #158

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG] ENH enable setting pipeline components as parameters (#1769)
Pipeline and FeatureUnion steps may now be set with set_params, and transformers may be replaced with None to effectively remove them.

Also test and improve ducktyping of Pipeline methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment