Skip to content

[MRG] ENH enable setting pipeline components as parameters #1769

Open
wants to merge 5 commits into from

7 participants

@jnothman
scikit-learn member

Until now, get_params() would return the steps of a pipeline by name, but setting them would fail silently (by setting an unused attribute); fixes bug #1800.

This allows users to grid search over alternative estimators for some step, as illustrated in the included example, or even to delete a step in a search by setting it to None.
But it may also be more directly practical: while a user may currently use get_params to extract an estimator from a nested pipeline using the double-underscore naming convention, e.g. for selective serialisation, they cannot use the reciprocal set_params which this PR enables.

This changeset also prohibits step names that equal initialisation parameters of the pipeline; otherwise FeatureUnion.set_params(transformer_weights=foo) would be ambiguous.

@larsmans larsmans and 1 other commented on an outdated diff Mar 13, 2013
sklearn/pipeline.py
@@ -140,47 +196,46 @@ def fit_transform(self, X, y=None, **fit_params):
else:
return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt)
+ def run_pipeline(self, X, est_method, est_args=(), est_kwargs={}):
@larsmans
scikit-learn member
larsmans added a note Mar 13, 2013

This causes a run_pipeline to appear on FeatureUnion, so run would be a better name. I don't really see the use of this at present, though. Also, the docstring is incomplete (the types should be documented).

@jnothman
scikit-learn member
jnothman added a note Mar 13, 2013

Its only purpose was to refactor. I made it public as a "why not?" decision, which is probably why I forgot the docstring. Or perhaps because parameter descriptions are lacking throughout the class, and are incomplete in FeatureUnion.

@jnothman
scikit-learn member
jnothman added a note Mar 13, 2013

Btw, this only applies to Pipeline, not _BasePipeline, so run_pipeline may indeed be appropriate (unless you think run is sufficiently descriptive). An equivalent does not easily apply to FeatureUnion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
scikit-learn member

I don't like the way this overloads the meaning of set_params. I'm thinking instead of an either list-like or dict-like interface with a __setitem__ to set a step to a different estimator. WDYT?

@jnothman
scikit-learn member

@larsmans, the problem is that in order for the underscore notation to work, let's say param "X__Y", apparently "X" needs to be a valid param (though perhaps this is a bug in BaseEstimator.set_params, but I have no idea what motivation there is for the current code and so wouldn't dare remove it). Therefore "X" needs to be returned by get_params(deep=True), as it had been previously. But calling pipeline.set_params(X=blah) would just perform pipeline.X = blah to no meaningful effect.

Now, one option is to make pipeline.X = blah meaningful, one of:

  • add BasePipeline.__{g,s}etattr__ which identifies step names and handles them differently
  • actually just store the steps by name in pipeline.__dict__, and compile pipeline.steps (and the unused pipeline.named_steps) on the fly: steps = [getattr(self, name) for name in self._step_names].

I personally think actually having the steps as attributes on the object is ugly. Already having to have the step names and the initialiser arguments in the same namespace is ugly. Having to also make sure step names don't conflict with actual attributes of the classifier (including things like 'fit', 'transform' which are certainly possible step names in existing code) seems unreasonable.

As much as modifying some of the standard semantics of set_params is undesirable, breaking existing code is less desirable. But I'm open to suggestions.

So I pass the ball back to you...

@amueller
scikit-learn member

I like the feature a lot, didn't have time to look into the implementation, though.,.. and won't have until next week :-/

@amueller
scikit-learn member

It would be nice if we could also set an estimator in the pipline to None, meaning a step should be skipped.

@jnothman
scikit-learn member

Please help with test fail: it's failing because _BasePipeline inherits from BaseEstimator, and hence is presumed to be testable among all estimators in sklearn/tests/test_common.py. Is the correct fix to make _BasePipeline some form of mixin (and hence not use super to call BaseEstimator.set_params), or to add _BasePipeline to sklearn.test_common.dont_test.

@amueller
scikit-learn member

The correct method is to make _BasePipeline an abstract base class by making the constructor an abstract function. look at other base classes for examples.

@jnothman
scikit-learn member

@amueller I like the idea of setting a step to None, but I think there's too much risk of some function -- be it a method on a _BasePipeline descendant, or some external library -- forgetting to handle that case. Why not just let the user supply a dummy? If sklearn lacks a library of dummies, this is a use-case worth considering.

@amueller
scikit-learn member

Yeah I also thought about that. I am not sure I buy your argument, though ;)
It would mean some extra code in the Pipeline, but it would be much more user friendly.

@jnothman
scikit-learn member

Presumably you would also need raise an error if the final step in the pipeline is None... How about, when we're happy with this patch, we consider the ramifications of that enhancement.

@jnothman
scikit-learn member

Related to @larsmans' comment, set_params order of setting is now significant:

grid_clf = GridSearchCV(
    Pipeline([('sel', SelectKBest(chi2)), ('est', LogisticRegression())]),
    param_grid={'est': [LogisticRegression(), LinearSVC()], 'est__C': [0.1, 1.0]}
)

If 'est__C' is set before 'est' -- dependent on the implementation of dict.iterkeys were it left to BaseEstimator.set_params -- it won't have any effect (and worse, GridSearchCV will act as if it did the right thing).

It is also possible with this patch to set 'steps' as well as one of the steps, and the current implementation doesn't address this in set_params.

We could ignore this last somewhat-pathological case and generally ensure that BaseEstimator.set_params iterates in order of string length, or do those without underscores before those with. Or we can implement orderings on a per-estimator basis (which overriding set_params here does, but there needs to be a comment to that effect; an alternative would be for estimators to explicitly define a parameter ordering when appropriate).

And it's worth considering whether there's a use-case where the user would require an explicit parameter ordering, and whether we care to facilitate that...

@jnothman
scikit-learn member
jnothman commented Apr 7, 2013

These latest changes attempt to:

  • implement @amueller's suggestion of allowing steps to be set to None;
  • conform to #1805's comment on methods in meta-estimators (which obviates @larsmans' comment about naming run_pipeline by removing it).
@jnothman
scikit-learn member
jnothman commented Apr 8, 2013

And this should possibly be augmented by an example of using advanced grid_search and pipeline features such as:

pipeline = Pipeline([('sel', SelectKBest()), ('clf', LinearSVC())]
param_grid = [{'sel__k': [5, 10, 20], 'sel__score_func': [chi2, f_classif], 'clf__C': [.1, 1, 10]}, {'sel': None, 'clf__C': [.1, 1, 10]}]
search = GridSearchCV(pipeline), param_grid=param_grid)

not that I think this is a particularly strong motivating example.

@amueller
scikit-learn member
amueller commented Apr 8, 2013

This is great :)
Unfortunately there is a deadline for ICCV next week, so I won't have much time this week either :-/ Sorry!

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Apr 8, 2013
sklearn/pipeline.py
- def score(self, X, y=None):
+ @property
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Apr 8, 2013

Why use a property here? What's the benefit compared to a method?

@jnothman
scikit-learn member
jnothman added a note Apr 8, 2013

As elsewhere, it ensures that hasattr(pipeline, 'inverse_transform') iff hasattr(step, 'inverse_transform') for all step in pipeline. Makes ducktyping meaningful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Apr 8, 2013
sklearn/pipeline.py
+ Parameters passed to the `fit` method of each step, where
+ each parameter name is prefixed such that parameter ``p`` for step
+ ``s`` has key ``s__p``.
+ """
+ last_step = self.steps[-1][-1]
+ if not hasattr(last_step, 'fit_transform') and not hasattr(last_step, 'transform'):
+ raise AttributeError('last step has neither `transform` nor `fit_transform`')
+ def fn(X, y=None, **fit_params):
+ Xt, fit_params = self._pre_transform(X, y, **fit_params)
+ if hasattr(last_step, 'fit_transform'):
+ return last_step.fit_transform(Xt, y, **fit_params)
+ else:
+ return last_step.fit(Xt, y, **fit_params).transform(Xt)
+ return fn
+
+ def _pipeline_property(est_method, doc):
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Apr 8, 2013

This meta-programming relying a custom vocabulary to populate objects makes them really hard to understand and to introspect. In addition, it is quite tricky code and to debug.

May I suggest something less sophisticated: for each of these methods, say 'transform', have a corresponding '_transform' method. Then a property:

@property
def _get_transform(self):
   if hasattr(self.steps[-1], 'transform'):
       return self._transform
  else:
       raise AttributeError

I believe that this should do the trick.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Apr 8, 2013

Fixed my code above, which wasn't correct :)

@jnothman
scikit-learn member
jnothman added a note Apr 8, 2013

I assume you mean:

@property
def transform(self): ## Note, transform, not _get_transform
   """documentation relating to transform"""
   if hasattr(self.steps[-1], 'transform'):
       return self._transform
  else:
       raise AttributeError

Sure that'll work, assuming a sensible definition for _transform. But seeing as both _transform and transform would then be templated code repeated for predict, predict_proba, predict_log_proba, decision_function, score, and potentially other methods not yet invented, why is repeated code better or less vulnerable to bugs? Most of these methods in the current version of Pipeline are not separately tested, and no one would notice if they had small aberrations. This way, _pipeline_property makes it clear that all these methods are nearly identical; though it should have a comment to that effect.

Of course, you could do similar with something like:

@property
def transform(self):
   """documentation relating to transform"""
   return partial(self._run_pipeline, self.steps[-1][1].transform)

def _run_pipeline(self, est_fn, X, *args, **kwargs):
    Xt = X
    for name, transform in self.steps[:-1]:
        if transform is not None:
            Xt = transform.transform(Xt)
    return est_fn(Xt, *args, **kwargs)

Is that preferred?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
scikit-learn member
amueller commented May 7, 2013

Currently this PR does 3 thinks if I'm correct:

  • add meaningful docstrings
  • fix the duck typing
  • make it possible to replace estimators with set_params.

Could you maybe cherry-pick the first into master? The diff is very hard to read :-/
Also, it might be worth splitting up the other two parts into two independent PRs if that is not too much work.
Did you have any comments for @GaelVaroquaux appraoch?
For the third part, I am really for the feature, but I think we need to discuss the api for it.

@jnothman
scikit-learn member
jnothman commented May 7, 2013

Yes, the PR is bloated. If I find the time, I'll try separate out the stuff that's not in the title of this PR.

Anyone that doesn't think steps should be set as parameters needs to contend with some other fix for #1800. So as far as I'm concerned, it's a must, or a might-as-well. I am not sure what API issues you mean. Do you mean:

  • should it actually set an attribute on the estimator for consistency with BaseEstimator.set_params? The current implementation can be changed to do that easily, but trying to do the reverse creates backwards-compatibility issues.
  • should there be some other way of specifying parameter-setting priority, rather than overwriting set_params? I consider this an implementation detail.
  • should we support setting to None? I can't see why not.

Gaël's approach is about how to fix the duck-typing and keep the code readable (which is one reason I can't just cherry-pick it into master). He tests hasattr explicitly and raises an AttributeError in the else clause, which seems redundant and verbose. But if this promises substantially more readable, explicit code, I have no problem with it (except that it consumes three more repeated lines per method). He also delegates operation to an underscore-prefixed version of each method, which I think should be avoided: all these methods do the same thing, modulo the method called on the final estimator, so the code is more explicit, and less bug-prone, when refactored into something like _run_pipeline. (But again, this is not precisely the topic of this PR.)

@amueller
scikit-learn member
amueller commented May 8, 2013

Thanks for your comments. I think we should fix the duck-typing first, and then discuss your improvement.
I must admit I did not look too closely into the implementation of the set_params, I was just a bit uneasy with it ;) there is probably no better way to resolve #1800. I think I just have to read the code in detail to convince myself that its a good idea ;)

@jnothman
scikit-learn member
jnothman commented May 8, 2013

Okay, I've squashed the #1805 and comments and the parameter setting separately. When you (or other) give that first commit the okay, I'll pull it into master. Then we can discuss the second commit separately.

@amueller
scikit-learn member
amueller commented May 8, 2013

Thanks for the quick update. I think it looks good but I'll have a closer look later. @GaelVaroquaux, any opinion? Still the same?

@GaelVaroquaux
scikit-learn member

I have completely forgot about this issue. I can try to find time to have a new close look at it (it will be new because I forgot about it). Before I do that, I would like to have other people's opinion: are people still excited about it and see it as an important addition?

@larsmans
scikit-learn member

Yes, I think this is a good addition.

@jnothman
scikit-learn member
@jnothman
scikit-learn member

One thing this lacks is an example (perhaps grid search over dimensionality reductions, or comparing linear and nonlinear models). Perhaps there are existing examples that could be modified to employ this feature.

@rmcgibbo

This is a pretty cool feature. Here's an example showing it off for grid search.

rmcgibbo@Computer-2 ~/projects/scikit-learn/examples
$ pep8 reduction_pipeline.py

rmcgibbo@Computer-2 ~/projects/scikit-learn/examples
$ cat reduction_pipeline.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
========================================================================
Optimizing over dimensionality reductions with Pipeline and GridSearchCV
========================================================================

This example constructs a pipeline that does unsupervised dimensionality
reduction followed by prediction with a support vector classifier. It
demonstrates the use of GridSearchCV to optimize over different classes of
estimators in a single CV run -- both PCA and NMF dimensionality reductions
are explored during the grid search.
"""
from __future__ import print_function
print(__doc__)

from sklearn.datasets import load_digits
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA, NMF

pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('svc', SVC())
])

digits = load_digits()
X_digits = digits.data
y_digits = digits.target


grid = GridSearchCV(pipe, cv=3, n_jobs=-1, param_grid={
    'reduce_dim': [PCA(), NMF()],
    'reduce_dim__n_components': [2, 4, 8],
    'svc__C': [1, 10, 100, 1000]
})
grid.fit(X_digits, y_digits)

print("Grid scores:")
print()
for params, mean_score, scores in grid.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() / 2, params))
print()
@jnothman
scikit-learn member

Thanks for that. I'm not sure if it's a good idea to show off the case that's rarely applicable: where between the alternative estimators there is a common parameter name. For example, if we had a SelectKBest in there as well, we'd unfortunately need a more verbose parameter grid:

param_grid=[
    {
        'reduce_dim': [PCA(), NMF()],
        'reduce_dim__n_components': [2, 4, 8],
        'svc__C': [1, 10, 100, 1000]
    },
    {
        'reduce_dim': [SelectKBest()],
        'reduce_dim__k': [2, 4, 8],
        'svc__C': [1, 10, 100, 1000]
    },
]

which is by no means as pretty, but is the more common case :s

@rmcgibbo

Yeah, that's probably more useful.

@jnothman
scikit-learn member

@GaelVaroquaux, @larsmans: I'd like to clean this code a little, but one main factor in its obfuscation is the use of Pipeline.steps but FeatureUnion.transformer_list. [Neither of FeatureUnion's parameter names is optimal: if nothing else, transformer_weights should not end with s to be consistent with class_weight.]

Would it be appropriate to rename transformer_list to steps? Or to rename both to components or elements?

@larsmans
scikit-learn member

Components is our name for one of the matrices produced by matrix decomposition, so that's going to be confusing. Also I don't have a strong opinion on FeatureUnion, but Pipeline is so old and so widely used that we shouldn't change its API.

@jnothman
scikit-learn member
@jnothman
scikit-learn member

I rewrote what I had in 5be20ac to prefer slightly more readable code over refactored code.

I have also not fixed the ducktyping issues which remain to be reviewed at #XXXXX

I have also included an enhanced version of @rmcgibbo's example.

And of course, I have rebased.

I will add the following to what's new when I know which section to put it in:

   - Added support for substituting or disabling :class:`pipeline.Pipeline`
     and :class:`pipeline.FeatureUnion` components using the ``set_params``
     interface that powers :mod:`sklearn.grid_search`.
     See :ref:`example_plot_compare_reduction.py`. By `Joel Nothman`_ and
     `Robert McGibbon`_.

@gaelvaroquaux, @larsmans, @rmcgibbo et al.: please review!

@coveralls

Coverage Status

Coverage decreased (-0.0%) when pulling c55b4bd on jnothman:pipeline_params into acac25d on scikit-learn:master.

@coveralls

Coverage Status

Coverage decreased (-0.0%) when pulling d3bd699 on jnothman:pipeline_params into acac25d on scikit-learn:master.

@jnothman jnothman changed the title from ENH enable setting pipeline components as parameters to [MRG] ENH enable setting pipeline components as parameters Aug 10, 2014
@jnothman
scikit-learn member

Note that it appears it would be useful to users to be able to set steps by name manually too. Anyone else think it's time to get this PR back up? (Perhaps especially as it's a bug fix?)

@jnothman
scikit-learn member

I have, probably as a waste of time, rebased on master.

jnothman added some commits May 9, 2013
@jnothman jnothman ENH enable setting pipeline components as parameters 0b554a3
@jnothman jnothman DOC add what's new entry 100ba8b
@jnothman jnothman FIX avoid dict comprehension for Py2.6 2feb77b
@jnothman jnothman FIX remove debug print statements :/
75ca029
@jnothman
scikit-learn member

Glad to have the tentative support of @larsmans, @rmcgibbo, @amueller and @GaelVaroquaux, but a review in some year would also be nice. Perhaps this is of interest to @MechCoder.

Note, if it saves time for reviewers, that most of the early comments are no longer applicable.

@jnothman jnothman FIX dummy estimator must support transform to go in FeatureUnion
e5236cf
@MechCoder
scikit-learn member

Soon!

@jnothman
scikit-learn member

Soon!

That's what he said. And what he said too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.