-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 2] ENH enable setting pipeline components as parameters #1769
Conversation
@@ -140,47 +196,46 @@ def fit_transform(self, X, y=None, **fit_params): | |||
else: | |||
return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt) | |||
|
|||
def run_pipeline(self, X, est_method, est_args=(), est_kwargs={}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes a run_pipeline
to appear on FeatureUnion
, so run
would be a better name. I don't really see the use of this at present, though. Also, the docstring is incomplete (the types should be documented).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its only purpose was to refactor. I made it public as a "why not?" decision, which is probably why I forgot the docstring. Or perhaps because parameter descriptions are lacking throughout the class, and are incomplete in FeatureUnion
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this only applies to Pipeline
, not _BasePipeline
, so run_pipeline
may indeed be appropriate (unless you think run
is sufficiently descriptive). An equivalent does not easily apply to FeatureUnion
.
I don't like the way this overloads the meaning of |
@larsmans, the problem is that in order for the underscore notation to work, let's say param "X__Y", apparently "X" needs to be a valid param (though perhaps this is a bug in Now, one option is to make
I personally think actually having the steps as attributes on the object is ugly. Already having to have the step names and the initialiser arguments in the same namespace is ugly. Having to also make sure step names don't conflict with actual attributes of the classifier (including things like 'fit', 'transform' which are certainly possible step names in existing code) seems unreasonable. As much as modifying some of the standard semantics of So I pass the ball back to you... |
I like the feature a lot, didn't have time to look into the implementation, though.,.. and won't have until next week :-/ |
It would be nice if we could also set an estimator in the pipline to None, meaning a step should be skipped. |
Please help with test fail: it's failing because |
The correct method is to make _BasePipeline an abstract base class by making the constructor an abstract function. look at other base classes for examples. |
@amueller I like the idea of setting a step to |
Yeah I also thought about that. I am not sure I buy your argument, though ;) |
Presumably you would also need raise an error if the final step in the pipeline is None... How about, when we're happy with this patch, we consider the ramifications of that enhancement. |
Related to @larsmans' comment,
If 'est__C' is set before 'est' -- dependent on the implementation of It is also possible with this patch to set 'steps' as well as one of the steps, and the current implementation doesn't address this in We could ignore this last somewhat-pathological case and generally ensure that And it's worth considering whether there's a use-case where the user would require an explicit parameter ordering, and whether we care to facilitate that... |
These latest changes attempt to:
|
And this should possibly be augmented by an example of using advanced pipeline = Pipeline([('sel', SelectKBest()), ('clf', LinearSVC())]
param_grid = [{'sel__k': [5, 10, 20], 'sel__score_func': [chi2, f_classif], 'clf__C': [.1, 1, 10]}, {'sel': None, 'clf__C': [.1, 1, 10]}]
search = GridSearchCV(pipeline), param_grid=param_grid) not that I think this is a particularly strong motivating example. |
This is great :) |
|
||
def score(self, X, y=None): | ||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a property here? What's the benefit compared to a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As elsewhere, it ensures that hasattr(pipeline, 'inverse_transform')
iff hasattr(step, 'inverse_transform')
for all step
in pipeline
. Makes ducktyping meaningful.
Currently this PR does 3 thinks if I'm correct:
Could you maybe cherry-pick the first into master? The diff is very hard to read :-/ |
Yes, the PR is bloated. If I find the time, I'll try separate out the stuff that's not in the title of this PR. Anyone that doesn't think steps should be set as parameters needs to contend with some other fix for #1800. So as far as I'm concerned, it's a must, or a might-as-well. I am not sure what API issues you mean. Do you mean:
Gaël's approach is about how to fix the duck-typing and keep the code readable (which is one reason I can't just cherry-pick it into master). He tests |
Thanks for your comments. I think we should fix the duck-typing first, and then discuss your improvement. |
Okay, I've squashed the #1805 and comments and the parameter setting separately. When you (or other) give that first commit the okay, I'll pull it into master. Then we can discuss the second commit separately. |
Thanks for the quick update. I think it looks good but I'll have a closer look later. @GaelVaroquaux, any opinion? Still the same? |
for name, trans in self.transformer_list) | ||
delayed(_transform_one)(trans, name, weight, X) | ||
for name, trans, weight in self._iter()) | ||
if not Xs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this tested now? Now this happens when all transformers are None, right? I think I understand now what happens ;) (still maybe add a one-line comment "all transformers are None"? Is having an empty list of transformers allowed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented. I suppose an empty list of transformers is allowed but untested.
I think tests that are missing are:
|
Also test and improve ducktyping of Pipeline methods
I've fixed those things. Also, despite my comment to @MechCoder, I've now made support for None as last estimator when |
017e5ef
to
33c7993
Compare
LGTM though python 2.6 complains |
I've submitted a patch that just uses |
I'm astounded that this might actually, finally, be merged. If @MechCoder or @amueller would like to give its recent changes a final pass, that'd be great. |
LGTM. Sorry, apparently I've been putting out fires in other places for the last 3 years (I've wanted this for a while). |
hahaha ;) On 29 August 2016 at 02:47, Andreas Mueller notifications@github.com
|
Well, I suppose I'd better merge before what's new merge conflicts appear! |
Sorry for the delay. Does it still need a review? :p |
Thanks for your effort and long wait! |
Wohoo! |
This will be great! |
Is it worth highlighting this feature alongside |
…arn#1769) Pipeline and FeatureUnion steps may now be set with set_params, and transformers may be replaced with None to effectively remove them. Also test and improve ducktyping of Pipeline methods
…arn#1769) Pipeline and FeatureUnion steps may now be set with set_params, and transformers may be replaced with None to effectively remove them. Also test and improve ducktyping of Pipeline methods
Until now,
get_params()
would return the steps of a pipeline by name, but setting them would fail silently (by setting an unused attribute); fixes bug #1800.This allows users to grid search over alternative estimators for some step, as illustrated in the included example, or even to delete a step in a search by setting it to
None
.But it may also be more directly practical: while a user may currently use
get_params
to extract an estimator from a nested pipeline using the double-underscore naming convention, e.g. for selective serialisation, they cannot use the reciprocalset_params
which this PR enables.This changeset also prohibits step names that equal initialisation parameters of the pipeline; otherwise
FeatureUnion.set_params(transformer_weights=foo)
would be ambiguous.