-
-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Pipeline compatible with AdaBoost #2630
Conversation
…ing" This reverts commit 87db631.
Getting recent changes from scikit-learn people.
Thanks for the PR! This relates to some issues that have been under discussion for some time. I'll review your implementation and then try to point some of them out. |
@@ -52,3 +52,5 @@ benchmarks/bench_covertype_data/ | |||
*.prefs | |||
.pydevproject | |||
.idea | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you meant to commit this file. At least, please remove this blank line.
Overriding getattribute (or getattr) is really not something that should be done. In general, I am pretty much opposed to overriding any _foo method: thou shall not change how objects behave. The delegation mechanism to be used in scikit-learn is the get_params/set_params mechanism, but anyhow, it is only for model parameters. |
except AttributeError: | ||
pass | ||
raise AttributeError("Neither the '%s' object nor any of its steps has the attribute '%s'" | ||
% (self.__class__.__name__, name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really doing way too much magic. You have here a behavior that is impossible to define in a precise way, and thus a potential source of bugs. It is something to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better just to have a classes_ property that explicitly gets the classes_ from the last step? Like this (untested):
@Property
def classes_(self):
return self.steps[-1][1].classes_
As you've discovered, the issue with Maybe your problem is more readily solved by modifying AdaBoost instead of Pipeline, even if that design seems awkward. |
@@ -141,64 +168,71 @@ def fit_transform(self, X, y=None, **fit_params): | |||
else: | |||
return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt) | |||
|
|||
def predict(self, X): | |||
def predict(self, X, **params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which situation do we need parameters on predict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It is not needed for AdaBoost. The reason it's here is because I sometimes use the scikit-learn Pipeline as a sort of glue to combine different kinds of models from other packages. For example, I have some wrappers that I use with the statsmodels GLM models to make them work in the Pipeline. The GLM models can take offset or exposure parameters when making predictions. So, it's not really needed for scikit-learn by itself but has been a useful extension for me when combining scikit-learn with other packages. It may be out of scope.
@jnothman In this case, wouldn't your code be non-functional before sample weight is implemented for your transformer? As it's coded here, sample_weight goes to all steps if it's provided. If any step does not accept sample_weight, an error would be raised. I did consider the alternative, having sample_weight go only to those steps that could accept it, but that approach seemed rather more problematic (as you point out). Perhaps a better solution is to explicitly set up which arguments get sent to which steps when the Pipeline is created. Something like this: t1 = SomeTransformer() where I've assumed that t1 and c get sample_weight and t2 does not. Another use case, which I didn't properly explain initially but mentioned in my response to @GaelVaroquaux above, is using scikit-learn as a sort of glue to connect other packages. For example, I currently use this modified Pipeline to connect py-earth models to GLMs from statsmodels, using a simple adapter around GLM to make it conform to the scikit-learn api. The adapted fit and predict methods of the GLMs can take offset or exposure arguments, which these changes to Pipeline also allow me to handle easily. I don't know, though, how many other users are interested in doing things like this. |
I can see the merit in this approach, but naming that parameter will be hard :) I also can only assume it would be adhered to whenever a One could also take a similarly explicit approach to pass-through attributes @GaelVaroquaux, do you think we should be compiling a document of API design issues for pipelines and other meta-estimators? |
Yes. Please ping @ogrisel about this. We have had a lot of discussions Thanks a lot! |
Link to the document created by @jnothman : https://github.com/scikit-learn/scikit-learn/wiki/Pipeline-et-al.-design-issues |
I don't understand what is the status. Does pipeline supports sample_weight or not and if it does what is the syntax? Thanks! |
@hshteingart "I don't understand what is the status. Does pipeline supports sample_weight or not and if it does what is the syntax? Thanks!" I do not know if you ever received a reply to your question. I have recently begun using Pipelines and found the following to work using sklearn version 0.19.dev0. For example,
model = GradientBoostingClassifier()
model.fit(X_train, y_train, **{'sample_weight': sample_weights_train})
model.predict(X_test)
model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])
pipe.fit(X_train, y_train, **{'classifier__sample_weight': sample_weights_train})
pipe.predict(X_test)
model = GradientBoostingClassifier()
pipe = make_pipeline(model)
pipe.fit(X_train, y_train,
**{'gradientboostingclassifier__sample_weight': sample_weights_train})
pipe.predict(X_test)
model = GradientBoostingClassifier()
param_grid = [{'learning_rate': [0.01]}]
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring="accuracy",
fit_params={'sample_weight': sample_weights_train})
grid.fit(X_train, y_train)
grid.predict(X_test)
model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])
param_grid = [{'classifier': [model], 'classifier__learning_rate': [0.01]}]
grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy",
fit_params={'classifier__sample_weight': sample_weights_train})
grid.fit(X_train, y_train)
grid.predict(X_test)
model = GradientBoostingClassifier()
pipe = make_pipeline(model)
param_grid = [{'gradientboostingclassifier__learning_rate': [0.01]}]
grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy",
fit_params={'gradientboostingclassifier__sample_weight': sample_weights_train})
grid.fit(X_train, y_train)
grid.predict(X_test) NB: Case number 5 and 6 pending API syntax verification. But from what I have read it does not seem that @amueller, is this a correct assessment of UPDATE 1: The syntax for case number 5 and 6 have been verified to work. UPDATE 2: I was incorrect to believe that the |
@ecampana , I just tried your examples 5 and 6 on the master branch of scikit-learn, and they both work. However, you should pass fit parameters from the grid searcher via the grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train) and grid.fit(X_train, y_train, gradientboostingclassifier__sample_weight=sample_weights_train) PR #8278 added the ability for grid searchers to pass through fit parameters to the estimators they wrap. |
@stephen-hoover, I have verified that case number 5 and 6 of my original post worked as you said it would. Thank for confirming my syntax and for clarifying that TypeError: fit() got an unexpected keyword argument 'classifier__sample_weight' I used the following python code:
model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])
param_grid = [{'classifier': [model], 'classifier__learning_rate': [0.01]}]
grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy")
# None of the below variations on the syntax work
#grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train)
#grid.fit(X_train, y_train, gradientboostingclassifier__sample_weight=sample_weights_train)
#grid.fit(X_train, y_train, sample_weight=sample_weights_train)
grid.predict(X_test) I am using sklearn version 0.19.dev0 and maybe this may be the reason why your suggested syntax is not working. Would you happen to know what I may be doing incorrectly? My last question is regarding the UPDATE 1: I was incorrect to believe that the |
@ecampana , if I run the code you've provided, What would sample weights on |
@stephen-hoover, when I have a chance I will update to a newer version of scikit-learn as you suggested but I am confident that the syntax you recommend works. Thank you for pointing out an alternative solution. By the way, you made me realize what my confusion was, and you are correct in that predictions happen on a per sample basis. The sample weights I am using are physics event weights so I would in principle apply the sample weights when constructing an event histogram of the machine learning model prediction-probabilities. For example, the sample_weights_test would be applied as follows, plt.hist(grid.predict_proba(X_test), weights=sample_weights_test, bins=bins,
histtype='stepfilled', normed=False) The following methods/functions But the I have one last observation. I noticed that average_precision_score does not allow for the possibilities of negative event weights. The topic of negative event weights was mentioned in #3774. In my case they constitute less than 1%. Therefore for such situations I just set the negative event weight to 1.0 as a temporary solution. |
@ecampana , I see you're having discussion about |
@stephen-hoover, thank you for your reply. I will open up a new issue. Hopefully when I put in a new PR I can give an explanation that will clarify the motivation behind my request to add in |
What is the status of this issue of sample weights and pipelines? @ecampana is the sample weight request for calibration curves resolved? |
@CentralLT I believe the actual code for implementing sample weights for calibration_curve method has been completed for a few years now (i.e. All checks have passed). I think @stephen-hoover recommended a minor change, which I implemented on my end long ago but could not validate because I was running an older version of scikit-learn. I would like to pick this task up again if there is anyone willing to help answer questions I may have in order to complete the PR. I am willing to do the necessary leg work. Is there any objection, at least in principle, to this PR? |
|
This pull request makes the following changes to the Pipeline class:
My objective was to make changes that would generalize well to other potential uses of the Pipeline. For example, it is now possible to pass an argument to the fit method of a particular step by passing a parameter named step__argname instead of just argname. It is also possible to access step attributes other than the classes_ attribute of the final step. Neither of these features is strictly necessary to get AdaBoost compatibility, but seemed like reasonable generalizations of the necessary functionality. I think sklearn devs should pay particular attention to change 3 above and make sure they're comfortable with it. There are definitely other generalizations that might be better, such as only allowing access to the attributes of the final step or even only forwarding the classes_ attribute.
A gist showing the intended usage of this enhancement is located here:
https://gist.github.com/jcrudy/7756798
The gist uses py-earth because I was not familiar with a scikit-learn transformer that takes sample_weight (I don't know them all, though). If the EarthRegressor (#2285) pull request is merged then there will be at least one such transformer.