Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Pipeline compatible with AdaBoost #2630

Closed
wants to merge 10 commits into from
Closed

Conversation

jcrudy
Copy link

@jcrudy jcrudy commented Dec 2, 2013

This pull request makes the following changes to the Pipeline class:

  1. Allow passing arguments to methods of steps, not just their constructors. Necessary so that AdaBoost can pass the sample_weight argument.
  2. If an argument is not of the form step__argname it will be passed to all steps. Necessary again so that AdaBoost can pass the sample_weight argument.
  3. Forward attribute access to steps if the attribute is not found in the Pipeline object itself. Steps are searched backward starting with the final step. Allows AdaBoostClassifier to access the classes_ attribute of the final classifier in the Pipeline.

My objective was to make changes that would generalize well to other potential uses of the Pipeline. For example, it is now possible to pass an argument to the fit method of a particular step by passing a parameter named step__argname instead of just argname. It is also possible to access step attributes other than the classes_ attribute of the final step. Neither of these features is strictly necessary to get AdaBoost compatibility, but seemed like reasonable generalizations of the necessary functionality. I think sklearn devs should pay particular attention to change 3 above and make sure they're comfortable with it. There are definitely other generalizations that might be better, such as only allowing access to the attributes of the final step or even only forwarding the classes_ attribute.

A gist showing the intended usage of this enhancement is located here:

https://gist.github.com/jcrudy/7756798

The gist uses py-earth because I was not familiar with a scikit-learn transformer that takes sample_weight (I don't know them all, though). If the EarthRegressor (#2285) pull request is merged then there will be at least one such transformer.

@jnothman
Copy link
Member

jnothman commented Dec 2, 2013

Thanks for the PR! This relates to some issues that have been under discussion for some time. I'll review your implementation and then try to point some of them out.

@@ -52,3 +52,5 @@ benchmarks/bench_covertype_data/
*.prefs
.pydevproject
.idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you meant to commit this file. At least, please remove this blank line.

@GaelVaroquaux
Copy link
Member

Overriding getattribute (or getattr) is really not something that should be done. In general, I am pretty much opposed to overriding any _foo method: thou shall not change how objects behave.

The delegation mechanism to be used in scikit-learn is the get_params/set_params mechanism, but anyhow, it is only for model parameters.

except AttributeError:
pass
raise AttributeError("Neither the '%s' object nor any of its steps has the attribute '%s'"
% (self.__class__.__name__, name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really doing way too much magic. You have here a behavior that is impossible to define in a precise way, and thus a potential source of bugs. It is something to avoid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better just to have a classes_ property that explicitly gets the classes_ from the last step? Like this (untested):

@Property
def classes_(self):
return self.steps[-1][1].classes_

@jnothman
Copy link
Member

jnothman commented Dec 2, 2013

As you've discovered, the issue with sample_weight in Pipeline is that most transformers don't accept -- let alone use -- it. So we either need to support sample_weight more universally, or Pipeline needs to inspect method signatures or similar to work out where it can pass which parameters (and then **kwargs approaches are even more problematic). And again, this implicit approach creates version compatibility issues: I have a Pipeline working one way, then someone includes sample_weight support in a transformer, and my Pipleine's behaviour changes.

Maybe your problem is more readily solved by modifying AdaBoost instead of Pipeline, even if that design seems awkward.

@@ -141,64 +168,71 @@ def fit_transform(self, X, y=None, **fit_params):
else:
return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt)

def predict(self, X):
def predict(self, X, **params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which situation do we need parameters on predict?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It is not needed for AdaBoost. The reason it's here is because I sometimes use the scikit-learn Pipeline as a sort of glue to combine different kinds of models from other packages. For example, I have some wrappers that I use with the statsmodels GLM models to make them work in the Pipeline. The GLM models can take offset or exposure parameters when making predictions. So, it's not really needed for scikit-learn by itself but has been a useful extension for me when combining scikit-learn with other packages. It may be out of scope.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 451df83 on jcrudy:pipeline into 06a1eaf on scikit-learn:master.

@jcrudy
Copy link
Author

jcrudy commented Dec 3, 2013

@jnothman
"And again, this implicit approach creates version compatibility issues: I have a Pipeline working one way, then someone includes sample_weight support in a transformer, and my Pipleine's behaviour changes."

In this case, wouldn't your code be non-functional before sample weight is implemented for your transformer? As it's coded here, sample_weight goes to all steps if it's provided. If any step does not accept sample_weight, an error would be raised. I did consider the alternative, having sample_weight go only to those steps that could accept it, but that approach seemed rather more problematic (as you point out).

Perhaps a better solution is to explicitly set up which arguments get sent to which steps when the Pipeline is created. Something like this:

t1 = SomeTransformer()
t2 = SomeOtherTransformer()
c = SomeClassifier
p = Pipeline([('t1', t1), ('t2', t2), ('c', c)], [('sample_weight', ('t1', 'c'))])

where I've assumed that t1 and c get sample_weight and t2 does not.

Another use case, which I didn't properly explain initially but mentioned in my response to @GaelVaroquaux above, is using scikit-learn as a sort of glue to connect other packages. For example, I currently use this modified Pipeline to connect py-earth models to GLMs from statsmodels, using a simple adapter around GLM to make it conform to the scikit-learn api. The adapted fit and predict methods of the GLMs can take offset or exposure arguments, which these changes to Pipeline also allow me to handle easily. I don't know, though, how many other users are interested in doing things like this.

@jnothman
Copy link
Member

jnothman commented Dec 3, 2013

p = Pipeline([('t1', t1), ('t2', t2), ('c', c)], [('sample_weight', ('t1', 'c'))])

I can see the merit in this approach, but naming that parameter will be hard :) I also can only assume it would be adhered to whenever a sample_weight kwarg is passed in, whether to score, fit or transform. It's not very pretty either (though it should be a dict, not a list of tuples).

One could also take a similarly explicit approach to pass-through attributes attribute_mapping={'classes_': 'clf.classes_'}.

@GaelVaroquaux, do you think we should be compiling a document of API design issues for pipelines and other meta-estimators?

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux, do you think we should be compiling a document of API design
issues for pipelines and other meta-estimators?

Yes. Please ping @ogrisel about this. We have had a lot of discussions
about API issues, and I think that the sample_weight problem that you
are having is related to some of our discussions.

Thanks a lot!

@jcrudy
Copy link
Author

jcrudy commented Dec 5, 2013

@hshteingart
Copy link

I don't understand what is the status. Does pipeline supports sample_weight or not and if it does what is the syntax? Thanks!

@ecampana
Copy link

ecampana commented Mar 9, 2017

@hshteingart "I don't understand what is the status. Does pipeline supports sample_weight or not and if it does what is the syntax? Thanks!"

I do not know if you ever received a reply to your question. I have recently begun using Pipelines and found the following to work using sklearn version 0.19.dev0. For example,

  1. Classifier
model = GradientBoostingClassifier()

model.fit(X_train, y_train, **{'sample_weight': sample_weights_train})

model.predict(X_test)
  1. Pipeline
model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])

pipe.fit(X_train, y_train, **{'classifier__sample_weight': sample_weights_train})

pipe.predict(X_test)
  1. make_pipeline
model = GradientBoostingClassifier()
pipe = make_pipeline(model)

pipe.fit(X_train, y_train, 
         **{'gradientboostingclassifier__sample_weight': sample_weights_train})

pipe.predict(X_test)
  1. GridSearchCV with Classifier
model = GradientBoostingClassifier()

param_grid = [{'learning_rate': [0.01]}]

grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring="accuracy",
                    fit_params={'sample_weight': sample_weights_train})

grid.fit(X_train, y_train)

grid.predict(X_test)
  1. GridSearchCV with Pipeline
model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])

param_grid = [{'classifier': [model], 'classifier__learning_rate': [0.01]}]

grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy",
                    fit_params={'classifier__sample_weight': sample_weights_train})

grid.fit(X_train, y_train)

grid.predict(X_test)
  1. GridSearchCV with make_pipeline
model = GradientBoostingClassifier()
pipe = make_pipeline(model)

param_grid = [{'gradientboostingclassifier__learning_rate': [0.01]}]

grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy",
                    fit_params={'gradientboostingclassifier__sample_weight': sample_weights_train})

grid.fit(X_train, y_train)

grid.predict(X_test)

NB: Case number 5 and 6 pending API syntax verification. But from what I have read it does not seem that GridSearchCV passes fit_params to the fit method of Pipeline or make_pipeline objects. I have so far not found a workaround. Additionally, the classifiers do not seem to allow for the propagation of the test sample weights to be applied in the prediction method. This appears to be related to #4497.

@amueller, is this a correct assessment of GridSearchCV using Pipeline and fit_params? Thank you for any information that you may be able to provide.

UPDATE 1: The syntax for case number 5 and 6 have been verified to work.

UPDATE 2: I was incorrect to believe that the predict, predict_proba, and decision_function methods needed a sample_weight argument for the use case of physics event weights.

@stephen-hoover
Copy link
Contributor

stephen-hoover commented Mar 11, 2017

@ecampana , I just tried your examples 5 and 6 on the master branch of scikit-learn, and they both work. However, you should pass fit parameters from the grid searcher via the fit method rather than in the constructor. e.g.

grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train)

and

grid.fit(X_train, y_train, gradientboostingclassifier__sample_weight=sample_weights_train)

PR #8278 added the ability for grid searchers to pass through fit parameters to the estimators they wrap.

@ecampana
Copy link

ecampana commented Mar 16, 2017

@stephen-hoover, I have verified that case number 5 and 6 of my original post worked as you said it would. Thank for confirming my syntax and for clarifying that GridSearchCV and Pipeline together work as a user would intend. However, I did try to pass the sample_weights_train to the fit method of the GridSearchCV object instead of its constructor as you had recommended but, unfortunately, I received the following error message when I attempted it.

TypeError: fit() got an unexpected keyword argument 'classifier__sample_weight'

I used the following python code:

  1. GridSearchCV with Pipeline
model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])

param_grid = [{'classifier': [model], 'classifier__learning_rate': [0.01]}]

grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy")

# None of the below variations on the syntax work
#grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train)
#grid.fit(X_train, y_train, gradientboostingclassifier__sample_weight=sample_weights_train)
#grid.fit(X_train, y_train, sample_weight=sample_weights_train)

grid.predict(X_test)

I am using sklearn version 0.19.dev0 and maybe this may be the reason why your suggested syntax is not working. Would you happen to know what I may be doing incorrectly?

My last question is regarding the predict method. Currently it does not take in sample_weight as an argument. Do you happen to know if there is any interest in adding this feature to scikit-learn, or if there is a development timeline for it? Is there a recommended workaround for this use case, or have people in general implemented their own version of the function for the time being.

UPDATE 1: I was incorrect to believe that the predict, predict_proba, and decision_function methods needed a sample_weight argument for the use case of physics event weights.

@stephen-hoover
Copy link
Contributor

@ecampana , if I run the code you've provided, grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train) works. Probably you're using a version of scikit-learn which doesn't include the code necessary to run it. I believe "0.19.dev0" means "between 0.18.1 and 0.19.0", and doesn't specify an exact state of the code. Update to the latest master and it should work.

What would sample weights on predict do? Predictions happen one sample at a time, so I don't know how including sample weights would change the output.

@ecampana
Copy link

ecampana commented Mar 22, 2017

@stephen-hoover, when I have a chance I will update to a newer version of scikit-learn as you suggested but I am confident that the syntax you recommend works. Thank you for pointing out an alternative solution.

By the way, you made me realize what my confusion was, and you are correct in that predictions happen on a per sample basis. The sample weights I am using are physics event weights so I would in principle apply the sample weights when constructing an event histogram of the machine learning model prediction-probabilities. For example, the sample_weights_test would be applied as follows,

plt.hist(grid.predict_proba(X_test), weights=sample_weights_test, bins=bins,
         histtype='stepfilled', normed=False)

The following methods/functions model.score, cross_val_score, classification_report, roc_auc_score, roc_curve, average_precision_score, precision_recall_curve, brier_score_loss, precision_score, recall_score, f1_score all solve the problem of applying the sample_weights_test values.

But the calibration_curve function does not allow for sample_weight to be included. Should this function not in principle also allow such an argument?

I have one last observation. I noticed that average_precision_score does not allow for the possibilities of negative event weights. The topic of negative event weights was mentioned in #3774. In my case they constitute less than 1%. Therefore for such situations I just set the negative event weight to 1.0 as a temporary solution.

@stephen-hoover
Copy link
Contributor

@ecampana , I see you're having discussion about calibration_curve elsewhere. The suggestion to open a new issue for discussion is good. I don't think that it makes sense to try to include sample weights in the calibration curve. The calibration curve is comparing events with similar predicted probabilities to the fraction of those events which are truly positive. One could make an argument for doing some kind of weighting there, but I'm not sure if I would believe a "weighted" calibration curve.

@ecampana
Copy link

@stephen-hoover, thank you for your reply. I will open up a new issue. Hopefully when I put in a new PR I can give an explanation that will clarify the motivation behind my request to add in sample_weight as a parameter for calibration_curve.

@jnothman jnothman mentioned this pull request Aug 16, 2017
11 tasks
@amueller amueller added API Needs Decision Requires decision labels Aug 5, 2019
@CentralLT
Copy link

What is the status of this issue of sample weights and pipelines? @ecampana is the sample weight request for calibration curves resolved?

@ecampana
Copy link

@CentralLT I believe the actual code for implementing sample weights for calibration_curve method has been completed for a few years now (i.e. All checks have passed). I think @stephen-hoover recommended a minor change, which I implemented on my end long ago but could not validate because I was running an older version of scikit-learn. I would like to pick this task up again if there is anyone willing to help answer questions I may have in order to complete the PR. I am willing to do the necessary leg work. Is there any objection, at least in principle, to this PR?

@haiatn
Copy link
Contributor

haiatn commented Aug 25, 2023

As I understand this code is out of date and it turned out to be a nontrivial feature as I can see from looking at these:
#24026
#25776
I think we can close this PR

@adrinjalali
Copy link
Member

Pipeline has already implemented metadata routing, but AdaBoost not yet. Regardless, we can close this PR since the solution is now very different. Thanks for your triaging efforts @haiatn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.