Make Pipeline compatible with AdaBoost #2630

jcrudy · 2013-12-02T20:12:45Z

This pull request makes the following changes to the Pipeline class:

Allow passing arguments to methods of steps, not just their constructors. Necessary so that AdaBoost can pass the sample_weight argument.
If an argument is not of the form step__argname it will be passed to all steps. Necessary again so that AdaBoost can pass the sample_weight argument.
Forward attribute access to steps if the attribute is not found in the Pipeline object itself. Steps are searched backward starting with the final step. Allows AdaBoostClassifier to access the classes_ attribute of the final classifier in the Pipeline.

My objective was to make changes that would generalize well to other potential uses of the Pipeline. For example, it is now possible to pass an argument to the fit method of a particular step by passing a parameter named step__argname instead of just argname. It is also possible to access step attributes other than the classes_ attribute of the final step. Neither of these features is strictly necessary to get AdaBoost compatibility, but seemed like reasonable generalizations of the necessary functionality. I think sklearn devs should pay particular attention to change 3 above and make sure they're comfortable with it. There are definitely other generalizations that might be better, such as only allowing access to the attributes of the final step or even only forwarding the classes_ attribute.

A gist showing the intended usage of this enhancement is located here:

https://gist.github.com/jcrudy/7756798

The gist uses py-earth because I was not familiar with a scikit-learn transformer that takes sample_weight (I don't know them all, though). If the EarthRegressor (#2285) pull request is merged then there will be at least one such transformer.

…ing" This reverts commit 87db631.

Getting recent changes from scikit-learn people.

See here: https://gist.github.com/jcrudy/7493865

…sifier

jnothman · 2013-12-02T20:56:30Z

Thanks for the PR! This relates to some issues that have been under discussion for some time. I'll review your implementation and then try to point some of them out.

jnothman · 2013-12-02T20:56:30Z

.gitignore

@@ -52,3 +52,5 @@ benchmarks/bench_covertype_data/
 *.prefs
 .pydevproject
 .idea
+


I'm not sure you meant to commit this file. At least, please remove this blank line.

GaelVaroquaux · 2013-12-02T21:26:35Z

Overriding getattribute (or getattr) is really not something that should be done. In general, I am pretty much opposed to overriding any _foo method: thou shall not change how objects behave.

The delegation mechanism to be used in scikit-learn is the get_params/set_params mechanism, but anyhow, it is only for model parameters.

GaelVaroquaux · 2013-12-02T21:28:42Z

sklearn/pipeline.py

+                except AttributeError:
+                    pass
+        raise AttributeError("Neither the '%s' object nor any of its steps has the attribute '%s'" 
+                             % (self.__class__.__name__, name))


That's really doing way too much magic. You have here a behavior that is impossible to define in a precise way, and thus a potential source of bugs. It is something to avoid.

Maybe it would be better just to have a classes_ property that explicitly gets the classes_ from the last step? Like this (untested):

@Property
def classes_(self):
return self.steps[-1][1].classes_

jnothman · 2013-12-02T21:30:53Z

As you've discovered, the issue with sample_weight in Pipeline is that most transformers don't accept -- let alone use -- it. So we either need to support sample_weight more universally, or Pipeline needs to inspect method signatures or similar to work out where it can pass which parameters (and then **kwargs approaches are even more problematic). And again, this implicit approach creates version compatibility issues: I have a Pipeline working one way, then someone includes sample_weight support in a transformer, and my Pipleine's behaviour changes.

Maybe your problem is more readily solved by modifying AdaBoost instead of Pipeline, even if that design seems awkward.

GaelVaroquaux · 2013-12-02T21:37:51Z

sklearn/pipeline.py

@@ -141,64 +168,71 @@ def fit_transform(self, X, y=None, **fit_params):
        else:
            return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt)

-    def predict(self, X):
+    def predict(self, X, **params):


In which situation do we need parameters on predict?

Good point. It is not needed for AdaBoost. The reason it's here is because I sometimes use the scikit-learn Pipeline as a sort of glue to combine different kinds of models from other packages. For example, I have some wrappers that I use with the statsmodels GLM models to make them work in the Pipeline. The GLM models can take offset or exposure parameters when making predictions. So, it's not really needed for scikit-learn by itself but has been a useful extension for me when combining scikit-learn with other packages. It may be out of scope.

coveralls · 2013-12-02T23:39:17Z

Coverage remained the same when pulling 451df83 on jcrudy:pipeline into 06a1eaf on scikit-learn:master.

jcrudy · 2013-12-03T00:36:18Z

@jnothman
"And again, this implicit approach creates version compatibility issues: I have a Pipeline working one way, then someone includes sample_weight support in a transformer, and my Pipleine's behaviour changes."

In this case, wouldn't your code be non-functional before sample weight is implemented for your transformer? As it's coded here, sample_weight goes to all steps if it's provided. If any step does not accept sample_weight, an error would be raised. I did consider the alternative, having sample_weight go only to those steps that could accept it, but that approach seemed rather more problematic (as you point out).

Perhaps a better solution is to explicitly set up which arguments get sent to which steps when the Pipeline is created. Something like this:

t1 = SomeTransformer()
t2 = SomeOtherTransformer()
c = SomeClassifier
p = Pipeline([('t1', t1), ('t2', t2), ('c', c)], [('sample_weight', ('t1', 'c'))])

where I've assumed that t1 and c get sample_weight and t2 does not.

Another use case, which I didn't properly explain initially but mentioned in my response to @GaelVaroquaux above, is using scikit-learn as a sort of glue to connect other packages. For example, I currently use this modified Pipeline to connect py-earth models to GLMs from statsmodels, using a simple adapter around GLM to make it conform to the scikit-learn api. The adapted fit and predict methods of the GLMs can take offset or exposure arguments, which these changes to Pipeline also allow me to handle easily. I don't know, though, how many other users are interested in doing things like this.

jnothman · 2013-12-03T09:06:32Z

p = Pipeline([('t1', t1), ('t2', t2), ('c', c)], [('sample_weight', ('t1', 'c'))])

I can see the merit in this approach, but naming that parameter will be hard :) I also can only assume it would be adhered to whenever a sample_weight kwarg is passed in, whether to score, fit or transform. It's not very pretty either (though it should be a dict, not a list of tuples).

One could also take a similarly explicit approach to pass-through attributes attribute_mapping={'classes_': 'clf.classes_'}.

@GaelVaroquaux, do you think we should be compiling a document of API design issues for pipelines and other meta-estimators?

GaelVaroquaux · 2013-12-03T09:08:23Z

@GaelVaroquaux, do you think we should be compiling a document of API design
issues for pipelines and other meta-estimators?

Yes. Please ping @ogrisel about this. We have had a lot of discussions
about API issues, and I think that the sample_weight problem that you
are having is related to some of our discussions.

Thanks a lot!

jcrudy · 2013-12-05T19:03:00Z

Link to the document created by @jnothman :

https://github.com/scikit-learn/scikit-learn/wiki/Pipeline-et-al.-design-issues

hshteingart · 2016-03-23T12:35:46Z

I don't understand what is the status. Does pipeline supports sample_weight or not and if it does what is the syntax? Thanks!

ecampana · 2017-03-09T20:14:01Z

@hshteingart "I don't understand what is the status. Does pipeline supports sample_weight or not and if it does what is the syntax? Thanks!"

I do not know if you ever received a reply to your question. I have recently begun using Pipelines and found the following to work using sklearn version 0.19.dev0. For example,

Classifier

model = GradientBoostingClassifier()

model.fit(X_train, y_train, **{'sample_weight': sample_weights_train})

model.predict(X_test)

Pipeline

model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])

pipe.fit(X_train, y_train, **{'classifier__sample_weight': sample_weights_train})

pipe.predict(X_test)

make_pipeline

model = GradientBoostingClassifier()
pipe = make_pipeline(model)

pipe.fit(X_train, y_train, 
         **{'gradientboostingclassifier__sample_weight': sample_weights_train})

pipe.predict(X_test)

GridSearchCV with Classifier

model = GradientBoostingClassifier()

param_grid = [{'learning_rate': [0.01]}]

grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring="accuracy",
                    fit_params={'sample_weight': sample_weights_train})

grid.fit(X_train, y_train)

grid.predict(X_test)

GridSearchCV with Pipeline

model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])

param_grid = [{'classifier': [model], 'classifier__learning_rate': [0.01]}]

grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy",
                    fit_params={'classifier__sample_weight': sample_weights_train})

grid.fit(X_train, y_train)

grid.predict(X_test)

GridSearchCV with make_pipeline

model = GradientBoostingClassifier()
pipe = make_pipeline(model)

param_grid = [{'gradientboostingclassifier__learning_rate': [0.01]}]

grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy",
                    fit_params={'gradientboostingclassifier__sample_weight': sample_weights_train})

grid.fit(X_train, y_train)

grid.predict(X_test)

NB: Case number 5 and 6 pending API syntax verification. But from what I have read it does not seem that GridSearchCV passes fit_params to the fit method of Pipeline or make_pipeline objects. I have so far not found a workaround. Additionally, the classifiers do not seem to allow for the propagation of the test sample weights to be applied in the prediction method. This appears to be related to #4497.

@amueller, is this a correct assessment of GridSearchCV using Pipeline and fit_params? Thank you for any information that you may be able to provide.

UPDATE 1: The syntax for case number 5 and 6 have been verified to work.

UPDATE 2: I was incorrect to believe that the predict, predict_proba, and decision_function methods needed a sample_weight argument for the use case of physics event weights.

stephen-hoover · 2017-03-11T12:04:50Z

@ecampana , I just tried your examples 5 and 6 on the master branch of scikit-learn, and they both work. However, you should pass fit parameters from the grid searcher via the fit method rather than in the constructor. e.g.

grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train)

and

grid.fit(X_train, y_train, gradientboostingclassifier__sample_weight=sample_weights_train)

PR #8278 added the ability for grid searchers to pass through fit parameters to the estimators they wrap.

ecampana · 2017-03-16T01:54:17Z

@stephen-hoover, I have verified that case number 5 and 6 of my original post worked as you said it would. Thank for confirming my syntax and for clarifying that GridSearchCV and Pipeline together work as a user would intend. However, I did try to pass the sample_weights_train to the fit method of the GridSearchCV object instead of its constructor as you had recommended but, unfortunately, I received the following error message when I attempted it.

TypeError: fit() got an unexpected keyword argument 'classifier__sample_weight'

I used the following python code:

GridSearchCV with Pipeline

model = GradientBoostingClassifier()
pipe = Pipeline([('classifier', model)])

param_grid = [{'classifier': [model], 'classifier__learning_rate': [0.01]}]

grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring="accuracy")

# None of the below variations on the syntax work
#grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train)
#grid.fit(X_train, y_train, gradientboostingclassifier__sample_weight=sample_weights_train)
#grid.fit(X_train, y_train, sample_weight=sample_weights_train)

grid.predict(X_test)

I am using sklearn version 0.19.dev0 and maybe this may be the reason why your suggested syntax is not working. Would you happen to know what I may be doing incorrectly?

My last question is regarding the predict method. Currently it does not take in sample_weight as an argument. Do you happen to know if there is any interest in adding this feature to scikit-learn, or if there is a development timeline for it? Is there a recommended workaround for this use case, or have people in general implemented their own version of the function for the time being.

UPDATE 1: I was incorrect to believe that the predict, predict_proba, and decision_function methods needed a sample_weight argument for the use case of physics event weights.

stephen-hoover · 2017-03-16T14:07:12Z

@ecampana , if I run the code you've provided, grid.fit(X_train, y_train, classifier__sample_weight=sample_weights_train) works. Probably you're using a version of scikit-learn which doesn't include the code necessary to run it. I believe "0.19.dev0" means "between 0.18.1 and 0.19.0", and doesn't specify an exact state of the code. Update to the latest master and it should work.

What would sample weights on predict do? Predictions happen one sample at a time, so I don't know how including sample weights would change the output.

ecampana · 2017-03-22T09:37:25Z

@stephen-hoover, when I have a chance I will update to a newer version of scikit-learn as you suggested but I am confident that the syntax you recommend works. Thank you for pointing out an alternative solution.

By the way, you made me realize what my confusion was, and you are correct in that predictions happen on a per sample basis. The sample weights I am using are physics event weights so I would in principle apply the sample weights when constructing an event histogram of the machine learning model prediction-probabilities. For example, the sample_weights_test would be applied as follows,

plt.hist(grid.predict_proba(X_test), weights=sample_weights_test, bins=bins,
         histtype='stepfilled', normed=False)

The following methods/functions model.score, cross_val_score, classification_report, roc_auc_score, roc_curve, average_precision_score, precision_recall_curve, brier_score_loss, precision_score, recall_score, f1_score all solve the problem of applying the sample_weights_test values.

But the calibration_curve function does not allow for sample_weight to be included. Should this function not in principle also allow such an argument?

I have one last observation. I noticed that average_precision_score does not allow for the possibilities of negative event weights. The topic of negative event weights was mentioned in #3774. In my case they constitute less than 1%. Therefore for such situations I just set the negative event weight to 1.0 as a temporary solution.

stephen-hoover · 2017-03-27T13:27:58Z

@ecampana , I see you're having discussion about calibration_curve elsewhere. The suggestion to open a new issue for discussion is good. I don't think that it makes sense to try to include sample weights in the calibration curve. The calibration curve is comparing events with similar predicted probabilities to the fraction of those events which are truly positive. One could make an argument for doing some kind of weighting there, but I'm not sure if I would believe a "weighted" calibration curve.

ecampana · 2017-03-29T02:15:21Z

@stephen-hoover, thank you for your reply. I will open up a new issue. Hopefully when I put in a new PR I can give an explanation that will clarify the motivation behind my request to add in sample_weight as a parameter for calibration_curve.

CentralLT · 2020-08-10T06:49:42Z

What is the status of this issue of sample weights and pipelines? @ecampana is the sample weight request for calibration curves resolved?

ecampana · 2020-08-26T01:11:12Z

@CentralLT I believe the actual code for implementing sample weights for calibration_curve method has been completed for a few years now (i.e. All checks have passed). I think @stephen-hoover recommended a minor change, which I implemented on my end long ago but could not validate because I was running an older version of scikit-learn. I would like to pick this task up again if there is anyone willing to help answer questions I may have in order to complete the PR. I am willing to do the necessary leg work. Is there any objection, at least in principle, to this PR?

haiatn · 2023-08-25T21:49:51Z

As I understand this code is out of date and it turned out to be a nontrivial feature as I can see from looking at these:
#24026
#25776
I think we can close this PR

adrinjalali · 2023-09-20T15:15:42Z

Pipeline has already implemented metadata routing, but AdaBoost not yet. Regardless, we can close this PR since the solution is now very different. Thanks for your triaging efforts @haiatn

jcrudy added 9 commits June 13, 2013 10:15

Made changes to pipeline to allow more general parameter passing

87db631

Revert "Made changes to pipeline to allow more general parameter pass…

c5b716a

…ing" This reverts commit 87db631.

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

67cd399

Getting recent changes from scikit-learn people.

Merge remote-tracking branch 'upstream/master'

bb88ee7

Merge remote-tracking branch 'upstream/master'

4c514be

ignore my eclipse .project

1836b5d

Integrating pipeline changes from gist

89bf3df

See here: https://gist.github.com/jcrudy/7493865

Added attribute forwarding for better compatibility with AdaBoostClas…

6024b68

…sifier

Added a unit test for compatibility between Pipeline and AdaBoostClas…

cb10858

…sifier

jnothman reviewed Dec 2, 2013
View reviewed changes

GaelVaroquaux reviewed Dec 2, 2013
View reviewed changes

Fix broken unit test test_pipeline_ada_boost_classifier

451df83

ogrisel removed this from the 0.15 milestone Jun 4, 2014

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller mentioned this pull request Apr 9, 2015

[API] Consistent API for attaching properties to samples #4497

Closed

ecampana mentioned this pull request Mar 11, 2017

[MRG+1] Accept keyword parameters to hyperparameter search fit methods #8278

Merged

ecampana mentioned this pull request Mar 23, 2017

[MRG] Reliability curves for calibration of predict_proba #3574

Closed

jnothman mentioned this pull request Aug 16, 2017

[WIP] Sample property routing #9566

Closed

11 tasks

amueller added API Needs Decision Requires decision labels Aug 5, 2019

github-actions bot added the module:pipeline label Mar 2, 2020

adrinjalali mentioned this pull request Oct 29, 2020

[WIP] sample props (proposal 4) #16079

Closed

Base automatically changed from master to main January 22, 2021 10:48

adrinjalali mentioned this pull request Jun 24, 2021

sample-props alternate implementation #20350

Closed

adrinjalali mentioned this pull request Aug 18, 2022

FEAT SLEP006: metadata routing infrastructure #24027

Merged

adrinjalali mentioned this pull request Apr 26, 2023

SLEP006 - Metadata Routing task list #22893

Open

28 tasks

adrinjalali closed this Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Pipeline compatible with AdaBoost #2630

Make Pipeline compatible with AdaBoost #2630

jcrudy commented Dec 2, 2013

jnothman commented Dec 2, 2013

jnothman Dec 2, 2013

GaelVaroquaux commented Dec 2, 2013

GaelVaroquaux Dec 2, 2013

jcrudy Dec 2, 2013

jnothman commented Dec 2, 2013

GaelVaroquaux Dec 2, 2013

jcrudy Dec 2, 2013

coveralls commented Dec 2, 2013

jcrudy commented Dec 3, 2013

jnothman commented Dec 3, 2013

GaelVaroquaux commented Dec 3, 2013

jcrudy commented Dec 5, 2013

hshteingart commented Mar 23, 2016

ecampana commented Mar 9, 2017 •

edited

Loading

stephen-hoover commented Mar 11, 2017 •

edited

Loading

ecampana commented Mar 16, 2017 •

edited

Loading

stephen-hoover commented Mar 16, 2017

ecampana commented Mar 22, 2017 •

edited

Loading

stephen-hoover commented Mar 27, 2017

ecampana commented Mar 29, 2017

CentralLT commented Aug 10, 2020

ecampana commented Aug 26, 2020

haiatn commented Aug 25, 2023

adrinjalali commented Sep 20, 2023

Make Pipeline compatible with AdaBoost #2630

Make Pipeline compatible with AdaBoost #2630

Conversation

jcrudy commented Dec 2, 2013

jnothman commented Dec 2, 2013

jnothman Dec 2, 2013

Choose a reason for hiding this comment

GaelVaroquaux commented Dec 2, 2013

GaelVaroquaux Dec 2, 2013

Choose a reason for hiding this comment

jcrudy Dec 2, 2013

Choose a reason for hiding this comment

jnothman commented Dec 2, 2013

GaelVaroquaux Dec 2, 2013

Choose a reason for hiding this comment

jcrudy Dec 2, 2013

Choose a reason for hiding this comment

coveralls commented Dec 2, 2013

jcrudy commented Dec 3, 2013

jnothman commented Dec 3, 2013

GaelVaroquaux commented Dec 3, 2013

jcrudy commented Dec 5, 2013

hshteingart commented Mar 23, 2016

ecampana commented Mar 9, 2017 • edited Loading

stephen-hoover commented Mar 11, 2017 • edited Loading

ecampana commented Mar 16, 2017 • edited Loading

stephen-hoover commented Mar 16, 2017

ecampana commented Mar 22, 2017 • edited Loading

stephen-hoover commented Mar 27, 2017

ecampana commented Mar 29, 2017

CentralLT commented Aug 10, 2020

ecampana commented Aug 26, 2020

haiatn commented Aug 25, 2023

adrinjalali commented Sep 20, 2023

ecampana commented Mar 9, 2017 •

edited

Loading

stephen-hoover commented Mar 11, 2017 •

edited

Loading

ecampana commented Mar 16, 2017 •

edited

Loading

ecampana commented Mar 22, 2017 •

edited

Loading