Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some ideas for breaking changes for 2.0 #28394

Open
amueller opened this issue Feb 9, 2024 · 9 comments
Open

Some ideas for breaking changes for 2.0 #28394

amueller opened this issue Feb 9, 2024 · 9 comments

Comments

@amueller
Copy link
Member

amueller commented Feb 9, 2024

There's some breaking changes that we wanted to do for a long time but didn't do for 1.0 but I figure it might be worth starting a list somewhere for 2.0, since they keep coming up and we might forget them.

  • in OneHotEncoder change handle_unknown to warn by default (and maybe recommend target encoder or max_categories or something)
  • Make train_test_split assume the second entry is y and stratify if the estimator is a classifier, like all other cv methods do. Maybe call the attributes X and y and allow arbitrary additional entries.
  • Make Pipeline do a clone in __init__.
@amueller amueller added New Feature Needs Triage Issue requires triage and removed New Feature Needs Triage Issue requires triage labels Feb 9, 2024
@lorentzenchr
Copy link
Member

A few more:

  • For classifiers, make predict return probabilities and introduce decide with an explicit threshold/cutoff argument returning classes.
  • Make it possible to produce/create a metadata object in the middle of a pipeline and pass that via, e.g., set_fit_request to later steps (sound a lot like DAG).
  • Introduce a new interface for linear models in order to specify (could deviate from traditional API)
    • terms (polynomial, splines, categorical)
    • interactions (between terms)
    • (term specific) penalties
  • Allow to pass inhomogeneous data along a pipeline, e.g., categorical columns.
  • Align max_features of RandomForestClassifier andRandomForestRegressor, see DEP max_features="auto" for regression trees and forests #20255.
  • Align more parameters between GradientBoostingX and HistGradientBoostingX, see RFC Unify old GradientBoosting estimators and HGBT #27873.

@betatim
Copy link
Member

betatim commented Feb 12, 2024

Because "breaking changes" (backwards incompatible change without deprecation cycle) triggers a pet peeve in me: can we approach this from a "benefit for the user" point of view? With this I mean thinking about what the benefit to the user is and then why it can't be done via deprecation cycle. IMHO a new (major) version should be about all the great (new) things people can do, not about all the things that they can't do anymore. Maybe it is only a small mindset shift or people are already thinking this way and just not writing it down like that (because it is obvious to them).

Besides making me happier, I think a positive message around changes makes it easier to "sell" those changes to users (has to be genuine though, not "shit sandwich" style ;)

For example #27873 which I like and want could be: Use the easy to discover name for the estimator that is our "best option". This means more users will discover and use what is likely to be the right tool for the job.

For me that is an example of approaching it from the point of view of "what does a user gain from this". I don't know yet it we can't do this with a normal deprecation cycle, or at least I haven't given up hope that we might discover a way to do it.

@amueller
Copy link
Member Author

@lorentzenchr I'd have to check on the max_features thing, the discussion was like 3 years ago, but I didn't remember that the outcome was that they should be the same for classification and regression, consistency seems kinda counter-intuitive?

I wanted to list things that are "only" changes without the need to design new APIs; there's a bunch more of those in the 1.0 milestone but they are ... not super easy to achieve.

@betatim The things I cited were things were we evaluated long ago (and I'm happy to reevaluate that at any point) a deprecation cycle would not be feasible. But maybe we are considering different trade-offs now.
The reason why I want the changes is because the change is from "really weird unexpected behavior" to "the thing the user expected to happen all along". I.e. this will fix bugs that you didn't know you had because didn't realize the weird things that scikit-learn does.

All meta-estimators clone, except Pipeline and we shied away from that since we couldn't support pre-trained feature extractors before introducing the clone protocol. Now there's no reason (in my eyes) to keep this really weird quirk of pipeline.

train_test_split is the most used function in scikit-learn according to any statistics I've ever seen, but it doesn't do the same as cross_validate or cross_val_score or any other cross-validation, even though train_test_split calls cross-validation under the hood, because the api of train_test_split is not actually to take X and y, but to take any list of arrays. Therefore the somewhat strange return order as well. And for that reason, we didn't know if any were a y so we can't decide whether we should use stratified or non-stratified cross-validation the way we do in any other data splitting situation in scikit-learn.

Oh and OneHotEncoder. Well basically every time I give a scikit-learn tutorial I have to tell people to set that option very early on in their scikit-learn journey because... if you do cross-validation there's no reason that all categories would appear in all splits, and having a pipeline fail randomly depending on how the data is sampled is not a good default.

@betatim
Copy link
Member

betatim commented Feb 13, 2024

I'm not against (or for) particular breaking changes. The thing I see happening too often for my taste is some kind of "free for all" when there is a major version change. Suddenly everything gets done via a breaking change without deprecation cycles "because its a new version, we are allowed to do it!". This is the thing I'm against.

If we can agree that a particular change will improve the life of a user and we can't think of a way to do it with the normal deprecation cycle methods, that is a prime candidate for a change in a 2.0 release.

You don't want to end up being Perl 6.

@thomasjpfan
Copy link
Member

thomasjpfan commented Feb 14, 2024

We have went with deprecations over breaking changes for 1.0. I feel like it will be hard to change this philosophy for 2.0. If we must have breaking changes, I would only want <=2 of them and only if it is a major benefit to our users.

As for the items in #28394 (comment), I see the following deprecation strategies:

in OneHotEncoder change handle_unknown to warn by default

We can add the new option now and deprecate the default.

Make train_test_split assume the second entry is y

Update the signature to (X, y=None, *arrays) and add a stratify_y option, then go through a deprecation cycle for stratify_y's default.

Make Pipeline do a clone in init.

Introduce a new compose.Pipeline and compose.FeatureUnion that clones and deprecate pipeline.*.

Most of the suggestions in #28394 (comment) is new API, so there should be no breaking changes.

@amueller
Copy link
Member Author

@thomasjpfan That was actually the original reason to have it be compose.ColumnTransformer. If we can make it happen then even better. But the pipeline move might still be a good 2.0 candidate even if it's a deprecation.

Btw, another one just came up, which is the score method, which is bad, and it's use by default in GridSearchCV which is also bad.

@betatim
Copy link
Member

betatim commented Mar 11, 2024

An idea I had just now regarding making life easier for people transitioning from 1.x to 2.0 in the situation where there are breaking changes: we could investigate making a small tool like 2to3 to transform (or point out locations that need attention) user's code. I'm not sure if it is always possible to have a tool make a transformation, nor if it is less work to create such a tool vs dealing with deprecations. However, I thought it was a neat idea.

(I got the idea while thinking about HistGradientBoostingClassifier and how to transfer from the current exact gradient boosting estimators to the binned ones.)

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 11, 2024

Another idea:

@lcrmorin
Copy link

lcrmorin commented May 4, 2024

Nice Idea to have a list like that :) Here are the two things that I would change:

  • Predict API. This has been mentioned. I would do something similar: use 'predict' or 'predict_score' for raw outputs, 'decide' or 'predict_binary' for binary decision (after threshold selection or tuning), 'predict_proba' after probability calibration. On top of that I would make the threshold for binarisation a mandatory parameter, or at least issue a warning when it is not set explicitly.

  • More generally there can be a 'Data Science Warning' meaning 'it is possible to do this but this is probably wrong'. It would trigger on predict when the threshold is set to default / untuned, predict_proba when no probability calibration step has been performed, or more generally when there is something wrong in a pipeline. I suspect this warning could already be implemented (if threshold==.5 then throw 'Data Science Warning: threshold is set to default value, who is likely not optimal. You can refer to documentation (link) to tune the decision threshold'.

  • Better handling Data: categoricals, types (if I reduce my float types, does it reduces memory usage and run time in the same way ?), infinite values (that should not break tree models), of sentinel values (those hardcoded '99999').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants