Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update scikit-learn to 1.4 #5851

Merged
merged 11 commits into from
May 29, 2024

Conversation

betatim
Copy link
Member

@betatim betatim commented Apr 17, 2024

This is an attempt to update the scikit-learn dependency from 1.2 to 1.4. Most changes are related to constructor arguments that were deprecated in 1.2 and in 1.4 have changed/been removed.

A question I have is what cuml's deprecation policy is? I've gone with "two releases" for parameters where we can easily do so (deprecated in 24.06 and then remove them in 24.10). However that is only about 4 months of deprecation which could be a bit short.

Some of the changes would be hard to do as a deprecation (with 1.4 there is no way to provide the "old way"), we'd have to stick with 1.3 for now. I think this is a bit of a bummer but maybe the price to pay for not keeping on top of deprecations. And it seems like there is no deprecation policy in the docs/towards users? So maybe we can play this card once now, to catch up and at the same time introduce a deprecation policy.

The SHAP test needed its reference updating. I am not sure why, at least I couldn't quickly find a reason for why you'd have to do this.

I am not sure how possible it would be to support a range of scikit-learn versions (say 1.2 - 1.4). Would be cool but maybe not worth the added complexity?

Todo:

  • add deprecation warning in AgglomerativeClustering
  • add tests for deprecations
    • RF regressor
    • RF classifier
    • LARS - LARS is experimental, so no need for deprecation
    • LogisticRegression
    • OneHotEncoder
    • AgglomerativeClustering
  • think about how to combine this with Enable pytest failures on FutureWarnings/DeprecationWarnings #5799
  • decide deprecation cycle length - copy cudf, so 24.06 -> 24.08
  • update "expiry" version in the warnings
  • update doc strings

xref #5799

@betatim betatim requested a review from a team as a code owner April 17, 2024 13:38
Copy link

copy-pr-bot bot commented Apr 17, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Apr 17, 2024
@bdice
Copy link
Contributor

bdice commented Apr 17, 2024

cuML closely mirrors scikit-learn, just as cuDF closely mirrors pandas. This alignment led us to a one-release deprecation policy in cuDF. I think it is totally fine to favor alignment with the target library over the stability of slower deprecation/removal. I would vote to deprecate anything you can in 24.06 with removal in 24.08. For anything you can’t safely deprecate, I would consider just making breaking changes to align with the latest scikit-learn.

https://docs.rapids.ai/api/cudf/stable/developer_guide/contributing_guide/#deprecating-and-removing-code

@dantegd dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 17, 2024
@betatim
Copy link
Member Author

betatim commented Apr 19, 2024

If there is prior art within RAPIDS then it makes some amount of sense to follow that (aka copy cudf). For me the important factor in deciding is how much time (in months) do we need to/want to give users to deal with deprecations. scikit-learn is quite conservative and gives you ~1 year to deal with things (1.2 deprecates something with a warning, six months later 1.3 continues to warn, another six months later 1.4 removes the warning and old behaviour). This means you have to deal with deprecations for quite a while (as a scikit-learn developer).

How did cudf decide on the mark as deprecated and warn in 22.08, remove in 22.10 cycle? That seems much quicker than pandas. From the doc you linked it sounds like they deprecate something with a warning in v1.x and will continue to warn until they release 2.0. My assumption is that a lot more than a few months pass between v1.x and v2.0. Potentially even longer than scikit-learn's ~1 year policy.

Overall I'd vote for using the same deprecation cycle and policy for cudf and cuml (and other RAPIDS packages). What ever the pros and cons of each policy might be, they are all outweighed by having something consistent across RAPIDS IMHO.

@vyasr
Copy link
Contributor

vyasr commented Apr 20, 2024

tl;dr I think a 1-release deprecation cycle for cuml-specific behavior that is not matched by scikit-learn is fine, but if you have a deprecated code path that provides support for behavior that is deprecated in scikit-learn too, it's probably worth the effort of maintaining it as long as they do unless it starts posing a heavy undue burden.

We should differentiate between cudf's own deprecation policy and the policy cudf follows for matching pandas deprecations. If pandas deprecates something, we deprecate it as well in cudf, but we won't remove it until pandas does. That can be a very long time, e.g. DataFrame.append was deprecated in cudf for over a year. OTOH, if cudf has behavior that doesn't currently match pandas exactly and we decide to make some change in that realm, we follow the deprecation policy that Bradley linked to above. IOW, if you have to break current users of some cuml API to better align with scikit-learn and help future users migrate seamlessly from scikit-learn to cuml, that is a good tradeoff to make on a short (one-release) timescale.

While I do think we should try and align cudf and cuml (and other RAPIDS libraries) as much as possible in this regard, it is worth keeping in mind that the situations are a bit different. In general the scikit-learn API is far less expansive and has fewer sharp edges than the pandas API. Many pandas functions have very large API surfaces, both in terms of the number of parameters and the degree of flexibility in the ways that the corresponding arguments may be specified. Furthermore, there are a lot more behavioral differences that often cause headaches, usually around cases like null handling, nulls vs nans, empty columns, all-null columns, etc. We have historically had many cases where either specific combinations of parameters were intentionally doing something that we considered more sensible than the default pandas choice (possibly because we supported truly nullable data while pandas historically conflated nulls and nans), or some combination of arguments hits a data-specific edge case that we simply hadn't considered before. Especially now with cudf.pandas, more of those are things we would classify as bugs, but in the past some of these are cases that we classified as deprecating behavior in cudf to better align with pandas. I don't think you'll have nearly as much of an issue with scikit-learn, which in my experience has made very carefully designed API choices from the beginning and also has a somewhat simpler problem to deal with from an API perspective (data wrangling APIs are some of the hardest to design, whereas in ML much more of the complexity is algorithmic in nature).

@betatim
Copy link
Member Author

betatim commented Apr 22, 2024

Thinking out loud:

In vX of cuml it would depend on the latest released version of scikit-learn, say vY. The deprecations would mirror those of vY. Say some class' constructor argument is deprecated in vY, then cuml will also emit a warning. At some point in the future cuml vX+N will be released that depends on the then current version of scikit-learn which is vY+2 and the deprecations from vY will now be gone. This is because scikit-learn deprecations go for two releases.

This means work on cuml now, should attempt to match the deprecations of scikit-learn v1.5 which will probably be released in May, before cuml 24.06 comes out.

During the transition phase from the current state of cuml to this "in sync" state there will be a few breaking changes. This is because cuml currently depends on scikit-learn 1.2, so in order to switch to v1.5 we will have to make breaking changes. I think even changing to v1.4 requires a breaking change, but it is in LARS which is marked experimental.

I think matching the deprecations makes sense if you are in sync with scikit-learn. What do people think of this? I think it is probably desirable to be in sync.

A practical snag in the above, but solvable somehow, is that it is unknowable when a version of scikit-learn will be released. The target is to release roughly every six months, but this isn't a firm commitment. For example if it looks like a release would come out just before the end of December, it is likely to be moved to January. This means when we write the deprecation warning in cuml, we need to make an educated guess as to the version of cuml in which the deprecation expires.

@vyasr
Copy link
Contributor

vyasr commented Apr 23, 2024

I think that strategy makes sense. One good way that you can verify this is by running your tests with warnings as errors. cudf already does this. That forces us to catch warnings from APIs that we expect to throw them, and when we write such tests we also catch the warnings from the corresponding cudf API. That allows us to verify that we're throwing the same warning as pandas. cuml might need to do more work to get its tests running without warnings before that is viable, though.

You're right that not knowing the exact schedule is a bit tricky. At that point I think it's mostly a social question of what you want to promise. You could avoid putting a specific removal timeline in the deprecation message, you could include a best-effort estimate, or you could say something like "will be removed by X if not sooner" if you can confidently upper bound. I think that's largely up to you.

@dantegd dantegd added the 2 - In Progress Currenty a work in progress label Apr 24, 2024
@betatim
Copy link
Member Author

betatim commented May 3, 2024

I've updated the "will be removed by" version to 25.08. My best guess for when scikit-learn 1.7 will be released.

This PR updates cuml to scikit-learn 1.4, but 1.5 will probably be out in a week or two. Likely before 24.06 will be released, so maybe we merge this PR now and update cuml to scikit-learn 1.5 if it has been released in time.

@dantegd
Copy link
Member

dantegd commented May 6, 2024

@betatim merged branch-24.06 into the PR to trigger CI

@bdice
Copy link
Contributor

bdice commented May 7, 2024

/ok to test

Mostly deals with parameters that were deprecated in scikit-learn 1.2
and are now no longer available.

Update end version of deprecation

Update version in which deprecation expires

Change argument name to deal with deprecation
@betatim
Copy link
Member Author

betatim commented May 23, 2024

I started working on updating to 1.5 which was released yesterday

@betatim betatim requested a review from a team as a code owner May 24, 2024 07:07
@github-actions github-actions bot added the conda conda issue label May 24, 2024
@betatim
Copy link
Member Author

betatim commented May 24, 2024

I'll take a look at the failing metrics comparison tests on Monday if no one else has an idea. A little unclear why suddenly the metrics should be different/need less precision to get the comparison to be equal.

@dantegd
Copy link
Member

dantegd commented May 24, 2024

The distance metrics seem to only differ very slightly on ARM, I wonder if it's a change in sklearn or a dependency of it, but from a first looks seems too small to warrant worries

The arguments to r2_score were the wrong way around, this might explain
the sudden decrease in the score which leads to failing tests.
@betatim
Copy link
Member Author

betatim commented May 27, 2024

I reduced the precision for the metrics tests. Looks like they pass now, except for the two ARM jobs that failed to start (is there a way to restart/re-run them?). Also fixed a few uses of deprecated constructor arguments and the usage of r2_score in the dask tests. You need to pass "the truth" as first argument and the model's predictions as second. I think something in the computation in scikit-learn changed that lead to this making a difference now (my guess based on the fact that changing the order seems to have fixed it).

@dantegd
Copy link
Member

dantegd commented May 29, 2024

/merge

@rapids-bot rapids-bot bot merged commit 326b049 into rapidsai:branch-24.06 May 29, 2024
65 of 68 checks passed
@betatim betatim deleted the upgrade-sklearn-1.4 branch May 29, 2024 16:09
@trxcllnt trxcllnt removed their assignment May 29, 2024
@mmccarty
Copy link

@betatim You may want to update the title of this PR to say 1.5. I didn't realize you went to the latest release until reading through the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress conda conda issue Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants