Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Support for Some Developer Utilities #15801

Open
thomasjpfan opened this issue Dec 5, 2019 · 14 comments
Open

RFC Support for Some Developer Utilities #15801

thomasjpfan opened this issue Dec 5, 2019 · 14 comments
Labels
API Hard Hard level of difficulty module:utils Needs Decision Requires decision RFC

Comments

@thomasjpfan
Copy link
Member

On the Utilities for Developers page, it states:

Warning: These utilities are meant to be used internally within the scikit-learn package. They are not guaranteed to be stable between versions of scikit-learn. Backports, in particular, will be removed as the scikit-learn dependencies evolve.

If we want to provide utilities to support third-party estimators, we should treat some of these utilities as "first class" citizens.

For example safe_indexing would be extremely useful for third parties that want to support DataFrames as input. Currently, the options for third-party developers is to build their own "safe_indexing" or depend on our private version which may not be stable.

Another example is scikit-learn/enhancement_proposals#22, which defines a n_features_in_ contract where we will internally use private methods to cohere with the contract. Third-party estimators would need to build their own methods or functions to work with the SLEP.

TLDR: Now that much of the utilities are "private", we can make deliberate decisions about what utilities should be public and supported by us. This would mean deprecation cycles, etc. If we support some of the utils module, it will make it easier to build estimators, which will enrich the ecosystem of scikit-learn compatible estimators.

CC @scikit-learn/core-devs

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 5, 2019 via email

@NicolasHug
Copy link
Member

Warning: These utilities are meant to be used internally within the scikit-learn package. They are not guaranteed to be stable between versions of scikit-learn. Backports, in particular, will be removed as the scikit-learn dependencies evolve.

In practice, that's not true. We've been treating all these utilities as public, and we've always deprecated things smoothly with the 2 versions rules. Case in point: safe_indexing, which we just deprecated in 0.22.

It seems that the questions are:

  • Do we want a developer API? If yes, what are the rules regarding deprecations?
  • What utilities become part of the developer API (if any), which ones go private, and which ones stay public?

I'm happy to consider a proposal :)

@jnothman
Copy link
Member

jnothman commented Dec 6, 2019 via email

@glemaitre
Copy link
Member

a developer API more clearly

I'm +1 with that. Then, we should ensure some deprecation cycle if we have breaking changes but we could make it as with experimental to have fast adoption (and we can use DeprecationWarning to warn the end-user :)).

I think that we can come with a list of utilities. Another way would be to ask third-party which tools they are using?

@qinhanmin2014
Copy link
Member

In practice, that's not true. We've been treating all these utilities as public, and we've always deprecated things smoothly with the 2 versions rules. Case in point: safe_indexing, which we just deprecated in 0.22.

I guess not @NicolasHug , see #14545, we change the behaviour of check_is_fitted without a deprecation cycle (You can say that there's a deprecation cycle, but I'd argue that since we no longer support the old behaviour, the deprecation cycle there is meaningless).

Perhaps we should introduce a deprecation cycle if we want to change public functions in sklearn.utils.

@thomasjpfan
Copy link
Member Author

With 1.0, let's see if we can revisit this topic. With the comment at #20695 (comment):

I am not sure if we should either make it public or available to developers.
I think that I would prefer to define a developer API with utilities that should be used by people developing scikit-learn estimators.

My current definition of public API is "always deprecate with 2 cycles before changing behavior". We have mostly been treating items in sklearn.utils as public API under that definition.

Before we decide what are utilities are developer API, I think we need to decide what is the API contract for items listed as "developer API". For me it comes down to, "how much work does a third-party estimator developer have to do to support the latest version of sklearn". I am happy with following the 2 cycle deprecation rule as we do with "public API". If we want to add features to the developer utility, we use our standard practices for ensuring backward compatibility.

@adrinjalali
Copy link
Member

Another related discussion is here: #20657 : how do we signal something is developer API?

@glemaitre
Copy link
Member

how much work does a third-party estimator developer have to do to support the latest version of sklearn

Thinking with what we encounter with imbalanced-learn, the issue is not to support the latest version of scikit-learn but instead, be compatible with the older versions. We made the choice to only support the latest version of scikit-learn because this is easy to make the change. The pain to support the older version is that we need to backport old non-backwards compatible functionalities. I don't say that we did a good choice, but it is the least painful on our side with the available workforce at hand.

I assume that the impact of 1 cycle vs. 2 cycles will be that 1 cycle allow third-party to only support scikit-learn while 2 cycles would allow supporting the 2 latest versions easily.

@glemaitre
Copy link
Member

how do we signal something is developer API?

something documented publicly but starting with an underscore :)

@WenjieZ
Copy link

WenjieZ commented Aug 11, 2021

Let me weigh in here. The point of a Developer API is to tell the developers that these functionalities are quite stable and will not undergo changes without notification and consulting the public's opinion. The developers want to know what they can depend on to write their programs, just like foreign entrepreneurs want to know the local law and legislation before investing.

@glemaitre
Copy link
Member

the point of a Developer API is to tell the developers that these functionalities are quite stable and will not undergo changes without notification

I think that we agree on this. What would be interesting is to know if we can shorten it to 1 cycle or be a bit more conservative and use 2 cycles as in the public API.

@adrinjalali
Copy link
Member

something documented publicly but starting with an underscore :)

What I'm trying to say, is that documentation is great, but something else would be nice as well. As in, it'd be nice for the user to inspect a class and notice certain things which are private, as opposed to other things which are a part of the developer API. That's why I'm very happy with the dundar notation in #20657

@lorentzenchr lorentzenchr changed the title Support for Some Developer Utilities RFC Support for Some Developer Utilities Mar 26, 2023
@lorentzenchr
Copy link
Member

If we agree on the usual 2 release cycle for a developer API, then there is hardly a difference to the user API except where to find it.
We could collect the developer API in a (sub)module though:

  • If an existing one, then sklearn.utils.
  • If a new one, then maybe sklearn.dev.

It would be nice to resolve this issue somehow. Maybe use the next dev meeting?

@glemaitre
Copy link
Member

I like the idea to have a new module which might make it more straightforward to create documentation.

Maybe use the next dev meeting?

It would be nice. I'll support any way forward there (I will be travelling during the dev meeting).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Hard Hard level of difficulty module:utils Needs Decision Requires decision RFC
Projects
No open projects
Meeting Issues
  
To do
Development

No branches or pull requests

9 participants