Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Definition of public API #12927

Closed
rth opened this issue Jan 5, 2019 · 21 comments
Closed

RFC Definition of public API #12927

rth opened this issue Jan 5, 2019 · 21 comments

Comments

@rth
Copy link
Member

@rth rth commented Jan 5, 2019

In several PRs (e.g. #11182 #12916) the question arises whether we need to deprecate some object before it can be removed or changed. This goes back to defining what is public API in scikit-learn.

The lest controversial definition is that,

  • import paths with that include a module with a leading _ are private
  • other modules are public.

However, you could do, for instance,

from sklearn.cluster.dbscan_ import NearestNeighbors

Does it mean that we are supposed to preserve backward compatibility on sklearn.cluster.dbscan_.NearestNeighbors in terms of import path? How about sklearn.preprocessing.data.sparse (scipy.sparse)?

I guess not, meaning that just because we have an import path without an underscore does not mean that it is part of the public API. At the very least it also needs to be documented or used in examples.

If we take this definition,

This would mean that we can e.g. remove sklearn.externals.six in #12916 without a deprecation warning (but possibly with a what's new entry). I have a hard time seeing a user reasonably complaining that we didn't go through a deprecation cycle there.

This would also help resolving the "public vs private utils" discussion in #6616

WDYT, do you have other ideas of how we should define what is public API in scikit-learn?

cc @scikit-learn/core-devs

@adrinjalali

This comment has been minimized.

Copy link
Member

@adrinjalali adrinjalali commented Jan 5, 2019

I'd rather see a definition which is reflected in the code and import paths, for it to be clear from the code that people write, and not only some documentation.

I'd even throw the wild idea of having v0.21 as the last in 0.xx series and have it as an LTS release, and simply add an _ to everything which is not supposed to be public in v1.0 (and work on other major API changes waiting for the v1.0 release).

If not, we can try the hack you mentioned in #11182 (comment), doesn't sound too bad to me.

@rth

This comment has been minimized.

Copy link
Member Author

@rth rth commented Jan 5, 2019

Thanks for the feedback!

I'd rather see a definition which is reflected in the code and import paths, for it to be clear from the code that people write, and not only some documentation.

How would that handle the case of sklearn.cluster.dbscan_.NearestNeighbors or sklearn.preprocessing.data.sparse then?

I'd even throw the wild idea of having v0.21 as the last in 0.xx series and have it as an LTS release, and simply add an _ to everything which is not supposed to be public in v1.0

yeah, v1.0 is a possibility, but ideally, we would need some way to handle the deprecation/removal of sklearn.externals.{six,joblib} and some unused sklearn.utils.fixes (#12928) already in v0.21.

If not, we can try the hack you mentioned in #11182 (comment), doesn't sound too bad to me.

The most general way of moving imports including nested subdirectories I'm aware of is with import hooks. That's definitely not trivial and error-prone (e.g. for pickling, or some systems that implement their own hooks such as pyinstaller) and I don't think that it would be justified in this case.

@adrinjalali

This comment has been minimized.

Copy link
Member

@adrinjalali adrinjalali commented Jan 6, 2019

How would that handle the case of sklearn.cluster.dbscan_.NearestNeighbors or sklearn.preprocessing.data.sparse then?

My proposal is that we make all files to follow _blahblah.py style, which would mean to rename dbscan_.py to _dbscan.py and data.py to _data.py.

This means the imports would look like:

sklearn.cluster._dbscan.NearestNeighbors
sklearn.preprocessing._data.sparse

And I'd say both are private because a part of the import path has an _ at the beginning, and that's clear in the user's code.

yeah, v1.0 is a possibility, but ideally, we would need some way to handle the deprecation/removal of sklearn.externals.{six,joblib} and some unused sklearn.utils.fixes (#12928) already in v0.21.

If v0.21 is going to be the last in the 0.xx series, I guess having a way to handle removal/deprecation of code from externals becomes less critical, i.e. worst case scenario we keep them there, but they're removed/moved to _externals right after for v1.0.

@glemaitre

This comment has been minimized.

Copy link
Contributor

@glemaitre glemaitre commented Jan 6, 2019

My proposal is that we make all files to follow blahblah.py style, which would mean to rename dbscan.py to _dbscan.py and data.py to _data.py

I am +1 for this. We are not really consistent up to now but changes all the naming will be costly for all open PRs. The proposal to make it happen at 1.0 might be a way.

This would mean that we can e.g. remove sklearn.externals.six in #12916 without a deprecation warning (but possibly with a what's new entry). I have a hard time seeing a user reasonably complaining that we didn't go through a deprecation cycle there.

I agree with this point and I think that it could be extended for sklearn.utils.fixes as well. joblib might be a bit more tricky.

@rth

This comment has been minimized.

Copy link
Member Author

@rth rth commented Jan 6, 2019

Actually, PEP8 make the distinction between public / private API using the documentation only: https://www.python.org/dev/peps/pep-0008/#public-and-internal-interfaces

Documented interfaces are considered public, unless the documentation explicitly declares them to be provisional or internal interfaces exempt from the usual backwards compatibility guarantees. All undocumented interfaces should be assumed to be internal.

Regarding,

My proposal is that we make all files to follow blahblah.py style, which would mean to rename dbscan.py to _dbscan.py and data.py to _data.py

This would break a lot of existing code though, which can make users unhappy even in a major 1.0 release.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jan 7, 2019

jnothman added a commit that referenced this issue Jan 8, 2019
This continues the work done in #12639 on dropping the python 2 support by,
 - ~~removing unnecessary `from __future__` imports~~
 - removing unused `sklearn.utils.fixes` assuming we can agree in #12927 that `sklearn.utils.fixes` are private as was stated e.g. in #6616 (comment)
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 8, 2019

@rth

This comment has been minimized.

Copy link
Member Author

@rth rth commented Jan 16, 2019

On this subject https://github.com/glyph/publication is an intersting read (even if I'm not sure I agree with the proposed final solution). In particular, see the section about disadvanges of using underscores everwhere.

@amueller amueller changed the title RFC Defintion of public API RFC Definition of public API Mar 4, 2019
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Mar 8, 2019

My proposal is that we make all files to follow blahblah.py style, which would mean to rename dbscan.py to _dbscan.py and data.py to _data.py

You mean my proposal ;P #9250

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Mar 8, 2019

One of the reasons I proposed the underscore is that it simplifies tab-completion, but I think the definition of __all__ might fix that in many IDEs?

If we do that and have otherwise empty __init__ files, we won't actually have any public module which contains numpy any more potentially. Not sure if that is something to strive for, though.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 10, 2019

xhlulu added a commit to xhlulu/scikit-learn that referenced this issue Apr 28, 2019
This continues the work done in scikit-learn#12639 on dropping the python 2 support by,
 - ~~removing unnecessary `from __future__` imports~~
 - removing unused `sklearn.utils.fixes` assuming we can agree in scikit-learn#12927 that `sklearn.utils.fixes` are private as was stated e.g. in scikit-learn#6616 (comment)
koenvandevelde added a commit to koenvandevelde/scikit-learn that referenced this issue Jul 12, 2019
This continues the work done in scikit-learn#12639 on dropping the python 2 support by,
 - ~~removing unnecessary `from __future__` imports~~
 - removing unused `sklearn.utils.fixes` assuming we can agree in scikit-learn#12927 that `sklearn.utils.fixes` are private as was stated e.g. in scikit-learn#6616 (comment)
@NicolasHug

This comment has been minimized.

Copy link
Contributor

@NicolasHug NicolasHug commented Sep 8, 2019

I have opened #14913 that renames rbm.py into _rbm.py and multilayer_perceptron.py into _multilayer_perceptron.py.

Imports from neural_network.rbm and neural_network.multilayer_perceptron are still supported but deprecated.

I would like to do that for all modules. Please comment whether you'd like to see this happen.


Advantages:

  • Anything that is not in an __init__ file is now private. No more surprises. No more "ugh, this is public, I can't change it."
  • All these tools that are public without us knowing (and that are undocumented) are now private.
  • Private things cannot be imported unless you start with a _ somewhere. The distinction between public and private API is now very clear and explicit.
  • It's pretty easy to do, and PRs are pretty easy to review.

Drawbacks:

  • might make some users unhappy?? But if they make a case, we can still consider making something public again.
  • Breaks blame (requires to create new files)
  • Creates merge conflicts (which are arguably easy to fix).
  • will break pickles when deprecation ends (from #12927 (comment))
@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Sep 8, 2019

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 8, 2019

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 8, 2019

@thomasjpfan

This comment has been minimized.

Copy link
Member

@thomasjpfan thomasjpfan commented Sep 9, 2019

+1 on moving forward with this.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Sep 9, 2019

I proposed this in 2017 (#9250) and I'm still in favor.

@rth

This comment has been minimized.

Copy link
Member Author

@rth rth commented Sep 16, 2019

I would like to do that for all modules
[...]
Creates merge conflicts (which are arguably easy to fix).

I'm wondering about the effect on merge conflicts of this. If it was just a rename, git would probably handle it fine, but we are both moving file.py to _file.py and re-creating file.py to raise the deprecation warning there. I think that would result in quite bad conflicts. For instance #10665 that changes neural_network/multilayer_perceptron.py now has a merge conflict since #14939 was merged. The proposed merge solution proposes to remove most code from multilayer_perceptron.py without adding any of the changes. So to solve it one would need to manually apply all changes to _multilayer_perceptron.py. That is doable there, but will be very annoying for larger PRs if we do this everywhere.

So I think we should at least evaluate some approaches that would preserve backward compatibility of imports from file.py without actually re-creating that file. What we might want is some form of import redirect from file.py to _file.py with a deprecation warning.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 16, 2019

@NicolasHug

This comment has been minimized.

Copy link
Contributor

@NicolasHug NicolasHug commented Sep 16, 2019

I agree that the conflicts aren't as trivial as I thought.

The solution I could come up with so far is that the PR authors must create a commit where they rename file.py into _file.py and create a new empty file.py (or better check it out from master).

Then the merge with master is easy.

But the process isn't necessarily obvious yeah

@rth

This comment has been minimized.

Copy link
Member Author

@rth rth commented Sep 18, 2019

Closing as a duplicate of #9250

If interested, please subscribe to that issue, as currently the conversation is split between here and there, which is confusing to follow.

@rth rth closed this Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.