Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.0 of scikit-learn #14386

Open
MartinThoma opened this issue Jul 17, 2019 · 13 comments
Open

Version 1.0 of scikit-learn #14386

MartinThoma opened this issue Jul 17, 2019 · 13 comments

Comments

@MartinThoma
Copy link
Contributor

@MartinThoma MartinThoma commented Jul 17, 2019

I just realized (by looking at 0ver.org ) that scikit-learn is also in Version 0.x. I could not find any discussion about version 1.0 in the issues.

I would like to understand the reasoning / see if there is any other channel where this topic is discussed.

Why it matters

Semantic Versioning is wide spread. People who are new to Python still know (parts of) semantic versioning. Having software in a 0.x version feels as if the software is brittle / prone to get breaking changes.

scikit-learn does not use any of the Development Status :: trove classifiers (setup.py, list of trove classifers). Although I guess anybody working with Python has heard from scikit-learn, it might be hard to reason about the maturity of the project quickly as a newcomer.

An alternative is calendar based versioning.

Why scikit-learn should be 1.0

  • Wide-Spread: 35,895 stars on Github
  • Maturity:
    • First release in 2010
    • Releases so far: 29
    • A lot of software relies on it (according to Github: 61504 repositories!)
    • 17910 articles cited the version 0.8 publication

The Process to get to 1.0

scipy made this really nice. I guess some of the developers there also have a look at scikit-learn, so I hope to get more details.

From my perspective, it looked as if the scipy community made the following steps to get to 1.0:

  • Code changes (see 1.0.0 Milestone of scipy):
    • Are there key features missing?
    • Are there important interface changes that should be done?
    • Are there any other issues that need to be solved before 1.0?
  • Add a community governance document (scipy)
  • Write an version 1.0 paper (scipy) - this might be a nice reward for a couple of contributors, if they are in academica. Lasagne (deep learning library) did a simpler version of it (lasagne software publication), but that is still nice so people can cite what they used. scikit-learn did that a while ago as well. There is also a nice Tensorflow Whitepaper.
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 17, 2019

There's a milestone:
https://github.com/scikit-learn/scikit-learn/issues?q=is%3

Personally, I think #7242 and #10603 need to be fixed.
Right now it's not possible to train a pipeline with preprocessing and logistic regression on the titanic dataset and figure out what the coefficients mean. This is work in progress. We already made strides. Once we have support for feature names, I think we're at a reasonable point.

I know some other people, including @adrinjalali and @GaelVaroquaux feel strongly about #4497 and #4143. As you can see from the numbers, these issues are quite old. There is no consensus yet on how to address these.
These also relate to being able to undersample and oversample for imbalanced data, which scikit-learn doesn't support.

We have delayed 1.0 to allow a breaking change to fix these issues. Whether this is (still) a good strategy is debatable.

We very recently introduced a governance document, roadmap and a enhancement proposal formalism.

These have actually allowed us to discuss some of the longstanding issues in a more productive way. We could decide to postpone some of the issues, make a polished 1.0 and then address them in 2.0.
Or we could keep working on them and release 1.0 once we addressed or punted on them.
It is helpful to think about a timeline for 1.0, I think, and what we want from it.

There's actually two separate things we might desire from a 1.0: stable interfaces, and reliable implementation. So far most of our discussion has been around having the right interfaces, but there's also issues with our implementations. There's issues in LatentDirichletAllocation, in much of the cross_decomposition module, in some of the Bayesian linear models, and there's pretty annoying issues wrt to convergence and solver choices in LogisticRegression and LinearSVC.

I would at least like to resolve the issues in LogisticRegression and LinearSVC before we do a 1.0.

I'm not sure if writing a 1.0 paper is helpful, but it's something to consider.

@MartinThoma

This comment has been minimized.

Copy link
Contributor Author

@MartinThoma MartinThoma commented Jul 17, 2019

Cool, I missed the 1.0 milestone - let's see if I can contribute :-)

It's awesome to see that this is already in progress. scikit-learn is a project that helped me a lot during my studies / career; I will try to find some time to give something back.

I'm not sure if writing a 1.0 paper is helpful, but it's something to consider.

Personally, I would consider this as the "cherry on the top": Very nice to have, a very rewarding thing to do, probably less useful than many (all?) other things in the issue list. And also something that can be done at any point in time.

I'm not sure if this "issue" should be closed then. Maybe it is a good way to channel comments / suggestions?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 17, 2019

One of the issue with adding additional papers is that it gets less clear for users what to cite and it splits our citation count.
On the other hand it allows new contributors to share in the citations (me and @jnothman are not in the published journal version of the previous paper).
These are somewhat tangential issues though.

I think having an issue to discuss 1.0 is not a bad idea so I think it's fine to leave this open to have a central place for discussion.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Dec 4, 2019

Since this came up again today: I'm a bit torn between wanting to have something I'm really happy with and getting a 1.0 out of the door.

I don't think the wish-list items will be done for the next release (currently called 0.22), and there's maybe a slight chance they will be done for the one after that.

If we want 1.0 do be stable in some sense, than we would really need to prioritize those issues, which we haven't done so far (from what I can tell).

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 4, 2019

I think I have come to agree that we should just do 1.0 and if we want to make any big changes that should be 2.0.

We've certainly got enough content and enough quality assurance tools to suggest that we can be 1.0. If we're aiming for 1.0 we should work out what we want to include, focusing, I think, more on consistency than features. 1.0 for instance might be a good opportunity to improve some parameter name/definition consistency, scale (and sample weight) invariance in parameter definitions, etc.

FWIW, some of the changes around sample props may be best with backwards incompatibility. The change to NamedArray may also present backwards incompatibility that would deserve a major release. But, indeed, there would be no great harm if that major release was 1. to 2. rather than 0. to 1.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Dec 5, 2019

@ahowe42

This comment has been minimized.

Copy link

@ahowe42 ahowe42 commented Dec 5, 2019

Looking over the issues mentioned by @amueller in July, I wouldn't be concerned about 7242. Ensuring that the columns used for training / testing / inference are consistent is pretty basic. Regarding 10603, that is a valid point, and I think it should be true for a 1.0 relase. Issue 4497 seems more like something that should not hold up a 1.0 release, while I do think 4143 is important enough that I'd like to see it in 1.0.

With the prevalence of pandas, I do have to say that named features is probably important enough to ensure that's in a 1.0 release.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

@NicolasHug NicolasHug commented Dec 5, 2019

Another feature I'd personally like to see before 1.0 is native support for categorical data (in tree models, or at least some of them). Which is sort of a prerequisite for @amueller's #10603. And also make an informed decision on the randomness handling scikit-learn/enhancement_proposals#24

I agree with most of what has been said and I'm very happy to start considering 1.0 right now.

Let's bring up the 1.0 topic during the next meeting so we can start figuring out what could / should be in there

@agramfort

This comment has been minimized.

Copy link
Member

@agramfort agramfort commented Dec 5, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Dec 5, 2019

+1 to release 1.0 ASAP, two questions:
(1) Is it acceptable to have experimental features in 1.0? (I guess we have to do so)
(2) We mention things like "XXX is deprecated in 0.22 and will be removed in 0.24" so we promise that there will be 0.24?

@NicolasHug

This comment has been minimized.

Copy link
Contributor

@NicolasHug NicolasHug commented Dec 5, 2019

(1) ideally these would be stable by then IMO
(2) There will probably be 2 major releases between the time we decide on 1.0 and the time we release it so that might not be a problem

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 6, 2019

@adrinjalali adrinjalali added this to To do in Meeting Issues Jan 6, 2020
@VarIr

This comment has been minimized.

Copy link
Contributor

@VarIr VarIr commented Jan 19, 2020

I would like to second the proposal for a version 1.0 paper, as publications are still an essential corner stone in the academic world.

As a PhD student considering an academic career, and non-core developer of scikit-learn, my contributions currently work like this:

  1. Stumble upon some issue that must be solved for my own projects building upon scikit-learn
  2. Fix the code for my project during working hours
  3. Create PR outside working hours, because there are always so many other tasks, and those for which I can get academic credit have precedence. In the end, I want to contribute, so I do this in my free time.

If there was a clear commitment to a publication, I would have leverage in discussions with my supervisor/faculty about allocating more time towards contributing to scikit-learn. I imagine other contributors are in similar situations.

One of the issue with adding additional papers is that it gets less clear for users what to cite and it splits our citation count.

I think these issues can be addressed. In my field (computational biology), papers about public resources are often updated every few years, i.e. there might be "The XY database in 2017", "in 2019", etc. One typically cites the latest iteration/highest version, which could be easily provided at https://scikit-learn.org/stable/about.html#citing-scikit-learn.
Aggregating two (later on: a handful) numbers for a global scikit-learn citation count should be doable as well.
In addition, there are a number of academic metrics that only take into account publications of the last five years, already excluding the JMLR paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Meeting Issues
  
To do
9 participants
You can’t perform that action at this time.