Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
I just realized (by looking at 0ver.org ) that scikit-learn is also in Version 0.x. I could not find any discussion about version 1.0 in the issues.
I would like to understand the reasoning / see if there is any other channel where this topic is discussed.
Why it matters
Semantic Versioning is wide spread. People who are new to Python still know (parts of) semantic versioning. Having software in a 0.x version feels as if the software is brittle / prone to get breaking changes.
scikit-learn does not use any of the
An alternative is calendar based versioning.
Why scikit-learn should be 1.0
The Process to get to 1.0
scipy made this really nice. I guess some of the developers there also have a look at scikit-learn, so I hope to get more details.
From my perspective, it looked as if the scipy community made the following steps to get to 1.0:
There's a milestone:
Personally, I think #7242 and #10603 need to be fixed.
I know some other people, including @adrinjalali and @GaelVaroquaux feel strongly about #4497 and #4143. As you can see from the numbers, these issues are quite old. There is no consensus yet on how to address these.
We have delayed 1.0 to allow a breaking change to fix these issues. Whether this is (still) a good strategy is debatable.
These have actually allowed us to discuss some of the longstanding issues in a more productive way. We could decide to postpone some of the issues, make a polished 1.0 and then address them in 2.0.
There's actually two separate things we might desire from a 1.0: stable interfaces, and reliable implementation. So far most of our discussion has been around having the right interfaces, but there's also issues with our implementations. There's issues in
I would at least like to resolve the issues in
I'm not sure if writing a 1.0 paper is helpful, but it's something to consider.
Cool, I missed the 1.0 milestone - let's see if I can contribute :-)
It's awesome to see that this is already in progress. scikit-learn is a project that helped me a lot during my studies / career; I will try to find some time to give something back.
Personally, I would consider this as the "cherry on the top": Very nice to have, a very rewarding thing to do, probably less useful than many (all?) other things in the issue list. And also something that can be done at any point in time.
I'm not sure if this "issue" should be closed then. Maybe it is a good way to channel comments / suggestions?
One of the issue with adding additional papers is that it gets less clear for users what to cite and it splits our citation count.
I think having an issue to discuss 1.0 is not a bad idea so I think it's fine to leave this open to have a central place for discussion.
Since this came up again today: I'm a bit torn between wanting to have something I'm really happy with and getting a 1.0 out of the door.
I don't think the wish-list items will be done for the next release (currently called 0.22), and there's maybe a slight chance they will be done for the one after that.
If we want 1.0 do be stable in some sense, than we would really need to prioritize those issues, which we haven't done so far (from what I can tell).
We've certainly got enough content and enough quality assurance tools to suggest that we can be 1.0. If we're aiming for 1.0 we should work out what we want to include, focusing, I think, more on consistency than features. 1.0 for instance might be a good opportunity to improve some parameter name/definition consistency, scale (and sample weight) invariance in parameter definitions, etc.
FWIW, some of the changes around sample props may be best with backwards incompatibility. The change to NamedArray may also present backwards incompatibility that would deserve a major release. But, indeed, there would be no great harm if that major release was 1. to 2. rather than 0. to 1.
Looking over the issues mentioned by @amueller in July, I wouldn't be concerned about 7242. Ensuring that the columns used for training / testing / inference are consistent is pretty basic. Regarding 10603, that is a valid point, and I think it should be true for a 1.0 relase. Issue 4497 seems more like something that should not hold up a 1.0 release, while I do think 4143 is important enough that I'd like to see it in 1.0.
With the prevalence of pandas, I do have to say that named features is probably important enough to ensure that's in a 1.0 release.
Another feature I'd personally like to see before 1.0 is native support for categorical data (in tree models, or at least some of them). Which is sort of a prerequisite for @amueller's #10603. And also make an informed decision on the randomness handling scikit-learn/enhancement_proposals#24
I agree with most of what has been said and I'm very happy to start considering 1.0 right now.
Let's bring up the 1.0 topic during the next meeting so we can start figuring out what could / should be in there
#4143 (transforming y) is always *possible* already with an appropriate meta-estimator designed for a specific use-case (and the resampling components mostly just need decisions, although there are open questions about handling props aside from X and y), while #4497 (sample props) is more or less impossible for a user to achieve without rewriting our model selection tools. #7242 should be doable by the next release. #10603 has come a long way, but better handling of feature names would be good either for v1 or v2.
(2) We mention things like "XXX is deprecated in 0.22 and will be removed in 0.24" so we promise that there will be 0.24?
I don't think that's a problem. There are lots of valid solutions, but apart from anything else those messages are entirely about ensuring some local backwards compatibility *within* a major version. Once we jump to 1.0 we can make whatever choices we like (within reasonable risk).
I would like to second the proposal for a version 1.0 paper, as publications are still an essential corner stone in the academic world.
As a PhD student considering an academic career, and non-core developer of scikit-learn, my contributions currently work like this:
If there was a clear commitment to a publication, I would have leverage in discussions with my supervisor/faculty about allocating more time towards contributing to scikit-learn. I imagine other contributors are in similar situations.
I think these issues can be addressed. In my field (computational biology), papers about public resources are often updated every few years, i.e. there might be "The XY database in 2017", "in 2019", etc. One typically cites the latest iteration/highest version, which could be easily provided at https://scikit-learn.org/stable/about.html#citing-scikit-learn.