New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SPEC7: Seeding pseudo-random number generation #180
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea to try to uniformize all that 👍
(I suppose you meant SPEC and not NEP.)
spec-0007/index.md
Outdated
1. Because `np.random.seed` is so often used in practice, no seed means | ||
using the global `RandomState` object, `np.random.mtrand._rand`. | ||
2. (Option a) When a seed is provided, a `RandomState` object is initialized with that seed. | ||
3. (Option b) When a seed is provided, a `Generator` object is initialized with that seed. | ||
4. If an instance of `RandomState` is provided, it is used as-is. | ||
5. If an instance of `Generator` is provided, it is used as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is describing the current state in some libraries. But is it where we want to see this 10 years from now?
I am personally against any global state and advertising of any "legacy" behaviours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel we may want to think about a new keyword argument instead, that adopts recommended best practices instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I see how that could help.
If you add a rng
to a function which has seed
or random_state
we don't avoid raising some warning about deprecation, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big problem with random_state
is that it allows for None
, which then grabs global state. So, that will always conflict with an rng=None
kwarg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the ideal API (using todays tool at least) would be that random_state=None
would give you np.random.default_rng()
.
There is also the crazy thought, which I kind of like, from @ilayn: do not accept integers, only a Generator
(or other object). Point being, you must provide a RNG if you want any reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do that, though, because it would be a backward incompatible change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting only Generator objects could work, but we still have to deal with None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both integer and global state behaviour are BC issues yep.
It would be interesting to see how in real life Generator
was painful to move to, taking into account a large sample of projects, folks, etc.
i.e. sometimes I feel like we are concerning ourselves too much about BC while for users it might be super easy and accepted to make the change. It's mostly a communication issue to me. Always taking my backend example, but there they do break (intentionally or not) production code like all the time. There are complains yep, but it's mostly ok and the ball is rolling and these projects are loved and praised (FastAPI is known for doing this often and seen as THE thing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and don't expect the same favors from commercial code. It's shackles that we choose to put on, sometimes at great time cost to (oftentimes volunteer) developers.
There's a deprecation strategy that can work to migrate from |
To make sure I understand, this will change the return values of functions ( |
Yes, it's a deprecation strategy, not a backwards-compatibility-preserving strategy. |
@stefanv thanks for starting to summarize that long and complex discussion!
Can you please elaborate on this? It's not all that obvious, because when you're not seeding the first intuition I'd have is "I am not expecting specific results, only random numbers with a given distribution". Since you're kinda steering towards a large amount of churn due to changing names here, I think it's important to be specific under what circumstances there is a backwards compatibility impact. I guess the point here is:
And then there's the question whether this scenarios matter. It may impact exact reproducibility of some scientific result. However that reproducibility was only ever guaranteed when using the same version of the same libraries on the same hardware. I'd suggest finding the most compelling scenario here, that makes it as easy as possible to say that that's not acceptable, and hence we must change from |
The deprecation strategy I outlined does imply a change in semantics of the affected functions above and beyond the change in the precise numbers that come out of them. There are plenty of programs (using |
Right, and my take was that this is a desired outcome from our perspective. |
By far the most common use of seeding is to fix test suites. Most of those will keep running as-is. The failures that arise will be legitimate failures, and could be fixed by playing with the seed, or by making the underlying code more robust. |
I am willing to spend time on the rewriting. The test suite is seriously out-of-date in many places anyways. You can even smell the year from just by reading the comments. |
I've made more explicit the points you mentioned, Ralf. It may benefit from fleshing out even further as we continue to evolve the document. I don't want to tighten things up before we've agreed on a pathway forward! |
Yes, I think so. But I interpreted Ralf's question as whether it was really necessary to go through a deprecation and a name churn to do this instead of just changing what |
Yes, I think that's saying exactly the same thing I was saying in my bullet points higher up. I would add that library code doing this is already broken, because it's not robust to (for example) the end user using |
Why don't we emit a warning from numpy random.seed? |
Because it will create utter havoc in the many valid uses in test suites? |
Isn't that what you want, eventually? |
Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed? |
Yes. There are plenty of ML programs (in particular) that call
One enormous hurdle at a time, please. 😉 |
Fair enough :) |
There is no plan to do so. Deprecating |
Yes, okay - I agree, this summary and rationale is enough to explain why we cannot stay with That also means that item (a) of my reasoning in scipy/scipy#14322 (comment) is not "in the same ballpark" and hence it seems clear now that we should prefer |
I think that would not prevent from having a user warning. Average users don't read docs and keep copy pasting old code until "something" is getting in their ways. So until it's visible in their code that something is legacy they will keep using that I am afraid. Also reading at the NEP19, to me it's really not clear that the global state would not change. The fate of
|
There seems to be some vague consensus around the deprecation approach. I don't want to run things ahead, but at the same time scikit-image has to make a calculated guess of what to do for its forthcoming release. So, without holding anyone to the fire, I will propose that we make the I would appreciate it if those involved in the discussion would co-author this SPEC (whether by adding your name to the authors list, or by helping to clarify language). If you want to keep a safe distance, advice on how to solidify the thrust of the argument further would also be welcome. Thanks! |
Use `rng` consistently, replacing `random_state` and `seed`. See also scientific-python/specs#180
* Unify pseudo-random seeding interface Use `rng` consistently, replacing `random_state` and `seed`. See also scientific-python/specs#180 * Fix seeding in examples
skimage completed the transition for our 0.21 release (scikit-image/scikit-image#6922). Maybe premature, we'll see. But at least usage is consistent now. |
I wonder whether this converged enough to be merged as a draft, certainly has more content and usage than some of the other specs? |
Let me chat to the NetworkX people at their community call next week, and see what their thoughts are. While this is now implemented in scikit-image, it would be good to have at least multi-project consensus on this being a viable route forward. |
Comment from the sideline (and as library author who recently added a first method to geopandas where we needed to decide which keyword name to use for this, without being constrained by existing methods): |
As you point out, if you start from scratch, using random_state is an option. But many libraries already use random_state=None to refer to the old RandomState. They cannot easily keep using that keyword without breaking backward compatibility in an unexpected way. So, the question is not so much finding an optimal argument name for one library, but coming to consensus over a consistent pattern that can be adopted across the ecosystem. |
Yes, I fully understand that the deprecation process limits what can be used, but I hope someone could come up with a better name than "rng" and which is not "random_state" .. (although I don't directly have an idea myself, except from spelling it out, which would make it rather long) |
FWIW, I responded to a similar comment on the related scipy thread. |
I agree that ideally we would want to have a descriptive name for non experts and newcomers. Has the ship really sailed to still use Otherwise I would be +1 on trying to brainstorm a new name.
|
Environment variables are never an option: it has to be explicit in the code. |
Yes, that ship is long gone over the horizon. I think the other ship is gone, too. With TBF, I wanted |
FWIW, I don't think rng will be hard to teach: users will read it once and get it, especially if we make it a common pattern across the ecosystem:
Rng, the thing we want as input, is already enshrined in the NumPy function name too. prng would have been slightly better from an educational point of view.
|
Ah, sorry Robert, didn't see your message there. |
For what it is worth, I think I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the current document and strategy. I am +1.
I don't like the name 😅 but yes this is something users will eventually get with the warnings; and being written all over in NumPy's doc is helping us to sell it.
The array API discussion about random number generation and the differences between the NumPy-style and JAX-style APIs may or may not be relevant here data-apis/array-api#431. I don't know if it matters for seeding specifically, but I also know a lot of the same people on that discussion are already on this one. |
Thanks for this effort! It might be useful to add a Motivation and/or Context section to this spec. I think it makes it a bit clearer why this spec was created and where it came from. |
Under discussion at scipy/scipy#14322