Add SPEC7: Seeding pseudo-random number generation #180

stefanv · 2023-04-19T18:24:28Z

tupui

That's a good idea to try to uniformize all that 👍

(I suppose you meant SPEC and not NEP.)

tupui · 2023-04-19T18:31:38Z

spec-0007/index.md

+1. Because `np.random.seed` is so often used in practice, no seed means
+   using the global `RandomState` object, `np.random.mtrand._rand`.
+2. (Option a) When a seed is provided, a `RandomState` object is initialized with that seed.
+3. (Option b) When a seed is provided, a `Generator` object is initialized with that seed.
+4. If an instance of `RandomState` is provided, it is used as-is.
+5. If an instance of `Generator` is provided, it is used as-is.


This is describing the current state in some libraries. But is it where we want to see this 10 years from now?

I am personally against any global state and advertising of any "legacy" behaviours.

I also feel we may want to think about a new keyword argument instead, that adopts recommended best practices instead.

I am not sure I see how that could help.

If you add a rng to a function which has seed or random_state we don't avoid raising some warning about deprecation, etc.

The big problem with random_state is that it allows for None, which then grabs global state. So, that will always conflict with an rng=None kwarg.

To me, the ideal API (using todays tool at least) would be that random_state=None would give you np.random.default_rng().

There is also the crazy thought, which I kind of like, from @ilayn: do not accept integers, only a Generator (or other object). Point being, you must provide a RNG if you want any reproducibility.

I don't think we can do that, though, because it would be a backward incompatible change?

Accepting only Generator objects could work, but we still have to deal with None.

Both integer and global state behaviour are BC issues yep.

It would be interesting to see how in real life Generator was painful to move to, taking into account a large sample of projects, folks, etc.

i.e. sometimes I feel like we are concerning ourselves too much about BC while for users it might be super easy and accepted to make the change. It's mostly a communication issue to me. Always taking my backend example, but there they do break (intentionally or not) production code like all the time. There are complains yep, but it's mostly ok and the ball is rolling and these projects are loved and praised (FastAPI is known for doing this often and seen as THE thing.)

Yes, and don't expect the same favors from commercial code. It's shackles that we choose to put on, sometimes at great time cost to (oftentimes volunteer) developers.

rkern · 2023-04-19T19:18:06Z

The big problem with random_state is that it allows for None, which then grabs global state. So, that will always conflict with an rng=None kwarg.

There's a deprecation strategy that can work to migrate from random_state to rng, if one wants that. Functions will (for a time) take random_state and rng arguments. Have a check_rng(rng=None, random_state=None) function that will return a Generator given its arguments that function authors can use. You can change the behavior over time, with DeprecationWarnings. So on the first release, if rng=None, then it looks at random_state and start issuing DeprecationWarnings, but otherwise using the same semantics as check_random_state(random_state) (but taking out the BitGenerator of the resulting RandomState and wrapping it in a Generator instead). Then when you enforce the deprecations, you can migrate to just returning default_rng(rng) and start raise informative errors if something other than None is passed to random_state, then eventually you can drop the random_state= argument entirely.

stefanv · 2023-04-20T04:15:06Z

There's a deprecation strategy that can work to migrate from random_state to rng

To make sure I understand, this will change the return values of functions (rng=None, random_seed=None) over time (i.e., violate the Hinsen principle), but we can choose how long that time period is.

rkern · 2023-04-20T04:37:10Z

Yes, it's a deprecation strategy, not a backwards-compatibility-preserving strategy.

rgommers · 2023-04-25T17:20:59Z

@stefanv thanks for starting to summarize that long and complex discussion!

Specifically, behavior will change over a long period of time in the case where no seed is specified

Can you please elaborate on this? It's not all that obvious, because when you're not seeding the first intuition I'd have is "I am not expecting specific results, only random numbers with a given distribution". Since you're kinda steering towards a large amount of churn due to changing names here, I think it's important to be specific under what circumstances there is a backwards compatibility impact.

I guess the point here is:

the user may be seeding elsewhere with np.random.seed(a_number),
and not threading the state for that seed through to this API call,
and executing without multiprocessing or a similar parallel mechanism,
hence was relying implicitly on the global state controlling by seeding,
this global state is now no longer determining the exact random numbers to the unseeded API call,
hence the numerical result returned by that API call changes,

And then there's the question whether this scenarios matter. It may impact exact reproducibility of some scientific result. However that reproducibility was only ever guaranteed when using the same version of the same libraries on the same hardware.

I'd suggest finding the most compelling scenario here, that makes it as easy as possible to say that that's not acceptable, and hence we must change from random_state to rng.

rkern · 2023-04-25T19:22:50Z

The deprecation strategy I outlined does imply a change in semantics of the affected functions above and beyond the change in the precise numbers that come out of them. There are plenty of programs (using scipy and sklearn components that use check_random_state()) that rely on controlling the output deterministically (for the same program, same builds, same environment, not across versions or builds) by calling np.random.seed(seed) once at the top rather than explicitly threading through a RandomState. Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

stefanv · 2023-04-25T19:44:56Z

Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

Right, and my take was that this is a desired outcome from our perspective.

stefanv · 2023-04-25T19:47:39Z

And then there's the question whether this scenarios matter.

By far the most common use of seeding is to fix test suites. Most of those will keep running as-is. The failures that arise will be legitimate failures, and could be fixed by playing with the seed, or by making the underlying code more robust.

ilayn · 2023-04-25T19:50:04Z

I am willing to spend time on the rewriting. The test suite is seriously out-of-date in many places anyways. You can even smell the year from just by reading the comments.

stefanv · 2023-04-25T20:07:58Z

I've made more explicit the points you mentioned, Ralf. It may benefit from fleshing out even further as we continue to evolve the document. I don't want to tighten things up before we've agreed on a pathway forward!

rkern · 2023-04-25T20:57:24Z

Right, and my take was that this is a desired outcome from our perspective.

Yes, I think so. But I interpreted Ralf's question as whether it was really necessary to go through a deprecation and a name churn to do this instead of just changing what random_state=None does since stream reproducibility across versions is not something that most of the downstream libraries using the check_random_state() pattern guarantee. I was confirming that there was something beyond the stream reproducibility that is a concern.

rgommers · 2023-04-27T22:53:17Z

rely on controlling the output deterministically (for the same program, same builds, same environment, not across versions or builds) by calling np.random.seed(seed) once at the top rather than explicitly threading through a RandomState. Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

Yes, I think that's saying exactly the same thing I was saying in my bullet points higher up. I would add that library code doing this is already broken, because it's not robust to (for example) the end user using multiprocessing. So I think what you're worried about is end user code doing this. And missing the change in semantics, then re-run experiments after upgrading to the next scipy version and saving results that they think are deterministic but silently aren't. Right?

stefanv · 2023-04-27T23:02:58Z

Why don't we emit a warning from numpy random.seed?

rgommers · 2023-04-27T23:07:08Z

Because it will create utter havoc in the many valid uses in test suites?

stefanv · 2023-04-27T23:35:48Z

Isn't that what you want, eventually?

stefanv · 2023-04-27T23:42:45Z

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

rkern · 2023-04-28T00:21:39Z

So I think what you're worried about is end user code doing this. And missing the change in semantics, then re-run experiments after upgrading to the next scipy version and saving results that they think are deterministic but silently aren't.

Yes. There are plenty of ML programs (in particular) that call np.random.seed(seed) (and random.seed(seed) and torch.manual_seed(seed), etc.) at the top because that's what they've been told to do (and encouraged to do by well-meaning frameworks) and call sklearn and scipy functions, and those scripts are deterministic (and mostly fine). That will silently change if we just change check_random_state(None) to do default_rng(None) without a deprecation switcheroo. I care less about the results they get from any one run (they'll be perfectly valid) so much as they will now show unexpected and hard-to-debug behavior as everyone chases down a bunch of red herrings.

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

One enormous hurdle at a time, please. 😉

stefanv · 2023-04-28T00:59:53Z

One enormous hurdle at a time, please. 😉

Fair enough :)

rgommers · 2023-04-28T08:38:23Z

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

There is no plan to do so. Deprecating random_state=None in scipy and (hopefully after that) scikit-learn is a very different thing from deprecating the legacy numpy.random functionality. There is no such plan, it's legacy only and NEP 19 explicitlly lays out that it will stay, be ultra-stable, and useful for testing.

rgommers · 2023-04-28T08:43:59Z

and those scripts are deterministic (and mostly fine)

Yes, okay - I agree, this summary and rationale is enough to explain why we cannot stay with random_state, I think the average reader will understand this well enough.

That also means that item (a) of my reasoning in scipy/scipy#14322 (comment) is not "in the same ballpark" and hence it seems clear now that we should prefer rng over random_state.

tupui · 2023-04-28T08:51:35Z

There is no such plan, it's legacy only and NEP 19 explicitlly lays out that it will stay, be ultra-stable, and useful for testing.

I think that would not prevent from having a user warning. Average users don't read docs and keep copy pasting old code until "something" is getting in their ways. So until it's visible in their code that something is legacy they will keep using that I am afraid.

Also reading at the NEP19, to me it's really not clear that the global state would not change. The fate of RandomState is clear, but the rest of the global section even ends with:

This NEP does not propose that these requirements remain in perpetuity. After we have experience with the new PRNG subsystem, we can and should revisit these issues in future NEPs.

stefanv · 2023-05-02T23:12:16Z

There seems to be some vague consensus around the deprecation approach. I don't want to run things ahead, but at the same time scikit-image has to make a calculated guess of what to do for its forthcoming release. So, without holding anyone to the fire, I will propose that we make the seed (random_state) to rng transition there, and hope that it doesn't cause too much work in the future should the decision here go differently.

I would appreciate it if those involved in the discussion would co-author this SPEC (whether by adding your name to the authors list, or by helping to clarify language). If you want to keep a safe distance, advice on how to solidify the thrust of the argument further would also be welcome.

Thanks!

Use `rng` consistently, replacing `random_state` and `seed`. See also scientific-python/specs#180

* Unify pseudo-random seeding interface Use `rng` consistently, replacing `random_state` and `seed`. See also scientific-python/specs#180 * Fix seeding in examples

stefanv · 2023-05-10T16:50:23Z

skimage completed the transition for our 0.21 release (scikit-image/scikit-image#6922). Maybe premature, we'll see. But at least usage is consistent now.

bsipocz · 2023-05-31T18:03:21Z

I wonder whether this converged enough to be merged as a draft, certainly has more content and usage than some of the other specs?

stefanv · 2023-05-31T18:53:38Z

Let me chat to the NetworkX people at their community call next week, and see what their thoughts are. While this is now implemented in scikit-image, it would be good to have at least multi-project consensus on this being a viable route forward.

jorisvandenbossche · 2023-06-01T12:49:17Z

Comment from the sideline (and as library author who recently added a first method to geopandas where we needed to decide which keyword name to use for this, without being constrained by existing methods):
From a user API perspective, I find the keyword name "rng" rather unclear / unintuitive. You already need to know it is short for "random number generator" (I assume), but I think many users won't know that. In contrast, something like "random_state" is much clearer from the name that it has to do with the random aspect of the function.

stefanv · 2023-06-01T15:35:52Z

As you point out, if you start from scratch, using random_state is an option. But many libraries already use random_state=None to refer to the old RandomState. They cannot easily keep using that keyword without breaking backward compatibility in an unexpected way.

So, the question is not so much finding an optimal argument name for one library, but coming to consensus over a consistent pattern that can be adopted across the ecosystem.

jorisvandenbossche · 2023-06-01T15:43:41Z

Yes, I fully understand that the deprecation process limits what can be used, but I hope someone could come up with a better name than "rng" and which is not "random_state" .. (although I don't directly have an idea myself, except from spelling it out, which would make it rather long)

rkern · 2023-06-01T15:48:11Z

FWIW, I responded to a similar comment on the related scipy thread.

tupui · 2023-06-01T15:49:45Z

I agree that ideally we would want to have a descriptive name for non experts and newcomers.

Has the ship really sailed to still use random_state but change the semantic progressively? (there are a few options like using an env variable) Because at some point, whatever we do, it seems like we are going to introduce some churn.

Otherwise I would be +1 on trying to brainstorm a new name.

randomness, chaos_state, etc. I am sure we can come up with something 😃

stefanv · 2023-06-01T15:58:41Z

Environment variables are never an option: it has to be explicit in the code.

rkern · 2023-06-01T15:59:12Z

Yes, that ship is long gone over the horizon.

I think the other ship is gone, too. With default_rng() and all of the numpy documentation using rng = default_rng(), rng is here to stay as a term that everyone will become familiar with.

TBF, I wanted prng = default_prng() to emphasize the pseudo aspect of the pseudorandomness, but people wanted something even shorter than that.

stefanv · 2023-06-01T16:04:57Z

FWIW, I don't think rng will be hard to teach: users will read it once and get it, especially if we make it a common pattern across the ecosystem:

rng = np.random.default_rng(x)

Rng, the thing we want as input, is already enshrined in the NumPy function name too.

prng would have been slightly better from an educational point of view.

random_generator is a longer form that is quite intuitive and also captures the notion of a seed adequately.

stefanv · 2023-06-01T16:06:38Z

Ah, sorry Robert, didn't see your message there.

betatim · 2023-06-05T09:56:03Z

For what it is worth, I think rng is no more or less arbitrary (some might say random ..) than random_state. You gotta learn it once and then you know it. Consistency across the ecosystem will be super useful here (just like it is for random_state).

I think rng has a nice ... ring to it ;) (I couldn't resist)

tupui

I went over the current document and strategy. I am +1.

I don't like the name 😅 but yes this is something users will eventually get with the warnings; and being written all over in NumPy's doc is helping us to sell it.

asmeurer · 2023-07-13T18:36:48Z

The array API discussion about random number generation and the differences between the NumPy-style and JAX-style APIs may or may not be relevant here data-apis/array-api#431. I don't know if it matters for seeding specifically, but I also know a lot of the same people on that discussion are already on this one.

EwoutH · 2023-10-15T10:10:52Z

Thanks for this effort! It might be useful to add a Motivation and/or Context section to this spec. I think it makes it a bit clearer why this spec was created and where it came from.

stefanv mentioned this pull request Apr 19, 2023

ENH: refactor RNG usage to use np.random.Generator scipy/scipy#14322

Open

tupui reviewed Apr 19, 2023

View reviewed changes

stefanv changed the title ~~Add draft NEP7: seeding pseudo-random number generation~~ Add draft SPEC7: seeding pseudo-random number generation Apr 19, 2023

Add draft SPEC7: seeding pseudo-random number generation

39eb105

stefanv force-pushed the nep7-prng branch from 3533db4 to 39eb105 Compare April 19, 2023 18:34

stefanv closed this Apr 19, 2023

stefanv deleted the nep7-prng branch April 19, 2023 18:36

stefanv restored the nep7-prng branch April 19, 2023 18:37

stefanv reopened this Apr 19, 2023

Add deprecation strategy

f3906b1

Explain why globally seeded code will become unseeded

8f3500b

stefanv added a commit to stefanv/scikit-image that referenced this pull request May 9, 2023

Unify pseudo-random seeding interface

395e64b

Use `rng` consistently, replacing `random_state` and `seed`. See also scientific-python/specs#180

stefanv mentioned this pull request May 9, 2023

Unify pseudo-random seeding interface scikit-image/scikit-image#6922

Merged

mkcor mentioned this pull request May 13, 2023

set random seed for np.random.default_rng scikit-image/scikit-image#6929

Open

lorentzenchr mentioned this pull request May 17, 2023

Support numpy.random.Generator and/or BitGenerator for random number generation scikit-learn/scikit-learn#16988

Open

stefanv mentioned this pull request May 31, 2023

Add missing PRs to release notes scikit-image/scikit-image#6949

Merged

martinfleis mentioned this pull request Jun 1, 2023

REF: deprecate seed keyword to rng in sample_points geopandas/geopandas#2913

Merged

tupui approved these changes Jun 26, 2023

View reviewed changes

rkern mentioned this pull request Jul 31, 2023

Stochastic tutorial added numpy/numpy-tutorials#185

Open

rgommers mentioned this pull request Sep 3, 2023

ENH: constructors for sparse arrays scipy/scipy#19171

Merged

tupui mentioned this pull request Sep 20, 2023

Numpy RNGs blog post scientific-python/blog.scientific-python.org#135

Merged

jni mentioned this pull request Sep 26, 2023

Implement direct color calculation in shaders for Labels auto color mode napari/napari#6179

Merged

jarrodmillman changed the title ~~Add draft SPEC7: seeding pseudo-random number generation~~ Add SPEC7: Seeding pseudo-random number generation Mar 5, 2024

Add SPEC7: Seeding pseudo-random number generation #180

Are you sure you want to change the base?

Add SPEC7: Seeding pseudo-random number generation #180

Conversation

stefanv commented Apr 19, 2023 • edited

tupui left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tupui Apr 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkern commented Apr 19, 2023

stefanv commented Apr 20, 2023

rkern commented Apr 20, 2023

rgommers commented Apr 25, 2023

rkern commented Apr 25, 2023

stefanv commented Apr 25, 2023

stefanv commented Apr 25, 2023

ilayn commented Apr 25, 2023

stefanv commented Apr 25, 2023 • edited

rkern commented Apr 25, 2023

rgommers commented Apr 27, 2023

stefanv commented Apr 27, 2023

rgommers commented Apr 27, 2023

stefanv commented Apr 27, 2023

stefanv commented Apr 27, 2023

rkern commented Apr 28, 2023

stefanv commented Apr 28, 2023

rgommers commented Apr 28, 2023

rgommers commented Apr 28, 2023

tupui commented Apr 28, 2023 • edited

stefanv commented May 2, 2023

stefanv commented May 10, 2023 • edited

bsipocz commented May 31, 2023

stefanv commented May 31, 2023 • edited

jorisvandenbossche commented Jun 1, 2023

stefanv commented Jun 1, 2023 • edited

jorisvandenbossche commented Jun 1, 2023

rkern commented Jun 1, 2023

tupui commented Jun 1, 2023

stefanv commented Jun 1, 2023

rkern commented Jun 1, 2023

stefanv commented Jun 1, 2023

stefanv commented Jun 1, 2023

betatim commented Jun 5, 2023

tupui left a comment

Choose a reason for hiding this comment

asmeurer commented Jul 13, 2023

EwoutH commented Oct 15, 2023 • edited

stefanv commented Apr 19, 2023 •

edited

tupui left a comment •

edited

tupui Apr 19, 2023 •

edited

stefanv commented Apr 25, 2023 •

edited

tupui commented Apr 28, 2023 •

edited

stefanv commented May 10, 2023 •

edited

stefanv commented May 31, 2023 •

edited

stefanv commented Jun 1, 2023 •

edited

EwoutH commented Oct 15, 2023 •

edited