Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SPEC7: Seeding pseudo-random number generation #180

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

stefanv
Copy link
Member

@stefanv stefanv commented Apr 19, 2023

Under discussion at scipy/scipy#14322

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea to try to uniformize all that 👍

(I suppose you meant SPEC and not NEP.)

Comment on lines 43 to 48
1. Because `np.random.seed` is so often used in practice, no seed means
using the global `RandomState` object, `np.random.mtrand._rand`.
2. (Option a) When a seed is provided, a `RandomState` object is initialized with that seed.
3. (Option b) When a seed is provided, a `Generator` object is initialized with that seed.
4. If an instance of `RandomState` is provided, it is used as-is.
5. If an instance of `Generator` is provided, it is used as-is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is describing the current state in some libraries. But is it where we want to see this 10 years from now?

I am personally against any global state and advertising of any "legacy" behaviours.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel we may want to think about a new keyword argument instead, that adopts recommended best practices instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see how that could help.

If you add a rng to a function which has seed or random_state we don't avoid raising some warning about deprecation, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big problem with random_state is that it allows for None, which then grabs global state. So, that will always conflict with an rng=None kwarg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the ideal API (using todays tool at least) would be that random_state=None would give you np.random.default_rng().

There is also the crazy thought, which I kind of like, from @ilayn: do not accept integers, only a Generator (or other object). Point being, you must provide a RNG if you want any reproducibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that, though, because it would be a backward incompatible change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting only Generator objects could work, but we still have to deal with None.

Copy link
Member

@tupui tupui Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both integer and global state behaviour are BC issues yep.

It would be interesting to see how in real life Generator was painful to move to, taking into account a large sample of projects, folks, etc.

i.e. sometimes I feel like we are concerning ourselves too much about BC while for users it might be super easy and accepted to make the change. It's mostly a communication issue to me. Always taking my backend example, but there they do break (intentionally or not) production code like all the time. There are complains yep, but it's mostly ok and the ball is rolling and these projects are loved and praised (FastAPI is known for doing this often and seen as THE thing.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and don't expect the same favors from commercial code. It's shackles that we choose to put on, sometimes at great time cost to (oftentimes volunteer) developers.

@stefanv stefanv changed the title Add draft NEP7: seeding pseudo-random number generation Add draft SPEC7: seeding pseudo-random number generation Apr 19, 2023
@stefanv stefanv closed this Apr 19, 2023
@stefanv stefanv deleted the nep7-prng branch April 19, 2023 18:36
@stefanv stefanv restored the nep7-prng branch April 19, 2023 18:37
@stefanv stefanv reopened this Apr 19, 2023
@rkern
Copy link

rkern commented Apr 19, 2023

The big problem with random_state is that it allows for None, which then grabs global state. So, that will always conflict with an rng=None kwarg.

There's a deprecation strategy that can work to migrate from random_state to rng, if one wants that. Functions will (for a time) take random_state and rng arguments. Have a check_rng(rng=None, random_state=None) function that will return a Generator given its arguments that function authors can use. You can change the behavior over time, with DeprecationWarnings. So on the first release, if rng=None, then it looks at random_state and start issuing DeprecationWarnings, but otherwise using the same semantics as check_random_state(random_state) (but taking out the BitGenerator of the resulting RandomState and wrapping it in a Generator instead). Then when you enforce the deprecations, you can migrate to just returning default_rng(rng) and start raise informative errors if something other than None is passed to random_state, then eventually you can drop the random_state= argument entirely.

@stefanv
Copy link
Member Author

stefanv commented Apr 20, 2023

There's a deprecation strategy that can work to migrate from random_state to rng

To make sure I understand, this will change the return values of functions (rng=None, random_seed=None) over time (i.e., violate the Hinsen principle), but we can choose how long that time period is.

@rkern
Copy link

rkern commented Apr 20, 2023

Yes, it's a deprecation strategy, not a backwards-compatibility-preserving strategy.

@rgommers
Copy link
Contributor

@stefanv thanks for starting to summarize that long and complex discussion!

Specifically, behavior will change over a long period of time in the case where no seed is specified

Can you please elaborate on this? It's not all that obvious, because when you're not seeding the first intuition I'd have is "I am not expecting specific results, only random numbers with a given distribution". Since you're kinda steering towards a large amount of churn due to changing names here, I think it's important to be specific under what circumstances there is a backwards compatibility impact.

I guess the point here is:

  • the user may be seeding elsewhere with np.random.seed(a_number),
  • and not threading the state for that seed through to this API call,
  • and executing without multiprocessing or a similar parallel mechanism,
  • hence was relying implicitly on the global state controlling by seeding,
  • this global state is now no longer determining the exact random numbers to the unseeded API call,
  • hence the numerical result returned by that API call changes,

And then there's the question whether this scenarios matter. It may impact exact reproducibility of some scientific result. However that reproducibility was only ever guaranteed when using the same version of the same libraries on the same hardware.

I'd suggest finding the most compelling scenario here, that makes it as easy as possible to say that that's not acceptable, and hence we must change from random_state to rng.

@rkern
Copy link

rkern commented Apr 25, 2023

The deprecation strategy I outlined does imply a change in semantics of the affected functions above and beyond the change in the precise numbers that come out of them. There are plenty of programs (using scipy and sklearn components that use check_random_state()) that rely on controlling the output deterministically (for the same program, same builds, same environment, not across versions or builds) by calling np.random.seed(seed) once at the top rather than explicitly threading through a RandomState. Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

@stefanv
Copy link
Member Author

stefanv commented Apr 25, 2023

Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

Right, and my take was that this is a desired outcome from our perspective.

@stefanv
Copy link
Member Author

stefanv commented Apr 25, 2023

And then there's the question whether this scenarios matter.

By far the most common use of seeding is to fix test suites. Most of those will keep running as-is. The failures that arise will be legitimate failures, and could be fixed by playing with the seed, or by making the underlying code more robust.

@ilayn
Copy link

ilayn commented Apr 25, 2023

I am willing to spend time on the rewriting. The test suite is seriously out-of-date in many places anyways. You can even smell the year from just by reading the comments.

@stefanv
Copy link
Member Author

stefanv commented Apr 25, 2023

I've made more explicit the points you mentioned, Ralf. It may benefit from fleshing out even further as we continue to evolve the document. I don't want to tighten things up before we've agreed on a pathway forward!

@rkern
Copy link

rkern commented Apr 25, 2023

Right, and my take was that this is a desired outcome from our perspective.

Yes, I think so. But I interpreted Ralf's question as whether it was really necessary to go through a deprecation and a name churn to do this instead of just changing what random_state=None does since stream reproducibility across versions is not something that most of the downstream libraries using the check_random_state() pattern guarantee. I was confirming that there was something beyond the stream reproducibility that is a concern.

@rgommers
Copy link
Contributor

rely on controlling the output deterministically (for the same program, same builds, same environment, not across versions or builds) by calling np.random.seed(seed) once at the top rather than explicitly threading through a RandomState. Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

Yes, I think that's saying exactly the same thing I was saying in my bullet points higher up. I would add that library code doing this is already broken, because it's not robust to (for example) the end user using multiprocessing. So I think what you're worried about is end user code doing this. And missing the change in semantics, then re-run experiments after upgrading to the next scipy version and saving results that they think are deterministic but silently aren't. Right?

@stefanv
Copy link
Member Author

stefanv commented Apr 27, 2023

Why don't we emit a warning from numpy random.seed?

@rgommers
Copy link
Contributor

Because it will create utter havoc in the many valid uses in test suites?

@stefanv
Copy link
Member Author

stefanv commented Apr 27, 2023

Isn't that what you want, eventually?

@stefanv
Copy link
Member Author

stefanv commented Apr 27, 2023

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

@rkern
Copy link

rkern commented Apr 28, 2023

So I think what you're worried about is end user code doing this. And missing the change in semantics, then re-run experiments after upgrading to the next scipy version and saving results that they think are deterministic but silently aren't.

Yes. There are plenty of ML programs (in particular) that call np.random.seed(seed) (and random.seed(seed) and torch.manual_seed(seed), etc.) at the top because that's what they've been told to do (and encouraged to do by well-meaning frameworks) and call sklearn and scipy functions, and those scripts are deterministic (and mostly fine). That will silently change if we just change check_random_state(None) to do default_rng(None) without a deprecation switcheroo. I care less about the results they get from any one run (they'll be perfectly valid) so much as they will now show unexpected and hard-to-debug behavior as everyone chases down a bunch of red herrings.

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

One enormous hurdle at a time, please. 😉

@stefanv
Copy link
Member Author

stefanv commented Apr 28, 2023

One enormous hurdle at a time, please. 😉

Fair enough :)

@rgommers
Copy link
Contributor

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

There is no plan to do so. Deprecating random_state=None in scipy and (hopefully after that) scikit-learn is a very different thing from deprecating the legacy numpy.random functionality. There is no such plan, it's legacy only and NEP 19 explicitlly lays out that it will stay, be ultra-stable, and useful for testing.

@rgommers
Copy link
Contributor

and those scripts are deterministic (and mostly fine)

Yes, okay - I agree, this summary and rationale is enough to explain why we cannot stay with random_state, I think the average reader will understand this well enough.

That also means that item (a) of my reasoning in scipy/scipy#14322 (comment) is not "in the same ballpark" and hence it seems clear now that we should prefer rng over random_state.

@tupui
Copy link
Member

tupui commented Apr 28, 2023

There is no such plan, it's legacy only and NEP 19 explicitlly lays out that it will stay, be ultra-stable, and useful for testing.

I think that would not prevent from having a user warning. Average users don't read docs and keep copy pasting old code until "something" is getting in their ways. So until it's visible in their code that something is legacy they will keep using that I am afraid.

Also reading at the NEP19, to me it's really not clear that the global state would not change. The fate of RandomState is clear, but the rest of the global section even ends with:

This NEP does not propose that these requirements remain in perpetuity. After we have experience with the new PRNG subsystem, we can and should revisit these issues in future NEPs.

@stefanv
Copy link
Member Author

stefanv commented May 2, 2023

There seems to be some vague consensus around the deprecation approach. I don't want to run things ahead, but at the same time scikit-image has to make a calculated guess of what to do for its forthcoming release. So, without holding anyone to the fire, I will propose that we make the seed (random_state) to rng transition there, and hope that it doesn't cause too much work in the future should the decision here go differently.

I would appreciate it if those involved in the discussion would co-author this SPEC (whether by adding your name to the authors list, or by helping to clarify language). If you want to keep a safe distance, advice on how to solidify the thrust of the argument further would also be welcome.

Thanks!

stefanv added a commit to stefanv/scikit-image that referenced this pull request May 9, 2023
Use `rng` consistently, replacing `random_state` and `seed`.

See also scientific-python/specs#180
jarrodmillman pushed a commit to scikit-image/scikit-image that referenced this pull request May 10, 2023
* Unify pseudo-random seeding interface

Use `rng` consistently, replacing `random_state` and `seed`.

See also scientific-python/specs#180

* Fix seeding in examples
@stefanv
Copy link
Member Author

stefanv commented May 10, 2023

skimage completed the transition for our 0.21 release (scikit-image/scikit-image#6922). Maybe premature, we'll see. But at least usage is consistent now.

@bsipocz
Copy link
Member

bsipocz commented May 31, 2023

I wonder whether this converged enough to be merged as a draft, certainly has more content and usage than some of the other specs?

@stefanv
Copy link
Member Author

stefanv commented May 31, 2023

Let me chat to the NetworkX people at their community call next week, and see what their thoughts are. While this is now implemented in scikit-image, it would be good to have at least multi-project consensus on this being a viable route forward.

@jorisvandenbossche
Copy link
Member

Comment from the sideline (and as library author who recently added a first method to geopandas where we needed to decide which keyword name to use for this, without being constrained by existing methods):
From a user API perspective, I find the keyword name "rng" rather unclear / unintuitive. You already need to know it is short for "random number generator" (I assume), but I think many users won't know that. In contrast, something like "random_state" is much clearer from the name that it has to do with the random aspect of the function.

@stefanv
Copy link
Member Author

stefanv commented Jun 1, 2023

As you point out, if you start from scratch, using random_state is an option. But many libraries already use random_state=None to refer to the old RandomState. They cannot easily keep using that keyword without breaking backward compatibility in an unexpected way.

So, the question is not so much finding an optimal argument name for one library, but coming to consensus over a consistent pattern that can be adopted across the ecosystem.

@jorisvandenbossche
Copy link
Member

Yes, I fully understand that the deprecation process limits what can be used, but I hope someone could come up with a better name than "rng" and which is not "random_state" .. (although I don't directly have an idea myself, except from spelling it out, which would make it rather long)

@rkern
Copy link

rkern commented Jun 1, 2023

FWIW, I responded to a similar comment on the related scipy thread.

@tupui
Copy link
Member

tupui commented Jun 1, 2023

I agree that ideally we would want to have a descriptive name for non experts and newcomers.

Has the ship really sailed to still use random_state but change the semantic progressively? (there are a few options like using an env variable) Because at some point, whatever we do, it seems like we are going to introduce some churn.

Otherwise I would be +1 on trying to brainstorm a new name.

randomness, chaos_state, etc. I am sure we can come up with something 😃

@stefanv
Copy link
Member Author

stefanv commented Jun 1, 2023

Environment variables are never an option: it has to be explicit in the code.

@rkern
Copy link

rkern commented Jun 1, 2023

Yes, that ship is long gone over the horizon.

I think the other ship is gone, too. With default_rng() and all of the numpy documentation using rng = default_rng(), rng is here to stay as a term that everyone will become familiar with.

TBF, I wanted prng = default_prng() to emphasize the pseudo aspect of the pseudorandomness, but people wanted something even shorter than that.

@stefanv
Copy link
Member Author

stefanv commented Jun 1, 2023

FWIW, I don't think rng will be hard to teach: users will read it once and get it, especially if we make it a common pattern across the ecosystem:

rng = np.random.default_rng(x)

Rng, the thing we want as input, is already enshrined in the NumPy function name too.

prng would have been slightly better from an educational point of view.

random_generator is a longer form that is quite intuitive and also captures the notion of a seed adequately.

@stefanv
Copy link
Member Author

stefanv commented Jun 1, 2023

Ah, sorry Robert, didn't see your message there.

@betatim
Copy link
Contributor

betatim commented Jun 5, 2023

For what it is worth, I think rng is no more or less arbitrary (some might say random ..) than random_state. You gotta learn it once and then you know it. Consistency across the ecosystem will be super useful here (just like it is for random_state).

I think rng has a nice ... ring to it ;) (I couldn't resist)

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over the current document and strategy. I am +1.

I don't like the name 😅 but yes this is something users will eventually get with the warnings; and being written all over in NumPy's doc is helping us to sell it.

@asmeurer
Copy link

The array API discussion about random number generation and the differences between the NumPy-style and JAX-style APIs may or may not be relevant here data-apis/array-api#431. I don't know if it matters for seeding specifically, but I also know a lot of the same people on that discussion are already on this one.

@EwoutH
Copy link

EwoutH commented Oct 15, 2023

Thanks for this effort! It might be useful to add a Motivation and/or Context section to this spec. I think it makes it a bit clearer why this spec was created and where it came from.

@jarrodmillman jarrodmillman changed the title Add draft SPEC7: seeding pseudo-random number generation Add SPEC7: Seeding pseudo-random number generation Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants