Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RandomState._bit_generator as input in check_random_state for users with numpy version >= 1.17.0 #20669

Open
zoj613 opened this issue Aug 3, 2021 · 15 comments

Comments

@zoj613
Copy link
Contributor

zoj613 commented Aug 3, 2021

Describe the workflow you want to enable

Since numpy version 1.17.0, np.random.RandomState can accept the ._bit_generator attribute as input in the constructor. This can be a plus for those who use np.random.Generator in their code and want to use the same bitgenerator with sklearn's estimators. Currently this is not possible, see:

from sklearn.datasets import make_classification
from sklearn.manifold import TSNE

X, y = make_classification(n_samples=150, n_features=5, n_informative=5,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)

rng = np.random.default_rng(12345)
tsne = TSNE()
# some piece of code here
# then later we use our own rng to set the seed of `tsne`
# notice `_bit_generator` used here, which is compatible with RandomState
tsne.set_params(random_state=rng._bit_generator)
tsne.fit_transform(X, y)

this leads to the error:

  File "/home/python3.6/site-packages/sklearn/manifold/_t_sne.py", line 932, in fit_transform
    embedding = self._fit(X)
  File "/home/python3.6/site-packages/sklearn/manifold/_t_sne.py", line 728, in _fit
    random_state = check_random_state(self.random_state)
  File "/home/python3.6/site-packages/sklearn/utils/validation.py", line 944, in check_random_state
    ' instance' % seed)
ValueError: <numpy.random.pcg64.PCG64 object at 0x7ffa3ab471b8> cannot be used to seed a numpy.random.RandomState instance

Describe your proposed solution

I propose we add a conditional in check_random_state that supports an instance of BitGenerator, see:

def check_random_state(seed):
"""Turn seed into a np.random.RandomState instance
Parameters
----------
seed : None, int or instance of RandomState
If seed is None, return the RandomState singleton used by np.random.
If seed is an int, return a new RandomState instance seeded with seed.
If seed is already a RandomState instance, return it.
Otherwise raise ValueError.
"""
if seed is None or seed is np.random:
return np.random.mtrand._rand
if isinstance(seed, numbers.Integral):
return np.random.RandomState(seed)
if isinstance(seed, np.random.RandomState):
return seed
raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
' instance' % seed)

something like

supported_bitgenerators = {'PCG64', 'SFC64', 'Philox', ...}

def check_random_state(seed):
	...
	if seed.__class__.__name__ in supported_bitgenerators:
		return np.random.RandomState(seed) # should work if numpy>=1.17.0
	...

Describe alternatives you've considered, if relevant

I know there is an issue regarding supporting the new numpy Generator interface but I feel this is slightly different since it does not attempt to replace RandomState.

@zoj613 zoj613 changed the title Support an instance/subclass of np.random.BitGenerator in check_random_state for users with numpy version >= 1.17.0 Support RandomState._bit_generator as input in check_random_state for users with numpy version >= 1.17.0 Aug 3, 2021
@NicolasHug
Copy link
Member

Thanks for the proposal @zoj613 ,

tsne.set_params(random_state=rng._bit_generator)

The _bit_generator is a private attribute. I'm not sure users should be using it directly, and even less sure that we should encourage users to do so. A cleaner solution would be t o fully support Generator objects #16988, but it will involve some work.

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 4, 2021

@rkern What is your take on the fragility of the _bit_generator attribute? Is it going to change anytime in the near future? Are developers supposed to use it in 3rd party libraries to wrap Generator objects that can be passed to sklearn estimators as a RandomState instance? I wrote a wrapper so I can use it with the current version of sklearn here: https://github.com/zoj613/pyloras/blob/52661a530493754969ddb008388dc585ac833f0f/pyloras/_common.py#L4-L19

however based on @NicolasHug 's comment, im not sure if this is the best option.

@rkern
Copy link

rkern commented Aug 4, 2021

Note that the case in the code snippet would actually be rng.bit_generator; rng is a Generator, and Generator exposes bit_generator as a property alias. So for this case, turning a Generator into a RandomState for consumption by otherwise unmodified sklearn code that is expecting a RandomState, you can just access bit_generator. I would recommend that sklearn's check_random_state() accept either a BitGenerator or a Generator (doing the attribute access itself in the latter case).

Now for the opposite case, turning RandomState into a Generator: RandomState._bit_generator won't go away, per se. We'll probably promote it to bit_generator with a property like we do in Generator (without removing _bit_generator). In order to enable the transition to Generator, we definitely do want people to be able to take a RandomState and wrap its BitGenerator with a Generator. We'll probably make default_rng() just accept a RandomState and do the _bit_generator access ourselves, but for you to support current versions of numpy, you can safely use RandomState._bit_generator. It will remain cdef public at least until the first release with bit_generator becomes the oldest-supported-numpy. I do recommend supporting this with one of these utility functions so users can just pass in the RandomState and not think about the details.

For detecting BitGenerators, I would recommend trying to import numpy.random.bit_generator to get BitGenerator and using isinstance() instead of checking the name of the type. We want to support other people creating their own BitGenerators, so you will never have a complete list of names. If that module doesn't exist, then you can be sure the value you were given isn't a BitGenerator!

@rkern
Copy link

rkern commented Aug 4, 2021

If that module doesn't exist, then you can be sure the value you were given isn't a BitGenerator!

Ugh, except for numpy 1.18.

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 4, 2021

Thank you @rkern for the detailed explanation, I was not aware that _bit_generator has a public alias. @NicolasHug, based on Robert's recommendation it sounds to me like supporting numpy's new random interface is much more straightforward that we anticipated. I am happy to help undertake this task.

@NicolasHug
Copy link
Member

based on Robert's recommendation it sounds to me like supporting numpy's new random interface is much more straightforward that we anticipated

turning a Generator into a RandomState for consumption by otherwise unmodified sklearn code that is expecting a RandomState, you can just access bit_generator

I might be missing something but I don't see how the bit_generator exposes any of the RandomState methods

In [11]: dir(np.random.default_rng(0).bit_generator)
Out[11]:
['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_cython__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__setstate_cython__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_benchmark',
 '_cffi',
 '_ctypes',
 '_seed_seq',
 'advance',
 'capsule',
 'cffi',
 'ctypes',
 'jumped',
 'lock',
 'random_raw',
 'state']

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 4, 2021

The proposal is to not replace RandomState, but to allow it to take the bit_generator attribute as an argument in its constructor so that when users pass in a Generator object, check_random_state can use its bit_generator attribute to convert the Generator into a RandomState to be used by the Estimators.This does not require changing the behavior of the current random number generator.

See the snippet below @NicolasHug :

In [1]: import numpy as np

In [2]: new_rng = np.random.default_rng(0)

In [3]: old_rng = np.random.RandomState(new_rng.bit_generator)

In [4]: dir(old_rng)
Out[4]: 
['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_bit_generator',
 '_poisson_lam_max',
 'beta',
 'binomial',
 'bytes',
 'chisquare',
 'choice',
 'dirichlet',
 'exponential',
 'f',
 'gamma',
 'geometric',
 'get_state',
 'gumbel',
 'hypergeometric',
 'laplace',
 'logistic',
 'lognormal',
 'logseries',
 'multinomial',
 'multivariate_normal',
 'negative_binomial',
 'noncentral_chisquare',
 'noncentral_f',
 'normal',
 'pareto',
 'permutation',
 'poisson',
 'power',
 'rand',
 'randint',
 'randn',
 'random',
 'random_integers',
 'random_sample',
 'rayleigh',
 'seed',
 'set_state',
 'shuffle',
 'standard_cauchy',
 'standard_exponential',
 'standard_gamma',
 'standard_normal',
 'standard_t',
 'tomaxint',
 'triangular',
 'uniform',
 'vonmises',
 'wald',
 'weibull',
 'zipf']

The code above uses Generator's underlying bitgenerator inside of the RandomState instance while still exposing the old functionality used in scikit-learn.

@NicolasHug
Copy link
Member

oh I see, thanks. That's an interesting development.

One good thing is that we shouldn't expect a breaking change (i.e. a different RNG) if one day we choose to natively support Generators:

In [39]: np.random.RandomState(np.random.default_rng(0).bit_generator).random(10)
Out[39]:
array([0.63696169, 0.26978671, 0.04097352, 0.01652764, 0.81327024,
       0.91275558, 0.60663578, 0.72949656, 0.54362499, 0.93507242])

In [40]: np.random.default_rng(0).random(10)
Out[40]:
array([0.63696169, 0.26978671, 0.04097352, 0.01652764, 0.81327024,
       0.91275558, 0.60663578, 0.72949656, 0.54362499, 0.93507242])

Is this correct @rkern ?

Also, does that mean that a RandomState(default_rng(0).bit_generator) object has the same "nice RNG properties" as a Generator object mentioned in #16988 (comment) ? (CC @lorentzenchr )?

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 4, 2021

As far as I am aware, numpy uses the PCG64 bitgenerator as its default one, and it has since 1.17.0. There were talks about replacing it with its stronger version PCG64DXSM in the future, but not sure which version will that take effect.

By nice properties, Im assuming you're referring to the new methods and parameters added in the Generator class (e.g. being able to select covariance factorization algorithm in multivariate_normal via the method parameter)? If so, then no. Those new features are are a result of the Generator interface and have nothing to do with the underlying bit_generator used by both Generator and RandomState.

One nice property this proposal does bring is that random number generation in RandomState will become much faster if passed an efficient bitgenerator like PCG64 (Generator's default) instead of the slower default used by RandomState (the Masserne Twister np.random.MT19937).

@NicolasHug
Copy link
Member

By nice properties, Im assuming you're referring to the new methods and parameters added in the Generator class

I'm referring to #16988 (comment):

But those have better properties of randomness and a faster in a several instances

@rkern
Copy link

rkern commented Aug 4, 2021

  1. I'm not sure I understand what you're asking. Yes, wrapping a passed BitGenerator with RandomState now will help smooth things towards using Generator natively in sklearn, but there are differences (improvements) in some of the methods (say, standard_normal()). And per NEP 19, there will be other changes that come, including the default BitGenerator chosen by default_rng().
  2. Half and half. The BitGenerator used by default_rng() is statistically a little better and faster than the Mersenne Twister beneath RandomState (by default), but that comment also refers to the improved method implementations in Generator that are not available to RandomState.

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 4, 2021

By nice properties, Im assuming you're referring to the new methods and parameters added in the Generator class

I'm referring to #16988 (comment):

But those have better properties of randomness and a faster in a several instances

Yes, that too. The other bitgenerators are faster and higher quality than MT19937 which is RandomState default bitgenerator.

@zoj613 zoj613 changed the title Support RandomState._bit_generator as input in check_random_state for users with numpy version >= 1.17.0 Support RandomState.bit_generator as input in check_random_state for users with numpy version >= 1.17.0 Aug 4, 2021
@NicolasHug
Copy link
Member

Thanks for the replies, now I see that the underlying implementation is not always the same:

In [18]: np.random.RandomState(np.random.default_rng(0).bit_generator).standard_normal(10)
Out[18]:
array([-1.35785352,  0.80783308,  1.49674359,  0.6954632 ,  0.72872164,
        0.07306939, -0.78215698,  0.55381447,  0.12696293,  1.11212987])

In [19]: np.random.default_rng(0).standard_normal(10)
Out[19]:
array([ 0.12573022, -0.13210486,  0.64042265,  0.10490012, -0.53566937,
        0.36159505,  1.30400005,  0.94708096, -0.70373524, -1.26542147])

and the 2 methods don't even have the same API (one has extra parameters). So we should expect RNG changes if we ever choose to natively support Generators.

@rkern
Copy link

rkern commented Aug 4, 2021

Yes, NEP 19 lays out the policy and the reasons for it. It was well-deliberated.

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 4, 2021

Thanks for the replies, now I see that the underlying implementation is not always the same:

In [18]: np.random.RandomState(np.random.default_rng(0).bit_generator).standard_normal(10)
Out[18]:
array([-1.35785352,  0.80783308,  1.49674359,  0.6954632 ,  0.72872164,
        0.07306939, -0.78215698,  0.55381447,  0.12696293,  1.11212987])

In [19]: np.random.default_rng(0).standard_normal(10)
Out[19]:
array([ 0.12573022, -0.13210486,  0.64042265,  0.10490012, -0.53566937,
        0.36159505,  1.30400005,  0.94708096, -0.70373524, -1.26542147])

and the 2 methods don't even have the same API (one has extra parameters). So we should expect RNG changes if we ever choose to natively support Generators.

Correct, the Generator implementation of standard_normal, exponential and gamma uses the faster ziggurat algorithm instead of the Box-Muller used in RandomState, so the random stream is gonna be different.

EDIT: oops didnt realize its already answered above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants