You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was expecting the sample methods to allow the user to pass a seed for the random numbers generators.
What I Did
Instead, we have to use, outside the function call, numpy.random.seed(seed_value) and random.setstate(seed_value). This is a bad practise from a software engineering standpoint and it is very error prone because it affects the global state. Also, this can negatively impact experiment reproducibility and the debugging stages.
Recommendations
Currently, in order to get the same sample from the sampling methods, we need to
invoke np.random.seed(seed_value)
invoke random.setstate(random_state_tuple)
outside of the sampling function being invoked (i.e. sample). This results in what is called, in software engineering, a leaky abstraction. In order to solve this issue with seed control, there are (at least) two approaches:
Create a parameter, in the sample methods, named seed or random_state.
Create a parameter in the constructor of classes offering the sample method named seed or random_state IF the distribution fit method requires some sort of stochastic process.
In scikit-learn and other popular python machine learning tools, what happens is the following
When a model depends on some sort of stochastic process during the fit procedure, the model class constructor allows the user to set the random_state value. This value can be one of 3 things: None, an integer or an instance of numpy.random.RandomState instance. No matter what the value is, it will be checked and processed by sklearn.utils.check_random_state, which will output a numpy.random.RandomState instance. Note that the sklearn.utils.check_random_state method will be invoked at the beginning of the fit method (check this example: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L129). If you set random_state as an integer, every fit has to be deterministic in its behavior and output.
If, besides the fit method, there is another method that depends on stochastic processes (e.g. the sample method in sklearn.neighbors.KernelDensity), we are allowed to control the seed through the random_state parameter.
In other more low level APIs like scipy, the seed must be an integer or None.
I also advise against using both the random and numpy.random modules at the same time because it makes the seed management harder.
I agree that being able to set the random_state to ensure reproducibility of your results is indeed a useful feature to have.
From the approaches you propose, I'll definetely pick the second one: having random_state as an constructor argument on the classes that use some kind of stochastic process.
I would also take your suggestion of using only one of random and numpy.random. Will you be open to do the changes yourself?
Change classes bivariate.base.Bivariate and multivariate.tree.Tree__new__ method in order for them to accept arbitrary positional and keyword arguments *args, **kwargs
Add the argument random_state=None on __init__ for both classes, that sets its value to the attribute of the same name.
Replace on their subclasses all calls to functions that depend on their np.random with calls to random_state_call, that is:
Description
I was expecting the
sample
methods to allow the user to pass a seed for the random numbers generators.What I Did
Instead, we have to use, outside the function call,
numpy.random.seed(seed_value)
andrandom.setstate(seed_value)
. This is a bad practise from a software engineering standpoint and it is very error prone because it affects the global state. Also, this can negatively impact experiment reproducibility and the debugging stages.Recommendations
Currently, in order to get the same sample from the sampling methods, we need to
np.random.seed(seed_value)
random.setstate(random_state_tuple)
outside of the sampling function being invoked (i.e.
sample
). This results in what is called, in software engineering, a leaky abstraction. In order to solve this issue with seed control, there are (at least) two approaches:sample
methods, namedseed
orrandom_state
.sample
method namedseed
orrandom_state
IF the distribution fit method requires some sort of stochastic process.In scikit-learn and other popular python machine learning tools, what happens is the following
When a model depends on some sort of stochastic process during the fit procedure, the model class constructor allows the user to set the
random_state
value. This value can be one of 3 things: None, an integer or an instance ofnumpy.random.RandomState
instance. No matter what the value is, it will be checked and processed bysklearn.utils.check_random_state
, which will output anumpy.random.RandomState
instance. Note that thesklearn.utils.check_random_state
method will be invoked at the beginning of the fit method (check this example: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L129). If you setrandom_state
as an integer, every fit has to be deterministic in its behavior and output.If, besides the fit method, there is another method that depends on stochastic processes (e.g. the
sample
method insklearn.neighbors.KernelDensity
), we are allowed to control the seed through therandom_state
parameter.In other more low level APIs like scipy, the seed must be an integer or None.
I also advise against using both the
random
andnumpy.random
modules at the same time because it makes the seed management harder.EDIT: Current fix available at #62
The text was updated successfully, but these errors were encountered: