seed for the random numbers generators #63

echo66 · 2018-10-11T10:34:50Z

Copulas version: 0.1
Python version: 3.6.6
Operating System: Fedora release 28 (Twenty Eight)

Description

I was expecting the sample methods to allow the user to pass a seed for the random numbers generators.

What I Did

Instead, we have to use, outside the function call, numpy.random.seed(seed_value) and random.setstate(seed_value). This is a bad practise from a software engineering standpoint and it is very error prone because it affects the global state. Also, this can negatively impact experiment reproducibility and the debugging stages.

Recommendations

Currently, in order to get the same sample from the sampling methods, we need to

invoke np.random.seed(seed_value)
invoke random.setstate(random_state_tuple)

outside of the sampling function being invoked (i.e. sample). This results in what is called, in software engineering, a leaky abstraction. In order to solve this issue with seed control, there are (at least) two approaches:

Create a parameter, in the sample methods, named seed or random_state.
Create a parameter in the constructor of classes offering the sample method named seed or random_state IF the distribution fit method requires some sort of stochastic process.

In scikit-learn and other popular python machine learning tools, what happens is the following

When a model depends on some sort of stochastic process during the fit procedure, the model class constructor allows the user to set the random_state value. This value can be one of 3 things: None, an integer or an instance of numpy.random.RandomState instance. No matter what the value is, it will be checked and processed by sklearn.utils.check_random_state, which will output a numpy.random.RandomState instance. Note that the sklearn.utils.check_random_state method will be invoked at the beginning of the fit method (check this example: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L129). If you set random_state as an integer, every fit has to be deterministic in its behavior and output.
If, besides the fit method, there is another method that depends on stochastic processes (e.g. the sample method in sklearn.neighbors.KernelDensity), we are allowed to control the seed through the random_state parameter.
In other more low level APIs like scipy, the seed must be an integer or None.

I also advise against using both the random and numpy.random modules at the same time because it makes the seed management harder.

EDIT: Current fix available at #62

The text was updated successfully, but these errors were encountered:

ManuelAlvarezC · 2018-10-15T14:01:26Z

Hi @echo66

I agree that being able to set the random_state to ensure reproducibility of your results is indeed a useful feature to have.

From the approaches you propose, I'll definetely pick the second one: having random_state as an constructor argument on the classes that use some kind of stochastic process.

I would also take your suggestion of using only one of random and numpy.random. Will you be open to do the changes yourself?

ManuelAlvarezC · 2018-10-17T10:29:38Z

The required changes should be the following:

Replace all the ocurrences of random with np.random
Create a method random_state_call on copulas.__init__, like this:

def random_state_call(random_state, function, *args, **kwargs):
    if random_state is None:
        return function(*args, **kwargs)

    else:
        original_state = np.random.get_state()
        np.random.seed(random_state)
        
        result = function(*args, **kwargs)
        
        np.random.set_state(original_state)
        return result

Change classes bivariate.base.Bivariate and multivariate.tree.Tree __new__ method in order for them to accept arbitrary positional and keyword arguments *args, **kwargs
Add the argument random_state=None on __init__ for both classes, that sets its value to the attribute of the same name.
Replace on their subclasses all calls to functions that depend on their np.random with calls to random_state_call, that is:

np.random.uniform(0, 1, n_samples)

random_state_call(self.random_state, np.random.uniform, 0, 1, n_samples)

Repeat the steps 4 & 5 for univariate.base.Univariate, multivariate.base.Multivariate and multivariate.vine.VineCopulas.
Create unittests for random_state_call and for the sample methods, in which, after a random state is set, the sample values are exactly the same.
Check everything is according to our Contributing Guide.

If you plan on updating your PR with this changes, please let us know and we will assign this issue to you.

echo66 mentioned this issue Oct 11, 2018

Added seed for random numbers generators #62

Closed

ManuelAlvarezC added this to the 0.2.1 milestone Nov 19, 2018

ManuelAlvarezC self-assigned this Dec 12, 2018

ManuelAlvarezC added the internal The issue doesn't change the API or functionality label Dec 12, 2018

ManuelAlvarezC removed this from the 0.2.1 milestone Jan 4, 2019

ManuelAlvarezC added this to the 0.2.2 milestone Feb 11, 2019

ManuelAlvarezC mentioned this issue Feb 14, 2019

Issue 63: Add option to seed random state #86

Merged

ManuelAlvarezC closed this as completed in #86 Feb 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seed for the random numbers generators #63

seed for the random numbers generators #63

echo66 commented Oct 11, 2018 •

edited

Loading

ManuelAlvarezC commented Oct 15, 2018

ManuelAlvarezC commented Oct 17, 2018

seed for the random numbers generators #63

seed for the random numbers generators #63

Comments

echo66 commented Oct 11, 2018 • edited Loading

Description

What I Did

Recommendations

ManuelAlvarezC commented Oct 15, 2018

ManuelAlvarezC commented Oct 17, 2018

echo66 commented Oct 11, 2018 •

edited

Loading