Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seed for the random numbers generators #63

Closed
echo66 opened this issue Oct 11, 2018 · 2 comments
Closed

seed for the random numbers generators #63

echo66 opened this issue Oct 11, 2018 · 2 comments
Assignees
Labels
internal The issue doesn't change the API or functionality
Milestone

Comments

@echo66
Copy link

echo66 commented Oct 11, 2018

  • Copulas version: 0.1
  • Python version: 3.6.6
  • Operating System: Fedora release 28 (Twenty Eight)

Description

I was expecting the sample methods to allow the user to pass a seed for the random numbers generators.

What I Did

Instead, we have to use, outside the function call, numpy.random.seed(seed_value) and random.setstate(seed_value). This is a bad practise from a software engineering standpoint and it is very error prone because it affects the global state. Also, this can negatively impact experiment reproducibility and the debugging stages.

Recommendations

Currently, in order to get the same sample from the sampling methods, we need to

  1. invoke np.random.seed(seed_value)
  2. invoke random.setstate(random_state_tuple)

outside of the sampling function being invoked (i.e. sample). This results in what is called, in software engineering, a leaky abstraction. In order to solve this issue with seed control, there are (at least) two approaches:

  1. Create a parameter, in the sample methods, named seed or random_state.
  2. Create a parameter in the constructor of classes offering the sample method named seed or random_state IF the distribution fit method requires some sort of stochastic process.

In scikit-learn and other popular python machine learning tools, what happens is the following

  1. When a model depends on some sort of stochastic process during the fit procedure, the model class constructor allows the user to set the random_state value. This value can be one of 3 things: None, an integer or an instance of numpy.random.RandomState instance. No matter what the value is, it will be checked and processed by sklearn.utils.check_random_state, which will output a numpy.random.RandomState instance. Note that the sklearn.utils.check_random_state method will be invoked at the beginning of the fit method (check this example: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L129). If you set random_state as an integer, every fit has to be deterministic in its behavior and output.

  2. If, besides the fit method, there is another method that depends on stochastic processes (e.g. the sample method in sklearn.neighbors.KernelDensity), we are allowed to control the seed through the random_state parameter.

  3. In other more low level APIs like scipy, the seed must be an integer or None.

I also advise against using both the random and numpy.random modules at the same time because it makes the seed management harder.

EDIT: Current fix available at #62

@ManuelAlvarezC
Copy link
Contributor

Hi @echo66

I agree that being able to set the random_state to ensure reproducibility of your results is indeed a useful feature to have.

From the approaches you propose, I'll definetely pick the second one: having random_state as an constructor argument on the classes that use some kind of stochastic process.

I would also take your suggestion of using only one of random and numpy.random. Will you be open to do the changes yourself?

@ManuelAlvarezC
Copy link
Contributor

The required changes should be the following:

  1. Replace all the ocurrences of random with np.random
  2. Create a method random_state_call on copulas.__init__, like this:
def random_state_call(random_state, function, *args, **kwargs):
    if random_state is None:
        return function(*args, **kwargs)

    else:
        original_state = np.random.get_state()
        np.random.seed(random_state)
        
        result = function(*args, **kwargs)
        
        np.random.set_state(original_state)
        return result
  1. Change classes bivariate.base.Bivariate and multivariate.tree.Tree __new__ method in order for them to accept arbitrary positional and keyword arguments *args, **kwargs

  2. Add the argument random_state=None on __init__ for both classes, that sets its value to the attribute of the same name.

  3. Replace on their subclasses all calls to functions that depend on their np.random with calls to random_state_call, that is:

np.random.uniform(0, 1, n_samples)

random_state_call(self.random_state, np.random.uniform, 0, 1, n_samples)
  1. Repeat the steps 4 & 5 for univariate.base.Univariate, multivariate.base.Multivariate and multivariate.vine.VineCopulas.

  2. Create unittests for random_state_call and for the sample methods, in which, after a random state is set, the sample values are exactly the same.

  3. Check everything is according to our Contributing Guide.

If you plan on updating your PR with this changes, please let us know and we will assign this issue to you.

@ManuelAlvarezC ManuelAlvarezC added this to the 0.2.1 milestone Nov 19, 2018
@ManuelAlvarezC ManuelAlvarezC self-assigned this Dec 12, 2018
@ManuelAlvarezC ManuelAlvarezC added the internal The issue doesn't change the API or functionality label Dec 12, 2018
@ManuelAlvarezC ManuelAlvarezC removed this from the 0.2.1 milestone Jan 4, 2019
@ManuelAlvarezC ManuelAlvarezC added this to the 0.2.2 milestone Feb 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal The issue doesn't change the API or functionality
Projects
None yet
Development

No branches or pull requests

2 participants