Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] handling of random_state in clone #279

Open
fkiraly opened this issue Jan 31, 2024 · 5 comments
Open

[ENH] handling of random_state in clone #279

fkiraly opened this issue Jan 31, 2024 · 5 comments
Labels
API design API design & software architecture

Comments

@fkiraly
Copy link
Contributor

fkiraly commented Jan 31, 2024

Opening an issue to discuss API design around a requirement where independent, yet random-state-fixed copies of an estimator need to be obtained.

An example would be the bootstrap clones discussed here: sktime/sktime#5823 - these should be statistically independent pseudo-random.

Currently, clone copies the random_seed 1:1, which results in:

  • if random_seed=None, results in independent copies - but not pseudo-random fixed (each run gives different values)
  • if random_seed is set, results in value-identical copies, not statistically independent pseudo-random copies - but pseudo-random fixed copies

Neither meets the requirement above, because that would ned to be both pseudo-random fixed, and statistically independent (not value-identical).

In light of the rework of random_seed functionality (see #268), it is worth a discussion how this should even look like from the API perspective.

A key problem arises if multiple clones are needed - it needs to be known in advance, or at least they need to be sampled in a chain, to obtain dependent seeds which give rise to pseudo-random independent copies.

Further, we cannot change the default behaviour of clone and its current parameters, as it is an interface point of high importance.

Options I can think of:

  • sth like clone(deep=True, random_seed="exact_copy", n_clones=None)
  • a new method clone_random(deep=True, n_clones=1)

FYI @ericjb, @jmwhyte, @tpvasconcelos - since we all discussed either clone or random_seed recently.

@fkiraly fkiraly added the API design API design & software architecture label Jan 31, 2024
@ericjb
Copy link

ericjb commented Jan 31, 2024

In case you are getting deep into the weeds and need to consider independent random numbers in a distributed computing environment, you may need to think about the PRNG itself and not just the seed. FYI https://www.thesalmons.org/john/random123/papers/random123sc11.pdf

@fkiraly
Copy link
Contributor Author

fkiraly commented Jan 31, 2024

deep into the weeds

Ouch, not that deep into the weeds. I think we need to deal with the case of single location/env only, and leave pseudo-random seed handling for distributed environments to backends like joblib.

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 20, 2024

@ericjb, I have come to realize that indeed we will probably need PRNG which guarantees pseudo-random independence with a tree-like hierarchy. As such, the linked paper is exactly of the kind I was looking for.

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 20, 2024

Pinging @johnsalmon, @moraesmark, @pbelevich regarding the paper - it would be great if there were a tool, possibly with python bindings which, for a tree-like structure of sampler objects can generate independent pseudo-random seeds such that all samplers end up (mutually) independent pseudo-random, if any node in the tree does node.set_random_seed(children, node.random_seed), assuming it has been set by its parents.

(the scope of this package is in-principle all of ML pipelines)

What is unfortunate that we do not know the size of the tree in advance, so a node doesn't know the number of its children, nor can it communicate with its parents. Otherwise, the solution would be "fairly trivial" by doing the following at the root: 1. compute the number of nodes, 2. run a PRNG sequence generator, 3. distribute the seeds for any enumeration, across the tree

@fkiraly
Copy link
Contributor Author

fkiraly commented Mar 1, 2024

Getting a bit deeper in the woods, one option would be to convolve each call to a dependent random seed with:

  • a hash of the python file it was called from
  • the line number it was called from

That would ensure uniqueness, pseudo-randomness, and pseudo-independence, as long as no line of code contains more than a call, and no two files the dependent seed generator is called from are identical (e.g., sth silly like near-empty __init__ files).

For reference and potential use in that, here is random code from stackoverflow that produces the line a function was called from:

from inspect import currentframe

def get_linenumber():
    cf = currentframe()
    return cf.f_back.f_lineno

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture
Projects
None yet
Development

No branches or pull requests

2 participants