[ENH] handling of `random_state` in `clone` #279

fkiraly · 2024-01-31T14:55:17Z

Opening an issue to discuss API design around a requirement where independent, yet random-state-fixed copies of an estimator need to be obtained.

An example would be the bootstrap clones discussed here: sktime/sktime#5823 - these should be statistically independent pseudo-random.

Currently, clone copies the random_seed 1:1, which results in:

if random_seed=None, results in independent copies - but not pseudo-random fixed (each run gives different values)
if random_seed is set, results in value-identical copies, not statistically independent pseudo-random copies - but pseudo-random fixed copies

Neither meets the requirement above, because that would ned to be both pseudo-random fixed, and statistically independent (not value-identical).

In light of the rework of random_seed functionality (see #268), it is worth a discussion how this should even look like from the API perspective.

A key problem arises if multiple clones are needed - it needs to be known in advance, or at least they need to be sampled in a chain, to obtain dependent seeds which give rise to pseudo-random independent copies.

Further, we cannot change the default behaviour of clone and its current parameters, as it is an interface point of high importance.

Options I can think of:

sth like clone(deep=True, random_seed="exact_copy", n_clones=None)
a new method clone_random(deep=True, n_clones=1)

FYI @ericjb, @jmwhyte, @tpvasconcelos - since we all discussed either clone or random_seed recently.

The text was updated successfully, but these errors were encountered:

ericjb · 2024-01-31T15:13:51Z

In case you are getting deep into the weeds and need to consider independent random numbers in a distributed computing environment, you may need to think about the PRNG itself and not just the seed. FYI https://www.thesalmons.org/john/random123/papers/random123sc11.pdf

fkiraly · 2024-01-31T15:57:12Z

deep into the weeds

Ouch, not that deep into the weeds. I think we need to deal with the case of single location/env only, and leave pseudo-random seed handling for distributed environments to backends like joblib.

fkiraly · 2024-02-20T17:08:00Z

@ericjb, I have come to realize that indeed we will probably need PRNG which guarantees pseudo-random independence with a tree-like hierarchy. As such, the linked paper is exactly of the kind I was looking for.

fkiraly · 2024-02-20T17:21:18Z

Pinging @johnsalmon, @moraesmark, @pbelevich regarding the paper - it would be great if there were a tool, possibly with python bindings which, for a tree-like structure of sampler objects can generate independent pseudo-random seeds such that all samplers end up (mutually) independent pseudo-random, if any node in the tree does node.set_random_seed(children, node.random_seed), assuming it has been set by its parents.

(the scope of this package is in-principle all of ML pipelines)

What is unfortunate that we do not know the size of the tree in advance, so a node doesn't know the number of its children, nor can it communicate with its parents. Otherwise, the solution would be "fairly trivial" by doing the following at the root: 1. compute the number of nodes, 2. run a PRNG sequence generator, 3. distribute the seeds for any enumeration, across the tree

fkiraly · 2024-03-01T22:24:59Z

Getting a bit deeper in the woods, one option would be to convolve each call to a dependent random seed with:

a hash of the python file it was called from
the line number it was called from

That would ensure uniqueness, pseudo-randomness, and pseudo-independence, as long as no line of code contains more than a call, and no two files the dependent seed generator is called from are identical (e.g., sth silly like near-empty __init__ files).

For reference and potential use in that, here is random code from stackoverflow that produces the line a function was called from:

from inspect import currentframe

def get_linenumber():
    cf = currentframe()
    return cf.f_back.f_lineno

fkiraly added the API design API design & software architecture label Jan 31, 2024

fkiraly mentioned this issue Feb 20, 2024

[BUG] Time-series classification pipeline (panel data) and GridSearchCV - potential for uninformative grid searching/no reproducibility? sktime/sktime#5702

Open

fkiraly mentioned this issue Apr 8, 2024

[ENH] randomization/derandomization tag and conditional test logic sktime/sktime#6274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] handling of `random_state` in `clone` #279

[ENH] handling of `random_state` in `clone` #279

fkiraly commented Jan 31, 2024 •

edited

ericjb commented Jan 31, 2024

fkiraly commented Jan 31, 2024

fkiraly commented Feb 20, 2024

fkiraly commented Feb 20, 2024 •

edited

fkiraly commented Mar 1, 2024 •

edited

[ENH] handling of random_state in clone #279

[ENH] handling of random_state in clone #279

Comments

fkiraly commented Jan 31, 2024 • edited

ericjb commented Jan 31, 2024

fkiraly commented Jan 31, 2024

fkiraly commented Feb 20, 2024

fkiraly commented Feb 20, 2024 • edited

fkiraly commented Mar 1, 2024 • edited

[ENH] handling of `random_state` in `clone` #279

[ENH] handling of `random_state` in `clone` #279

fkiraly commented Jan 31, 2024 •

edited

fkiraly commented Feb 20, 2024 •

edited

fkiraly commented Mar 1, 2024 •

edited