Add `n_jobs` and `random_state` to global config #23732

davidgilbertson · 2022-06-22T21:02:04Z

Describe the workflow you want to enable

I would like to be able to set n_jobs=-1 in one place and have this take effect in any function with an n_jobs parameter. Same for random_state.

Perhaps there are other parameters that fit the theme: "if a user sets this in one instance, they probably want to set it in all instances".

Describe your proposed solution

Expand the accepted parameters of sklearn.set_config, update all functions to fall back to the config value if the parameter isn't passed.

This would require changing a default of None to a sentinel (ala _NoValue in NumPy), to allow a user to override the global config with the value None, while still allowing the code to detect if the argument was passed.

Rudimentary mockup:

_sentinel = object()

config = {}


def resolve_arg_value(arg_name, passed_value, default_value):
    if passed_value is not _sentinel:
        return passed_value
    
    if arg_name in config:
        return config[arg_name]
    
    return default_value


def do_something(random_state=_sentinel):
    random_state = resolve_arg_value("random_state", random_state, None)
    print(f"{random_state!s}")


do_something('Hi')  # 'Hi'
do_something()  # None
do_something(None)  # None

config['random_state'] = 77

do_something('Hi')  # 'Hi'
do_something()  # 77
do_something(None)  # None

config.pop('random_state')

do_something('Hi')  # 'Hi'
do_something()  # None
do_something(None)  # None

I'll admit, having to add something like resolve_arg_value("random_state", random_state, None) to tons of functions sounds painful, but I think for the user, being able to set and forget a random state/other params, potentially based on an environment variable would be nice.

sklearn.set_config(
    random_state=None if os.environ["PROD"] else 0,
    n_jobs=-1,
)

Describe alternatives you've considered, if relevant

No response

Additional context

I see that the tests get a global random seed. #22749

The text was updated successfully, but these errors were encountered:

glemaitre · 2022-06-23T12:36:01Z

For n_jobs, we ruled out the possibility here: #23253. It is already possible to do so with the parallel_backend of joblib. However, be extremely careful with n_jobs=-1.

For the random_state, isn't it enough to seed the singleton random generator from NumPy using np.random.seed and let all random_state=None. In this case, the results will be reproducible since random_state=None will use the global singleton.

davidgilbertson · 2022-06-23T21:54:14Z

Thanks, joblib suits my needs!

Re random_state, np.random.seed will work, thanks for that. Although the docs specifically warn against it and recommend passing the random_state arg to every function that takes it.

So the goal of my global config suggestion was to allow the 'recommended' way of getting reproducible results without the clutter of passing a random_state parameter to all the scikit-learn functions that take it (and the mental overhead of thinking about which functions take it and which don't).

Also if scikit-learn one day moves from the legacy NumPy RandomState to the new Generator, a global config would provide an upgrade path to all the users who were using the discouraged np.random.seed technique.

davidgilbertson · 2022-06-24T23:22:31Z

Update: I've just noticed that setting np.random.seed(77) doesn't give me reproducible results when using GradientBoostingRegressor, but GradientBoostingRegressor(random_state=77) does.

I'm also using HalvingRandomSearchCV with scipy.stats.randint, maybe that complicates things (I'm a newbie, the landscape is still a little foggy). If this isn't a known issue I'll try and create a minimal repro.

davidgilbertson added Needs Triage Issue requires triage New Feature labels Jun 22, 2022

thomasjpfan added API Needs Decision - API and removed Needs Triage Issue requires triage labels Jun 28, 2022

lorentzenchr added Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `n_jobs` and `random_state` to global config #23732

Add `n_jobs` and `random_state` to global config #23732

davidgilbertson commented Jun 22, 2022

glemaitre commented Jun 23, 2022

davidgilbertson commented Jun 23, 2022

davidgilbertson commented Jun 24, 2022

Add n_jobs and random_state to global config #23732

Add n_jobs and random_state to global config #23732

Comments

davidgilbertson commented Jun 22, 2022

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Jun 23, 2022

davidgilbertson commented Jun 23, 2022

davidgilbertson commented Jun 24, 2022

Add `n_jobs` and `random_state` to global config #23732

Add `n_jobs` and `random_state` to global config #23732