Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add n_jobs and random_state to global config #23732

Open
davidgilbertson opened this issue Jun 22, 2022 · 3 comments
Open

Add n_jobs and random_state to global config #23732

davidgilbertson opened this issue Jun 22, 2022 · 3 comments
Labels

Comments

@davidgilbertson
Copy link
Contributor

Describe the workflow you want to enable

I would like to be able to set n_jobs=-1 in one place and have this take effect in any function with an n_jobs parameter. Same for random_state.

Perhaps there are other parameters that fit the theme: "if a user sets this in one instance, they probably want to set it in all instances".

Describe your proposed solution

Expand the accepted parameters of sklearn.set_config, update all functions to fall back to the config value if the parameter isn't passed.

This would require changing a default of None to a sentinel (ala _NoValue in NumPy), to allow a user to override the global config with the value None, while still allowing the code to detect if the argument was passed.

Rudimentary mockup:

_sentinel = object()

config = {}


def resolve_arg_value(arg_name, passed_value, default_value):
    if passed_value is not _sentinel:
        return passed_value
    
    if arg_name in config:
        return config[arg_name]
    
    return default_value


def do_something(random_state=_sentinel):
    random_state = resolve_arg_value("random_state", random_state, None)
    print(f"{random_state!s}")


do_something('Hi')  # 'Hi'
do_something()  # None
do_something(None)  # None

config['random_state'] = 77

do_something('Hi')  # 'Hi'
do_something()  # 77
do_something(None)  # None

config.pop('random_state')

do_something('Hi')  # 'Hi'
do_something()  # None
do_something(None)  # None

I'll admit, having to add something like resolve_arg_value("random_state", random_state, None) to tons of functions sounds painful, but I think for the user, being able to set and forget a random state/other params, potentially based on an environment variable would be nice.

sklearn.set_config(
    random_state=None if os.environ["PROD"] else 0,
    n_jobs=-1,
)

Describe alternatives you've considered, if relevant

No response

Additional context

I see that the tests get a global random seed. #22749

@davidgilbertson davidgilbertson added Needs Triage Issue requires triage New Feature labels Jun 22, 2022
@glemaitre
Copy link
Member

For n_jobs, we ruled out the possibility here: #23253. It is already possible to do so with the parallel_backend of joblib. However, be extremely careful with n_jobs=-1.

For the random_state, isn't it enough to seed the singleton random generator from NumPy using np.random.seed and let all random_state=None. In this case, the results will be reproducible since random_state=None will use the global singleton.

@davidgilbertson
Copy link
Contributor Author

Thanks, joblib suits my needs!

Re random_state, np.random.seed will work, thanks for that. Although the docs specifically warn against it and recommend passing the random_state arg to every function that takes it.

So the goal of my global config suggestion was to allow the 'recommended' way of getting reproducible results without the clutter of passing a random_state parameter to all the scikit-learn functions that take it (and the mental overhead of thinking about which functions take it and which don't).

Also if scikit-learn one day moves from the legacy NumPy RandomState to the new Generator, a global config would provide an upgrade path to all the users who were using the discouraged np.random.seed technique.

@davidgilbertson
Copy link
Contributor Author

Update: I've just noticed that setting np.random.seed(77) doesn't give me reproducible results when using GradientBoostingRegressor, but GradientBoostingRegressor(random_state=77) does.

I'm also using HalvingRandomSearchCV with scipy.stats.randint, maybe that complicates things (I'm a newbie, the landscape is still a little foggy). If this isn't a known issue I'll try and create a minimal repro.

@thomasjpfan thomasjpfan added API Needs Decision - API and removed Needs Triage Issue requires triage labels Jun 28, 2022
@lorentzenchr lorentzenchr added Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants