Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results differ at each run #323

Closed
gcasamat opened this issue Nov 18, 2020 · 8 comments · Fixed by #325
Closed

Results differ at each run #323

gcasamat opened this issue Nov 18, 2020 · 8 comments · Fixed by #325

Comments

@gcasamat
Copy link

I train a forest DML estimator and specify random_state:

est_forest_rf_train = ForestDMLCateEstimator(
                            model_y = RandomForestRegressor(n_estimators = 200, random_state = 42), 
                            model_t = RandomForestRegressor(n_estimators = 200, random_state = 42),
                            n_estimators = 200,
                            random_state = 123)

However, I get different results each time I run the following code:

est_forest_rf_train.fit(Y, T, X, W = 'blb')
est_forest_rf_train.const_marginal_effect_inference(X_test).summary_frame(alpha = 0.05, value = 0, decimals = 3))

It is surprising as I thought this bug had been corrected, as indicated in #252.

@vsyrgkanis
Copy link
Collaborator

Seems weird. I checked again quickly and all relevant randomness pulls from random state. Are you running from the git code or feom the release. Maybe try installing the git code in case this is something we fixed alteady but not merged in a release.

We’re about to make a release today, so alternatively you could also wait a day

@gcasamat
Copy link
Author

I am running from the release (v0.8.0b1). FYI, things are ok when I specify the folds for cross-fitting "manually" (and they were not with the 0.7.0 version). I will wait for the new release and see if the problem persists.
Thanks!

@kbattocchi
Copy link
Collaborator

Just to clarify, are the results differing when you both initialize and fit the model several times, or are you initializing just once and calling fit repeatedly? In the second case, the stored models' internal random state will evolve over time and so seeing different results across fit calls would be expected.

@vsyrgkanis
Copy link
Collaborator

@gcasamat after chatting with @kbattocchi this seems to be by design. You seem to be calling fit on the same object, but random state is an evolving quantity within the object and the crossfit splitter is initialized at fit time. So if you call init and then fit and then again fit, the splitter is initialized with a different randomstate.

To get the exact same results you should just re-instantiate the object. i.e.

est_forest_rf_train = ForestDMLCateEstimator(
                            model_y = RandomForestRegressor(n_estimators = 200, random_state = 42), 
                            model_t = RandomForestRegressor(n_estimators = 200, random_state = 42),
                            n_estimators = 200,
                            random_state = 123)
est_forest_rf_train.fit(Y, T, X, W = 'blb')
est_forest_rf_train.const_marginal_effect_inference(X_test).summary_frame(alpha = 0.05, value = 0, decimals = 3))
est_forest_rf_train = ForestDMLCateEstimator(
                            model_y = RandomForestRegressor(n_estimators = 200, random_state = 42), 
                            model_t = RandomForestRegressor(n_estimators = 200, random_state = 42),
                            n_estimators = 200,
                            random_state = 123)
est_forest_rf_train.fit(Y, T, X, W = 'blb')
est_forest_rf_train.const_marginal_effect_inference(X_test).summary_frame(alpha = 0.05, value = 0, decimals = 3))

Here you should be getting the same result both times. Let me know if you get otherwise

@gcasamat
Copy link
Author

I have tested what you propose and it works fine indeed. Thanks to both of you. I think it would be a nice feature to be able to initialize the model only once and then to get the same results for each subsequent fit when random_state is specified.

@vsyrgkanis
Copy link
Collaborator

@kbattocchi that seems reasonable. Checked sklearn and that’s what it does. Fix is simple: we just need to move check_random_state at the beginning of fit and not at init of _ortholearner

@vsyrgkanis
Copy link
Collaborator

@gcasamat I pushed some changes that would fix the issue you raised once the PR is merged.

@gcasamat
Copy link
Author

This is great! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants