New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST Replace boston in histgradboost test_predictor #16918
Conversation
assert r2_score(y_train, predictor.predict(X_train)) > 0.69 | ||
assert r2_score(y_test, predictor.predict(X_test)) > 0.30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't these rather low? Same with the other PR you have, maybe using another dataset would be more easonale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the problem with the diabetes dataset. I have changed to california housing and it seems to work reasonably well with the original bins.
- train: (bins=200; 0.8233 (bins=256; 0.8340)
- test: (bins=200; 0.8112) (bins=256; 0.8094)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of using fetch_california_housing
is that it requires network access, which means we would need to mark these test with @pytest.mark.network
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some parameter tuning on the diabetes dataset I can get these results:
n_bins=50
train: 0.4253613178731953
test: 0.38498296812822475
n_bins=100
train: 0.4298426536827863
test: 0.3991035630532065
Parameters:
min_samples_leaf = 50
max_leaf_nodes = None
@thomasjpfan and @adrinjalali which dataset do you guys suggest to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a make_regression
with some tuned parameters not be a good option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a make_regression
with some tuned parameters not be a good option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a make_regression
with some tuned parameters not be a good option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll try this tomorrow.
+1 for using |
Thanks @ogrisel and @adrinjalali. I've amended to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @lucyleeow . This LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you @lucyleeow !
Reference Issues/PRs
Towards #16155
What does this implement/fix? Explain your changes.
Replace boston dataset with
diabetesCalifornia housing dataset insklearn/ensemble/_hist_gradient_boosting/tests/test_predictor.py
Any other comments?
Unsure of the bestn_bins
values/what this is testing. I noticed that the boston features are more spread out and generally has a longer right tail cf the diabetes dataset. Also the R2 values withn_bin
200 and 256 with diabetes were the same.The R2 values with California housing are: