Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST Replace boston in histgradboost test_predictor #16918

Merged
merged 5 commits into from Apr 23, 2020

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Apr 14, 2020

Reference Issues/PRs

Towards #16155

What does this implement/fix? Explain your changes.

Replace boston dataset with diabetes California housing dataset in sklearn/ensemble/_hist_gradient_boosting/tests/test_predictor.py

Any other comments?

Unsure of the best n_bins values/what this is testing. I noticed that the boston features are more spread out and generally has a longer right tail cf the diabetes dataset. Also the R2 values with n_bin 200 and 256 with diabetes were the same.

The R2 values with California housing are:

  • train: (bins=200; 0.8233 (bins=256; 0.8340)
  • test: (bins=200; 0.8112) (bins=256; 0.8094)

Comment on lines 37 to 38
assert r2_score(y_train, predictor.predict(X_train)) > 0.69
assert r2_score(y_test, predictor.predict(X_test)) > 0.30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't these rather low? Same with the other PR you have, maybe using another dataset would be more easonale?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the problem with the diabetes dataset. I have changed to california housing and it seems to work reasonably well with the original bins.

  • train: (bins=200; 0.8233 (bins=256; 0.8340)
  • test: (bins=200; 0.8112) (bins=256; 0.8094)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of using fetch_california_housing is that it requires network access, which means we would need to mark these test with @pytest.mark.network.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some parameter tuning on the diabetes dataset I can get these results:

n_bins=50
train: 0.4253613178731953
test: 0.38498296812822475

n_bins=100
train: 0.4298426536827863
test: 0.3991035630532065

Parameters:
min_samples_leaf = 50
max_leaf_nodes = None

@thomasjpfan and @adrinjalali which dataset do you guys suggest to use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a make_regression with some tuned parameters not be a good option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a make_regression with some tuned parameters not be a good option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a make_regression with some tuned parameters not be a good option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll try this tomorrow.

@ogrisel
Copy link
Member

ogrisel commented Apr 22, 2020

+1 for using make_regression to avoid relying on the network for such tests.

@lucyleeow
Copy link
Member Author

Thanks @ogrisel and @adrinjalali. I've amended to make_regression.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lucyleeow . This LGTM

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you @lucyleeow !

@thomasjpfan thomasjpfan changed the title Replace boston in histgradboost test_predictor TST Replace boston in histgradboost test_predictor Apr 23, 2020
@thomasjpfan thomasjpfan merged commit 7844d1c into scikit-learn:master Apr 23, 2020
@lucyleeow lucyleeow deleted the test_predictor branch April 24, 2020 11:58
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants