New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) #10774
Conversation
I think it is applicable to lfw. We are not stopping the user from getting the bunch, just making it easier to get the primary tuple of data And versionadded should be 0.20 |
@jnothman thanks - will add to ifw and change versionadded to 0.20. |
@jnothman I added return_X_y to lfw as well as fixed a couple of pep8 things I missed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still marked WIP. Is there something else you intend to do before considering this sufficient for review and merge?
@@ -77,6 +78,13 @@ def test_20news_vectorized(): | |||
assert_equal(bunch.target.shape[0], 7532) | |||
assert_equal(bunch.data.dtype, np.float64) | |||
|
|||
# test return_X_y option | |||
X_y_tuple = datasets.fetch_20newsgroups_vectorized(subset='test',return_X_y=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after , please
sklearn/datasets/tests/test_lfw.py
Outdated
@@ -139,6 +140,15 @@ def test_load_fake_lfw_people(): | |||
['Abdelatif Smith', 'Abhati Kepler', 'Camara Alvaro', | |||
'Chen Dupont', 'John Lee', 'Lin Bauman', 'Onur Lopez']) | |||
|
|||
# test return_X_y option | |||
X_y_tuple = fetch_lfw_people(data_home=SCIKIT_LEARN_DATA, resize=None, | |||
slice_=None, color=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this indent should be consistent with the argument on the previous line
@jnothman thanks for the heads up on those two issues and on marking as MRG. I believe this is now ready for review and merge. |
flake8 says you have left some lines longer than 79 characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests are not run by our ci, I think. Have you run them?
It might be nice to have this as a common test for datasets. Otherwise LGTM
@jnothman I have run the tests and they are passing. I just committed fix for those too-long lines flagged by flake8. A common test for datasets - meaning a ci test that runs the sklearn/datasets/tests ? |
No, by a common test, I mean avoiding the repetition of code in the current tests, instead looping over fetches to confirm that they return the right format when return_X_y is used |
…ay to get around memory isssue. (scikit-learn#10734)
@jnothman "avoiding the repetition of code in the current tests" - Ah I see what you're saying. Certainly could do that - does it make more sense to have every dataset's test in its own test_ file or test across datasets for functionality that crosses them, is the question I guess. I'm happy to do that if you think it's a good idea. I made a small change to test_rcv1 where I think the test was running out of memory attempting to test the entire array returned. The codecov/patch that is marked as failing at https://codecov.io/gh/scikit-learn/scikit-learn/compare/ccbf9975fcf1676f6ac4f311e388529d3a3c4d3f...7dcadcb12a74b4b871c1f4d976564992c25ce30a - is that indicating the previous diff does not hit large enough test coverage percentage? |
the codecov failure simply says that the new test code is not being run.
Which is to be expected.
a common test would live in datasets/tests/test_common.py.
|
@jnothman looking at the datasets/tests/test_common.py idea for return_X_y. Perhaps I'm misinterpreting your idea, but I think there'd still wind up being duplicated code from moving the relevant pieces of the the test_.py files into the tests/test_common.py: While looping over the limited set of datasets which accept the return_X_y parameter (rcv1, lfw, 20_newsgroups, kddcup99 and various fetches from base.py)
In sum - while the actual test of the X_y_tuples are the same, the fetching involved differs by dataset. Moving that to test_common.py would lead to code duplication of that part of the test logic. While I agree that it would be nice to capture the repetitive parts of this return_X_y test logic, it feels to me like it would
If I've misunderstood or mischaracterized your proposal, please let me know. If you feel like test_common.py is the best way to go, I can certainly implement it that way. Thanks for your thoughts. Also thanks for the hand-holding as I get acclimated to the codebase and the contributing flow. |
Okay. Could also just use a helper called check_return_X_y called by each
dataset's test.
|
@jnothman great, yes, I will do that - refactor the return_X_y test part called by each test into tests/test_common.py. |
thanks!
|
…parameter into common test function.
@jnothman I refactored as suggested into test_common.py - both the files I had updated (kddcup99, lfw etc) as well as test_base.py to use the new test function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
You have a flake8 failure
|
||
|
||
def check_return_X_y(bunch, X_y_tuple): | ||
assert_true(isinstance(X_y_tuple, tuple)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to phase out these assertion functions. Please use a bare assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fixed in 5f73583
|
||
def check_return_X_y(bunch, X_y_tuple): | ||
assert_true(isinstance(X_y_tuple, tuple)) | ||
assert_array_equal(X_y_tuple[0].shape, bunch.data.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shapes should not be arrays. Just use assert ==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, switched in 5f73583
@@ -38,6 +40,10 @@ def test_percent10(): | |||
assert_equal(data.data.shape, (9571, 3)) | |||
assert_equal(data.target.shape, (9571,)) | |||
|
|||
X_y_tuple = fetch_kddcup99('smtp', return_X_y=True) | |||
bunch = fetch_kddcup99('smtp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already have a bunch above? I think it's unclear to the reader why we get this again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - you're right. Fixed in e68f3f9
from sklearn.utils.testing import assert_true | ||
|
||
|
||
def check_return_X_y(bunch, X_y_tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had imagined this would take a partial function and pass in return_X_y=True. You still have left a lot of duplicated idiom in the test functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By
a lot of duplicated idiom
do you mean the setting up of the bunches and fetch of the X_y_tuples to compare?
Were you thinking of something like this?
Each of the dataset test files sets up the appropriate fetch partial
- test_20news.py, test_lfw.py, etc.
from functools import partial
# test return_X_y option
fetch_func = partial(datasets.fetch_20newsgroups_vectorized, subset='test')
check_return_X_y(bunch, fetch_func)
During the check_return_X_y call, the partial is called, passing return_X_y=True
- test_common.py
def check_return_X_y(bunch, fetch_func_partial):
X_y_tuple = fetch_func_partial(return_X_y=True)
assert(isinstance(X_y_tuple, tuple))
assert(X_y_tuple[0].shape == bunch.data.shape)
assert(X_y_tuple[1].shape == bunch.target.shape)
Wouldn't this run into problems if ever additional arguments are added to those fetch functions after what's currently the last parameter, the return_X_y one?
no, I don't see how it would run into problems if args are passed by name
|
Ok is that what you’re going for as far as the usage of the partial then? If so I’ll do that tonight. |
I think so, only if it seems to make things neater
|
…e partial fetch function (scikit-learn#10734).
@jnothman Ok I've updated each to use the partial function version. How does this look now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
assert_true(isinstance(X_y_tuple, tuple)) | ||
assert_array_equal(X_y_tuple[0], bunch.data) | ||
assert_array_equal(X_y_tuple[1], bunch.target) | ||
check_return_X_y(res, partial(load_diabetes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need partial here? (and other places in this file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests_base.py test functions pass in the partial for the same reason the partial is passed in from the other test_* dataset files like test_20news.py - so that check_return_X_y can call that same dataset fetch function with the additional return_X_y=True parameter and perform the same standard X_y_tuple checks.
Doing it this way keeps the interface to check_return_X_y uniform among the dataset test_* files -- even if these particular fetch functions are relatively simple compared to the test files like test_20news.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to argue on such a minor question (though I still prefer to remove partial if we don't need it). But I think you should try to make the whole file consistent. (e.g., you're not using partial in test_load_digits)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qinhanmin2014 Aahh - good catch, I had missed that one. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a fix for that.
@@ -77,6 +79,10 @@ def test_20news_vectorized(): | |||
assert_equal(bunch.target.shape[0], 7532) | |||
assert_equal(bunch.data.dtype, np.float64) | |||
|
|||
# test return_X_y option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the training set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought only one subset return_X_y check was needed since fetch_20newsgroups_vectorized is subsetting the fetched dataset before return_X_y is checked.
@ccatalfo Seems that some functions are not included here (e.g., |
@qinhanmin2014 As far as adding return_X_y to the last few datasets like california_housing it looked to me as if there were more than just the X and y to return. For example
so it didn't seem to make sense to return just the I'm not sure which other datasets, beyond that one, might benefit from return_X_y? Maybe covtype.py? So shall I add return_X_y tests to these two?
|
yes, it's useful even without feature names.
|
…igits test to check_return_X_y
I've added return_X_y to covtype. |
…ng datasets test file (scikit-learn#10374)
…alifornia_housing dataset (scikit-learn#10734).
Also added return_X_y to california_housing. In doing that, I also added a new test_california_housing.py test file (I didn't see an existing test file for california_housing.py) and return_X_y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I can't fully understand the codecov failure but I think it's irrelevant and 'll trust jnothman's comment. Thanks @ccatalfo :)
Thanks for all the suggestions and comments! |
thank you for your work and clear communication!
|
Reference Issues/PRs
Fixes #10734 by implementing return_X_y for kdcupp99, twenty_newsgroups, rcv1, and lfw datasets.
What does this implement/fix? Explain your changes.
This replicates the return_X_y parameter that was added to datasets/base.py.
Any other comments?
I did not add return_X_y to some of the other datasets as it seemed to make less sense for these.