Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) #10774

Merged
merged 21 commits into from Mar 25, 2018

Conversation

ccatalfo
Copy link
Contributor

@ccatalfo ccatalfo commented Mar 8, 2018

Reference Issues/PRs

Fixes #10734 by implementing return_X_y for kdcupp99, twenty_newsgroups, rcv1, and lfw datasets.

What does this implement/fix? Explain your changes.

This replicates the return_X_y parameter that was added to datasets/base.py.

Any other comments?

I did not add return_X_y to some of the other datasets as it seemed to make less sense for these.

@jnothman
Copy link
Member

jnothman commented Mar 8, 2018

I think it is applicable to lfw. We are not stopping the user from getting the bunch, just making it easier to get the primary tuple of data

And versionadded should be 0.20

@ccatalfo
Copy link
Contributor Author

ccatalfo commented Mar 8, 2018

@jnothman thanks - will add to ifw and change versionadded to 0.20.

@ccatalfo
Copy link
Contributor Author

ccatalfo commented Mar 9, 2018

@jnothman I added return_X_y to lfw as well as fixed a couple of pep8 things I missed.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still marked WIP. Is there something else you intend to do before considering this sufficient for review and merge?

@@ -77,6 +78,13 @@ def test_20news_vectorized():
assert_equal(bunch.target.shape[0], 7532)
assert_equal(bunch.data.dtype, np.float64)

# test return_X_y option
X_y_tuple = datasets.fetch_20newsgroups_vectorized(subset='test',return_X_y=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after , please

@@ -139,6 +140,15 @@ def test_load_fake_lfw_people():
['Abdelatif Smith', 'Abhati Kepler', 'Camara Alvaro',
'Chen Dupont', 'John Lee', 'Lin Bauman', 'Onur Lopez'])

# test return_X_y option
X_y_tuple = fetch_lfw_people(data_home=SCIKIT_LEARN_DATA, resize=None,
slice_=None, color=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this indent should be consistent with the argument on the previous line

@ccatalfo ccatalfo changed the title [WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 11, 2018
@ccatalfo
Copy link
Contributor Author

@jnothman thanks for the heads up on those two issues and on marking as MRG. I believe this is now ready for review and merge.

@jnothman
Copy link
Member

flake8 says you have left some lines longer than 79 characters.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests are not run by our ci, I think. Have you run them?

It might be nice to have this as a common test for datasets. Otherwise LGTM

@ccatalfo
Copy link
Contributor Author

@jnothman I have run the tests and they are passing.

I just committed fix for those too-long lines flagged by flake8.

A common test for datasets - meaning a ci test that runs the sklearn/datasets/tests ?

@jnothman
Copy link
Member

No, by a common test, I mean avoiding the repetition of code in the current tests, instead looping over fetches to confirm that they return the right format when return_X_y is used

@ccatalfo
Copy link
Contributor Author

ccatalfo commented Mar 12, 2018

@jnothman "avoiding the repetition of code in the current tests" - Ah I see what you're saying. Certainly could do that - does it make more sense to have every dataset's test in its own test_ file or test across datasets for functionality that crosses them, is the question I guess. I'm happy to do that if you think it's a good idea.

I made a small change to test_rcv1 where I think the test was running out of memory attempting to test the entire array returned.

The codecov/patch that is marked as failing at https://codecov.io/gh/scikit-learn/scikit-learn/compare/ccbf9975fcf1676f6ac4f311e388529d3a3c4d3f...7dcadcb12a74b4b871c1f4d976564992c25ce30a - is that indicating the previous diff does not hit large enough test coverage percentage?

@jnothman
Copy link
Member

jnothman commented Mar 12, 2018 via email

@ccatalfo
Copy link
Contributor Author

@jnothman looking at the datasets/tests/test_common.py idea for return_X_y.

Perhaps I'm misinterpreting your idea, but I think there'd still wind up being duplicated code from moving the relevant pieces of the the test_.py files into the tests/test_common.py:

While looping over the limited set of datasets which accept the return_X_y parameter (rcv1, lfw, 20_newsgroups, kddcup99 and various fetches from base.py)

  1. test_common.py would have to catch exceptions from not having each dataset downloaded, as the normal test_lfw.py etc. test files do.

  2. test_common.py would also have to call the particular dataset's Bunch-returning fetch function, each of which take different parameters, in order to compare the original Bunch with the X_y_tuple from returned from the return_X_y fetch.

In sum - while the actual test of the X_y_tuples are the same, the fetching involved differs by dataset. Moving that to test_common.py would lead to code duplication of that part of the test logic.

While I agree that it would be nice to capture the repetitive parts of this return_X_y test logic, it feels to me like it would

  1. wind up having to duplicate nearly as much code as it replaces and
  2. it would break up the coherence of having each dataset's tests only found in each test_.py.

If I've misunderstood or mischaracterized your proposal, please let me know. If you feel like test_common.py is the best way to go, I can certainly implement it that way. Thanks for your thoughts.

Also thanks for the hand-holding as I get acclimated to the codebase and the contributing flow.

@jnothman
Copy link
Member

jnothman commented Mar 13, 2018 via email

@ccatalfo
Copy link
Contributor Author

@jnothman great, yes, I will do that - refactor the return_X_y test part called by each test into tests/test_common.py.

@jnothman
Copy link
Member

jnothman commented Mar 13, 2018 via email

@ccatalfo ccatalfo changed the title [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) [WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 14, 2018
@ccatalfo ccatalfo changed the title [WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 15, 2018
@ccatalfo
Copy link
Contributor Author

@jnothman I refactored as suggested into test_common.py - both the files I had updated (kddcup99, lfw etc) as well as test_base.py to use the new test function.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
You have a flake8 failure



def check_return_X_y(bunch, X_y_tuple):
assert_true(isinstance(X_y_tuple, tuple))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to phase out these assertion functions. Please use a bare assert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed in 5f73583


def check_return_X_y(bunch, X_y_tuple):
assert_true(isinstance(X_y_tuple, tuple))
assert_array_equal(X_y_tuple[0].shape, bunch.data.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shapes should not be arrays. Just use assert ==

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, switched in 5f73583

@@ -38,6 +40,10 @@ def test_percent10():
assert_equal(data.data.shape, (9571, 3))
assert_equal(data.target.shape, (9571,))

X_y_tuple = fetch_kddcup99('smtp', return_X_y=True)
bunch = fetch_kddcup99('smtp')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already have a bunch above? I think it's unclear to the reader why we get this again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - you're right. Fixed in e68f3f9

from sklearn.utils.testing import assert_true


def check_return_X_y(bunch, X_y_tuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had imagined this would take a partial function and pass in return_X_y=True. You still have left a lot of duplicated idiom in the test functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By

a lot of duplicated idiom

do you mean the setting up of the bunches and fetch of the X_y_tuples to compare?

Were you thinking of something like this?

Each of the dataset test files sets up the appropriate fetch partial

  • test_20news.py, test_lfw.py, etc.
from functools import partial
# test return_X_y option
fetch_func = partial(datasets.fetch_20newsgroups_vectorized, subset='test')
check_return_X_y(bunch, fetch_func)

During the check_return_X_y call, the partial is called, passing return_X_y=True

  • test_common.py
def check_return_X_y(bunch, fetch_func_partial):
    X_y_tuple = fetch_func_partial(return_X_y=True)
    assert(isinstance(X_y_tuple, tuple))
    assert(X_y_tuple[0].shape == bunch.data.shape)
    assert(X_y_tuple[1].shape == bunch.target.shape)

Wouldn't this run into problems if ever additional arguments are added to those fetch functions after what's currently the last parameter, the return_X_y one?

@jnothman
Copy link
Member

jnothman commented Mar 15, 2018 via email

@ccatalfo
Copy link
Contributor Author

Ok is that what you’re going for as far as the usage of the partial then?

If so I’ll do that tonight.

@jnothman
Copy link
Member

jnothman commented Mar 16, 2018 via email

@ccatalfo
Copy link
Contributor Author

@jnothman Ok I've updated each to use the partial function version. How does this look now?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this :)

@jnothman jnothman changed the title [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) [MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 17, 2018
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

assert_true(isinstance(X_y_tuple, tuple))
assert_array_equal(X_y_tuple[0], bunch.data)
assert_array_equal(X_y_tuple[1], bunch.target)
check_return_X_y(res, partial(load_diabetes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need partial here? (and other places in this file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests_base.py test functions pass in the partial for the same reason the partial is passed in from the other test_* dataset files like test_20news.py - so that check_return_X_y can call that same dataset fetch function with the additional return_X_y=True parameter and perform the same standard X_y_tuple checks.

Doing it this way keeps the interface to check_return_X_y uniform among the dataset test_* files -- even if these particular fetch functions are relatively simple compared to the test files like test_20news.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to argue on such a minor question (though I still prefer to remove partial if we don't need it). But I think you should try to make the whole file consistent. (e.g., you're not using partial in test_load_digits)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qinhanmin2014 Aahh - good catch, I had missed that one. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a fix for that.

@@ -77,6 +79,10 @@ def test_20news_vectorized():
assert_equal(bunch.target.shape[0], 7532)
assert_equal(bunch.data.dtype, np.float64)

# test return_X_y option
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the training set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought only one subset return_X_y check was needed since fetch_20newsgroups_vectorized is subsetting the fetched dataset before return_X_y is checked.

@qinhanmin2014
Copy link
Member

@ccatalfo Seems that some functions are not included here (e.g., fetch_california_housing). Any specific reason?

@ccatalfo
Copy link
Contributor Author

@qinhanmin2014 As far as adding return_X_y to the last few datasets like california_housing it looked to me as if there were more than just the X and y to return.

For example fetch_california_housing returns

python return Bunch(data=data, target=target, feature_names=feature_names, DESCR=MODULE_DOCS)

so it didn't seem to make sense to return just the X=data and y=target and leave out the feature_names. But I could certainly add to that one if you think it's useful just with the X and y.

I'm not sure which other datasets, beyond that one, might benefit from return_X_y?

Maybe covtype.py?

So shall I add return_X_y tests to these two?

  1. fetch_california_housing
  2. covtype

@jnothman
Copy link
Member

jnothman commented Mar 24, 2018 via email

@ccatalfo
Copy link
Contributor Author

I've added return_X_y to covtype.

@ccatalfo
Copy link
Contributor Author

Also added return_X_y to california_housing.

In doing that, I also added a new test_california_housing.py test file (I didn't see an existing test file for california_housing.py) and return_X_y.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I can't fully understand the codecov failure but I think it's irrelevant and 'll trust jnothman's comment. Thanks @ccatalfo :)

@qinhanmin2014 qinhanmin2014 merged commit ff3230c into scikit-learn:master Mar 25, 2018
@ccatalfo
Copy link
Contributor Author

Thanks for all the suggestions and comments!

@jnothman
Copy link
Member

jnothman commented Mar 25, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

return_X_y should be available on more dataset loaders/fetchers
3 participants