New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) #10774

Merged
merged 21 commits into from Mar 25, 2018

Conversation

Projects
None yet
3 participants
@ccatalfo
Contributor

ccatalfo commented Mar 8, 2018

Reference Issues/PRs

Fixes #10734 by implementing return_X_y for kdcupp99, twenty_newsgroups, rcv1, and lfw datasets.

What does this implement/fix? Explain your changes.

This replicates the return_X_y parameter that was added to datasets/base.py.

Any other comments?

I did not add return_X_y to some of the other datasets as it seemed to make less sense for these.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 8, 2018

Member

I think it is applicable to lfw. We are not stopping the user from getting the bunch, just making it easier to get the primary tuple of data

And versionadded should be 0.20

Member

jnothman commented Mar 8, 2018

I think it is applicable to lfw. We are not stopping the user from getting the bunch, just making it easier to get the primary tuple of data

And versionadded should be 0.20

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 8, 2018

Contributor

@jnothman thanks - will add to ifw and change versionadded to 0.20.

Contributor

ccatalfo commented Mar 8, 2018

@jnothman thanks - will add to ifw and change versionadded to 0.20.

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 9, 2018

Contributor

@jnothman I added return_X_y to lfw as well as fixed a couple of pep8 things I missed.

Contributor

ccatalfo commented Mar 9, 2018

@jnothman I added return_X_y to lfw as well as fixed a couple of pep8 things I missed.

@jnothman

This is still marked WIP. Is there something else you intend to do before considering this sufficient for review and merge?

Show outdated Hide outdated sklearn/datasets/tests/test_20news.py Outdated
Show outdated Hide outdated sklearn/datasets/tests/test_lfw.py Outdated

@ccatalfo ccatalfo changed the title from [WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) to [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 11, 2018

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 11, 2018

Contributor

@jnothman thanks for the heads up on those two issues and on marking as MRG. I believe this is now ready for review and merge.

Contributor

ccatalfo commented Mar 11, 2018

@jnothman thanks for the heads up on those two issues and on marking as MRG. I believe this is now ready for review and merge.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 11, 2018

Member

flake8 says you have left some lines longer than 79 characters.

Member

jnothman commented Mar 11, 2018

flake8 says you have left some lines longer than 79 characters.

@jnothman

the tests are not run by our ci, I think. Have you run them?

It might be nice to have this as a common test for datasets. Otherwise LGTM

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 12, 2018

Contributor

@jnothman I have run the tests and they are passing.

I just committed fix for those too-long lines flagged by flake8.

A common test for datasets - meaning a ci test that runs the sklearn/datasets/tests ?

Contributor

ccatalfo commented Mar 12, 2018

@jnothman I have run the tests and they are passing.

I just committed fix for those too-long lines flagged by flake8.

A common test for datasets - meaning a ci test that runs the sklearn/datasets/tests ?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 12, 2018

Member

No, by a common test, I mean avoiding the repetition of code in the current tests, instead looping over fetches to confirm that they return the right format when return_X_y is used

Member

jnothman commented Mar 12, 2018

No, by a common test, I mean avoiding the repetition of code in the current tests, instead looping over fetches to confirm that they return the right format when return_X_y is used

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 12, 2018

Contributor

@jnothman "avoiding the repetition of code in the current tests" - Ah I see what you're saying. Certainly could do that - does it make more sense to have every dataset's test in its own test_ file or test across datasets for functionality that crosses them, is the question I guess. I'm happy to do that if you think it's a good idea.

I made a small change to test_rcv1 where I think the test was running out of memory attempting to test the entire array returned.

The codecov/patch that is marked as failing at https://codecov.io/gh/scikit-learn/scikit-learn/compare/ccbf9975fcf1676f6ac4f311e388529d3a3c4d3f...7dcadcb12a74b4b871c1f4d976564992c25ce30a - is that indicating the previous diff does not hit large enough test coverage percentage?

Contributor

ccatalfo commented Mar 12, 2018

@jnothman "avoiding the repetition of code in the current tests" - Ah I see what you're saying. Certainly could do that - does it make more sense to have every dataset's test in its own test_ file or test across datasets for functionality that crosses them, is the question I guess. I'm happy to do that if you think it's a good idea.

I made a small change to test_rcv1 where I think the test was running out of memory attempting to test the entire array returned.

The codecov/patch that is marked as failing at https://codecov.io/gh/scikit-learn/scikit-learn/compare/ccbf9975fcf1676f6ac4f311e388529d3a3c4d3f...7dcadcb12a74b4b871c1f4d976564992c25ce30a - is that indicating the previous diff does not hit large enough test coverage percentage?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 12, 2018

Member
Member

jnothman commented Mar 12, 2018

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 13, 2018

Contributor

@jnothman looking at the datasets/tests/test_common.py idea for return_X_y.

Perhaps I'm misinterpreting your idea, but I think there'd still wind up being duplicated code from moving the relevant pieces of the the test_.py files into the tests/test_common.py:

While looping over the limited set of datasets which accept the return_X_y parameter (rcv1, lfw, 20_newsgroups, kddcup99 and various fetches from base.py)

  1. test_common.py would have to catch exceptions from not having each dataset downloaded, as the normal test_lfw.py etc. test files do.

  2. test_common.py would also have to call the particular dataset's Bunch-returning fetch function, each of which take different parameters, in order to compare the original Bunch with the X_y_tuple from returned from the return_X_y fetch.

In sum - while the actual test of the X_y_tuples are the same, the fetching involved differs by dataset. Moving that to test_common.py would lead to code duplication of that part of the test logic.

While I agree that it would be nice to capture the repetitive parts of this return_X_y test logic, it feels to me like it would

  1. wind up having to duplicate nearly as much code as it replaces and
  2. it would break up the coherence of having each dataset's tests only found in each test_.py.

If I've misunderstood or mischaracterized your proposal, please let me know. If you feel like test_common.py is the best way to go, I can certainly implement it that way. Thanks for your thoughts.

Also thanks for the hand-holding as I get acclimated to the codebase and the contributing flow.

Contributor

ccatalfo commented Mar 13, 2018

@jnothman looking at the datasets/tests/test_common.py idea for return_X_y.

Perhaps I'm misinterpreting your idea, but I think there'd still wind up being duplicated code from moving the relevant pieces of the the test_.py files into the tests/test_common.py:

While looping over the limited set of datasets which accept the return_X_y parameter (rcv1, lfw, 20_newsgroups, kddcup99 and various fetches from base.py)

  1. test_common.py would have to catch exceptions from not having each dataset downloaded, as the normal test_lfw.py etc. test files do.

  2. test_common.py would also have to call the particular dataset's Bunch-returning fetch function, each of which take different parameters, in order to compare the original Bunch with the X_y_tuple from returned from the return_X_y fetch.

In sum - while the actual test of the X_y_tuples are the same, the fetching involved differs by dataset. Moving that to test_common.py would lead to code duplication of that part of the test logic.

While I agree that it would be nice to capture the repetitive parts of this return_X_y test logic, it feels to me like it would

  1. wind up having to duplicate nearly as much code as it replaces and
  2. it would break up the coherence of having each dataset's tests only found in each test_.py.

If I've misunderstood or mischaracterized your proposal, please let me know. If you feel like test_common.py is the best way to go, I can certainly implement it that way. Thanks for your thoughts.

Also thanks for the hand-holding as I get acclimated to the codebase and the contributing flow.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 13, 2018

Member
Member

jnothman commented Mar 13, 2018

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 13, 2018

Contributor

@jnothman great, yes, I will do that - refactor the return_X_y test part called by each test into tests/test_common.py.

Contributor

ccatalfo commented Mar 13, 2018

@jnothman great, yes, I will do that - refactor the return_X_y test part called by each test into tests/test_common.py.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 13, 2018

Member
Member

jnothman commented Mar 13, 2018

@ccatalfo ccatalfo changed the title from [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) to [WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 14, 2018

@ccatalfo ccatalfo changed the title from [WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) to [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 15, 2018

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 15, 2018

Contributor

@jnothman I refactored as suggested into test_common.py - both the files I had updated (kddcup99, lfw etc) as well as test_base.py to use the new test function.

Contributor

ccatalfo commented Mar 15, 2018

@jnothman I refactored as suggested into test_common.py - both the files I had updated (kddcup99, lfw etc) as well as test_base.py to use the new test function.

@jnothman

Thanks!
You have a flake8 failure

Show outdated Hide outdated sklearn/datasets/tests/test_common.py Outdated
Show outdated Hide outdated sklearn/datasets/tests/test_common.py Outdated
Show outdated Hide outdated sklearn/datasets/tests/test_kddcup99.py Outdated
Show outdated Hide outdated sklearn/datasets/tests/test_common.py Outdated
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 15, 2018

Member
Member

jnothman commented Mar 15, 2018

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 16, 2018

Contributor

Ok is that what you’re going for as far as the usage of the partial then?

If so I’ll do that tonight.

Contributor

ccatalfo commented Mar 16, 2018

Ok is that what you’re going for as far as the usage of the partial then?

If so I’ll do that tonight.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 16, 2018

Member
Member

jnothman commented Mar 16, 2018

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 16, 2018

Contributor

@jnothman Ok I've updated each to use the partial function version. How does this look now?

Contributor

ccatalfo commented Mar 16, 2018

@jnothman Ok I've updated each to use the partial function version. How does this look now?

@jnothman

I'm happy with this :)

@jnothman jnothman changed the title from [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) to [MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 17, 2018

assert_true(isinstance(X_y_tuple, tuple))
assert_array_equal(X_y_tuple[0], bunch.data)
assert_array_equal(X_y_tuple[1], bunch.target)
check_return_X_y(res, partial(load_diabetes))

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Mar 22, 2018

Member

Why do we need partial here? (and other places in this file)

@qinhanmin2014

qinhanmin2014 Mar 22, 2018

Member

Why do we need partial here? (and other places in this file)

This comment has been minimized.

@ccatalfo

ccatalfo Mar 24, 2018

Contributor

These tests_base.py test functions pass in the partial for the same reason the partial is passed in from the other test_* dataset files like test_20news.py - so that check_return_X_y can call that same dataset fetch function with the additional return_X_y=True parameter and perform the same standard X_y_tuple checks.

Doing it this way keeps the interface to check_return_X_y uniform among the dataset test_* files -- even if these particular fetch functions are relatively simple compared to the test files like test_20news.py.

@ccatalfo

ccatalfo Mar 24, 2018

Contributor

These tests_base.py test functions pass in the partial for the same reason the partial is passed in from the other test_* dataset files like test_20news.py - so that check_return_X_y can call that same dataset fetch function with the additional return_X_y=True parameter and perform the same standard X_y_tuple checks.

Doing it this way keeps the interface to check_return_X_y uniform among the dataset test_* files -- even if these particular fetch functions are relatively simple compared to the test files like test_20news.py.

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Mar 25, 2018

Member

I don't want to argue on such a minor question (though I still prefer to remove partial if we don't need it). But I think you should try to make the whole file consistent. (e.g., you're not using partial in test_load_digits)

@qinhanmin2014

qinhanmin2014 Mar 25, 2018

Member

I don't want to argue on such a minor question (though I still prefer to remove partial if we don't need it). But I think you should try to make the whole file consistent. (e.g., you're not using partial in test_load_digits)

This comment has been minimized.

@ccatalfo

ccatalfo Mar 25, 2018

Contributor

@qinhanmin2014 Aahh - good catch, I had missed that one. Thanks.

@ccatalfo

ccatalfo Mar 25, 2018

Contributor

@qinhanmin2014 Aahh - good catch, I had missed that one. Thanks.

This comment has been minimized.

@ccatalfo

ccatalfo Mar 25, 2018

Contributor

I've pushed a fix for that.

@ccatalfo

ccatalfo Mar 25, 2018

Contributor

I've pushed a fix for that.

@@ -77,6 +79,10 @@ def test_20news_vectorized():
assert_equal(bunch.target.shape[0], 7532)
assert_equal(bunch.data.dtype, np.float64)
# test return_X_y option

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Mar 22, 2018

Member

How about the training set?

@qinhanmin2014

qinhanmin2014 Mar 22, 2018

Member

How about the training set?

This comment has been minimized.

@ccatalfo

ccatalfo Mar 24, 2018

Contributor

I thought only one subset return_X_y check was needed since fetch_20newsgroups_vectorized is subsetting the fetched dataset before return_X_y is checked.

@ccatalfo

ccatalfo Mar 24, 2018

Contributor

I thought only one subset return_X_y check was needed since fetch_20newsgroups_vectorized is subsetting the fetched dataset before return_X_y is checked.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Mar 22, 2018

Member

@ccatalfo Seems that some functions are not included here (e.g., fetch_california_housing). Any specific reason?

Member

qinhanmin2014 commented Mar 22, 2018

@ccatalfo Seems that some functions are not included here (e.g., fetch_california_housing). Any specific reason?

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 24, 2018

Contributor

@qinhanmin2014 As far as adding return_X_y to the last few datasets like california_housing it looked to me as if there were more than just the X and y to return.

For example fetch_california_housing returns

python return Bunch(data=data, target=target, feature_names=feature_names, DESCR=MODULE_DOCS)

so it didn't seem to make sense to return just the X=data and y=target and leave out the feature_names. But I could certainly add to that one if you think it's useful just with the X and y.

I'm not sure which other datasets, beyond that one, might benefit from return_X_y?

Maybe covtype.py?

So shall I add return_X_y tests to these two?

  1. fetch_california_housing
  2. covtype
Contributor

ccatalfo commented Mar 24, 2018

@qinhanmin2014 As far as adding return_X_y to the last few datasets like california_housing it looked to me as if there were more than just the X and y to return.

For example fetch_california_housing returns

python return Bunch(data=data, target=target, feature_names=feature_names, DESCR=MODULE_DOCS)

so it didn't seem to make sense to return just the X=data and y=target and leave out the feature_names. But I could certainly add to that one if you think it's useful just with the X and y.

I'm not sure which other datasets, beyond that one, might benefit from return_X_y?

Maybe covtype.py?

So shall I add return_X_y tests to these two?

  1. fetch_california_housing
  2. covtype
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 24, 2018

Member
Member

jnothman commented Mar 24, 2018

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 25, 2018

Contributor

I've added return_X_y to covtype.

Contributor

ccatalfo commented Mar 25, 2018

I've added return_X_y to covtype.

@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 25, 2018

Contributor

Also added return_X_y to california_housing.

In doing that, I also added a new test_california_housing.py test file (I didn't see an existing test file for california_housing.py) and return_X_y.

Contributor

ccatalfo commented Mar 25, 2018

Also added return_X_y to california_housing.

In doing that, I also added a new test_california_housing.py test file (I didn't see an existing test file for california_housing.py) and return_X_y.

@qinhanmin2014

LGTM. I can't fully understand the codecov failure but I think it's irrelevant and 'll trust jnothman's comment. Thanks @ccatalfo :)

@qinhanmin2014 qinhanmin2014 merged commit ff3230c into scikit-learn:master Mar 25, 2018

7 of 8 checks passed

codecov/patch 70% of diff hit (target 95.01%)
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/project 95.08% (+0.06%) compared to ccbf997
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@ccatalfo

This comment has been minimized.

Show comment
Hide comment
@ccatalfo

ccatalfo Mar 25, 2018

Contributor

Thanks for all the suggestions and comments!

Contributor

ccatalfo commented Mar 25, 2018

Thanks for all the suggestions and comments!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 25, 2018

Member
Member

jnothman commented Mar 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment