Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) #10774
Fixes #10734 by implementing return_X_y for kdcupp99, twenty_newsgroups, rcv1, and lfw datasets.
What does this implement/fix? Explain your changes.
This replicates the return_X_y parameter that was added to datasets/base.py.
Any other comments?
I did not add return_X_y to some of the other datasets as it seemed to make less sense for these.
referenced this pull request
Mar 8, 2018
@jnothman "avoiding the repetition of code in the current tests" - Ah I see what you're saying. Certainly could do that - does it make more sense to have every dataset's test in its own test_ file or test across datasets for functionality that crosses them, is the question I guess. I'm happy to do that if you think it's a good idea.
I made a small change to test_rcv1 where I think the test was running out of memory attempting to test the entire array returned.
The codecov/patch that is marked as failing at https://codecov.io/gh/scikit-learn/scikit-learn/compare/ccbf9975fcf1676f6ac4f311e388529d3a3c4d3f...7dcadcb12a74b4b871c1f4d976564992c25ce30a - is that indicating the previous diff does not hit large enough test coverage percentage?
@jnothman looking at the datasets/tests/test_common.py idea for return_X_y.
Perhaps I'm misinterpreting your idea, but I think there'd still wind up being duplicated code from moving the relevant pieces of the the test_.py files into the tests/test_common.py:
While looping over the limited set of datasets which accept the return_X_y parameter (rcv1, lfw, 20_newsgroups, kddcup99 and various fetches from base.py)
In sum - while the actual test of the X_y_tuples are the same, the fetching involved differs by dataset. Moving that to test_common.py would lead to code duplication of that part of the test logic.
While I agree that it would be nice to capture the repetitive parts of this return_X_y test logic, it feels to me like it would
If I've misunderstood or mischaracterized your proposal, please let me know. If you feel like test_common.py is the best way to go, I can certainly implement it that way. Thanks for your thoughts.
Also thanks for the hand-holding as I get acclimated to the codebase and the contributing flow.
@qinhanmin2014 As far as adding return_X_y to the last few datasets like california_housing it looked to me as if there were more than just the X and y to return.
so it didn't seem to make sense to return just the
I'm not sure which other datasets, beyond that one, might benefit from return_X_y?
So shall I add return_X_y tests to these two?