[MRG+1] allow y to be a list in GridSearchCV, cross_val_score and train_test_split #2694

Merged
merged 4 commits into from Jul 16, 2014

Conversation

Projects
None yet
6 participants
Owner

amueller commented Dec 27, 2013

Fixes #2508.

This makes input validation consistent across the three cross-validating functions and less intrusive.

I am not entirely sure this is the right thing to do, as it has some slight backward-incompatibilities. When X or y was given as a list to train_test_split, it was converted to an array before, but is no longer. In the case of X this could be considered a bug-fix (for lists of strings). There is an off-chance that this breaks someone's code, though. GridSearchCV now passes y though as a list, without converting. I think that is fine and very unlikely to break anyone's code (as I think an estimator should handle exactly the same kind of input GridSearchCV gets).

Coverage Status

Coverage remained the same when pulling d41f245 on amueller:allow_y_lists into e2c0306 on scikit-learn:master.

Owner

agramfort commented Dec 27, 2013

LGTM +1 for merge

Owner

amueller commented Jan 6, 2014

grrr I missed something. Boolean cv masks are still allowed. That will break with this PR I think. need to investigate :(

Owner

amueller commented Jan 6, 2014

Huh, never mind.... I don't remember what the safe_mask was for. I will check but it didn't do what I thought it did. (GridSearchCV refuses to split sparse X with indices=False)

Owner

amueller commented Jan 6, 2014

Ok there is / was a redundant check of the indices: once in check_cv (and possibly refused) and once in fit_grid_point (and converted if necessary). So if someone used fit_grid_point directly before using boolean masks on sparse X (which worked despite what check_cv was claiming), then it will break here. I will check if that actually worked before, then fix it and remove the restriction in check_cv because it is stupid ;)

Owner

amueller commented Jan 6, 2014

actually, I think i'll rather leave it as it is as it keeps the code more simple. So never mind my rambling.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jan 6, 2014

sklearn/cross_validation.py
@@ -1369,7 +1370,7 @@ def train_test_split(*arrays, **options):
[4, 5],
[6, 7],
[8, 9]])
- >>> list(b)
@GaelVaroquaux

GaelVaroquaux Jan 6, 2014

Owner

That was necessary for Python 3, I believe, as in Python 3 range(5) is not a list.

@amueller

amueller Jan 7, 2014

Owner

oh, I didn't know that.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jan 6, 2014

sklearn/cross_validation.py
@@ -1396,18 +1397,21 @@ def train_test_split(*arrays, **options):
train_size = options.pop('train_size', None)
random_state = options.pop('random_state', None)
options['sparse_format'] = 'csr'
+ if not "allow_lists" in options:
+ options["allow_lists"] = True
@GaelVaroquaux

GaelVaroquaux Jan 6, 2014

Owner

What's the logic for preferring lists? I tend to always prefer arrays, which I know have a uniform type, are more compact in memory, and support fancy indexing.

@amueller

amueller Jan 7, 2014

Owner

The idea was not to prefer lists (I would also prefer arrays), but to be as non-intrusive as possible and delay the conversion.
In particular here any of the inputs might be an X that is a list of strings and should not be converted to an array.

@GaelVaroquaux

GaelVaroquaux Jan 7, 2014

Owner

The idea was not to prefer lists (I would also prefer arrays), but to be as
non-intrusive as possible and delay the conversion.

Right, but it makes it harder to work with the code, because we don't
know what the data type might be, and thus we have to code defensively.

In particular here any of the inputs might be an X that is a list of strings
and should not be converted to an array.

That's a pretty good argument. I would however say that we should convert
to arrays because if you have big data, you shouldn't be working with
strings anyhow.

@amueller

amueller Jan 7, 2014

Owner

No, we don't have to code more defensively than before as the estimators already accept lists as inputs.

This is the behavior of GridSearchCV and cross_val_score. I consider it a bug (possibly my fault) that it was different for train_test_split. I don't understand your argument about strings. This is about doing text processing in a pipeline.

Coverage Status

Coverage remained the same when pulling 52f305f on amueller:allow_y_lists into ca55b40 on scikit-learn:master.

Owner

amueller commented Feb 2, 2014

Could I please get reviews on this one?

Owner

GaelVaroquaux commented Feb 2, 2014

In general I am +1 for merging.

However, I think that if we want to handle lists as inputs to estimators, we need to add a test to test_common, or it will not be homogenous across the project. That's probably for a later PR.

Owner

agramfort commented Feb 3, 2014

can you rebase as it cannot be merged currently? thx

Owner

jnothman commented Feb 4, 2014

However, I think that if we want to handle lists as inputs to estimators, we need to add a test to test_common, or it will not be homogenous across the project. That's probably for a later PR.

It's necessary to handle non-arrays specially, or else the ?legacy list of lists multi-label format could break when the number of labels per sample is constant (np.asarray would interpret it as 2d).

Owner

GaelVaroquaux commented Feb 4, 2014

It's necessary to handle non-arrays specially, or else the ?legacy list
of lists multi-label format could break when the number of labels per
sample is constant (np.asarray would interpret it as 2d).

Ouch. That's lame.

At what version can we stop supporting the list of list multi-label
format?

Owner

jnothman commented Feb 4, 2014

At what version can we stop supporting the list of list multi-label
format?

I guess that depends how quickly we complete and merge the PRs that are
their deprecation (#2657) and replacement (#2458).

On 4 February 2014 22:46, Gael Varoquaux notifications@github.com wrote:

It's necessary to handle non-arrays specially, or else the ?legacy list
of lists multi-label format could break when the number of labels per
sample is constant (np.asarray would interpret it as 2d).

Ouch. That's lame.

At what version can we stop supporting the list of list multi-label
format?

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2694#issuecomment-34052072
.

Owner

amueller commented Mar 28, 2014

Sorry I don't think I understand the issue @jnothman. What will break? With this PR?

Owner

amueller commented Mar 28, 2014

This bug just caused my machine to swap when I tried to do
text_train, text_test, y_train, y_test = train_test_split(text, y)

Owner

jnothman commented Mar 29, 2014

I didn't say anything would break. I answered a question as to why we need
to support non-arrays.

On 29 March 2014 02:34, Andreas Mueller notifications@github.com wrote:

Sorry I don't think I understand the issue @jnothmanhttps://github.com/jnothman.
What will break? With this PR?

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2694#issuecomment-38932484
.

Owner

amueller commented Apr 28, 2014

rebased. sorry for the delay.

Coverage Status

Coverage remained the same when pulling 71489f3 on amueller:allow_y_lists into b3affdb on scikit-learn:master.

Owner

agramfort commented Apr 28, 2014

LGTM

thanks @amueller

Owner

jnothman commented May 3, 2014

There has been some recent interest in this feature and we should make sure it is merged for the 0.15 release.

However, some new cv functions have been introduced since it was written: the learning curve utilities also need similar support. Please check they have it.

Owner

amueller commented May 4, 2014

Can we not merge this and then check the other functions? Why does this PR need to wait?

Owner

amueller commented May 4, 2014

Btw, they all use _safe_split. It could be that y is converted to an array somewhere in between, though.

Maybe to clear this up again. This is mainly a bug-fix for working with text data, while trying to maintain a reasonable interface.

previously, doing
train_test_split(text_data, y) made the memory blow up because text was converted to an array.

Owner

jnothman commented May 4, 2014

Well, apart from perhaps covering those new functions, and that I think safe_indexing might be better as safe_take, this looks good to me.

Owner

amueller commented Jun 7, 2014

@ogrisel what do you think about including this in the release?

@jnothman jnothman commented on the diff Jun 9, 2014

sklearn/utils/__init__.py
@@ -129,6 +133,25 @@ def safe_mask(X, mask):
return mask
+def safe_indexing(X, indices):
@jnothman

jnothman Jun 9, 2014

Owner

Looking at this again, shouldn't you still need to call safe_mask? Or are we assured in all cases where safe_indexing is called that indexing is an integer array, not a boolean mask? I think, for instance, we are not assured this in _safe_split until the deprecation of cross validation generators' indices=False is complete.

@amueller

amueller Jul 16, 2014

Owner

Actually the check_cv function takes care of this. if indices=False, it will bail on lists and sparse matrices.

Owner

ogrisel commented Jun 9, 2014

@ogrisel what do you think about including this in the release?

I had not reviewed this PR yet. Will try to have a look at it this week. I put the 0.15 milestone on it to not forget to backport it to 0.15.X.

ogrisel added this to the 0.15 milestone Jun 9, 2014

ogrisel changed the title from MRG allow y to be a list in GridSearchCV, cross_val_score and train_test_split to [MRG] allow y to be a list in GridSearchCV, cross_val_score and train_test_split Jun 9, 2014

Owner

amueller commented Jun 14, 2014

I'll try to address the comments above this weekend, but I can't promise anything :-/

Owner

amueller commented Jul 16, 2014

I added tests. Anything else? This works also with the new learning_curve etc. Should I also add tests using the mock-classifiers there? I'm not sure what the tests are that @GaelVaroquaux wanted to see.

Coverage Status

Changes Unknown when pulling 5fe7c2c on amueller:allow_y_lists into * on scikit-learn:master*.

@ogrisel ogrisel and 1 other commented on an outdated diff Jul 16, 2014

sklearn/tests/test_grid_search.py
"""
- def __init__(self, foo_param=0):
+ def __init__(self, foo_param=0, check_y=False, check_X=True):
@ogrisel

ogrisel Jul 16, 2014

Owner

minor suggestion: maybe rename check_y_is_list & check_X_is_list would make the tests easier to follow.

Owner

ogrisel commented Jul 16, 2014

@amueller please consider amueller#21 as a non-regression test for this.

Owner

amueller commented Jul 16, 2014

merged/pushed -- @ogrisel pr that is.

Owner

ogrisel commented Jul 16, 2014

BTW, I agree with @jnothman about renaming safe_indexing to safe_take to better match the naming convention of the numpy.take function.

Owner

ogrisel commented Jul 16, 2014

Other than that +1 for merge and +1 for backporting to 0.15.X for inclusion in 0.15.1 as well.

Owner

ogrisel commented Jul 16, 2014

BTW, I agree with @jnothman about renaming safe_indexing to safe_take to better match the naming convention of the numpy.take function.

I misread @jnothman's comment. Let's forget about this.

Owner

ogrisel commented Jul 16, 2014

@amueller @GaelVaroquaux @jnothman @agramfort tests are green. Ok to merge and backport to 0.15.X?

amueller changed the title from [MRG] allow y to be a list in GridSearchCV, cross_val_score and train_test_split to [MRG+1] allow y to be a list in GridSearchCV, cross_val_score and train_test_split Jul 16, 2014

@GaelVaroquaux GaelVaroquaux commented on the diff Jul 16, 2014

sklearn/feature_extraction/tests/test_text.py
@@ -780,6 +778,21 @@ def test_vectorizer_pipeline_grid_selection():
assert_false(best_vectorizer.fixed_vocabulary)
+def test_vectorizer_pipeline_cross_validation():
+ # raw documents
+ data = JUNK_FOOD_DOCS + NOTJUNK_FOOD_DOCS
+
+ # label junk food as -1, the others as +1
+ target = [-1] * len(JUNK_FOOD_DOCS) + [1] * len(NOTJUNK_FOOD_DOCS)
+
@GaelVaroquaux

GaelVaroquaux Jul 16, 2014

Owner

Extra empty line.

Owner

GaelVaroquaux commented Jul 16, 2014

LGTM. Should I merge?

Owner

amueller commented Jul 16, 2014

feel free. Should also go into 0.15.X methinks.

Owner

GaelVaroquaux commented Jul 16, 2014

feel free. Should also go into 0.15.X methinks.

OK, I'll do this, and the backport.

@GaelVaroquaux GaelVaroquaux added a commit that referenced this pull request Jul 16, 2014

@GaelVaroquaux GaelVaroquaux Merge pull request #2694 from amueller/allow_y_lists
[MRG+1] allow y to be a list in GridSearchCV, cross_val_score and train_test_split
2e7ef3c

@GaelVaroquaux GaelVaroquaux merged commit 2e7ef3c into scikit-learn:master Jul 16, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Owner

GaelVaroquaux commented Jul 16, 2014

Backported to 0.15.X

Owner

amueller commented Jul 16, 2014

thanks :)

Owner

ogrisel commented Jul 17, 2014

thanks!

amueller deleted the amueller:allow_y_lists branch May 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment