Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupKFold fails in nested cross-validation (similar to #2879) #7646

Open
davidslater opened this issue Oct 11, 2016 · 9 comments · May be fixed by #9566

Comments

@davidslater
Copy link

@davidslater davidslater commented Oct 11, 2016

Description

groups parameter in model_selection.cross_val_score() is not propagated in to RandomSearchCV.fit() call. This is similar to #2879 and probably best addressed in #4497.

Steps/Code to Reproduce

import numpy as np
from sklearn.utils.validation import indexable
from sklearn import linear_model
from sklearn import model_selection

# generate data with simple decision boundary, with 2 labels and 2 groups per label
X = np.array(range(20)).reshape(-1, 1)
y = np.array([0] * 10 + [1] * 10)
groups = np.array([0] * 5 + [1] * 5 + [2] * 5 + [3] * 5)

# run nested cross-validation (works with StratifiedKFold, but not GroupKFold)
clf = linear_model.LogisticRegression()
#cv = model_selection.StratifiedKFold(n_splits=2)
cv = model_selection.GroupKFold(n_splits=2)
param_dist = {'penalty': ['l1', 'l2'], 'C': np.logspace(-3, 3, 13)}
random_search = model_selection.RandomizedSearchCV(clf, cv=cv, param_distributions=param_dist, n_iter=20)
print model_selection.cross_val_score(random_search, X, y=y, groups=groups, cv=cv)

Expected Results

When StratifiedKFold is used, the output is [ 0.8 0.7]. In general, it should be an array of 2 floats.

Actual Results

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 140, in cross_val_score
    for train, test in cv.split(X, y, groups))
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 322, in __init__
    self.results = batch()
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 238, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 1185, in fit
    return self._fit(X, y, groups, sampled_params)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 562, in _fit
    for parameters in parameter_iterable
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 603, in dispatch_one_batch
    tasks = BatchedCalls(itertools.islice(iterator, batch_size))
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 127, in __init__
    self.items = list(iterator_slice)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 563, in <genexpr>
    for train, test in cv.split(X, y, groups))
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 321, in split
    for train, test in super(_BaseKFold, self).split(X, y, groups):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 90, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 102, in _iter_test_masks
    for test_index in self._iter_test_indices(X, y, groups):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 474, in _iter_test_indices
    raise ValueError("The groups parameter should not be None")
ValueError: The groups parameter should not be None

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.11 (default, Jan 22 2016, 08:29:18) \n[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18')

@davidslater

This comment has been minimized.

Copy link
Author

@davidslater davidslater commented Oct 11, 2016

In particular, the estimator.fit(X_train, y_train, **fit_params) call in model_selection._validation.py does not include "groups" in fit_params, so it defaults to None.

@jnothman jnothman added the Bug label Oct 13, 2016
@jnothman jnothman added this to the 0.18.1 milestone Oct 13, 2016
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 13, 2016

Thanks for the report @davidslater. I'm not sure if this is to be fixed for 0.18.1 (it only applies to 0.18, but AFAIK this kind of nested CV wasn't possible before, so it's not exactly a regression), but I've labelled as such.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Oct 14, 2016

I agree we need #4497 to fix this, and I don't see how we could do it for 0.18.1 - except special-casing the groups parameter to be passed to the cross-validation but not the estimator.

@amueller amueller added the API label Oct 14, 2016
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Oct 14, 2016

"need contributor"? I don't think we agree on a fix, do we?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 15, 2016

Or just some kind of routing parameter specific to CV, seeing as we already
accept a groups param to GridSearchCV.fit?

On 15 October 2016 at 03:41, Andreas Mueller notifications@github.com
wrote:

"need contributor"? I don't think we agree on a fix, do we?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7646 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz66TUo2v46mTWvenCuQoKpOgxTvNAks5qz7DNgaJpZM4KUHR0
.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 17, 2016

I really do think that we need #4497 to address this. It's an issue that
is very dear to my heart, by I don't think that rushing it is a good
idea. This is probably material for a sprint discussion.

@raghavrv

This comment has been minimized.

Copy link
Member

@raghavrv raghavrv commented Oct 21, 2016

Yes, I don't think this should be tagged 0.18.1...

@aryamccarthy

This comment has been minimized.

Copy link
Contributor

@aryamccarthy aryamccarthy commented Mar 26, 2017

Bump. Is there an agreement on how to approach this?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
7 participants
You can’t perform that action at this time.