[MRG+1] FIX: Handling of multilabel targets in cross_val_score #3128

Closed
wants to merge 10 commits into
from

Conversation

Projects
None yet
7 participants
Contributor

nmayorov commented May 3, 2014

Currently StratifiedKFold is used as a split strategy in cross_val_score. But it won't work correctly if y is multilabel, i.e. a label indicator matrix or a sequence of sequences.

I added a couple of lines that check the type of y and if it's multilabel — simple KFold strategy is invoked.

@jnothman jnothman commented on an outdated diff May 3, 2014

sklearn/cross_validation.py
@@ -1327,7 +1328,14 @@ def _check_cv(cv, X=None, y=None, classifier=False, warn_mask=False):
else:
needs_indices = None
if classifier:
- cv = StratifiedKFold(y, cv, indices=needs_indices)
+ if is_multilabel(y):
+ if is_sequence_of_sequences(y):
@jnothman

jnothman May 3, 2014

Owner

Perhaps just factor out n_samples = _num_samples(X) before if is_multilabel.

Owner

arjoly commented May 4, 2014

Do you think you can come up with a non regression test?

Nikolay Mayorov Added tests 4186103
Contributor

nmayorov commented May 4, 2014

Hello! I added tests, but not sure how good they are. Take a look.

Nikolay Mayorov Modified tests 011b5f2

Coverage Status

Coverage remained the same when pulling 011b5f2 on nmayorov:cross_val_score_multilabel_strategy into 0dbbf19 on scikit-learn:master.

@arjoly arjoly commented on an outdated diff May 5, 2014

sklearn/cross_validation.py
@@ -1327,7 +1328,10 @@ def _check_cv(cv, X=None, y=None, classifier=False, warn_mask=False):
else:
needs_indices = None
if classifier:
- cv = StratifiedKFold(y, cv, indices=needs_indices)
+ if is_multilabel(y):
@arjoly

arjoly May 5, 2014

Owner

You could easily fix check_cv for 'mutliclass-multioutput' using if type_of_target(y) in ["binary", "multiclass"]

Owner

arjoly commented May 5, 2014

I would also fix check_cv for 'mutliclass-multioutput'.

Otherwise +1.

Nikolay Mayorov added some commits May 5, 2014

Contributor

nmayorov commented May 5, 2014

Good idea, I implemented it.

Owner

arjoly commented May 6, 2014

still +1. I renamed the title with [MRG+1].

arjoly changed the title from FIX: Handling of multilabel targets in cross_val_score to [MRG+1] FIX: Handling of multilabel targets in cross_val_score May 6, 2014

arjoly added the Bug label May 11, 2014

Owner

ogrisel commented May 16, 2014

Could you please add an end-to-end integration test that checks that calling cross_val_score with a multilabel scoring such as "precision" on a linear classifier trained on toy, hand verifiable multilabel data is actually working as expected?

Owner

ogrisel commented May 16, 2014

@nmayorov I realize that that last review / request was slow to come. If you no longer have the time to work on this, please feel free to tell us and I will finish up that PR myself.

Contributor

nmayorov commented May 16, 2014

Hello. I would like to finish it myself, if you don't mind. I think I'll do that in the next couple of days.

Contributor

nmayorov commented May 28, 2014

Hi! I finally wrote this test.

I decided to use 1-nearest neighbor classifier, because it's the easiest one for verification by hand. Is it OK? Please check.

Coverage Status

Coverage increased (+0.02%) when pulling 8baa033 on nmayorov:cross_val_score_multilabel_strategy into 0dbbf19 on scikit-learn:master.

Owner

arjoly commented Jun 3, 2014

I decided to use 1-nearest neighbor classifier, because it's the easiest one for verification by hand. Is it OK? Please check.

For me, it's ok.

@arjoly arjoly commented on an outdated diff Jun 3, 2014

sklearn/tests/test_cross_validation.py
+ assert_true(isinstance(cv, cval.KFold))
+
+ y_indicator_matrix = LabelBinarizer().fit_transform(y_seq_of_seqs)
+ cv = cval._check_cv(3, X, y_indicator_matrix, classifier=True)
+ assert_true(isinstance(cv, cval.KFold))
+
+ y_multioutput = np.array([[1, 2], [0, 3], [0, 0], [3, 1], [2, 0]])
+ cv = cval._check_cv(3, X, y_multioutput, classifier=True)
+ assert_true(isinstance(cv, cval.KFold))
+
+
+def test_cross_val_score_multilabel():
+ X = [[-3, 4], [2, 4], [3, 3], [0, 2], [-3, 1],
+ [-2, 1], [0, 0], [-2, -1], [-1, -2], [1, -2]]
+ y = [[1, 1], [0, 1], [0, 1], [0, 1], [1, 1],
+ [0, 1], [1, 0], [1, 1], [1, 0], [0, 0]]
@arjoly

arjoly Jun 3, 2014

Owner

Can you wrap this into an np.array?

jnothman added this to the 0.15 milestone Jun 11, 2014

Owner

jnothman commented Jun 11, 2014

@ogrisel, I've tagged this 0.15. I hope we can get it into the final version.

Owner

vene commented Jul 15, 2014

This change influences more than cross_val_score, right? It should also change the behaviour of GridSearchCV and maybe other places. This is good IMO. The default behaviour should be explicitly documented, however.

Owner

arjoly commented Jul 16, 2014

This would be nice to get this in 0.15.1

Contributor

nmayorov commented Jul 18, 2014

Hi!

The current documentation in cross_val_score states:

cv : cross-validation generator, optional, default: None

A cross-validation generator. If None, a 3-fold cross validation is used 
or 3-fold stratified cross-validation when y is supplied and estimator is a classifier.

I think it indeed needs to be modified.

In GridSearch classes we have

cv : integer or cross-validation generator, optional

If an integer is passed, it is the number of folds (default 3). 
Specific cross-validation objects  can be passed, see sklearn.cross_validation 
module for the list of possible objects

Seems like it doesn't contradict anything. (No change is required.)

Should I try to modify docstring in cross_val_score? Maybe anything else?

Owner

arjoly commented Jul 18, 2014

Yes please :-)

@amueller amueller modified the milestone: 0.15.1, 0.15 Jul 18, 2014

Owner

ogrisel commented Jul 29, 2014

+1 as well. Closing in favor of the squashed / rebased version in #3496. Thanks @nmayorov!

ogrisel closed this Jul 29, 2014

nmayorov deleted the nmayorov:cross_val_score_multilabel_strategy branch Sep 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment