New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] ENH/MNT Rename labels --> groups #6660

Merged
merged 5 commits into from Sep 11, 2016

Conversation

Projects
None yet
9 participants
@raghavrv
Member

raghavrv commented Apr 14, 2016

Partially fixes #5053, fixes #7210

Vote

Gael : -0
Joel : +0.5?
Alex : -1 +1
Vene : +1
Andy : -1+1

TODO

  • Renames Leave*LabelOut --> Leave*GroupOut
  • Renames labels parameter to groups.
  • Renames LabelKFold --> GroupKFold
  • Renames LabelShuffleSplit --> GroupShuffleSplit
  • Fix doctests
  • Make all the tests pass
  • Fix examples
  • Fix docs
  • Split cv section into two, one for IID data and one for grouped data...

Please take a look at this @jnothman @amueller @MechCoder @vene

REF

I raised this as a separate PR as I felt it warrants its own discussions.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 14, 2016

@raghavrv raghavrv changed the title from [MRG] ENH/MNT Rename labels --> groups to [WIP] ENH/MNT Rename labels --> groups Apr 14, 2016

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Apr 14, 2016

-0

I cringe at breaking backward compatibility for that. People rely on us. Docs and books are written.

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 14, 2016

We're already renaming the module... but yes, searchability is valuable too.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 14, 2016

Copying my comment from the other PR -

I think the motivation for renaming labels --> groups is to demarcate clearly what we mean by labels. (As labels can also mean the target or class labels).

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Apr 14, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 14, 2016

Quoting a previous discussion at #4294

@vene wrote -

Based on what the code does, the labels in permutation_test_score is a "group label" (observable at runtime, sample property, yadda yadda).

Permutation tests are a kind of non-parametric test where you approximate the data distribution by shuffling the "labels" class labels (ys, argh!) of the observed data. If the data is grouped (e.g. multiple measurements of the same fMRI subject, multiple documents for the same query), the labels param is used to only shuffle y within each group. I guess a reason would be so that class probabilities keep the same distribution.

So I think labels should be passed to the inner CV. Prior to this PR, this function was probably used like permutation_test_score(X, y, labels, cv=LeavePLabelOut(len(y), labels...)). Does this pose any problems?

I think the person to ping about this is indeed @agramfort, based on some online threads I've found about permutation tests :)

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 14, 2016

We had this discussion on whether we should use the labels as group labels for the cross-validator of permutation_test_score as well as for the permutation. That also needs to be addressed. Currently we use the labels for both permuting and as the group labels for the cv.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 14, 2016

@GaelVaroquaux I'm just populating this PR with relevant discussions for us to refer to :)

And

Yet, is it worth breaking the dozen of books that have been written using scikit-learn, and user code? Note that the breakage is more subttle than the change that we have just done.

But like Joel said model_selection is not yet released. May we could use this opportunity to make these kind of changes? Once released these issues become more pressing. We will be more compelled to not make these changes. However if the change brought about by this PR is not so useful, maybe we could leave things as such...

@agramfort

This comment has been minimized.

Member

agramfort commented Apr 14, 2016

@raghavrv raghavrv changed the title from [WIP] ENH/MNT Rename labels --> groups to [RFC/WIP] ENH/MNT Rename labels --> groups Apr 14, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 15, 2016

@vene at #5053 wrote -

That's my view as well. labels was a weird, overloaded word in this context. I was personally very confused about what LeavePLabelOut did for a while, because I was thinking of y as labels. I thought it's some sort of zero-shot learning evaluation.

Groups is much better; though there could be confusion with, say, GroupLasso.

Such breakage is annoying on two accounts. (a) user code will need to change: but this will happen anyway because of the move to model_selection. (b) books becoming out of date, as @GaelVaroquaux points out.

Now, my first impulse is to say that (b) is also moot because of model_selection but it's not that simple. It's easier to realize that cross_validation.LeavePLabelOut became model_selection.LeavePLabelOut rather than if it becomes model_selection.LeavePGroupOut.

Still, I think that if you at least know what your intention is, and understand the word "Group", the latter can be easily identified as relevant when tabbing through the imports.

(In light of this, I would have preferred LabelKFold to be KFoldLabels so that it would be easy to type KFold[tab] and find KFoldGroups instead...)

@raghavrv

This comment has been minimized.

Member

raghavrv commented May 11, 2016

I'm closing as we have agreed this is not quite useful. Feel free to comment/reopen.

@raghavrv raghavrv closed this May 11, 2016

@raghavrv raghavrv deleted the raghavrv:rename_labels_to_groups branch May 11, 2016

@raghavrv raghavrv restored the raghavrv:rename_labels_to_groups branch Aug 20, 2016

@raghavrv raghavrv reopened this Aug 20, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Aug 20, 2016

I'm reopening this in light of #7210

@amueller amueller added this to the 0.18 milestone Aug 22, 2016

@amueller

This comment has been minimized.

Member

amueller commented Aug 22, 2016

looks like @agramfort changed his mind between this issue and #7210 ;)

@agramfort

This comment has been minimized.

Member

agramfort commented Aug 23, 2016

@raghavrv raghavrv changed the title from [RFC/WIP] ENH/MNT Rename labels --> groups to [MRG] ENH/MNT Rename labels --> groups Aug 23, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Aug 23, 2016

This is ready for reviews @jnothman @vene @amueller @agramfort

@agramfort

This comment has been minimized.

Member

agramfort commented Aug 23, 2016

LGTM

did you

"git grep Label" in model_selection

?

- :class:`LeavePLabelOut` **(p)**
- :class:`LeavePGroupOut` **(p)**

This comment has been minimized.

@amueller

amueller Aug 23, 2016

Member

Groups?

This comment has been minimized.

@vene
@@ -261,25 +261,25 @@ two slightly unbalanced classes::
[0 1 2 4 5 6 7] [3 8 9]
Label k-fold

This comment has been minimized.

@amueller

amueller Aug 23, 2016

Member

Can you please organize it such that there is a "IID data" and a "Grouped data" section that explains the difference?
Maybe "Splitters for Independent Data (iid)" "Splitters for Grouped Data" and briefly explain each one?

This comment has been minimized.

@vene

vene Aug 24, 2016

Member

Indeed. This way we can have a general explanation of what we mean by "grouped data" with a nice example. I'd prefer it to be example-driven. Currently the example in the group k-fold description is a bit of an afterthought.

Instead we can lead with the grouped subjects example, and explain why, for a fair evaluation of generalization, we must assume that test time subjects are never seen before, not just individual samples. Then explain the classes available to accomplish this.

@amueller

This comment has been minimized.

Member

amueller commented Aug 23, 2016

Maybe "Group indicators" instead of "Group labels"? Or do we want the hint towards the old name?

@amueller

This comment has been minimized.

Member

amueller commented Aug 23, 2016

Sooo... there's no example for any of these? hum....
+1 from me (I did a git grep).

Before merge, I want to hear @GaelVaroquaux and @jnothman though ;)

@amueller

This comment has been minimized.

Member

amueller commented Sep 9, 2016

I vote to merge after @ogrisel's comments are addressed. This is one of the last release blockers, and we can always improve the docstrings later.

Such a grouping of data is domain specific. An example would be when there is
medical data collected from multiple patients, with multiple samples taken from
each patient. And such data is likely to be dependent on the individual group
(generative process). In our example, the patient id for each sample will be

This comment has been minimized.

@ogrisel

ogrisel Sep 9, 2016

Member

Please drop "(generative process)".

medical data collected from multiple patients, with multiple samples taken from
each patient. And such data is likely to be dependent on the individual group
(generative process). In our example, the patient id for each sample will be
it's group identifier.

This comment has been minimized.

@ogrisel

ogrisel Sep 9, 2016

Member

"its"

Data is non-i.i.d. when all the samples are not created from the same
generative process. Sometimes the samples are created from multiple similar but
non-identical generative processes. Each generative process generates a 'group'
of data.

This comment has been minimized.

@ogrisel

ogrisel Sep 9, 2016

Member

Please replace this paragraph by the following sentence:

The i.i.d. assumption is broken if the underlying generative process yield groups of dependent samples.

In this case we would like to know if a model trained on a particular set of
groups generalizes well to the unseen groups. To measure this, we need to
ensure that the testing set will have one (or more) groups that were not
present in the training set.

This comment has been minimized.

@ogrisel

ogrisel Sep 9, 2016

Member

To measure this, we need to ensure that all the samples in the validation fold come from groups that are not represented at all in the paired training fold.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Sep 9, 2016

I think I've addressed all the comments. @amueller @jnothman @ogrisel @vene

group is not in both testing and training sets. This is necessary for example
if you obtained data from different subjects and you want to avoid over-fitting
(i.e., learning person specific features) by testing and training on different
subjects.

This comment has been minimized.

@ogrisel

ogrisel Sep 10, 2016

Member

Group k-fold cross-validation will not avoid make the model avoid over-fitting but makes it possible to detect it when it happens. Here is a suggested rephrasing:

class:GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. class:GroupKFold makes it possible to detect this kind of overfitting situations.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 10, 2016

raghavrv added some commits Sep 11, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Sep 11, 2016

@ogrisel Done! Thanks for the patience!

train/test set.
groups : array-like, with shape (n_samples,), optional
Group labels used to constrain the permutation to specific subsets of
data.

This comment has been minimized.

@ogrisel

This comment has been minimized.

@raghavrv

raghavrv Sep 11, 2016

Member

This was the improved version :p

Based on your comment, I'm guessing it's not clear? The first line says it's used for permutation and the NOTE which follows it clarifies that it is also used for label-cvs...

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 11, 2016

Besides the remaining unaddressed comment by @jnothman, LGTM.

data.
Labels to constrain permutation within groups, i.e. ``y`` values
are permuted among samples with the same group identifier.
When not specified, ``y`` values are permuted among all samples.

This comment has been minimized.

@jnothman

jnothman Sep 11, 2016

Member

It'd be good to add that it's also passed to Group CV splitters...

This comment has been minimized.

@raghavrv

raghavrv Sep 11, 2016

Member

I added a NOTE below specially for that ;)

This comment has been minimized.

@jnothman

jnothman Sep 11, 2016

Member

Ah, sorry. Still not used to github's "viewing a subset of changes". I don't see why this is a NOTE as distinct from a parameter description, though.

This comment has been minimized.

@raghavrv

raghavrv Sep 11, 2016

Member

I felt the usage of GroupCV in permutation_test_score to be not a very common usecase... Would you rather have the NOTE heading removed?

This comment has been minimized.

@jnothman
@raghavrv

This comment has been minimized.

Member

raghavrv commented Sep 11, 2016

@jnothman Done.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 11, 2016

One of the travis job failed randomly (the execution of a timed joblib doctest took 24s instead of less than 1s usually). I restarted it.

@ogrisel ogrisel merged commit 9a12555 into scikit-learn:master Sep 11, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 11, 2016

Squash merged! 🍻

@raghavrv raghavrv deleted the raghavrv:rename_labels_to_groups branch Sep 11, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Sep 11, 2016

🍻

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 11, 2016

Thanks, Raghav!

On 12 September 2016 at 03:16, Raghav RV notifications@github.com wrote:

🍻


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6660 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64dJjU4zsouN0uOVq8yADwJNKKN9ks5qpDdQgaJpZM4IHUHL
.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Sep 12, 2016

Thank YOU for the patient reviews!

rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment