Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Data independent CV and model_selection module #3340

Conversation

pignacio
Copy link
Contributor

@pignacio pignacio commented Jul 3, 2014

Hi guys,

I'm working on #1848 and #2904, and would like to get some feedback on my changes.

Here's a writeup about the issues/changes: http://pignacio.github.io/jwoc-blog/stories/stuff-about-scikit-learn18482904.html

Any thoughts / suggestions / improvements / rants on the changes are greatly appreciated

@jnothman
Copy link
Member

jnothman commented Jul 4, 2014

Thanks, @pignacio. I can't look right now, but would like to note that we designate PRs with [WIP] in their title as "Work in progress (comments welcome)", as opposed to [MRG]("Ready to review for merge") and [MRG+1]("One reviewer has approved for merge")... So I'm popping [WIP] in front of this PR's title.

@jnothman jnothman changed the title Data independent CV and model_selection module [WIP] Data independent CV and model_selection module Jul 4, 2014
@jnothman
Copy link
Member

jnothman commented Jul 4, 2014

I don't think I get the need for _PartitionTestGenerator. You say:

As the same CV iterator can now be used several times (particularly in a nested way), a new object which controls the generation must be created each time, and must be independent from all others.

If def split(...) has yield statements in it, then calling split(...) will automatically create a new generator object, that is independent from other generators, as long as no shared state is modified.

@jnothman
Copy link
Member

jnothman commented Jul 4, 2014

Still not looking in detail, but I think we want to remove data-dependent parameters (e.g. n in KFold) from CV constructors. Even so, we need to maintain the old interface in sklearn.cross_validation


from sklearn.model_selection.search import GridSearchCV, RandomizedSearchCV
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use relative imports

@jnothman
Copy link
Member

jnothman commented Jul 4, 2014

I note that there will also be a raft of changes to the documentation

@pignacio
Copy link
Contributor Author

  • Fixed the minor issues
  • Leave*Out classes work with split now. *Shuffle* ones need some more work

I don't think I get the need for _PartitionTestGenerator [...]

You are right. _PartitionTestGenerator is not needed. At first glance, i thought there was some shared internal state that might cause problems, but it looks good.

Still not looking in detail, but I think we want to remove data-dependent parameters (e.g. n in KFold) from CV constructors. Even so, we need to maintain the old interface in sklearn.cross_validation

Right now I'm using defaults so that the same class can be used both ways:

for train, test in KFold(n, n_folds=3):
    print train, test

for train, test in KFold(n_folds=3).split(target):
    print train, test

Once the backwards compatibility is not needed anymore, we should be able to remove those parameters. Also, __iter__ now falls back to split(None) (Which means 'old behaviour, class members should be used'), which leads to some extra checks here and there, that will also be removed

I note that there will also be a raft of changes to the documentation

Yes, of course, also tests. Once we get the nasty details of this right, I'll get into that

I can't look right now, but would like to note that we designate PRs with [WIP] in their title as "Work in progress (comments welcome)", as opposed to MRG and MRG+1... So I'm popping [WIP] in front of this PR's title.

Sorry about that, I never intended to tag this like ready to merge or anything.

@jnothman
Copy link
Member

Sorry about that, I never intended to tag this like ready to merge or anything.

No problem, just letting you know why it changed.

@jnothman
Copy link
Member

Once the backwards compatibility is not needed anymore, we should be able to remove those parameters.

So, presumably, we should be raising a DeprecationWarning when any data-dependent arguments are filled. We would also delete these from the Parameters listing in the documentation, and emphasise in the documentation for n_folds that it must be specified by keyword.

I am worried this will lay too many usability traps, and I think @GaelVaroquaux was hoping that in renaming the path to this module, we could keep two separate copies of the functionality while deprecating, and not worry about backwards compatibility in the new one. Similarly, the new version could ignore any handling of the deprecated indices argument, etc.

@@ -388,7 +389,7 @@ def graph_lasso_path(X, alphas, cov_init=None, X_test=None, mode='cd',


class GraphLassoCV(GraphLasso):
"""Sparse inverse covariance w/ cross-validated choice of the l1 penalty
"""Sparse inverse covariance w/ cross-from .partition import _check_cvvalidated choice of the l1 penalty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have a spurious line here.

@GaelVaroquaux
Copy link
Member

I am worried this will lay too many usability traps, and I think @GaelVaroquaux was hoping that in renaming the path to this module

Yes, this is exactly what I had in mind. It would indeed probably be a bit simpler, although it would temporarily lead to duplicated code.

That said, this approach is also very interesting. I need to think it over a bit more.

I must say that in terms of names, I am not terribly happy with importing the cross_validation objects from model_selection.partition, as they are core objects and it makes them hard to discover. We should import them directly from model_selection (and thus put them in the init of the module).

Thanks a lot @pignacio for tackling this issue. It's a very important one, and you are doing it in a very elegant way.

from .externals.six import with_metaclass
from .externals.six.moves import zip
from .metrics.scorer import check_scoring
from .model_selection.partition import LeaveOneOut, LeavePOut, KFold, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this into separate lines, rather than continuation? i.e.

from .model_selection.partition import LeaveOneOut, LeavePOut, KFold
from .model_selection.partition import StratifiedKFold, LeaveOneLabelOut

etc.

@kastnerkyle
Copy link
Member

I made a few small nitpick comments but this looks really great! I also agree on the "have two versions, deprecate old" approach.

@vene vene mentioned this pull request Jul 16, 2014
6 tasks
@vene
Copy link
Member

vene commented Jul 16, 2014

I must say that in terms of names, I am not terribly happy with importing the cross_validation objects from model_selection.partition, as they are core objects and it makes them hard to discover. We should import them directly from model_selection (and thus put them in the init of the module).

+1, I think this is especially possible thanks to our descriptive function and object names. We can keep a nested structure for code organization, but expose a flat interface:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV

Currently, even for the simplest of realistic examples, you'd need to import from 4-5 different submodules. This would be a step in the right direction.

@vene
Copy link
Member

vene commented Jul 24, 2014

This PR is getting hard to review, because the renaming, coupled with the significant change, makes it hard to read the diff. It's a pity, because there is a lot of interest in this enhancement.

I thought for a bit what the best way to proceed would be, and I suggest this:

  1. Have a new PR that just does the renaming + deprecation, but makes copies of the existing code being renamed (e.g., copy sklearn.grid_search.GridSearchCV to sklearn.model_selection.GridSearchCV as it is, and deprecate sklearn.grid_search.GridSearchCV)
  2. Have a PR that implements the functionality by changing the copies (sklearn.model_selection.GridSearchCV).

What do you think?

@jnothman
Copy link
Member

Yes, I think others have been voting for that deprecation strategy in any
case. Certainly as a way forward, being able to focus only on the new API
should be helpful.

Ping @pignacio, are you still with us?

On 24 July 2014 19:25, Vlad Niculae notifications@github.com wrote:

This PR is getting hard to review, because the renaming, coupled with
the significant change, makes it hard to read the diff. It's a pity,
because there is a lot of interest in this enhancement.

I thought for a bit what the best way to proceed would be, and I suggest
this:

  1. Have a new PR that just does the renaming + deprecation, but makes
    copies of the existing code being renamed (e.g., copy
    sklearn.grid_search.GridSearchCV to sklearn.model_selection.GridSearchCV
    as it is, and deprecate sklearn.grid_search.GridSearchCV)
  2. Have a PR that implements the functionality by changing the copies (
    sklearn.model_selection.GridSearchCV).

What do you think?


Reply to this email directly or view it on GitHub
#3340 (comment)
.

@pignacio
Copy link
Contributor Author

Yep, right here.

Been a little busy these last days.

If we all agree, we can split this PR like vene suggests. I'll work on 1) for now so we can close that.

@kastnerkyle

I made a few small nitpick comments but this looks really great! I also agree on the "have two versions, deprecate old" approach.

+1 on comments of any kind. The more eyes looking this the better :)

@jnothman
Copy link
Member

Well, reading it again, I'm not actually sure that there's much point in @vene's two-phase approach. Copying the old versions only means that we're giving users of the development version a false deprecation warning, or rather telling them to use something that doesn't have the right interface yet.

The deprecations are module-level: we can warn when the user imports the module, so it's actually just a two-line header in each module, not a great change.

Rather, I think:

  • deprecate each module as necessary, but don't do everything at once. Much of the problem is just that github doesn't show file moves very clearly.
  • the new CV implementations shouldn't try to be backwards compatible.
  • the new CV implementations are independent of moving scoring, etc., so the API changes should be made without adding noise by moving other files around.

Even if we'd like this move to be released in one version, putting it all in one PR is just too overwhelming. And any/most changes outside of the cross_validation module should be backwards-compatible, as far as I can tell.

@raghavrv
Copy link
Member

@pignacio Would you be able to find time to continue the PR? Or if its okay by you could I continue with your work by cherry-picking your commits?

@jnothman
Copy link
Member

@ragv, just to note this is probably not as simple as it looks...

On 16 February 2015 at 19:28, ragv notifications@github.com wrote:

@pignacio https://github.com/pignacio Would you be able to find time
to continue the PR? Or if its okay by you could I continue with your work
by cherry-picking your commits?


Reply to this email directly or view it on GitHub
#3340 (comment)
.

@pignacio
Copy link
Contributor Author

@ragv No, i couldn't :). Feel free to take from here whatever you feel will help, and let me know if I can help, but as @jnothman notes, this was a somewhat complex change.

Good luck :)

@raghavrv
Copy link
Member

@jnothman and @pignacio Thanks for the heads up :) I'll ping you guys if I run into any trouble.... :p

@pignacio
Copy link
Contributor Author

Hi all, since this pull request was mentioned in the mailing list, and @GaelVaroquaux showed interest on seeing this closed, I'd like to share my thoughts respecting this changes.

A brief recap first:

This pull request intended to fix two issues:

What's done so far

This pull request handles the migration from grid_search to model_selectionand shows an idea of how we could handle both the old and the new cross validation syntax with the same code, to avoid duplication.

The implementation is not bulletproof, but looks promising so far.

But, as @vene points out, having these two changes implemented in the same pull request makes looking at diffs and reviewing the code difficult.

What do you think about splitting this changeset into three parts, as in:

  • Module migration (deprecation warnings in old module, new docs + tests)
  • Data independent CV iterators (deprecation warnings for old methods, docs +
    tests)
  • Extending the allowed data types for the y value.

I think these would make the changes more simple to implement and review.

@ragv has shown interest into finishing this, and I can help too. Splitting the changeset should help us make progress.

Thanks for reading up to here, and feel free to comment your thoughts :)

def __init__(self, n, n_folds, indices, shuffle, random_state):
super(_BaseKFold, self).__init__(n, indices)

if abs(n_folds - int(n_folds)) >= np.finfo('f').eps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller Wouldn't a simple if type(n_folds) is not int: suffice here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type(n_folds) is not good, as it could be a numpy int. A isinstance(n_folds, numeric.Integer) would work, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Thanks :)

@raghavrv
Copy link
Member

raghavrv commented Oct 9, 2015

@ogrisel @amueller This can be closed since we have #4294 which will be merged very soon(?)

@amueller
Copy link
Member

amueller commented Oct 9, 2015

when it's merged

@MechCoder MechCoder closed this Nov 5, 2015
@jnothman
Copy link
Member

jnothman commented Nov 5, 2015

yay!

On 6 November 2015 at 07:25, Manoj Kumar notifications@github.com wrote:

Closed #3340 #3340.


Reply to this email directly or view it on GitHub
#3340 (comment).

@raghavrv
Copy link
Member

raghavrv commented Nov 5, 2015

:D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants