MRG grid_search forgets estimators #770

amueller · 2012-04-13T14:39:05Z

This should address #565 in the least intrusive way.

It is still possible to set refit=False, but then it is not possible to use predict or best_estimator_.
Now best_estimator_ is a property that returns the best estimator when fit was called with refit=True and raises an appropriate error if not.
This should keep the API as consistent as possible. It only breaks if someone used the best estimator without refitting - which I don't feel is a very good idea any way. And at least it gives sensible feedback.

My reasoning was that maybe someone has a big dataset and uses GridSearch with ShuffleSplit and a low train_size. Then they might not want to fit to the whole dataset.
This option is now available without really changing anything for the average user.

Still need to document the changes in whatsnew.

ogrisel · 2012-04-13T14:42:44Z

Thanks for working on this. This looks good (but I have not run the code myself).

mblondel · 2012-04-13T15:19:56Z

sklearn/grid_search.py


        if self.refit:
            # fit the best estimator using the entire dataset
            # clone first to work around broken estimators
-            best_estimator = clone(best_estimator)
+            best_estimator = base_clf.set_params(**best_params)


We may want to clone base_clf to keep the original object unchanged. Other than that, looks good :)

Yeah I also wondered about that. you're right, it's probably better.

GaelVaroquaux · 2012-04-13T16:02:40Z

I have an issue with properties. I don't like them: when I interact with the code, I find them confusing.

To have an understandable error message, could we not simply check if best_estimator_ is not None in the predict and score method?

amueller · 2012-04-13T16:05:37Z

Yeah, I don't like them too much, either.
The thing is that I often used best_estimator_ in my code.

Alternatively, I could make best_estimator_ a deprecated property and include the checks in predict and score.
Maybe that's a good idea.

ogrisel · 2012-04-13T16:07:06Z

We could just use a property for backward compat with a deprecation warning and tell the user to use grid_search.best_params_ and a new method called fit_with_best_params(X, y=None) to refit the model with best parameter set on the development set if needed.

amueller · 2012-04-13T16:15:47Z

@ogrisel why do we need fit_with_best_params? why not just set refit=True if you want to fit the whole dataset?

I am just wondering how we should expose the best estimator to the user. I think they should be able to just get the object out. If we still store them in best_estimator_ people who upgrade and use refit=False get unexpected errors.

So we need a new attribute, that does exactly the same as best_estimator_ but has a different name, so that people
that use it know that they can only use it if they set refit=True? That seems way confusing.

I guess the solution would be not to give users the object but just the parameters. I don't like that so much.

amueller · 2012-04-13T16:23:18Z

After thinking about it a bit longer, I feel that the current solution is the best for a smooth transition.
In two versions, we could just make the property an attribute again, as people should know by then how to use it.

GaelVaroquaux · 2012-04-13T17:06:23Z

On Fri, Apr 13, 2012 at 09:23:19AM -0700, Andreas Mueller wrote:

After thinking about it a bit longer, I feel that the current solution is the best for a smooth transition.

What do you call the 'current situation', the one in the pull request, or
in master?

G

amueller · 2012-04-13T17:10:00Z

@GaelVaroquaux Sorry for being unspecific, I meant in the PR.

This would mean a property for now that can be made an attribute later on.

GaelVaroquaux · 2012-04-14T10:32:46Z

This would mean a property for now that can be made an attribute later on.

I am OK with that. Could we have a note that can easily be grepped so
that we don't forget to change this behavior in a few releases.

mblondel · 2012-04-16T06:48:28Z

@amueller: Is it ready for merge? I need to use SVC on a largish dataset and this PR would help :)

mblondel · 2012-04-16T06:49:26Z

sklearn/grid_search.py

@@ -144,7 +144,7 @@ def fit_grid_point(X, y, base_clf, clf_params, train, test, loss_func,
                              logger.short_format_time(time.time() -
                                                       start_time))
        print "[GridSearchCV] %s %s" % ((64 - len(end_msg)) * '.', end_msg)
-    return this_score, clf, this_n_test_samples


Would an explicit del clf help garbage-collect the classifier faster?

GaelVaroquaux · 2012-04-16T06:51:56Z

Yes, I think that this can be merged. Thanks @amueller for leading a good discussion on this issue.

amueller · 2012-04-16T10:07:02Z

I'll merge. (after adding a todo comment)
@mbonde how about you try out the del clf. In my experience, explicit deletes don't help that much.
If it helps, it's easy to add :)

MRG grid_search forgets estimators

tianhuil · 2012-07-23T16:21:22Z

Hi, has anyone looked at the picklability of GridSearchCV? When I try (using both pickle and joblib), I get an error:

TypeError: can't pickle instancemethod objects

This seems like an important bit of functionality. The issue was mentioned in Issue#565.

GaelVaroquaux · 2012-07-23T16:25:14Z

Hi, has anyone looked at the picklability of GridSearchCV? When I try (using both pickle and joblib), I get an error:

TypeError: can't pickle instancemethod objects

Can you give us a full minimum example to reproduce? I am not sure
whether is it a general problem, or specific to the setup that you used
the GridSearch in.

amueller · 2012-07-23T16:26:04Z

@tianhuil can you please open a new issue?

This is more of a convenience thing, though (still important).
If you want the fitted estimator, pickle best_estimator, if you want the scores, pickle grid_scores.

QtRoS · 2018-01-14T21:23:09Z

Any news on this? #565 affects me in 2018 :)

jnothman · 2018-01-14T22:30:43Z

Maybe you have misunderstood the issue. I'm pretty sure #565 is not relevant any longer. If you have another issue, please raise it

QtRoS · 2018-01-15T05:39:03Z

@jnothman recently I trained about 1400 fits of MultinomialNB via GridSearchCV. They didn't fit into my 32GB RAM and 8GB swap.

jnothman · 2018-01-15T07:03:13Z

Sounds interesting. What makes you think GridSearchCV is at fault? What was n_jobs set to? Come up with a reproducible code snippet and please create a new issue.

ENH grid_search forgets estimators

38cdfb2

DOC slightly better docs for refit, document best_params.

0295809

mblondel reviewed Apr 13, 2012
View reviewed changes

FIX clone base_clf before setting params.

db57c39

FIX messed up something in the short cut method.

f33403c

mblondel reviewed Apr 16, 2012
View reviewed changes

COSMIT add todo comment to grep

31546a1

amueller added a commit that referenced this pull request Apr 16, 2012

Merge pull request #770 from amueller/oblivious_grid_search

d7f86f5

MRG grid_search forgets estimators

amueller merged commit d7f86f5 into scikit-learn:master Apr 16, 2012

amueller mentioned this pull request Apr 20, 2012

grid search keeps all estimators in memory #565

Closed

amueller deleted the oblivious_grid_search branch May 19, 2017 20:46

Uh oh!

MRG grid_search forgets estimators #770

MRG grid_search forgets estimators #770

Uh oh!

Conversation

amueller commented Apr 13, 2012

Uh oh!

ogrisel commented Apr 13, 2012

Uh oh!

mblondel Apr 13, 2012

Choose a reason for hiding this comment

Uh oh!

amueller Apr 13, 2012

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Apr 13, 2012

Uh oh!

amueller commented Apr 13, 2012

Uh oh!

ogrisel commented Apr 13, 2012

Uh oh!

amueller commented Apr 13, 2012

Uh oh!

amueller commented Apr 13, 2012

Uh oh!

GaelVaroquaux commented Apr 13, 2012

Uh oh!

amueller commented Apr 13, 2012

Uh oh!

GaelVaroquaux commented Apr 14, 2012

Uh oh!

mblondel commented Apr 16, 2012

Uh oh!

mblondel Apr 16, 2012

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux Apr 16, 2012

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Apr 16, 2012

Uh oh!

amueller commented Apr 16, 2012

Uh oh!

tianhuil commented Jul 23, 2012

Uh oh!

GaelVaroquaux commented Jul 23, 2012

Uh oh!

amueller commented Jul 23, 2012

Uh oh!

QtRoS commented Jan 14, 2018

Uh oh!

jnothman commented Jan 14, 2018

Uh oh!

QtRoS commented Jan 15, 2018

Uh oh!

jnothman commented Jan 15, 2018 via email

Uh oh!

Uh oh!