Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG grid_search forgets estimators #770

Merged
merged 5 commits into from
Apr 16, 2012

Conversation

amueller
Copy link
Member

This should address #565 in the least intrusive way.

It is still possible to set refit=False, but then it is not possible to use predict or best_estimator_.
Now best_estimator_ is a property that returns the best estimator when fit was called with refit=True and raises an appropriate error if not.
This should keep the API as consistent as possible. It only breaks if someone used the best estimator without refitting - which I don't feel is a very good idea any way. And at least it gives sensible feedback.

My reasoning was that maybe someone has a big dataset and uses GridSearch with ShuffleSplit and a low train_size. Then they might not want to fit to the whole dataset.
This option is now available without really changing anything for the average user.

Still need to document the changes in whatsnew.

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2012

Thanks for working on this. This looks good (but I have not run the code myself).


if self.refit:
# fit the best estimator using the entire dataset
# clone first to work around broken estimators
best_estimator = clone(best_estimator)
best_estimator = base_clf.set_params(**best_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to clone base_clf to keep the original object unchanged. Other than that, looks good :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I also wondered about that. you're right, it's probably better.

@GaelVaroquaux
Copy link
Member

I have an issue with properties. I don't like them: when I interact with the code, I find them confusing.

To have an understandable error message, could we not simply check if best_estimator_ is not None in the predict and score method?

@amueller
Copy link
Member Author

Yeah, I don't like them too much, either.
The thing is that I often used best_estimator_ in my code.

Alternatively, I could make best_estimator_ a deprecated property and include the checks in predict and score.
Maybe that's a good idea.

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2012

We could just use a property for backward compat with a deprecation warning and tell the user to use grid_search.best_params_ and a new method called fit_with_best_params(X, y=None) to refit the model with best parameter set on the development set if needed.

@amueller
Copy link
Member Author

@ogrisel why do we need fit_with_best_params? why not just set refit=True if you want to fit the whole dataset?

I am just wondering how we should expose the best estimator to the user. I think they should be able to just get the object out. If we still store them in best_estimator_ people who upgrade and use refit=False get unexpected errors.

So we need a new attribute, that does exactly the same as best_estimator_ but has a different name, so that people
that use it know that they can only use it if they set refit=True? That seems way confusing.

I guess the solution would be not to give users the object but just the parameters. I don't like that so much.

@amueller
Copy link
Member Author

After thinking about it a bit longer, I feel that the current solution is the best for a smooth transition.
In two versions, we could just make the property an attribute again, as people should know by then how to use it.

@GaelVaroquaux
Copy link
Member

On Fri, Apr 13, 2012 at 09:23:19AM -0700, Andreas Mueller wrote:

After thinking about it a bit longer, I feel that the current solution is the best for a smooth transition.

What do you call the 'current situation', the one in the pull request, or
in master?

G

@amueller
Copy link
Member Author

@GaelVaroquaux Sorry for being unspecific, I meant in the PR.

This would mean a property for now that can be made an attribute later on.

@GaelVaroquaux
Copy link
Member

This would mean a property for now that can be made an attribute later on.

I am OK with that. Could we have a note that can easily be grepped so
that we don't forget to change this behavior in a few releases.

@mblondel
Copy link
Member

@amueller: Is it ready for merge? I need to use SVC on a largish dataset and this PR would help :)

@@ -144,7 +144,7 @@ def fit_grid_point(X, y, base_clf, clf_params, train, test, loss_func,
logger.short_format_time(time.time() -
start_time))
print "[GridSearchCV] %s %s" % ((64 - len(end_msg)) * '.', end_msg)
return this_score, clf, this_n_test_samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an explicit del clf help garbage-collect the classifier faster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly.

@GaelVaroquaux
Copy link
Member

Yes, I think that this can be merged. Thanks @amueller for leading a good discussion on this issue.

@amueller
Copy link
Member Author

I'll merge. (after adding a todo comment)
@mbonde how about you try out the del clf. In my experience, explicit deletes don't help that much.
If it helps, it's easy to add :)

amueller added a commit that referenced this pull request Apr 16, 2012
@amueller amueller merged commit d7f86f5 into scikit-learn:master Apr 16, 2012
@tianhuil
Copy link

Hi, has anyone looked at the picklability of GridSearchCV? When I try (using both pickle and joblib), I get an error:

TypeError: can't pickle instancemethod objects

This seems like an important bit of functionality. The issue was mentioned in Issue#565.

@GaelVaroquaux
Copy link
Member

Hi, has anyone looked at the picklability of GridSearchCV? When I try (using both pickle and joblib), I get an error:

TypeError: can't pickle instancemethod objects

Can you give us a full minimum example to reproduce? I am not sure
whether is it a general problem, or specific to the setup that you used
the GridSearch in.

@amueller
Copy link
Member Author

@tianhuil can you please open a new issue?

This is more of a convenience thing, though (still important).
If you want the fitted estimator, pickle best_estimator, if you want the scores, pickle grid_scores.

@amueller amueller deleted the oblivious_grid_search branch May 19, 2017 20:46
@QtRoS
Copy link

QtRoS commented Jan 14, 2018

Any news on this? #565 affects me in 2018 :)

@jnothman
Copy link
Member

Maybe you have misunderstood the issue. I'm pretty sure #565 is not relevant any longer. If you have another issue, please raise it

@QtRoS
Copy link

QtRoS commented Jan 15, 2018

@jnothman recently I trained about 1400 fits of MultinomialNB via GridSearchCV. They didn't fit into my 32GB RAM and 8GB swap.

@jnothman
Copy link
Member

jnothman commented Jan 15, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants