MRG grid_search forgets estimators #770

Merged
merged 5 commits into from Apr 16, 2012

5 participants

@amueller
scikit-learn member

This should address #565 in the least intrusive way.

It is still possible to set refit=False, but then it is not possible to use predict or best_estimator_.
Now best_estimator_ is a property that returns the best estimator when fit was called with refit=True and raises an appropriate error if not.
This should keep the API as consistent as possible. It only breaks if someone used the best estimator without refitting - which I don't feel is a very good idea any way. And at least it gives sensible feedback.

My reasoning was that maybe someone has a big dataset and uses GridSearch with ShuffleSplit and a low train_size. Then they might not want to fit to the whole dataset.
This option is now available without really changing anything for the average user.

Still need to document the changes in whatsnew.

@ogrisel
scikit-learn member

Thanks for working on this. This looks good (but I have not run the code myself).

@mblondel mblondel and 1 other commented on an outdated diff Apr 13, 2012
sklearn/grid_search.py
if self.refit:
# fit the best estimator using the entire dataset
# clone first to work around broken estimators
- best_estimator = clone(best_estimator)
+ best_estimator = base_clf.set_params(**best_params)
@mblondel
scikit-learn member

We may want to clone base_clf to keep the original object unchanged. Other than that, looks good :)

@amueller
scikit-learn member

Yeah I also wondered about that. you're right, it's probably better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
scikit-learn member

I have an issue with properties. I don't like them: when I interact with the code, I find them confusing.

To have an understandable error message, could we not simply check if best_estimator_ is not None in the predict and score method?

@amueller
scikit-learn member

Yeah, I don't like them too much, either.
The thing is that I often used best_estimator_ in my code.

Alternatively, I could make best_estimator_ a deprecated property and include the checks in predict and score.
Maybe that's a good idea.

@ogrisel
scikit-learn member

We could just use a property for backward compat with a deprecation warning and tell the user to use grid_search.best_params_ and a new method called fit_with_best_params(X, y=None) to refit the model with best parameter set on the development set if needed.

@amueller
scikit-learn member

@ogrisel why do we need fit_with_best_params? why not just set refit=True if you want to fit the whole dataset?

I am just wondering how we should expose the best estimator to the user. I think they should be able to just get the object out. If we still store them in best_estimator_ people who upgrade and use refit=False get unexpected errors.

So we need a new attribute, that does exactly the same as best_estimator_ but has a different name, so that people
that use it know that they can only use it if they set refit=True? That seems way confusing.

I guess the solution would be not to give users the object but just the parameters. I don't like that so much.

@amueller
scikit-learn member

After thinking about it a bit longer, I feel that the current solution is the best for a smooth transition.
In two versions, we could just make the property an attribute again, as people should know by then how to use it.

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

@GaelVaroquaux Sorry for being unspecific, I meant in the PR.

This would mean a property for now that can be made an attribute later on.

@GaelVaroquaux
scikit-learn member
@mblondel
scikit-learn member

@amueller: Is it ready for merge? I need to use SVC on a largish dataset and this PR would help :)

@mblondel mblondel commented on the diff Apr 16, 2012
sklearn/grid_search.py
@@ -144,7 +144,7 @@ def fit_grid_point(X, y, base_clf, clf_params, train, test, loss_func,
logger.short_format_time(time.time() -
start_time))
print "[GridSearchCV] %s %s" % ((64 - len(end_msg)) * '.', end_msg)
- return this_score, clf, this_n_test_samples
@mblondel
scikit-learn member

Would an explicit del clf help garbage-collect the classifier faster?

@GaelVaroquaux
scikit-learn member

Possibly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
scikit-learn member

Yes, I think that this can be merged. Thanks @amueller for leading a good discussion on this issue.

@amueller
scikit-learn member

I'll merge. (after adding a todo comment)
@mbonde how about you try out the del clf. In my experience, explicit deletes don't help that much.
If it helps, it's easy to add :)

@amueller amueller merged commit d7f86f5 into scikit-learn:master Apr 16, 2012
@tianhuil

Hi, has anyone looked at the picklability of GridSearchCV? When I try (using both pickle and joblib), I get an error:

TypeError: can't pickle instancemethod objects

This seems like an important bit of functionality. The issue was mentioned in Issue#565.

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

@tianhuil can you please open a new issue?

This is more of a convenience thing, though (still important).
If you want the fitted estimator, pickle best_estimator, if you want the scores, pickle grid_scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment