-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG grid_search forgets estimators #770
Conversation
Thanks for working on this. This looks good (but I have not run the code myself). |
|
||
if self.refit: | ||
# fit the best estimator using the entire dataset | ||
# clone first to work around broken estimators | ||
best_estimator = clone(best_estimator) | ||
best_estimator = base_clf.set_params(**best_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to clone base_clf to keep the original object unchanged. Other than that, looks good :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I also wondered about that. you're right, it's probably better.
I have an issue with properties. I don't like them: when I interact with the code, I find them confusing. To have an understandable error message, could we not simply check if best_estimator_ is not None in the predict and score method? |
Yeah, I don't like them too much, either. Alternatively, I could make |
We could just use a property for backward compat with a deprecation warning and tell the user to use |
@ogrisel why do we need I am just wondering how we should expose the best estimator to the user. I think they should be able to just get the object out. If we still store them in So we need a new attribute, that does exactly the same as I guess the solution would be not to give users the object but just the parameters. I don't like that so much. |
After thinking about it a bit longer, I feel that the current solution is the best for a smooth transition. |
On Fri, Apr 13, 2012 at 09:23:19AM -0700, Andreas Mueller wrote:
What do you call the 'current situation', the one in the pull request, or G |
@GaelVaroquaux Sorry for being unspecific, I meant in the PR. This would mean a property for now that can be made an attribute later on. |
I am OK with that. Could we have a note that can easily be grepped so |
@amueller: Is it ready for merge? I need to use SVC on a largish dataset and this PR would help :) |
@@ -144,7 +144,7 @@ def fit_grid_point(X, y, base_clf, clf_params, train, test, loss_func, | |||
logger.short_format_time(time.time() - | |||
start_time)) | |||
print "[GridSearchCV] %s %s" % ((64 - len(end_msg)) * '.', end_msg) | |||
return this_score, clf, this_n_test_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would an explicit del clf
help garbage-collect the classifier faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly.
Yes, I think that this can be merged. Thanks @amueller for leading a good discussion on this issue. |
I'll merge. (after adding a todo comment) |
MRG grid_search forgets estimators
Hi, has anyone looked at the picklability of GridSearchCV? When I try (using both pickle and joblib), I get an error: TypeError: can't pickle instancemethod objects This seems like an important bit of functionality. The issue was mentioned in Issue#565. |
Can you give us a full minimum example to reproduce? I am not sure |
@tianhuil can you please open a new issue? This is more of a convenience thing, though (still important). |
Any news on this? #565 affects me in 2018 :) |
Maybe you have misunderstood the issue. I'm pretty sure #565 is not relevant any longer. If you have another issue, please raise it |
@jnothman recently I trained about 1400 fits of MultinomialNB via GridSearchCV. They didn't fit into my 32GB RAM and 8GB swap. |
Sounds interesting. What makes you think GridSearchCV is at fault? What was
n_jobs set to? Come up with a reproducible code snippet and please create a
new issue.
|
This should address #565 in the least intrusive way.
It is still possible to set
refit=False
, but then it is not possible to usepredict
orbest_estimator_
.Now
best_estimator_
is a property that returns the best estimator whenfit
was called withrefit=True
and raises an appropriate error if not.This should keep the API as consistent as possible. It only breaks if someone used the best estimator without refitting - which I don't feel is a very good idea any way. And at least it gives sensible feedback.
My reasoning was that maybe someone has a big dataset and uses GridSearch with ShuffleSplit and a low train_size. Then they might not want to fit to the whole dataset.
This option is now available without really changing anything for the average user.
Still need to document the changes in whatsnew.