-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Proposal: Genearlized Cross-Validation and Early Stopping #1626
Comments
As for early stopping, that would not be too hard when there's a |
It is not only |
Regarding early stopping, this is definitely a feature to have. However, I think it should be designed to also allow for resuming the fitting process. Let's say I am building a random forest. Let's assume that This whole thing makes me also think that we should design this fitting/stopping/resuming process in parallel with a monitor API. They should clearly go hand in hand. |
You are right, my proposal does not address this very interactive setting where you have a model and want to improve it. Isn't that exactly the setting for which we have So what is your usage scenario and what kind of behavior would you like to have? My foremost motivation was cheap parameter selection via grid search. How do you want to control the fitting? Via a stopping criterion that you can change between calls? |
In regards to early stopping, in addition I would like to see a timeout parameter. Would that fit in well? To clarify what I mean: everytime the validation set is checked, the elapsed wall time could be evaluated. |
Not sure how well that works with multiprocessing.... |
I guess I don't understand why it wouldn't work. Anytime the early_stopping_tolerance is checked, the elapsed wall time can be computed. I didn't mean to use a timer, just grab the current time and see how much time has elapsed. Does that make sense or am I missing something obvious? |
I guess you could do that somehow. |
See #930 for a related issue that could possibly be tackled as well. |
I think by generalised CV you mean something like memoised prediction with respect to parameter variation: one For a trivial case, consider And, considering the interaction of such a feature with For this reason, I can as yet only see a solution along the following lines:
|
@jnothman Somewhat slow reply here, sorry ;) I am not sure I agree totally with your view of "one fit, multiple predict". That doesn't seem to cover path algorithms such as LassoCV. You could store the coef_ for each regularization parameter, but I don't think this is how I would do it. |
Hi guys can you please comment on the current status of this feature? |
@fingoldo there's early stopping in the neural nets and gradient boosting in scikit-learn. |
@amueller Yes thanks already know this, wondering if there are some comparisons available of the quality of solutions found with and without early stopping in sklearn :-) |
There is no general answer, this is both model and problem specific. |
Guys, is my understanding correct? Winning set of parameters is declared based on maximal average test score, right? Now the question: obviously each test score is a score of an estimator which has overfit the training set. So what we are getting as a mean test score for a parameters set is a mean performance of an OVERFIT estimator, which was
Later suggested production implementation steps include taking classifier/parameter set having best Ok you have added early stopping for neural nets and gradient boosting, but
This part of story all of the scikit manuals and books (including Andreas’ and Sebastian’s books) prefer to omit. But I think it’s a crucial point.
Does this make sense? |
Is this still an active discussion? For me, the early stopping option really makes sense especially in case of |
@Borda you can also just set |
it depends if I set |
A bunch of the ideas mentioned here have moved forward and are being worked on, in more recent issues/PRs. I think we can close this one. |
This is a proposal to to resolve two API issues in sklearn:
Why should we care about that?
With generalized cross validation I mean finding the best setting of some parameter without refitting the entire model. This is currently implemented for RFE and some linear models via a EstimatorCV. These don't work well together with GridSearchCV as might be required in a Pipeline or when more than one parameter needs to be found.
Also, a similar functionality would be great for other models like GradientBoosting (for n_estimators) and all tree-based methods (for max_depth).
With early stopping I mean saving computations when more computation doesn't improve the result. We don't have that yet but it would be a great (maybe even necessary) feature for SGD based methods and bagging methods (random forests and extra trees). Note that early stopping needs the use of a validation set to evaluate the model.
How can we solve that?
Let's start with the generalized cross validation.
We need it to work together with GridSearchCV.
This will definitely require changes in both GridSearchCV and the estimators.
My idea:
max_depth=range(1, 10)
. Duringfit
the estimator will fit in a way that it can produce predictions for all of these values.predict
is called, the estimator will return a dict with keys the parameter values and values the prediction values for these parameters. (we could also add a newpredict_all
function but I'm not sure about that).GridSearchCV could then simply incooperate these values into the grid-search result.
For that GridSearchCV needs to be able to ask the estimator for which parameters it can do generalized CV and just pass on the list of parameters it got there.
So now to early stopping. The reason I want to treat the two problems as one is that early stopping is basically a lazy form of generalized cross-validation.
So
n_iter=range(1, 100, 10)
.I would provide the validation set that is used either as a parameter to
__init__
orfit
(not sure). So it would be enough to add two parameters to the estimator:early_stopping_tolerance
andearly_stopping_set=None
(if it is None, no early stopping).There are two choices that I made here:
This is again so that this can be used inside GridSearchCV. And doesn't really add that much overhead if the user doesn't use GridSearchCV (why would they do that any way?)
It is also very explicit and gives the user a lot of control.
Restrictions
In pipelines, this will only work for the last estimator. So I'm not sure to do this with RFE for example.
Do we really need this?
The changes I proposed above are quite big in some sense, but I think the two issues need to be resolved. If you have any better idea, feel free to explain it ;)
The text was updated successfully, but these errors were encountered: