Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Genearlized Cross-Validation and Early Stopping #1626

Open
amueller opened this issue Jan 25, 2013 · 21 comments

Comments

@amueller
Copy link
Member

@amueller amueller commented Jan 25, 2013

This is a proposal to to resolve two API issues in sklearn:

  • Generalized Cross Validation
  • Early stopping

Why should we care about that?

With generalized cross validation I mean finding the best setting of some parameter without refitting the entire model. This is currently implemented for RFE and some linear models via a EstimatorCV. These don't work well together with GridSearchCV as might be required in a Pipeline or when more than one parameter needs to be found.
Also, a similar functionality would be great for other models like GradientBoosting (for n_estimators) and all tree-based methods (for max_depth).

With early stopping I mean saving computations when more computation doesn't improve the result. We don't have that yet but it would be a great (maybe even necessary) feature for SGD based methods and bagging methods (random forests and extra trees). Note that early stopping needs the use of a validation set to evaluate the model.

How can we solve that?

Let's start with the generalized cross validation.
We need it to work together with GridSearchCV.
This will definitely require changes in both GridSearchCV and the estimators.
My idea:

  1. give the estimator an iterable of values you want to try, i.e. max_depth=range(1, 10). During fit the estimator will fit in a way that it can produce predictions for all of these values.
  2. When predict is called, the estimator will return a dict with keys the parameter values and values the prediction values for these parameters. (we could also add a new predict_all function but I'm not sure about that).

GridSearchCV could then simply incooperate these values into the grid-search result.
For that GridSearchCV needs to be able to ask the estimator for which parameters it can do generalized CV and just pass on the list of parameters it got there.

So now to early stopping. The reason I want to treat the two problems as one is that early stopping is basically a lazy form of generalized cross-validation.
So

  1. you would provide the estimator with an iterable, i.e. n_iter=range(1, 100, 10).
  2. The estimator fits for all these values (as implemented as above). But for each setting, it also evaluates on a validation set, and if there is only a small improvement, training will stop.

I would provide the validation set that is used either as a parameter to __init__ or fit (not sure). So it would be enough to add two parameters to the estimator: early_stopping_tolerance and early_stopping_set=None (if it is None, no early stopping).

There are two choices that I made here:

  1. provide a separate validation set, not generate one from the training set on the fly.
    This is again so that this can be used inside GridSearchCV. And doesn't really add that much overhead if the user doesn't use GridSearchCV (why would they do that any way?)
    It is also very explicit and gives the user a lot of control.
  2. Provide the parameter settings as an iterable, not a maximum. The reason for that is that you probably don't want to evaluate the validation set every iteration, but maybe every k iterations. What's k? another parameter? I feel like an iterable is a good way to specify this. Also, it allows for a unified interface with the generalized cross-validation case.

Restrictions

In pipelines, this will only work for the last estimator. So I'm not sure to do this with RFE for example.

Do we really need this?

The changes I proposed above are quite big in some sense, but I think the two issues need to be resolved. If you have any better idea, feel free to explain it ;)

@larsmans

This comment has been minimized.

Copy link
Member

@larsmans larsmans commented Jan 26, 2013

As for early stopping, that would not be too hard when there's a partial_fit, but there needs to be a way to tell when it's appropriate. We could special-case on the parameter n_iter -- is that the only one where early stopping applies?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jan 26, 2013

It is not only n_iter, in the bagging methods it's n_estimators.

@glouppe

This comment has been minimized.

Copy link
Member

@glouppe glouppe commented Jan 28, 2013

Regarding early stopping, this is definitely a feature to have. However, I think it should be designed to also allow for resuming the fitting process.

Let's say I am building a random forest. Let's assume that early_stopping_tolerance is set too high, such that the fitting process is stopped too early. What can I do if I change my mind and actually want more trees? Set early_stopping_tolerance to a lower value and rebuild to whole thing? 👎 The same problem would also happen when the iterable you provide (n_iter or n_estimators) does not actually cover a wide enough range of values. What can you do then if you want to test more values? In other words, I don't think early stopping as you propose can solve all of our problems (e.g., it cannot solve #1585 in my opinion).

This whole thing makes me also think that we should design this fitting/stopping/resuming process in parallel with a monitor API. They should clearly go hand in hand.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jan 28, 2013

You are right, my proposal does not address this very interactive setting where you have a model and want to improve it. Isn't that exactly the setting for which we have partial_fit? Using partial fit together with early stopping as I proposed above might work.

So what is your usage scenario and what kind of behavior would you like to have? My foremost motivation was cheap parameter selection via grid search.

How do you want to control the fitting? Via a stopping criterion that you can change between calls?

@jwkvam

This comment has been minimized.

Copy link
Contributor

@jwkvam jwkvam commented Jan 29, 2013

In regards to early stopping, in addition I would like to see a timeout parameter. Would that fit in well?

To clarify what I mean: everytime the validation set is checked, the elapsed wall time could be evaluated.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jan 30, 2013

Not sure how well that works with multiprocessing....

@jwkvam

This comment has been minimized.

Copy link
Contributor

@jwkvam jwkvam commented Feb 12, 2013

Not sure how well that works with multiprocessing....

I guess I don't understand why it wouldn't work. Anytime the early_stopping_tolerance is checked, the elapsed wall time can be computed. I didn't mean to use a timer, just grab the current time and see how much time has elapsed. Does that make sense or am I missing something obvious?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Feb 15, 2013

I guess you could do that somehow.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Feb 22, 2013

See #930 for a related issue that could possibly be tackled as well.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 14, 2013

I think by generalised CV you mean something like memoised prediction with respect to parameter variation: one fit means predict can be called for multiple parameter values.

For a trivial case, consider k in feature_selection.SelectKBest: fit's operation is indepdendent of k. A more complicated case involves fit needing to explicitly store models for multiple parameters that are generated in the same process: the estimator needs to be prepared to store all those parameter cases.

And, considering the interaction of such a feature with Pipeline, we realise a changing transformer parameter upstream means a downstream model needs to be refit, but not vice-versa.

For this reason, I can as yet only see a solution along the following lines:

  • Each estimator implements a generate_param_grid_clones() method which returns an iterator of pairs (params, estimator_clone). But the clones should be the same object if they can share the same fit() process. For instance, SelectKBest().generate_param_grid_clones(k=[10, 20]) would generate ({'k': 10}, est) and ({'k': 20}, est) where est is the same clone of self.
    • The default implementation of generate_param_grid_clones() presumes all parameter values need to be fit independently; a standard variation could be written to handle fit-independent parameters like k. (More complicated variants might need to prime the clone to store models for each parameter value.)
    • The delegation of interpreting the parameter grid to the estimator is essential to get Pipeline working under this use-case, as far as I can see.
  • GridSearchCV calls generate_param_grid_clones() once for each fold. In parallel, each clone is fit exactly once (by keeping a flag somewhere), while predict (etc.) is called for each grid point.
  • To avoid many clone-models being kept in memory, generate_param_grid_clones() should generate all identical clones contiguously, to be garbage collected ASAP.
@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Mar 24, 2015

@jnothman Somewhat slow reply here, sorry ;) I am not sure I agree totally with your view of "one fit, multiple predict". That doesn't seem to cover path algorithms such as LassoCV. You could store the coef_ for each regularization parameter, but I don't think this is how I would do it.

@fingoldo

This comment has been minimized.

Copy link

@fingoldo fingoldo commented Jul 13, 2017

Hi guys can you please comment on the current status of this feature?
When I started reading Scikit learn books (btw thank you Andreas for your nice book) and watching online courses, one of my first question after coming from Matlab world (where i used to train ANN models) was "where the hell is early stopping here?"
One literally can't find any book on the subject mentioning it and it seems like most of scikit community is not familiar with this concept.
From what i understood, fit method of an estimator in most cases looks for a global minimum of error function over train data set. And recommended approach is to try many iterations with different hyperparameters sets (using GridSearchCV or RandomizedSearchCC), whereas at each iteration first internal models parameters are chosen which minimize error function over train data but irrelevant of current test data, and at later stage this trained model makes its prediction on current test set, and at last "best" hyper-parameters subset is defined as argmax of that recorded errors on the test sets.
However one can easily spot potential problem with such apporoach: if we believe that early stopping really can prevent model from learning noise of particular training set, then it means that our crossvalidated scores over test sets are suboptimal. And who can guarantee that if we employ early stopping at training stage, we receive totally different GridSearchCV judgement on the best hyperparameters set as well? Just 'cause computed scores over test sets will also be hugely different now.
I think. good motivation to add this feature into production release would be taking several baseline classification/regression problems (like iris and digits) and solving them without and with early stopping enabled. 'cause currently i cant't get rid of feeling that actual recommended approach is creating a bunch of overfitted models in many cases. Maybe it's not that bad as it's only allowing relatively simple models to pass through the filter, but really we can be missing something good with powerful models as well which in conjunction with early stopping could potentially produce results of better quality.
What do you think?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 15, 2017

@fingoldo there's early stopping in the neural nets and gradient boosting in scikit-learn.

@fingoldo

This comment has been minimized.

Copy link

@fingoldo fingoldo commented Dec 15, 2017

@amueller Yes thanks already know this, wondering if there are some comparisons available of the quality of solutions found with and without early stopping in sklearn :-)

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Dec 15, 2017

There is no general answer, this is both model and problem specific.

@fingoldo

This comment has been minimized.

Copy link

@fingoldo fingoldo commented Jan 23, 2018

Guys, is my understanding correct?
Let's take GridSearchCV or RandomizedSearchCV.
Either proceed to next training set.
That set contains signal+noise.
Ether call fit metod of a required estimator.
Fit method does not have explicit objective function as a parameter so I’m suggesting it’s always using Metric1 which is easily differentiable and has easily producible gradients (MSE, for example), is that right?
Next fit method optimizes Metric1 on (signal+training dataset noise) till convergence.
Next metaestimator computes Metric2 which is defined by an user and is business-related (for example roc_auc_weighted) on next test set (which again consists of signal+own noise of a test set).

Winning set of parameters is declared based on maximal average test score, right?

Now the question: obviously each test score is a score of an estimator which has overfit the training set. So what we are getting as a mean test score for a parameters set is a mean performance of an OVERFIT estimator, which was

  1. trained using totally different metric, just because it’s ‘faster to converge’ (but possibly can converge to a totally wrong goal)
  2. trained till convergence on NOISE

Later suggested production implementation steps include taking classifier/parameter set having best
average cross-validated score, and fitting it to entire dataset (which in turn fits in on NOISE as well). We all know it will be overfit, and the best sklearn framework can do for us is to provide estimates or performance “achievable in reality as based on the averaged cross validated scores“).
So basically you leave the user with a “best from all overfitting models” and the knowledge of “what real world performance to expect from such overfit model”. But what if it’s not what users are after?

Ok you have added early stopping for neural nets and gradient boosting, but

  1. what about tens of other classifiers/repressors, which make up a vast majority in sklearn?
  2. I tried a MLPClassfier with early stopping, mb I was doing smth wrong but even with early stopping it’s overfitting TERRIBLY

This part of story all of the scikit manuals and books (including Andreas’ and Sebastian’s books) prefer to omit. But I think it’s a crucial point.
Are you guys sure that chosen approach is the best one?

  1. What I am really interested in, is having an estimator which has learnt ONLY SIGNAL and not the particularities of a training dataset. And I also want to know what to expect from such estimator, and not from the one which is hugely overfit. MY SUGGESTION: why can’t we have a parameter, which tells the estimator to reserve random part of training set and use it for early stopping? Really that simple.
  2. Why don’t sklearn users have freedom of choosing optimizations algos other than gradient based-ones? Computing user cost function (Metric2 instead of Metric1) directly over sampled points in parameters space, or simulated annealing, or Gas, heuristics etc? MY SUGGESTION: allow users to choose optimizer and training set cost function.

Does this make sense?
Forgive me if I’m missing something, kindly point out where I’m wrong ;-)

@Borda

This comment has been minimized.

Copy link

@Borda Borda commented Jun 26, 2019

Is this still an active discussion? For me, the early stopping option really makes sense especially in case of RandomizedSearchCV when you may hit the best configuration almost any time and not wasting your resources... @amueller

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jun 26, 2019

@Borda

This comment has been minimized.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jul 23, 2019

@Borda you can also just set n_iter to 1 and do a for-loop around it. That's equivalent to early stopping. The only issue is that it's going to be serial, which is the issue with early stopping. really you want an asynchronous architecture that stops as soon as one of the jobs is good enough. If the jobs take reasonably similar amounts of time you can just do n_iter=n_cpus though and run a for-loop around it.

@Borda

This comment has been minimized.

Copy link

@Borda Borda commented Jul 23, 2019

it depends if I set n_iter=n_cpus in how many jobs? I meant that if one search is good enough you just do not start any other and finish the actually running, then you take the most from the result... So not necessary something very complicated with async. Yes, I can write own top-level for loop for train-test cross-validation and evaluate quality... I just wanted to open such a question if also someone else would find it interesting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
9 participants
You can’t perform that action at this time.