-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Averaging fold criteria not usually advisable #106
Comments
Sorry to be blunt, but this is absolutely not the case and never has been. Averaging the individual resamples for any resampling method has always been the method for estimating performance. There are a multitude of papers that show this but, at this time, the most authoritative reference is ESL:
This is true, as it should be. There are cases where we cannot move the integrals around without affecting the results (so they should not be equivalent). |
Um, please take a close look at the text you quoted, and compare it to what you said in your book and what I said. With one caveat that I'll get to in a moment, the book you quoted is saying what I said. In your book you say that you should compute the performance separately for each of the V folds and average these V performance metrics. Now look at the book you quoted. Observe that they are evaluating each of the N cases individually into a single pool of N cases and averaging these N error figures to compute a single performance metric from the pooled results. In other words, the book you quoted is doing exactly what I said is the usual method! The formula in the book you quoted is not averaging V separately computed criteria. There is a caveat to what I just stated. First, the method shown in the book that you quoted will only work on error criteria which can be computed on individual cases and are linear in the sense of averaging being valid. AUC, which is the context of your argument, fails at least the first of these requirements. So in that situation the formula in the book is incorrect for AUC and many other criteria. Also note that for criteria that are linear in the sense of allowing averaging, your method of averaging V separate fold criteria and the more usual method of pooling for a single criterion are equivalent; the average of V such fold criteria will be mathematically identical to the average of N pooled case criteria, as shown in the book that you quoted. So for such situations, such as Mean Squared Error, there is no argument. They are identical. The problem arises for criteria that fail the linearity restriction, in which case averaging V criteria can be disastrous. I worked in the financial industry for several decades, and many financial criteria are strongly nonlinear. Worse, many financial criteria become extremely unstable and can have very heavy tails for small evaluation sets. Computing them for relatively small folds invites disaster when a single wild criterion can pull the average far out of the ballpark. In such cases it is absolutely mandatory to pool all test cases and compute a single performance criterion. AUC is a rare exception to the importance of pooling, and this is only because each fold will likely have a different distribution of predicted values, making it invalid to sort the pooled predictions. And thus for AUC, the formula given in the book you quoted cannot be used! I really don't want to get into any kind of flaming war; I really am enjoying your book, and I consider it a valuable addition to my bookshelf. In fact, I plan to give it a very nice review on Amazon as soon as I finish reading it. But when I encounter a serious error or misleading statement, I feel compelled to comment. I intended to write you privately, but the book had no contact information, and a Google search for you names brought me to this GitHub Issues page. |
The only occasion where you would pool and compute is LOO. |
Topepo - I don't know if this is the proper forum for an academic discussion, but this cross validation topic is important and interesting. The great danger of an online discussion is that it can easily degenerate into a shouting session, when what I really want is the sort of thing a bunch of stuffy 18'th century gentlemen (and ladies I would hope) would have around a fireplace, smoking their pipes. Anyhow, in view of your statement that pooling would be done only for leave-one-out CV, I propose two questions for you, if I may.
HINT: Think about the denominator in the criterion. ANOTHER HINT: Here's another performance criterion that illustrates the situation. Our application needs to maximize the mean outcome of positive-prediction cases relative to the standard deviation of those outcomes. Our criterion is the mean outcome of positive-prediction cases divided by the standard deviation of these cases. Tim |
In section 3.4.1, the suggestion is made that in order to compute a performance criterion using V-fold cross validation, the criterion should be computed separately for each fold, and then these V numbers be averaged. This method is usually required for the AUC criterion, due to the fact that different folds will generally have different distributions of predicted values due to different models being employed (though this is not always the case). However, for most practical criteria this method ranges from ill-advised to completely invalid. The usual method is to pool all assessment cases into a single set that represents the entire training set, and then compute a single performance criterion for this pooled set. As an extreme example, consider a financial application in which the (usually excellent) profit-factor criterion is used. The smaller the assessment set (from a greater number of folds), the more likely we are to get an infinite profit factor in one or more folds, which of course blows up the whole process. Even seemingly innocuous criteria such as RMS error will give different results from averaging folds versus pooling folds, and I believe that an argument can be made that the latter is superior.
The text was updated successfully, but these errors were encountered: