Averaging fold criteria not usually advisable #106

TimothyMasters · 2022-03-03T14:33:11Z

In section 3.4.1, the suggestion is made that in order to compute a performance criterion using V-fold cross validation, the criterion should be computed separately for each fold, and then these V numbers be averaged. This method is usually required for the AUC criterion, due to the fact that different folds will generally have different distributions of predicted values due to different models being employed (though this is not always the case). However, for most practical criteria this method ranges from ill-advised to completely invalid. The usual method is to pool all assessment cases into a single set that represents the entire training set, and then compute a single performance criterion for this pooled set. As an extreme example, consider a financial application in which the (usually excellent) profit-factor criterion is used. The smaller the assessment set (from a greater number of folds), the more likely we are to get an infinite profit factor in one or more folds, which of course blows up the whole process. Even seemingly innocuous criteria such as RMS error will give different results from averaging folds versus pooling folds, and I believe that an argument can be made that the latter is superior.

topepo · 2022-03-10T15:59:37Z

However, for most practical criteria this method ranges from ill-advised to completely invalid. The usual method is to pool all assessment cases into a single set that represents the entire training set, and then compute a single performance criterion for this pooled set.

Sorry to be blunt, but this is absolutely not the case and never has been.

Averaging the individual resamples for any resampling method has always been the method for estimating performance. There are a multitude of papers that show this but, at this time, the most authoritative reference is ESL:

Even seemingly innocuous criteria such as RMS error will give different results from averaging folds versus pooling folds, and I believe that an argument can be made that the latter is superior.

This is true, as it should be. There are cases where we cannot move the integrals around without affecting the results (so they should not be equivalent).

TimothyMasters · 2022-03-10T16:59:16Z

Um, please take a close look at the text you quoted, and compare it to what you said in your book and what I said. With one caveat that I'll get to in a moment, the book you quoted is saying what I said.

In your book you say that you should compute the performance separately for each of the V folds and average these V performance metrics.
I said that you should pool all N out-of-sample results into a single set and then compute a single performance criterion from these N pooled results.

Now look at the book you quoted. Observe that they are evaluating each of the N cases individually into a single pool of N cases and averaging these N error figures to compute a single performance metric from the pooled results.
They are NOT grouping them into V separate performance evaluations and then averaging these V numbers, which is what you recommended.

In other words, the book you quoted is doing exactly what I said is the usual method! The formula in the book you quoted is not averaging V separately computed criteria.

There is a caveat to what I just stated. First, the method shown in the book that you quoted will only work on error criteria which can be computed on individual cases and are linear in the sense of averaging being valid. AUC, which is the context of your argument, fails at least the first of these requirements. So in that situation the formula in the book is incorrect for AUC and many other criteria.

Also note that for criteria that are linear in the sense of allowing averaging, your method of averaging V separate fold criteria and the more usual method of pooling for a single criterion are equivalent; the average of V such fold criteria will be mathematically identical to the average of N pooled case criteria, as shown in the book that you quoted. So for such situations, such as Mean Squared Error, there is no argument. They are identical.

The problem arises for criteria that fail the linearity restriction, in which case averaging V criteria can be disastrous. I worked in the financial industry for several decades, and many financial criteria are strongly nonlinear. Worse, many financial criteria become extremely unstable and can have very heavy tails for small evaluation sets. Computing them for relatively small folds invites disaster when a single wild criterion can pull the average far out of the ballpark. In such cases it is absolutely mandatory to pool all test cases and compute a single performance criterion. AUC is a rare exception to the importance of pooling, and this is only because each fold will likely have a different distribution of predicted values, making it invalid to sort the pooled predictions. And thus for AUC, the formula given in the book you quoted cannot be used!

I really don't want to get into any kind of flaming war; I really am enjoying your book, and I consider it a valuable addition to my bookshelf. In fact, I plan to give it a very nice review on Amazon as soon as I finish reading it. But when I encounter a serious error or misleading statement, I feel compelled to comment. I intended to write you privately, but the book had no contact information, and a Google search for you names brought me to this GitHub Issues page.

topepo · 2022-03-10T17:38:17Z

The only occasion where you would pool and compute is LOO.

TimothyMasters · 2022-03-11T12:10:12Z

Topepo - I don't know if this is the proper forum for an academic discussion, but this cross validation topic is important and interesting. The great danger of an online discussion is that it can easily degenerate into a shouting session, when what I really want is the sort of thing a bunch of stuffy 18'th century gentlemen (and ladies I would hope) would have around a fireplace, smoking their pipes. Anyhow, in view of your statement that pooling would be done only for leave-one-out CV, I propose two questions for you, if I may.

What, exactly, is your reason for choosing to pool only for leave-one-out? What about for, say, leave-two-out? Why would you choose not to pool in that case? What is your reasoning for that choice?
Consider this model-building goal: For cases whose predicted value is positive, we want to maximize the number and magnitude of positive outcomes while simultaneously minimizing the number and magnitude of negative outcomes. We devise the following performance criterion to implement this: given a collection of predicted assessment cases, consider only those whose predicted value is positive, and examine the outcomes of those cases. Then our performance criterion is the sum of positive outcomes divided by the sum of negative outcomes. This is actually a very common performance criterion with valuable generalization properties, superior in many ways to subtracting. Also suppose that your model training process is very compute-intensive, so you cannot use leave-one-out. Which do you think would provide a more accurate criterion, computing the criterion separately for each fold and averaging these values, or pooling all cases and computing one grand criterion for the pooled dataset?

HINT: Think about the denominator in the criterion.

ANOTHER HINT: Here's another performance criterion that illustrates the situation. Our application needs to maximize the mean outcome of positive-prediction cases relative to the standard deviation of those outcomes. Our criterion is the mean outcome of positive-prediction cases divided by the standard deviation of these cases.

Tim

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Averaging fold criteria not usually advisable #106

Averaging fold criteria not usually advisable #106

TimothyMasters commented Mar 3, 2022

topepo commented Mar 10, 2022

TimothyMasters commented Mar 10, 2022 •

edited

Loading

topepo commented Mar 10, 2022

TimothyMasters commented Mar 11, 2022

Averaging fold criteria not usually advisable #106

Averaging fold criteria not usually advisable #106

Comments

TimothyMasters commented Mar 3, 2022

topepo commented Mar 10, 2022

TimothyMasters commented Mar 10, 2022 • edited Loading

topepo commented Mar 10, 2022

TimothyMasters commented Mar 11, 2022

TimothyMasters commented Mar 10, 2022 •

edited

Loading