##  Nonlinear Supervised Learning Series

# Cross-validation

In this post we describe cross-validation, an effective framework for automatically and intelligently choosing the proper number of basis functions in building regressors and classifiers. Cross-validation is necessary in order to prevent underfitting and overfitting - two undesired phenomena that we will see arise in solving any regression/classification problem. Our discussion will eventually culminate in the description of a specific procedure known as *k-fold cross-validation* which is commonly used in practice.

## 1. More can be less with realistic data!

In the *ideal* instance of regression, where we look to approximate a continuous function using a linear combination of simpler basis functions, we saw previously that using more elements of a basis results in a better approximation. The same is true in the *ideal* case of classification: adding more basis elements always improves approximation of step functions.   
In short, in the context of continuous and step function approximation more (basis elements) is always better. But does the same principle apply in the real instances of regression and classification, where we only have access to noisy samples of the underlying data-generating functions? Unfortunately, the answer is *no*.

<hr>

#### <span style="color:#a50e3e;">Example 1: </span> More can be less with realistic regression data

In Figure 1 we show polynomial fits of degrees three (in blue) and ten (in purple) to two datasets: an ideal regression dataset on the left and a realistic one on the right. In the ideal case, by increasing the number of basis functions $M$ from $3$ to $10$ the corresponding polynomial model fits the data and the underlying sinusoidal function better. Conversely, in the right panel while the model fits the data better as we increase the number of polynomial features from $M = 3$ to $10$, the representation of the underlying data-generating function gets worse! Since the underlying function is the object we truly wish to understand, this is a problem.

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_5_13.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 1:</strong> <em> Plots of (left panel) finely discretized and (right panel) noisy samples of the data-generating function $y\left(x\right) = sin\left(2\pi x\right)$, along with its degree three and degree ten polynomial approximations in blue and purple, respectively. While the higher degree polynomial does a better job at modeling both the finely discretized data and underlying continuous function, it only fits the noisy sampled data better, providing a worse approximation of the underlying data-generating function than the lower degree polynomial on the right. </em>  </figcaption> 
</figure>

<hr>

#### <span style="color:#a50e3e;">Example 2: </span> More can be less with realistic classification data

In Figure 2 We illustrate the same issue as in Example 1, this time with classification, using the boundary view of a  particular discretized step function (left panel) along with a noisy sampled version of it (right panel). For each dataset we show the resulting fit provided by both a degree $2$ and a degree $5$ polynomial (shown in black and green respectively). While the degree $2$ approximation produces a boundary in each case that closely matches the true boundary, the higher degree $5$ polynomial creates an overly-complicated classifier which encapsulates noisy points outside of the circular boundary of the true function, leading to a poorer representation.

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_11.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 2:</strong> <em> Discretized (left panel) and noisy samples (right panel) from a data-generating step function with class boundary shown in dashed black (the circle of radius one-half), along with fits of degree $2$ (in solid black) and degree $5$ (in solid green) polynomials. Like regression, while increasing the number of basis elements produces a better fit in the ideal case, for more realistic cases like the dataset on the right this can lead to serious representation problems. </em>  </figcaption> 
</figure>

<hr>

The phenomenon illustrated through the Examples above is in fact true more generally: by increasing the number $M$ of any type of basis (polynomial, Fourier, blah, and blah) we can indeed produce better fitting models of a dataset, but at the potential cost of creating poorer representations of the data-generating function we care foremost about.

Stated formally, given any regression dataset we can drive the value of the Least Squares cost 

\begin{equation}
\underset{p=1}{\overset{P}{\sum}}\left[y_p-\left(w_0+\underset{m=1}{\overset{M}{\sum}} f_{m}\left(\mathbf{x}_p\right)w_m\right)\right]^{2}
\end{equation}

to zero by choosing $M$ large enough.

Similarly, given any classification dataset we can drive the value of a chosen classification cost, e.g., softmax

\begin{equation}
\underset{p=1}{\overset{P}{\sum}}\text{log}\left(1+e^{-y_p\left(w_0+\sum_{m=1}^{M} f_{m}\left(\mathbf{x}_p\right)w_m\right)}\right)
\end{equation}

to zero by increasing $M$. 

Therefore, choosing the value of $M$ correctly is extremely important. Note in the language of machine learning, a model corresponding to too large a choice of $M$ is said to *overfit* the data. Likewise, when choosing $M$ too small the model is said to *underfit* the data. For instance, using polynomial basis functions and with $M$ set to $1$ we can only find the best linear fit to the data in Figure 1, which would be not only a poor fit to the observed data but also a poor representation of the underlying sinusoidal pattern.

<hr>

#### <span style="color:#a50e3e;">Example 3: </span> Overfitting and underfitting Galileo’s ramp data

In the left panel of Figure 3 we show the data from Galileo's classic ramp experiment, initially described in BLAH, performed in order to understand the relationship between time and the acceleration of an object due to (the force we today know as) gravity. Also shown in this figure is (in the left panel) the kind of quadratic fit Galileo used to describe the underlying relationship traced out by the data, along with two other possible model choices (right panel): a linear fit in green, as well as a degree $12$ polynomial fit in magenta. Of course the linear model is inappropriate, as with this data any line would have large squared error and would thus be a poor representation of the data. On the other hand, while the degree $12$ polynomial fits the data perfectly, with corresponding squared error value of zero, the model itself just "looks wrong."

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_5_14.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 3:</strong> <em> Data from Galileo's simple ramp experiment, exploring the relationship between time and the distance an object falls due to gravity. (left panel) Galileo fit a simple quadratic to the data. (right panel) A linear model (shown in green) is not flexible enough and as a result, underfits the data. A degree $12$ polynomial (shown in magenta) overfits the data, being too complicated and unnatural (between the start and 0.25 of the way down the ramp the ball travels a negative distance!) to be a model of a simple natural phenomenon. </em>  </figcaption> 
</figure>

Examining the right panel of Figure 3 why, for example, when traveling between the beginning and a quarter of the way down the ramp, does the distance the ball travels become negative! This kind of behavior does not at all match our intuition or expectation about how gravity should operate on an object. This is why Galileo chose a quadratic, rather than a higher order degree polynomial, to fit such a dataset: because he *expected* that the rules which govern our universe are explanatory yet simple.

<hr>

This principle, that the rules we use to describe our universe should be flexible yet simple, is often called *Occam's Razor* and lies at the heart of essentially all scientific inquiry past and present. Since machine learning can be thought of as a set of tools for making sense of arbitrary kinds of data, i.e., not only data relating to a physical system or law, we want the relationship learned in solving a regression (or classification) problem to also satisfy this basic Occam's Razor principle. In the context of machine learning, Occam's Razor manifests itself geometrically, i.e., we expect the model (or function) underlying our data to be simple yet flexible enough to explain the data we have. The linear model in Figure 3, being too rigid and inflexible to establish the relationship between time and the distance an object falls due to gravity, fits very poorly. As previously mentioned, in machine learning such a model is said to underfit the data we have. On the other hand, the degree $12$ polynomial model is needlessly complicated, resulting in a very close fit to the data we have, but is far too oscillatory to be representative of the underlying phenomenon and is said to overfit the data.

## 2. Diagnosing the problem of overfitting/underfitting

A reasonable diagnosis of the overfitting/underfitting problems is that both fail at representing new data, generated via the same process by which the current data was made, that we can potentially receive in the future. For example, the overfitting degree ten polynomial shown in the right panel of Figure 1 would poorly model any future data generated by the same process since it poorly represents the underlying data-generating function (a sinusoid). This data-centric perspective provokes a practical criterion for determining an ideal choice of $M$ for a given dataset: the number $M$ of basis functions used should be such that the corresponding model fits well to both the current dataset as well as to new data we will receive in the future.

## 3. Hold-out cross-validation

While we of course do not have access to any "new data we will receive in the future," we can simulate such a scenario by splitting our data into two subsets: a larger *training set* of data we already have, and a smaller *testing set* of data that we "will receive in the future." Then, we can try a range of values for $M$ by fitting each to the training set of known data, and pick the one that performs the best on our testing set of unknown data. By keeping a larger portion of the original data as the training set we can safely assume that the learned model which best represents the testing data will also fit the training set fairly well. In short, by employing this sort of procedure for comparing a set of models, referred to as hold-out cross-validation, we can determine a candidate that approximately satisfies our criterion for an ideal well-fitting model.

What portion of our dataset should we save for testing? There is no hard rule, and in practice typically between $\frac{1}{10}$ to $\frac{1}{3}$ of the data is assigned to the testing set. One general rule of thumb is that the larger the dataset (given that it is relatively clean and well distributed) the bigger the portion of the original data may be assigned to the testing set (e.g., $\frac{1}{3}$ may be placed in the testing set) since the data is plentiful enough for the training data to still accurately represent the underlying phenomenon. Conversely, in general with smaller or less rich (i.e., more noisy or poorly distributed) datasets we should assign a smaller portion to the testing set (e.g., $\frac{1}{10}$ may be placed in the testing set) so that the relatively larger training set retains what little information of the underlying phenomenon was captured by the original data.

> In general the larger/smaller the original dataset the larger/smaller the portion of the original data that should be assigned to the testing set.

As illustrated in Figure 4, to form the training and testing sets we split the original data randomly into $k$ non-overlapping parts and assign $1$ portion for testing ($\frac{1}{k}$ of the original data) and $k−1$ portions to the training set ($\frac{k-1}{k}$ of the original data).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_5_15.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 4:</strong> <em> Hold-out cross-validation. The original data (left panel) shown here as the entire circular mass is split randomly (middle panel) into $k$ non-overlapping sets (here $k=3$). (right panel) One piece, or $\frac{1}{k}$ of the original dataset, is then taken randomly as the testing set with the remaining pieces, or $\frac{k-1}{k}$ of the original data, taken as the training set. </em>  </figcaption> 
</figure>

Regardless of the value we choose for $k$, we train our model on the training set using a range of different values of $M$. We then evaluate how well each model (or in other words, each value of $M$) fits to both the training and testing sets, via measuring the model's training error and testing error, respectively. The best-fitting model is chosen as the one providing the lowest testing error or the best fit to the "unseen" testing data. Finally, in order to leverage the full power of our data we use the optimal number of basis functions $M$ to train our model, this time using the entire data (that is, both training and testing sets).

<hr>

#### <span style="color:#a50e3e;">Example 4: </span> Hold-out for regression using Fourier kernel basis

To solidify these details, in Figure 5 we show an example of applying hold-out cross-validation using a dataset of $P=30$ points generated via the function $y(x)=e^{3x}\frac {\text{sin}\left(3\pi^2\left(x−0.5\right)\right)}{3\pi^2\left(x−0.5\right)}$. To perform hold-out cross-validation on this dataset we randomly partition it into $k=3$ equal-sized (ten points each) non-overlapping subsets, using two partitions together as the training set and the final part as testing set, as illustrated in the left panel of Figure 5. The points in this panel are colored blue and yellow indicating that they belong to the training and testing sets respectively. We then train our model on the training set (blue points) by solving several instances of the Least Squares problem. In particular we use a range of even values for $M$ Fourier basis functions $M=2, 4, 6, \ldots, 16$ (since Fourier elements naturally come in pairs of two) which corresponds to the range of degrees $D=1, 2, 3, \ldots, 8$ (note that for clarity panels in the figure are indexed by $D$).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_5_16.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5:</strong> <em> An example of hold-out cross-validation applied to a simple dataset using Fourier kernel basis functions. (left panel) The original data split into training and testing sets, with the points belonging to each set colored blue and yellow respectively. (middle eight panels) The fit resulting from each set of degree $D$ Fourier features in the range $D = 1, 2, \ldots, 8$ is shown in blue in each panel. Note how the lower degree fits underfit the data, while the higher degree fits overfit the data. (second from right panel) The training and testing errors, in blue and yellow respectively, of each fit over the range of degrees tested. From this we see that $D^{\star}=5$ (or $M^{\star}=10$) provides the best fit. Also note how the training error always decreases as we increase the degree/number of basis elements, which will always occur regardless of the dataset/basis type used. (right panel) The final model using $M^{\star}=10$ trained on the entire dataset (shown in red) fits the data well and closely matches the underlying data generating function (shown in dashed black).</em>  </figcaption> 
</figure>

Based on the models learned for each value of $M$ (see the middle set of eight panels of the figure) we plot training and testing errors (in the panel second from the right), measuring how well each model fits the training and testing data respectively, over the entire range of values. Note that unlike the testing error, the training error always decreases as we increase $M$ (which occurs more generally regardless of the dataset/basis used). The model that provides the smallest testing error ($M^{\star}=10$ or equivalently $D^{\star}=5$) is then trained again on the entire dataset, giving the final regression model shown in red in the rightmost panel of Figure 5.

<hr>

#### <span style="color:#a50e3e;">Example 5: </span> Hold-out for classification using polynomial kernel basis

In Figure 6 we show the result of applying hold-out cross-validation to the dataset first shown in BLAH. Here we use $k=3$, use the softmax cost, and $M$ in the range $M=2, 5, 9, 14, 20, 27, 35, 44$ which corresponds to polynomial degrees $D=1, 2, \ldots, 8$ (note that for clarity panels in the figure are indexed by $D$).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_12.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 6:</strong> <em> An example of hold-out cross-validation for classification using polynomial basis. (left panel) The original data split into training and testing sets, with the points belonging to each set drawn as smaller thick and larger thin points respectively. (middle eight panels) The fit resulting from each set of degree $D$ polynomial features in the range $D = 1, 2, \ldots, 8$ shown in black in each panel. Note how the lower degree fits underfit the data, while the higher degree fits overfit the data. (second from right panel) The training and testing errors, in blue and yellow respectively, of each fit over the range of degrees tested. From this we see that $D^{\star}=4$ (or $M^{\star}=14$) provides the best fit. Also note how the training error always decreases as we increase the degree/number of basis elements, which will always occur regardless of the dataset/basis type used. (right panel) The final model using $M^{\star}=14$, trained on the entire dataset (shown in black), fits the data well and closely matches the boundary of the underlying data generating function (shown in dashed black).</em>  </figcaption> 
</figure>

Based on the models learned for each value of $M$ (see the middle set of eight panels of the figure) we plot training and testing errors (in the panel second to the right), measuring how well each model fits the training and testing data respectively, over the entire range of values. Again, note that unlike the testing error, the training error always decreases as we increase $M$ (which occurs more generally regardless of the dataset/basis used). The model that provides the smallest testing error ($M^{\star}=14$ or equivalently $D^{\star}=4$) is then trained again on the entire dataset, giving the final classification model shown in black in the rightmost panel of the figure.

<hr>

## 4. Hold-out calculations

Here we give a complete set of hold-out cross-validation calculations in a general setting. We denote the collection of points belonging to the training and testing sets respectively by their indices as

\begin{equation}
\begin{array}{c}
\Omega_{\textrm{train}}=\left\{ p\,\vert\,\left(\mathbf{x}_{p},\,y_{p}\right)\,\mbox{belongs to the training set}\right\} \\
\Omega_{\textrm{test}}=\left\{ p\,\vert\,\left(\mathbf{x}_{p},\,y_{p}\right)\,\mbox{belongs to the testing set}\right\} 
\end{array}
\end{equation}

We then choose a basis type (e.g., kernel, neural network, or trees) and choose a range for the number of basis functions over which we search for an ideal value for $M$. To determine the training and testing error of each value of $M$ tested we fit a corresponding model to the training set.

In the case of regression, defining the Least Squares cost over an index set $\Omega$ as 

\begin{equation}
g_{\Omega}\left(w_0,\,\ldots,\,w_M\right)=\underset{p\in\Omega}{\sum}\left[y_p-\left(w_0+\underset{m=1}{\overset{M}{\sum}} f_{m}\left(\mathbf{x}_p\right)w_m\right)\right]^{2}
\end{equation}


we solve the problem 

\begin{equation}
\underset{w_0,\,\ldots,\,w_M}{\mbox{minimize}}\,\,\,\,g_{\Omega_{\text{train}}}\left(w_0,\,\ldots,\,w_M\right)
\end{equation}

Denoting a solution to the problem above as $w_0^{\star},\,\ldots,w_{M}^{\star}$ we find the training and testing errors for the current value of $M$ by simply computing the mean squared error using these parameters over the training and testing sets, respectively

\begin{equation}
\begin{array}{c}
\begin{array}{c}
\mbox{training error}=\frac{g_{\Omega_{\text{train}}}\left(w_0^{\star},\,\ldots,\,w_M^{\star}\right)}{\left|\Omega_{\textrm{train}}\right|}\\
\mbox{testing error}=\frac{g_{\Omega_{\text{test}}}\left(w_0^{\star},\,\ldots,\,w_M^{\star}\right)}{\left|\Omega_{\textrm{test}}\right|}
\end{array}\end{array}
\end{equation}


where the notation $\left|\Omega_{\textrm{train}}\right|$ and $\left|\Omega_{\textrm{test}}\right|$ denotes the cardinality or number of points in the training and testing sets, respectively. Once we have performed these calculations for all values of $M$ we wish to test, we choose the one that provides the lowest testing error, denoted by $M^{\star}$. 

The hold-out calculations for classification closely mirror the regression versions, with a few differences: instead of the Least Squares cost, here we minimize one of many cost functions built exclusively for classification, e.g., the softmax cost, defined over an index set $\Omega$ as   

\begin{equation}
g_{\Omega}\left(w_0,\,\ldots,\,w_M\right)=\underset{p\in\Omega}{\sum}\text{log}\left(1+e^{-y_p\left(w_0+\sum_{m=1}^{M} f_{m}\left(\mathbf{x}_p\right)w_m\right)}\right)
\end{equation}

Again, denoting a minimizer of the above as $w_0^{\star},\,\ldots,w_{M}^{\star}$, we can calculate training and testing errors as described in (6) for each value of $M$. However with classification, it is commonplace to evaluate training and testing errors in (6), not using the actual classification cost in (7), but more directly via the counting cost defined over an index set $\Omega$ as

\begin{equation}
g_{\Omega}\left(w_0,\,\ldots,\,w_M\right) =\frac{1}{4} \underset{p\in\Omega}{\sum}\left[y_p-\text{sign}\left(w_0+\underset{m=1}{\overset{M}{\sum}} f_{m}\left(\mathbf{x}_p\right)w_m\right)\right]^{2}
\end{equation}

The optimal value for $M$, denoted by $M^{\star}$, is then found similarly as the one in the selected range providing the lowest testing error.

Once $M^{\star}$ is found, with both regression and classification, the model should be re-trained one last time using $M^{\star}$ over the entire dataset, i.e., $\Omega=\{1,2,\ldots, P\}$.    

## 5. k-fold cross-validation

While the hold-out method previously described is an intuitive approach to determining proper fitting models, it suffers from an obvious flaw: having been chosen at random, the points assigned to the training set may not adequately describe the original data. However, we can easily extend and robustify the hold-out method as we now describe.

As illustrated in Figure 7 for $k=3$, with k-fold cross-validation we once again randomly split our data into $k$ non-overlapping parts. By combining $k−1$ parts we can, as with the hold-out method, create a large training set and use the remaining single fold as a test set. With k-fold cross-validation we will repeat this procedure $k$ times (each instance being referred to as a fold), in each instance using a different single portion of the split as testing set and the remaining $k−1$ parts as the corresponding training set, and computing the training and testing errors of all values of $M$ as described in the previous Subsection. We then choose the value of $M$ that has the lowest average testing error, a more robust choice than the hold out method provides, that can average out a scenario where one particular choice of training set inadequately describes the original data.

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_5_17.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 7:</strong> <em> k-fold cross-validation for $k=3$. The original data shown here as the entire circular mass (top left) is split into $k$ non-overlapping sets (top right) just as with the hold out method. However with k-fold cross-validation we repeat the hold out calculations $k$ times (bottom), once per "fold," in each instance, keeping a different portion of the split data as the testing set while merging the remaining $k−1$ pieces as the training set.</em>  </figcaption> 
</figure>

Note, however, that this advantage comes at a cost: k-fold cross-validation is (approximately) $k$ times more computationally costly than its hold-out counterpart. In fact performing k-fold cross-validation is often the most computationally expensive process performed to solve a regression problem.

> Performing k-fold cross-validation is often the most computationally expensive component in solving a general regression problem.


There is again no universal rule for the number $k$ of non-overlapping partitions (or the number of folds) to break the original data into. However, the same intuition previously described for choosing $k$ with the hold-out method also applies here, as well as the same convention with popular values of $k$ ranging from $k=3, \ldots,10$ in practice.

<hr>

#### <span style="color:#a50e3e;">Example 6: </span> k-fold cross-validation for regression using Fourier kernel basis

In Figure 8 we illustrate the result of applying k-fold cross-validation to choose the ideal number $M$ of Fourier basis functions for the dataset shown in Example 4, where it was originally used to illustrate the hold-out method for regression. As in Example 4, here we set $k=3$ and try $M$ in the range $M=2,4,6,\dots,16$, which corresponds to the range of degrees $D=1,2,3,\dots,8$ (note that for clarity, panels in the figure are indexed by $D$).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_5_18.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 8:</strong> <em> Result of performing k-fold cross-validation with $k=3$ (see text for further details). The top three rows display the result of performing the hold-out method on each fold. The left, middle, and right columns show each fold's training/testing sets (colored blue and yellow respectively) training and testing errors over the range of $M$ tried, and the final model (fit to the entire dataset) chosen by picking the value of $M$ providing the lowest testing error. Due to the split of the data, performing hold out on the first fold (top row) results in a poor underfitting model for the data. However, as illustrated in the final row, by averaging the testing errors (bottom middle panel) and choosing the model with minimum associated average test error, we average out this problem (finding that $D^{\star}=5$ or $M^{\star}=10$) and determine an excellent model for the phenomenon (as shown in the bottom right panel).</em>  </figcaption> 
</figure>


In the top three rows of Figure 8 we show the result of applying hold-out on each fold. In each row we show a fold's training and testing data colored blue and yellow respectively in the left panel, the training/testing errors for each $M$ on the fold (as computed in Equation (5.26)) in the middle panel, and the final model (learned to the entire dataset) provided by the choice of $M$ with lowest testing error. As can be seen in the top row, the particular split of the first fold leads to too low a value of $M$ being chosen, and thus an underfitting model. In the middle panel of the final row we show the result of averaging the training/testing errors over all $k=3$ folds, and in the right panel the result of choosing the overall best $M^{\star}=10$ (or equivalently $D^{\star}=5$) providing the lowest average testing error. By taking this value we average out the poor choice determined on the first fold, and end up with a model that fits both the data and underlying function quite well.

<hr>

#### <span style="color:#a50e3e;">Example 7: </span> k-fold cross-validation for classification using polynomial kernel basis

In Figure 9 we illustrate the result of applying k-fold cross-validation to choose the ideal number $M$ of polynomial basis functions for the dataset shown in Example 5, where it was originally used to illustrate the hold-out method for classification. As in Example 5, here we set $k=3$, use the softmax cost, and try $M$ in the range $M = 2,5,9,14,20,27,35,44$ which corresponds to polynomial degrees $D=1,2,\ldots,8$ (note that for clarity panels in the figure are indexed by $D$).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_13.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 9:</strong> <em> Result of performing k-fold cross-validation with $k=3$ (see text for further details). The top three rows display the result of performing the hold out method on each fold. The left, middle, and right columns show each fold's training/testing sets (drawn as thick and thin points respectively), training and testing errors over the range of $M$ tried, and the final model (fit to the entire dataset) chosen by picking the value of $M$ providing the lowest testing error. Due to the split of the data, performing hold-out on each fold results in a poor overfitting (first two folds) or underfitting (final fold) model for the data. However, as illustrated in the final row, by averaging the testing errors (bottom middle panel) and choosing the model with minimum associated average test error we average out these problems (finding that $D^{\star}=4$ or $M^{\star}=14$) and determine an excellent model for the phenomenon (as shown in the bottom right panel). </em>  </figcaption> 
</figure>

In the top three rows of Figure 9 we show the result of applying hold out on each fold. In each row we show a fold's training and testing data in the left panel, the training/testing errors for each $M$ on the fold in the middle panel, and the final model (learned to the entire dataset) provided by the choice of $M$ with lowest testing error. As can be seen, the particular split leads to an overfitting result on the first two folds and an underfitting result on the third fold. In the middle panel of the final row we show the result of averaging the training/testing errors over all $k=3$ folds, and in the right panel the result of choosing the overall best $M^{\star}=14$ (or equivalently $D^{\star}=4$) providing the lowest average testing error. By taking this value we average out the poor choices determined on each fold, and end up with a model that fits both the data and underlying function quite well.

<hr>

#### <span style="color:#a50e3e;">Example 8: </span> Leave-one-out cross-validation

In Figure 10 we show how using $k=P$ fold cross-validation applied to Galileo's ramp data (since we have only $P=6$ data points, intuition suggests that we use a large value for $k$), sometimes referred to as *leave-one-out cross-validation*, allows us to recover precisely the quadratic fit Galileo made by eye. Note that by choosing $k=P$ this means that every data point will take a turn being the testing set. Here we search over the polynomial basis functions in the range $M=1,2,\ldots,6$. While not all of the hold-out models over the six folds fit the data well, the average k-fold result is indeed the $M^{\star}=2$ quadratic polynomial fit originally proposed by Galileo!

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_5_19.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 10:</strong> <em> (six panels on the left) Each fold of training/testing sets shown in blue/yellow respectively of a k-fold run on the Galileo's ramp data, along with their individual hold-out model (shown in blue). Only the model learned on the fourth fold overfits the data. By choosing the model with minimum average testing error over the $k=6$ folds we recover the desired quadratic $M^{\star}=2$ fit originally proposed by Galileo (shown in magenta in the right panel).</em>  </figcaption> 
</figure>

<hr>

#### <span style="color:#a50e3e;">Example 9: </span> Pitfalls of cross-validation

When a k-fold determined set of basis functions performs poorly this is almost always indicative of a poorly structured dataset (i.e., there is little relationship between the input/output data), like the one shown in the left panel of Figure 11. However, there are also instances, when we have too little or too poorly distributed data, when a high performing k-fold model can be misleading as to how well we understand a phenomenon. In the middle and right panels of the figure we show two such instances that the reader should keep in mind when using k-folds, where we either have too little (middle panel) or poorly distributed data (right panel).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_14.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 11:</strong> <em> (left panel) A low accuracy k-fold fit to a dataset indicates that it has little structure (i.e., that there is little to no relationship between the input and output). It is possible that a high accuracy k-fold fit fails to capture the true nature of an underlying function, as when (middle panel) we have too little data (the k-fold linear separator is shown in black, and the true nonlinear separator is shown dashed) and (right panel) when we have poorly distributed data (again the k-fold separator is shown in black, the original separator dashed). See text for further details. </em>  </figcaption> 
</figure>

In the first instance, shown in the middle panel, we have generated a small sample of points based on the curvy boundary shown in dashed black. However, the sample of data is so small that it is perfectly linearly separable, and thus applying e.g., k-fold cross-validation with polynomial basis will properly (due to the small selection of data) recover a line to distinguish between the two classes. However, clearly data generated via the same underlying process in the future will violate this linear boundary, and thus our model will perform poorly. This sort of problem arises in applications such as automatic medical diagnosis where access to data is typically limited. Unless we can gather additional data to fill out the space (making the nonlinear boundary more visible) this problem is unavoidable.

In the second instance shown in the right panel of the figure, we have plenty of data (generated using a circular boundary shown in dashed black) but it is poorly distributed. In particular, we have no samples from the blue class in the lower half of the space. In this case the k-fold method (again here using polynomial basis) properly determines a separating boundary that perfectly distinguishes the two classes. However, many of the blue class points we would receive in the future in the lower half of the space will be misclassified given the learned k-fold model. This sort of issue can arise in practice, e.g., when performing face detection, if we do not collect a thorough dataset of blue (e.g., "non-face") examples. Again, unless we can gather further data to fill out the space this problem is unavoidable.

<hr>

#### <span style="color:#a50e3e;">Example 10: </span> k-fold cross-validation for multi-class classification

Employing the one-versus-all (OvA) framework for multi-class classification, we can immediately apply the k-fold method described previously. For a $C$ class problem we simply apply the k-fold method to each of the $C$ two-class classification problems, and combine the resulting classifiers using the fusion rule. We show the result of applying $k=3$ fold cross-validation with OvA on two datasets with $C=3$ and $C=5$ classes respectively in Figures 12 and 13, where we have used polynomial basis functions with $M=2,5,9,14,20,27,35,44$ or equivalently of degree $D =1,2,\ldots,8$ for each two class subproblem. Displayed in each figure are the nonlinear boundaries determined for each fold, as well as the combined result in the right panel of each figure. In both instances the combined boundaries separate the different classes of data very well.

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_15.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 12:</strong> <em> Result of performing $k=3$ fold cross-validation on the $C=3$ class dataset first shown in BLAH using OvA (see text for further details). The left three panels show the result for the red class versus all, blue class versus all, and green class versus all subproblems. For the red/green versus all problems the optimal degree found was $D^{\star}=2$, while for the blue versus all $D^{\star}=4$. The right panel shows the combined boundary determined by the fusion rule, which perfectly separates the three classes. </em>  </figcaption> 
</figure>

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_16.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 13:</strong> <em> Result of performing $k=3$ fold cross-validation on an overlapping $C=5$ class classification dataset (top panel) using OvA. The middle four panels show the result for the red, blue, green, and yellow class versus all subproblems respectively. The bottom two panels show the (left) purple class versus all and (right) the final combined boundary. For the red/purple versus all problems the k-fold found degree was $D^{\star}=1$, while for the remaining subproblems $D^{\star}=2$. </em>  </figcaption> 
</figure>