## Naive moment matching for crossvalidation

Let us consider a typical k-fold crossvalidation scheme for $n$ element dataset. Then for the $i$-th split we can measure: 

* training error (*downward-biased risk estimate*) $S_i$,
* test error (*empirical risk*) $E_i$
* optimism (*generalisation gap*) $\Delta_i = E_i-S_i$.

All of these values fluctuate (*are random variables*)  due to two sources of randomness:
* the choice of splits,
* the choice of data sample.

Our goal here is to establish, what is the expected value and variance of their arithmetic means
\begin{align*}
\overline{S}&=\frac{S_1+\ldots+S_k}{k}\\
\overline{E}&=\frac{E_1+\ldots+E_k}{k}\\
\overline{\Delta}&=\frac{\Delta_1+\ldots+\Delta_k}{k}\\
\end{align*}
The law of large numbers **alludes but not proves** that these averages should converge to normal distributions for two reasons:

* they as sums of *near-independent* variables should converge when $k$ approaches infinity;  
* each summand as a sum of *near-independent* variables should converge when the dataset size $n$ approaches infinity.

The latter allows us to approximate the distributions of $\overline{S}$, $\overline{E}$ and $\overline{\Delta}$ with a normal distribution $\mathcal{N}(\mu,\sigma)$ as soon as we can approximate their mean $\mu$ and variance $\sigma$. Thi in turn allows us to estimate how probable is that the observed mean deviates from the mean $\mu$ and define some confidence intervals.  

In [1]:
import numpy as np
import pandas as pd

from pandas import Series
from pandas import DataFrame

from numpy.random import random

from plotnine import *

# Local imports
from convenience import *

In [2]:
np.random.seed(1351430195)

## I. Truly naive view on the crossvalidation

Let us substitute crossvalidation scheme with an experiment where we draw $k$ datasets of size $n$ and use them to define each split. 
Let $A_1,\ldots, A_k$ be the corresponding training sets, $B_1,\ldots, B_k$ be the corresponding test sets, and let $f_1,\ldots, f_k$ be the corresponding predictors trained on respective training sets.

### Means

As all elements independently and identically sampled, we can easily establish
\begin{align*}
\mathbf{E}(\overline{S})&=
\mathbf{E}\left(\frac{S_1+\ldots+S_k}{k}\right) =\frac{\mathbf{E}(S_1)+\ldots+\mathbf{E}(S_k)}{k}=\mathbf{E}(S_1)\\ 
\mathbf{E}(\overline{E})&
=\mathbf{E}\left(\frac{E_1+\ldots+E_k}{k}\right) =\frac{\mathbf{E}(E_1)+\ldots+\mathbf{E}(E_k)}{k} =\mathbf{E}(E_1)\\ 
\mathbf{E}(\overline{\Delta})&
=\mathbf{E}\left(\frac{\Delta_1+\ldots+\Delta_k}{k}\right) =\frac{\mathbf{E}(\Delta_1)+\ldots+\mathbf{E}(\Delta_k)}{k} =\mathbf{E}(\Delta_1)
\end{align*}
where the right-hand side expressions are expected values over the standard holdout experiment. Note that the linearity of mean means that the result holds even if we sample the dataset only once and define sets $A_i, B_i$ through a crossvalidation scheme. We must just prove that all splits are equivalent which is obvious when $n$ is a multiple of $k$, as training and test sets in each split have same sizes.   

Now we need to understand what are the corresponding expectations:

* $\mathbf{E}(S_1)$ is the expected training error for the holdout scheme;
* $\mathbf{E}(E_1)$ is the expected test error for the holdout scheme;
* $\mathbf{E}(\Delta_1)$ is the expected optimism for the holdout scheme

where the expectation is taken over the choice of $n$ element datasets. These can be viewed as averages of over infinite experiment series where we first sample a dataset and split it into training and holdout set such that the test set is of size $n/k$.  

## Variances

Let us now consider the variance of these sums under our naive assumption of $k$ independent holdout experiments. Under this assumption the variance is linear and we can easily establish
\begin{align*}
\mathbf{D}(\overline{S})&=
\mathbf{D}\left(\frac{S_1+\ldots+S_k}{k}\right) =\frac{\mathbf{D}(S_1)+\ldots+\mathbf{D}(S_k)}{k^2}=\frac{\mathbf{D}(S_1)}{k}\\ 
\mathbf{D}(\overline{E})&
=\mathbf{D}\left(\frac{E_1+\ldots+E_k}{k}\right) =\frac{\mathbf{D}(E_1)+\ldots+\mathbf{D}(E_k)}{k^2} =\frac{\mathbf{D}(E_1)}{k}\\ 
\mathbf{D}(\overline{\Delta})&
=\mathbf{D}\left(\frac{\Delta_1+\ldots+\Delta_k}{k}\right) =\frac{\mathbf{D}(\Delta_1)+\ldots+\mathbf{D}(\Delta_k)}{k^2} =\frac{\mathbf{D}(\Delta_1)}{k}
\end{align*}
where the right-hand side expressions are scaled variances over the standard holdout experiment. More refined analysis would prove that the right-hand variance of test error and optimism matches the varinces of the experiment where the holdout set has size $n$. 

A priori the first simplification does not hold for the true crossvalidation experiment as summands are not independent and thus their correlations cannot be ignored during the simplification. The result can be re-established for test error under the assumption that the dataset is so large that all predictors $f_1,\ldots,f_k$ coincide on the dataset points. For large enough $n$, this assumption will be satisfied with negligible failure probability but in this case the the crossvalidation is pointless. The second simplification holds as long as $n$ is a multiple of $k$ as all splits are equivalent. 

### Consequnces

The results established above show that expected values of $\overline{S}$, $\overline{E}$ and $\overline{\Delta}$ match with the values we want to measure:

* expected training error 
* expected test error
* expected optimism

but all of them are measured in the experiment where the training set is of $(1-1/k)n$. Moreover **under our fake assumption** all these estimates can be approximated with normal distribution provided that we can find variances on the right-hand side. This variance term determines the length of the confidence interval. 

The result also indicates that estimates $\overline{S}$, $\overline{E}$ and $\overline{\Delta}$ are biased:
* $\overline{S}$ **underestimates** the training error as it **increases** with the size;
* $\overline{E}$ **overestimates** the test error as it **decreases** with the size;
* $\overline{\Delta}$ **overestimates** the optimism as it **decreases** with the size.

In practice the bias is small when $n$ is large enough so that adding more data yiels marginal gains.


## II. Missing links for computing confidence intervals

### Traditional method

We now need variance estimates for $\overline{S}$, $\overline{E}$ and $\overline{\Delta}$. Traditionally, these are computed from observed terms in the corresponding arithmetic sums. The corresponding formula for observations $x_1,\ldots,x_k$ is following
\begin{align*}
\hat{\sigma}^2 =\frac{1}{k-1}\cdot\sum_{i=1}^k (x_i-\overline{x})^2\quad\text{where}\quad \overline{x} = \frac{1}{n}\cdot\sum_{i=1}^k x_i\enspace.
\end{align*}
Mechanical substitution gives us the desired variance estimates. In practice, these can be computed with `np.var` functions. 


### Alternative view on crossvalidation error

However, there are some hidden correspondences to explore.
For a predictor $g$, let $L_i(g)=L(g(\mathbf{x}_i), y_i)$ denote loss on the $i$-th dataset element. Then it is easy to see that average of individual losses can be expressed:   
\begin{align*}
\overline{E}=\frac{1}{k}\cdot\sum_{i=1}^k\frac{k}{n}\cdot\sum_{j=1}^{n/k} L_{n/k(i-1)+j}(f_{k-i+1})
=\frac{1}{n}\cdot\sum_{i=1}^n L_i(f_{k-\lfloor ik/n\rfloor})
\end{align*}
where the horrible indexing just shows that we predict the value using the predictor trained outside of the test split. 

This allows us to view the crossvalidation as a way to assign non-related prediction for a data point $\mathbf{x}_i$ by training the model outside the current fold $B_i$. After that we can compute the test error in ordinary way by averaging over individual losses. Our independent split sampling assumption can be viewed in equivalent restatement. For each test fold $B_i$, we randomly sample the training set $A_i$ and use it to compute the predictor $f_i$. Now if we further strengthen the independence assumption and assume that we sample individual training set for each point $\mathbf{x}_i$. Then we have set of independent risk estimations $L_i(f_i)$ and we can just compute the variance over the set of observations
\begin{align*}
L_1(f_{n/k})\ldots, L_n(f_{1})\enspace.
\end{align*}
This estimate is more imprecise as it also ignores correlations of losses inside the block -- the particular training split can be a particularly good or bad for training.


### Alternative view on the training error

Similarly, the average of training losses can be expressed:
\begin{align*}
\overline{S}=\frac{1}{k}\cdot\sum_{i=1}^k\frac{k}{kn-n}\cdot\sum_{j\leq n(i-1)/k\atop j>ni/k} L_{j}(f_{k-i+1})
=\frac{1}{n}\cdot\sum_{i=1}^n \frac{1}{k-1}\cdot\sum_{j\neq k-\lfloor ik/n\rfloor}L_i(f_{j})
\end{align*}
where the horrible indexing just shows that for each point we compute the average loss over $k-1$ predictors that were computed on that split. 

This allows us to view the crossvalidation as a way to assign $k-1$ related predictions for a data point $\mathbf{x}_i$ by training the model in splits that contain the current fold $B_i$. After that we can compute the training error as double average. Our independent split sampling assumption can be viewed in equivalent restatement. For each test fold $B_i$, we randomly augment the set with an appropriate number of samples to get a training set $A_i$ and use it to compute the predictor $f_i$. We repeat this process $k-1$ times. Now it we further strengthen the independence assumption then for each point we generate $k-1$ testsets that contain it and compute corresponding training error.  Then we have set of independent averages for each training point and can compute the variance over the set of $n$ observations. This estimate is more imprecise as it also ignores correlations of losses due to the fact that different points share same training sets -- the particular training split can be a particularly good or bad for training.

## III. Concrete computational recepies

For simplicity, let us simulate predictions of $10$-fold crossvalidation by random sampling on the dataset of size 20. Let $y=(0,1,0,\ldots,1,0)$ be the true label.


In [53]:
y = DataFrame([np.tile([0, 1], 10)], index=['target'])
predictions = DataFrame(np.random.randint(0, 2, size=(10, 20)), index=[f'fold {i+1}' for i in range(10)])

display(y)
display(predictions)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
target,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
fold 1,0,1,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,1,1
fold 2,1,1,1,1,0,1,1,1,0,0,0,1,0,1,0,1,1,0,1,1
fold 3,1,1,0,0,0,1,1,1,0,1,1,1,1,1,1,1,0,0,1,0
fold 4,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0
fold 5,0,0,1,0,0,1,0,1,1,0,0,0,1,0,0,0,0,1,0,1
fold 6,1,1,0,1,0,1,0,1,1,1,1,1,0,0,1,0,0,0,0,1
fold 7,1,0,0,0,1,0,1,1,0,0,1,1,0,0,1,0,0,0,1,0
fold 8,1,0,1,0,1,0,0,0,1,1,1,0,1,1,1,0,1,0,0,1
fold 9,1,0,1,0,0,0,0,0,1,1,0,1,0,1,1,1,0,1,0,0
fold 10,0,0,1,0,0,0,1,0,1,1,0,1,1,1,1,1,0,1,0,1


In [62]:
S = np.full(10, np.nan)
E = np.full(10, np.nan)
Delta = np.full(10, np.nan)
for i in range(10):
    test_fold = [20 - 2*i - 2, 20 - 2*i - 1] 
    training_fold = [i for i in range(20) if i not in test_fold]
    E[i] = np.mean(y.iloc[0, test_fold] != predictions.iloc[i, test_fold])
    S[i] = np.mean(y.iloc[0, training_fold] != predictions.iloc[i, training_fold])
    Delta[i] = E[i] - S[i]    

measurements = DataFrame({'training_error': S, 'test_error': E, 'optimism': Delta}, index=predictions.index)
mdisplay([measurements.reset_index()],['Crossvalidation telemetry'])

index,training_error,test_error,optimism
fold 1,0.555556,0.5,-0.055556
fold 2,0.277778,1.0,0.722222
fold 3,0.444444,0.5,0.055556
fold 4,0.611111,0.5,-0.111111
fold 5,0.444444,0.5,0.055556
fold 6,0.333333,0.5,0.166667
fold 7,0.722222,0.5,-0.222222
fold 8,0.722222,1.0,0.277778
fold 9,0.388889,1.0,0.611111
fold 10,0.444444,0.5,0.055556

index,training_error,test_error,optimism
fold 1,0.555556,0.5,-0.055556
fold 2,0.277778,1.0,0.722222
fold 3,0.444444,0.5,0.055556
fold 4,0.611111,0.5,-0.111111
fold 5,0.444444,0.5,0.055556
fold 6,0.333333,0.5,0.166667
fold 7,0.722222,0.5,-0.222222
fold 8,0.722222,1.0,0.277778
fold 9,0.388889,1.0,0.611111
fold 10,0.444444,0.5,0.055556


In [92]:
print('Naive 95% confidence intervals')
print(f"Training error: {measurements['training_error'].mean():.3f} ± {2*np.std(measurements['training_error'], ddof=1)/np.sqrt(10):.3f}")
print(f"Test error:     {measurements['test_error'].mean():.3f} ± {2*np.std(measurements['test_error'], ddof=1)/np.sqrt(10):.3f}")
print(f"Optimism:       {measurements['optimism'].mean():.3f} ± {2*np.std(measurements['optimism'], ddof=1)/np.sqrt(10):.3f}")

Naive 95% confidence intervals
Training error: 0.494 ± 0.097
Test error:     0.650 ± 0.153
Optimism:       0.156 ± 0.192


You can also use `scipy.stats.norm` and its `ppf` method

In [87]:
from scipy.stats import norm

In [93]:
norm.ppf(q=[0.025, 0.975], loc = measurements['training_error'].mean(), scale=np.std(measurements['training_error'])/np.sqrt(10))

array([0.40406177, 0.58482712])

In [90]:
?norm.ppf

[0;31mSignature:[0m [0mnorm[0m[0;34m.[0m[0mppf[0m[0;34m([0m[0mq[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Percent point function (inverse of `cdf`) at q of the given RV.

Parameters
----------
q : array_like
    lower tail probability
arg1, arg2, arg3,... : array_like
    The shape parameter(s) for the distribution (see docstring of the
    instance object for more information)
loc : array_like, optional
    location parameter (default=0)
scale : array_like, optional
    scale parameter (default=1)

Returns
-------
x : array_like
    quantile corresponding to the lower tail probability q.
[0;31mFile:[0m      ~/Library/miniforge3/envs/huggingface/lib/python3.10/site-packages/scipy/stats/_distn_infrastructure.py
[0;31mType:[0m      method