Time series crosted validation (CV) is a little different from classical CV---folds are nested sets which grow forward in time, defining training and validation sets which are used for evaluating the accuracy of trained ML models. As demonstrated by [Rob Hyndman](https://robjhyndman.com/hyndsight/tscv/):

![ncv image](2021-04-10-time-series-cross-validation-variance/ncv-img.png)

Typically though, nested CV (NCV)  won’t check all possible folds, moving over only by a single point. A practical variant is $\mathrm{NCV}(k)$, which after some base training period uses only $k$ evenly spread out indices to define folds which overlap less. In the [Prophet](https://facebook.github.io/prophet/) paper, NCV is used for a couple of folds, rather than all possible ones from time $0$ to the end-of-data $T$, on the basis of (1) computational savings and (2) less correlated errors.
 
> The main advantage of using fewer simulated dates \[...\] is that it economizes on computation while providing less correlated accuracy measurements.
 
The last statement got me thinking. Sure, the computational angle makes sense, but are the correlated accuracy measurements really a problem? If they are, it’d mean you’d want $\mathrm{NCV}(k)$ to be in some goldilocks zone: not $\mathrm{NCV}(1)$, because of high variance from a single validation estimate, nor $\mathrm{NCV}(T)$, due to correlation possibly driving variance, but somewhere in between, where best is dataset-dependent.
 
You might be thinking, of course we want $\mathrm{NCV}(T)$, "the more estimates of error, the better", right? Turns out this kind of thinking is only relevant for independent estimates of error. For regular cross validation, you can come up with examples where $\mathrm{CV}(k)$ decreases, hits an optimum, than increases as $k$ increases. The intuition there is that your ML estimator might have some algorithmic instability exposed by only the absence of groups of points, whereas something as extreme as $\mathrm{CV}(n)$ for $n$ data points (also known as leave one out cross validation, or LOOCV) only ever evaluates on validation sets of size $1$, so our evaluation is blind to multi-point instability.
 
What muddies the water a little is that $\mathrm{CV}(2)$ has the highest positive bias among CV estimates, because you’re halving your dataset, whereas $\mathrm{NCV}(1)$ likely has the lowest positive bias among NCV estimates, since it’s closest to the true forecast horizon (if the process we’re predicting is non-stationary). Suppose for this conversation we can set aside these bias issues by assuming our dataset is large enough or process is stationary enough that the error variance is our largest concern.
 
Before we jump in, we need to make sure that we’re clear about what we’re doing for model evaluation and why it makes sense. We have $Y_t$, the process we care about predicting, which adapted to $\mathcal{F}_t$, some ambient filtration. We define on top of this the error process $\varepsilon_t$, also adapted, which is optained by training a time series forecaster on the segment $-P:(t-H)$ and evaluating it on the prediction horizon $(t-H+1):t$, where non-positive indices $-P:0$ are some minimum training set prefix. 
 
Given this setup, what do we hope to achieve with backtesting? The idea is that we can take our horizons in history and use that as an upper bound for what we expect our future performance to be. Without diving into too much time series statistical learning theory ([overview](https://danmackinlay.name/notebook/learning_theory_time_series.html)), as far as this blog post is concerned, we'll make that into a supermartingale assumption on the error:
 
$$\mathbb{E}\left[\varepsilon_s|\mathcal{F}_t\right] \le \varepsilon_t\,\,\forall s>t$$
 
This is of course fairly strong but, up to some data-dependent fudge factors based on $Y_t$’s mixing and concept class complexity, which I’ll not deal with for simplicity (since I want this discussion to focus on the variance term of NCV), it’s the basis for reasonable backtesting. So, given this setup, what can we say about correlated errors from NCV? Formally:
 
$$\mathrm{NCV}(k) = \frac{1}{k}\sum_{i\in[k]}\varepsilon_{t_i^{(k)}}\,\,,$$

where $t_i^{(k)}$ are from $\mathrm{linspace}(0, T, k)$ and $t_1^{(1)}=T$.
 
From the nestedness structure, along with a strong supermartingale assumption, a natural Azuma-like bound falls out ([Theorem 27](http://www.math.ucsd.edu/~fan/wp/concen.pdf), note they flip the meaning of super/submartingale), which for now I state without proof: (TODO something's messed up here w/o the averaging)
 
$$\mathbb{P}\left\{\mathbb{E}[\varepsilon_{*}] \ge \mathrm{NCV}(k) + \lambda\right\} \le \exp\left(\frac{-k\lambda^2}{\Theta\left(\sigma_{(k)}^2+\lambda\right)}\right)\,\,,$$
 
where $\varepsilon_{*}=\varepsilon_{T+H}$ is the true forecast error and $\sigma_{(k)}^2\ge \mathrm{var}\left(\varepsilon_{t_{i+1}^{(k)}}|\mathcal{F}_{t_{i}^{(k)}}\right)$ is some incremental variance bound uniform across time but likely variable in $k$. This bound gives us precisely what we're after: the probability that we're underestimating the error. Given boundedness assumptions on $\varepsilon_t$ (which we'd need for learning anyway), such uniform bounds would need to exist anyway.
 
On an intuitive level, we should expect $\sigma_{(k)}$ to decrease in $k$. For instance, consider how variance grows as increments lengthen for a Weiner process $W_t$: $\mathrm{var}\left(W_{t_{i+1}^{(k)}}|\mathcal{F}_{t_{i}^{(k)}}\right) = t_{i+1}^{(k)}-t_i^{(k)}=\Theta\left(\frac{1}{k}\right)$. Growing uncertainty over time is a natural assumption to make, and in such cases we should make $k$ as large as possible.
 
So, under admittedly a long list assumptions, we get an interesting outcome that NCV is in a strong sense different from CV, in that model estimation variance strictly benefits from having more evaluation horizons, regardless of evaluation set overlap! And somewhat neatly the previous worry about correlation goes away because the nested structure admits a concentration bound for averages.
 
But enough talk. Let’s see if this is actually true in practice.

First, we evaluate on a pristine example. Y_t = f(t) + W_t for a Weiner process W_t and determinitic f (say, sin). We can inspect the error process \varepsilon_t for an oracle forecaster which predicts f(t) + Y_{T} - f(T) based on the most recent point T given knowledge of f, as well as ARIMA (lookback = x) and an LSTM (hidden units = y).
 
TODO show error process for fixed x, y over time just examples 

In [2]:
1+1

2

TODO show something about P{ overestimate } across W_t realizations vs k
for oracle, arima, lstm

Then, we can really put the quality of NCV to a practical test. When it comes to predicting stocks (which I’m not saying one should do in this manner, but they’re very natural real adapted processes to use), we can inspect P { overestimate } for tuned realizations (todo describe eval tuned on each stock, evaluate P{overestimate}
 
P{overestimate} vs k plot for arima, lstm

TODO just ending model MAPE average across stocks?

398 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.16 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Note the theory above doesn’t quite account for the robustness of NCV to the adaptivity of tuning hypeparameters, but I felt that the above experiment is more representative of how NCV might be used in practice. Given a high-probability error bound though, a small fixed number of hyperparameter tunings can be permissible with a simple union bound.

todo conclusion - reiterate “So, under admittedly strong assumptions, we get an interesting outcome that NCV is in a strong sense different from CV, in that model estimation variance strictly benefits from having more horizons, regardless of evaluation set overlap!” + practical