# Millions Ways to Avoid Backtest Overfitting when Builing a Strategy
<br>
<br>
$$\text{Chia-Yi Yen}$$

## I hope I can bring you something...

1. Keep remind yourself the overfitting issue (when building a strategy)
2. Introduce a new approach to quantify the backtest overfitting
3. Python tookit for financial investment analysis


###There are millions way to avoid backtest overfitting, but I would only focus on new approach today, and compare to other approach.

## A New Way to Avoid Backtest Overfitting when building a strategy


## Recall the overfitting
The overfitting may look like this:
<img src="fig/overfitting_looklike.jpg">
The overfitting model does not have any predictive power for the future

# Let's look at the problem when build a strategy

## Your strategy capture the signal or noise ?
* Nature phenomenon v.s. Random pattern
* Sometimes we only fit the noise, and then overfitting happened.

## Danger! The overfitting of strategy selection by backtest 

## How does backtest overfitting look like?
The in-sample performance is perfect while the out-of-sample performance is disppointed.
<img src="fig/is_backtest_overfitting_looklike.jpg">

<img src="fig/oos_backtest_overfitting_looklike.jpg">

# Quantify the degree of Backtest Overfitting
## Reference
Bailey, David H. and Borwein, Jonathan M. and Lopez de Prado, Marcos and Zhu, Qiji Jim, The Probability of Backtest Overfitting (February 27, 2015). Journal of Computational Finance (Risk Journals), 2015, Forthcoming. Available at SSRN: http://ssrn.com/abstract=2326253 or http://dx.doi.org/10.2139/ssrn.2326253

* One of talk in R Finance conference 2014
* release applications in python
* intuitive & simple to understand
* easy to apply to any strategy selection process

## Definition of backtest overfitting
<img src="fig/definition_of_backtest_overfit.png">
<img src="fig/definition_of_probability_of_overfitting.png">


## Intuition of backtest overfitting
* Not overfitting: the optimal in-sample strategy should also outperform most of strategy out-of-sample. 
* Overfitting: it's more likely to underperform most of strategy out-of-samlpe. 
* Here we define "most of strategy" to 50% strategy.

### Definition of Backtest Overfitting in short
the optimal IS strategy is worse than 50% strategy when OOS
$$ \text{rank(optimal in-sample strategy) < }\frac{N}{2} \text{ when evaluate out-of-sample}$$

### Note
It's the overfitting of "Stock selection process", instead of "Strategy model calibration".

## Implement: CSCV
### the Combinational Symmetric Cross Validation procedure
1. Collect a matrix M of profit & loss time series for each strategy 
2. Split M into S disjoint submatrices
    * for example, $M_s$ in [$M_1$, $M_2$, $M_3$, $M_4$] if S=4
3. Generate all combination $C_s$ from $M_s$
    * for example
        * training sets = [$M_1$ + $M_2$ ,   $M_1$ +$M_3$,   $M_1$ + $M_4$, ...] (in-sample)
        * testing   sets = [$M_3$ + $M_4$ ,   $M_2$ +$M_4$,   $M_2$ + $M_3$, ...] (out-of-sample)
4. Each c in $C_s$,
    1. compute in-sample & out-of-sample performance 
    2. find the optimal in-sample strategy, and its corresponding out-of-sample performance
    3. get the relative rank $r_{oos}$ among all the out-of-sample performance
    4. compute the logit $\lambda_c$ = ln($\frac{r_{oos}}{1- r_{bar}}$ )
5. compute the probability of backtest overfitting
    * PBO = $\frac{\#(\lambda_c < 0) }{ \#(\lambda_c)}$
6. hypothesis test
    * $H_0$: the strategy selection process is overfitting
    * $\lambda_c$ ~ asymtopic Normal distribution
    * PBO < 0.05, reject the $H_0$



## Thanks for the powerful scientific computing python packages.
We can easily implement this procedure. see the <a href="http://nbviewer.ipython.org/github/exilespacer/ProbabilityOfBacktestOverfitting/blob/master/CSCV.ipynb">CSCV.ipynb</a>

## Why we use CSCV procedure? Why not others?
* Why not "hold-out" approach?
    1. Does not work for small data.
        <p>Waste lots of precious data to testing set. And may not be able to apply the performance measure.</p>
    2. High variance when use different hold-out
    3. Hold-out the last period, or the earliest period?
    <p>If you hold the last period, you loss the training data that you're most interested in; while if you hold the earliest period, you test the period that you don't really care about.</p>
    4. Cannot prevent overfitting as long as you try enough times
    <p>You might meet the testing set that has similar pattern just by luck</p>
 

## Why we use CSCV procedure? Why not others?(cont.)
* Why not "K-fold cross validation"?
    1. K could not be too large 
    <p>ensure each partition has enough data to compute a reliable performance measure</p>
    2. K could not be too small
    <p>Or it would like "hold-out" method.</p>
* Why not "leave-one-out cross validation"?
    1. Single testing data point might not be able to compute the performace measure
    2. Different size of trainingh and testing set is not fair when we compare the performance in the traing set and in the testing set.

## Why we use CSCV procedure? The advantage?
* The same size for training and testing set size
<p>Some performace measure might be sensitive to the sample size. By CSCV,  we can ensure the training and testing set would be in the same size, and thus could be in a comparable basis.</p>
<br>

* Symmetric
<p>The training set would be re-used to be testing set. So a bad out-of-sample performance is not from how you split your data.</p>
<br>

* Preserve the important characteristics of time-dependence  for financial time series
<p>Unlike K-fold, the CSCV procedure doesn't implement a random shuffle of the performace series. The significant time pattern would be preserved.</p>
<br>

## Why we use CSCV procedure? The advantage?(cont.)
* Non-random logit distribution for hypothesis test
<p>Unlike bootstrapping, you can get the same result if you run a CSCV procedure twice because each logit is from a deterministic set of data partition </p>
<br>

* Dispersion in logit would indicate the robustness of your strategy selection process
<p>If the distribution of logits have a small dispersion, it means your process is robust. You get a very simlilar result for each trials.</p>
<br>

* Model-free and non-parametric procedure to estimate the probability of backtest overfitting
<p>We don't need to know the trading rule or the distribution of underlying. Only require the time series of backtest performance.</p>
<br>

## Limitation
1. The performace measure could not be too sensitive to the order of data.
    <p>It is because the re-combinaiton of data partition. </p>
2. Assum all the sample strategy has equal probability, and thus the logits would be equally-weighted.
    <p>You can implement this weighting allocation into this procedure.</p>

## Some practice application

### Think of a strategy like that:
Building a strategy to profit from seasonal effect. For example, Side=Buy (or sell) a stock at the EntryDate = 4th day every month, and holding for N_holds=3 days. Stop loss when reach StopLoss=-3

### The parameter
    1. Side = Buy or Sell
    2. EntryDate = 1~22
    3. N_holds =1~20
    4. StopLoss = 0 ~ -10 
    
 ### you can find the code <a href="http://www.quantresearch.info/Software.htm">here</a>

### Pure noise: Return from random walk 
<img src="fig/overfit_backtest.png">

### Pure noise: Return from random walk 
<img src="fig/overfit_pbo.png">

### Signal: Return from random walk, with the first-5-day each month the same variance
<img src="fig/nonoverfit_backtest.png">

### Signal: Return from random walk, with the first-5-day each month the same variance
<img src="fig/nonoverfit_pbo.png">

# Thank you for your attention :)