# Chapter 8 - Evaluating Excess Returns

## Backtesting Best Practices

### Data Sourcing

- Definition and Interpretation
- Provenance
- Completeness
- Quality Assurance
- Point-in-time (PIT) vs. Restated Data
- Transformations
- Exploring Alternatives and Complements

### Research Process

- Data leakage:
  - Survivorship bias: some stocks may have been removed from the universe at the present time while not in the beginning. Use a stsable universe like Russell 1000/3000, MSCI benchmarks, or commercial factor models investment universes.
  - Financial statements: data is only available when the statements are released not the reference date range.
  - Point-in-time data
  - Price adjustments: use raw prices as much as possible. only use adjusted prices for returns calculation. If adjust prices for stock-splits, the stocks with multiple splits will have very low stock prices in the distant past, which might indicate the stock would do will in the future. This is a form of data leakage.
- Strategy Development
  - Have a theory
  - Reproducibility
  - Same settings in backtesting and in production
  - Calibrate market impact model: Use a market impact model such that backtesting == production
  - Include borrow costs: costs from shorting selling equities
  - Define the backtesting protocol and dataset beforehand

## The Backtesting Protocol

The well-known cross-validation (k-fold) has drawbacks, we can't shuffle the orders of subset due to serial dependency and ease of data leakage. Walk-forward validation solves these problems but suffers from: low data volume at the beginning, opportunity cost between the last training and production (buffer = validation set).

## The Rademacher Anti-Serum (RAS)

### Setup

- Strategies: simulated strategy returns z-scored by their predicted volatility.
  $$x_{t,n} = \frac{w^T_{t,n}r_t}{\sqrt{w^T_{t,n}\Omega_tw_{t,n}}}$$
- Signals: information coefficient (IC), cosine of the angle between alpha vector and idiosyncratic returns.
  $$x_{t,n} = \frac{\alpha^T_{t,n}\epsilon_t}{||\alpha_{t,n}|| ||\epsilon_t||}$$
The key assumption is that $x_t$ is iid, however always check the autocorrelation. If the first few lags in the ACF are significant, group them into non-overlapping bins. Define
$$\hat{\theta}(X) = 1/T \sum^{T}_{t=1}x_t$$
Rademacher complexity is defined as
$$\hat{R} = E_{\epsilon}\left(\sup_n\frac{\epsilon^Tx^n}{T}\right)$$
Where $\epsilon$ is a T-dimensional random vector whose elements are iid and take values 1 or -1 with probability 0.5.
Interpretation:
- As the covariance to random noise: since we are measuring the average of the max of covariance between a random vector and each strategy/signal, a high value means each random noise can be represented well by a strategy/signal. In other word, high value indicates randomness in strategy/signal
- As generalized two-way cross-validation: for each strategy/signal, the series can be randomly divided into two parts (one where random value == 1 and the other == -1). Looking from this lens, this measures the average of max distance between the two randomly sampled series. A high value suggests high inconsistency between two samples, thus in stablility and overfitting.
- As measure of span over possible performances: this is similar to the first interpretation, a high value suggests each random vector can be represented well by at least one strategy/signal. And random vectors span across the entire subspace. Thus a high value suggest the strategy is just all over the place.

Note that, if strategies are highly correlated, R automatically discount the effective number of strategies. 

### Lower bounds

The goal to have $\hat{R}$ is to find a lower bound of the performance with a probability greater than $1-\delta$:
- For signals (information coefficient)
  $$\sigma_n \geq \hat{\sigma}_n - 2\hat{R} - 2\sqrt{\frac{\log(2/\delta)}{T}}$$
- For strategies (sharpe)
  $$\sigma_n \geq \hat{\sigma}_n - 2\hat{R} - 3\sqrt{\frac{2\log(2/\delta)}{T}} - \sqrt{\frac{2\log(2N/\delta)}{T}}$$
Interpretations:
- $2\hat{R}$: increase in the number of strategies/signals, decrease in correlations
- The estimation error term: decreases as T increases
  - An additional error term proportional to $\sqrt{logN}$ is added to account for the unbounded case (sharpe ratio) in the tail cases.
In reality, since IC rarely exceeds 0.1, we need to shrink the estimation error term, otherwise the lower bound would be too loose:
$$\theta_n - \hat{\theta}_n \geq -a\hat{R} - b\sqrt{\frac{2\log(2/\delta)}{T}}$$
A good starting point would be: $b=0.04$