# Backtesting on Syntetic Data
## Generate Stochastic Process
* Characterize stochastic process using historical data
* Require initial parameters: $\{ P_{i, 0}, E_0 [P_{i, T_i}] \}$

# Backtesting through Cross-Validation
## Two Ways
* In a narrow sense: to simulate the historical performance
    * Simulate the historical performance of an investment strategy
    * Also known as walk-forward
* In a broader sense: to simulate the scenarios that did not happend
    * Less known

## The Walk-Forward (WF) Method
* Advantages
    * Clear historical interpretation
    * History is a filtration
    * Embargo is not needed
* Disadvantage
    * a single scenario is tested, which can lead to overfitting
    * WF is not necessary representitive of future performance, which can lead bias
    * The initial decisions are made on smaller portion of the total samples
    
## Cross Validation (CV) Method
* Advantages
    * Tests k alternative scenariors
    * Decision is made on sets of equal size
    * No warm-up subset
* Disadvantages
    * Like WF, there is one and only one forecast generated per observations
    * No clear historical interpretation
    * Leakage is possible
    
* The combinatorial purged cross-validation (CPCV) method
    * Use combinatorial after splitting datasets into N folds
    * Able to generate multiple paths
    
## Overfitting
* WF has high variance due to small portion of the dataset for large portion of decisions
* High variance leads to false discovery
* CV leads to lower variance due to multiple paths

# The Danger of Backtesting
## Typical Flaws
* Survivorship bias
    * Ignoring bankrupt and delisted
* Look-ahead bias
    * Using information that was not published at that moment
* Storytelling
    * Making up a story to explain random patterns
* Data mining and data snooping
    * Training the model on the testing set
* Transaction costs
    * The only way to be certain about transaction costs is interacting with trading book
* Outliers
* Shorting
    * Requires finding a lender
    * The cost of lending and the amount available is generally unkonwn

## Overfitting
* Multiple backtesting leads to selection bias
* Should not tweak a model using backtest
* General recommendations to avoid overfitting
    * Develop models for entire asset classes or investment universe
    * Apply Bagging
    * Do not backtest until all research is complete
    * Record every backtest conducted on a dataset so that the probability of backtest overfitting may be estimated
    * Simulate scenarios rather than history
    * If the backtest fails to identify a profitable strategy, start from scratch
    
## Probability of Overfitting
1. Split data into S piecies
2. Construct combination from S splits
3. Choose optimal strategy 
4. Try the optimal strategy on test data and estimate relative rank
5. Repeat 3. and 4. to get samples of relative ranks
6. Calculate probability that the optimal strategy of in sample performs worse than median

# Bet Sizing
* Need to learn test statistics

## Strategy Independent Approaches
* Use concurrency of bets and fit with mixture of two Gaussians
* Predict through ML algorithms

## Approach from Predicted Probabilities
* Normalize and use Gaussian's CDF

## Dynamic Bet Sizes and Limit Prices
* Adjust bet size depending on changes of current prices and predictions
* The lager gap between the current price and prediction, the larger size you put
* Set the bet size limit in advance

## Regularization
* Averaging active bets between t0 and t1
* Discretize bets

# Hyperparameter Tuning
## Search Method
* Grid search
* Random search: effective for searching in high dimensional space

## Loss Function
* F1: When using meta labeling
* Negative log loss: Better than accuracy due to based on probability rather than prediction itself

# Feature Importance
* Has to be dealt with before backtest
* Figure out what features are important instead of cycling backtests
* Substitution effects: importance is reduced when there are some other related features

### MDI (Mean Decrease Impurity)
* Tree based methods
* Measured through impurity decreasing
* In-Sample method
* Every feature has somewhat importance

### MDA (Mean Decrease  Accuracy)
* Out-Of-Sample method
* Any performance measure
* Any kind of estimators
* See the change by shuffling selected features

### SFI (Single Feature  Importance)
* Without substitution effects
    * Substitution effects can lead us to wrong conclusion though not problem in prediction
* Any estimators and performance measures
* Use a single feature for each trial and see the defined score
* Out-Of-Sample method
* Fail in finding combination effects

### Orthogonal Features
* Alleviate the impact of linear substitution effects
* Require centering and rescaling before applying
* Not related to overfitting because of not looking at labels

# Cross Validation in Finance

## Leakage
* Data points are overlapped
* Use a dataset multipletimes, which leads to overfitting
* Two ways to avoid inflating validation score due to leakage
    * Drop overlapped instances from training data
    * Apply early stopping to base estimators and use bagging with sequential bootstrap
* Shuffling inflates the performance due to simlarity among data instances

## Purged K-Fold CV
* Purging: Get rid of overlapped instances
* Embargo:
    * Get rid of training data immediately after test data
    * Before purging

# Ensemble
## Three Sources of Errors
* Bias
    * Caused by unrealistic assumption
    * Underfitting
* Variance
    * Caused by sensitivity to small changes in the training set
    * Overfitting
* Noise
    * Caused by the variance of observed values
    * Irreducible errors


## Boosting
* Reduce bias
* Resilient to overfitting
* Iterate the following steps:
    * Estimate error from the current estimator (Weights errors according to errors of previous models)
    * Learn weak learner to minimize error
    * Add a new learned model to the estimator with an optimal coefficient
* Gradient Boosting
    * Generic version of boosting, applicable for any loss functions
    * Require analytical derivative of loss function with respect to estimator
    * Generalize gradient by learned weak learners => Try to predict gradient with a new weak learner
    
## Bootstrap
* Reduce varaince
* Reduce bais for classifiers bettera than chance with large samples 
* Unable to improve poor performance classifiers
* The less correlated esimatores, the more you can reduce in bias
* Simlar out-of-bag in finacial applciation => strong correlcation
* Sequential bootstrap is alterantive solution
* Random Forest: Use second level randomness, use sampled subset of features

## Boosting with Samples
1. Subsamples training data according to certain weights
2. Train an estimator
3. If the learned estimator achives accuracy better than the threshold, acceot the esimator. Otherwise, go back to 1
4. Update weights by giving more weights to misclassified observations
5. Repeat 1-4 till getting N estimators
6. Aggregate trees

## Bagging vs Boosting
* Boosting has the risk of fitting which is most cared in finance context => Bagging is prefered in financial context
* Boosting is sequential while bagging can be parallelized
* Bagging is able to make non-scalable algorithm scalable by utilizing early stopping

## Stationarity
### Definition
* Strictly Stationary: If the joint distribution of that of $X_{t_1}, \dots, X_{t_k}$ is the same as $X_{t_1 + \tau}, \dots, X_{t_k + \tau}$ for all $t_1, \dots, t_k, \tau$
* Random walk is not stationary. Its difference is stationary.

### MA Process
* $X_t = \beta_0 Z_t + \beta_1 Z_{t-1} + \cdots + \beta_q Z_{t - q}$
* Invertible: $Z_t = \sum_{j=0}^\infty \pi_j X_{t - j}$
* Let $B^j X_t = X_{t - j}$,
$$X_t = (\beta_0 + \beta_1 B + \cdots + \beta_q B^q) Z_t = \theta (B) Z_t$$
* If the roots of $\theta(B) = 0$ lies outside the unit circle, its invertible.
Simply because
$$(B - \lambda)^{-1} = -\frac{1}{\lambda} (1 - \frac{B}{\lambda})^{-1}$$
Its Taylor expansion converges if $\lambda$ is larger than 1.

###  AR Process
* $X_t = \alpha_1 X_{t - 1} + \cdots + \alpha_p X_{t - p} + Z_t$
* $Z_t = (1 - \alpha_1 B - \cdots - \alpha_p B^p) X_t = \phi(B) X_t$
* In the same way as MA, if the root of $\phi(B)$ lies outside the unit circle, the process is invertible, which implies that the process maybe stationary.
* Non unit roots is sufficient condition of stationarity
* A unit root is necessary condition for stationarity

### Dickey-Fuller Test
* Reference, https://en.wikipedia.org/wiki/Dickey%E2%80%93Fuller_test
* Test if the AR process is stationary <=> test if a unit root is present
* For AR(1): $y_t = \rho y_{t-1} + u_t$ => $\Delta y_t = (\rho - 1) y_{t-1} + u_t = \delta y_{t-1} + u_t$
* Check if $\delta$ is equivalent to 0
* Tests:
    * $\Delta y_t = \delta y_{t-1} + u_t$
    * With drift: $\Delta y_{t-1} = \alpha_0 + \delta y_{t-1} + u_t$
    * With drift and deterministic time trend: $\Delta y_t = \alpha_0 + \alpha_1 t + \delta y_{t-1} + u_t$
* Use Dickey-Fuller table

### Augmented Dicky-Fuller  Test
* Reference, https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test
* Test if a unit root is present
* $\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \delta_1 \Delta y_{t-1} + \cdots + \delta_{p-1} \Delta y_{t- p + 1} + u_t$
* Null Hypothesis: $\gamma = 0$ => Non stationary
* Alternative Hypothesis: $\gamma < 0$ => Stationary
* Test Statistics: $DF = \frac{\hat{\gamma}}{SE(\hat{\gamma})}$

### Integration of order d
* If $(1 - B)^d X_t$ is stationary process, X has the integration of order d
* Co-integration: A linear combination leads to lower order of integration
* Engle–Granger co-integration test:
    * Estimate coefficients by OLS: $y_t - \hat{\beta} x_t = u_t$
    * Check if $u_t$ is stationary by Dickey-Fuller test


# Multiprocessors
## Parallelism
* Multithreading: Run more than one threads under the same core
* Multiprocessing: Run on more than one cores
* GIL (Global Interpreter Lock) assigns write access to one thread per core => Python parallelize through multiprocessing
* Atom: Single task
* Molecule: Subset of tasks


# Sample Weights
## Overlapping Outcomes
* Range of samples could overlap to each other => Not IID
* Consider the following properties:
    * Number of concurrent labels: how many labels uses data at a certain time, i.e., $c_t = \sum_{i=1}^I 1_{t, i}$
    * Uniqueness: $u_{t, i} = 1_{t, i} / c_t$
    * Average uniqueness of a label: uniqueness averaging over from 1 to T, i.e., $\bar{u}_i = (\sum_t u_{t, i}) / (\sum_t 1_{t, i})$
 
## Bagging Classifiers and Uniqueness
* Assuming IID leads to oversampling
* Let $\bar{u}$ be a average uniqueness
* If $I^{-1} \sum_{i=1}^I \bar{u}_i << 1$,
    * Redundant to each other
    * Very similar to out-of-bag
* Solutions:
    * Drop overlapping outcomes => extreme loss of information
    * Lower maximum number of samples
* [Sequential Bootstrap](https://ac.els-cdn.com/S0378375897000414/1-s2.0-S0378375897000414-main.pdf?_tid=c9bde38a-9f30-43ca-abcf-98033c135788&acdnat=1529805932_882bdfdde2acf41470f32510b3a5a03c)
* Variation of Sequential Bootstrap
    * $\bar{u}_j^{(i+1)} = 1_{t, j} (1 + \sum_{k \in \phi^{(i)}} 1_{t, k})^{-1}$
    * Resample with adapted weights: $\delta_j^{(i)} = \bar{u}_j^{(i)} (\sum_{k=1}^I \bar{u}_k^{(i)})^{-1}$
    * After every samaple, update weights
    * The process is repeated until $I$ draws
    
 
## Prioritized Sampling
* Return Attribution: Assign weight for training according to the value of returns
* Time Decay: prioritized decay linearly as data becomes old
* Class Weights: Make different prioritization among classes

# Labeling

## Fixed-Time Horizon Method
* Fixed threshold
* Time bar
* Not take into consideration the change of scale
* To improve:
    * Label per a varying threshold depending on estimated $sigma_t$
    * Use dollar or volume based bars

## The Triple-Barrier Method
* Three thresholds
    * Touching Upper: label 1
    * Touching Lower: label -1
    * Touching Vertical: 0 or sign of return
    
## Side and Size Label
* Need to define side to determine the direction of profit taking and stop loss
* Need algorithms to produce the side of transactions
* We do not want to learn the side with a single ML model
    * Primary model: Decide the side of your bets (Meta Labeling)
    * Secondary model: Decide the side of bets
    
## Meta Labeling
* Similar to model stacking
* Helpful to achieve higher F1-scores
    * Primary Model: Determine the side with high recall
    * Secondary Model: Determine if you act or pass, focus on improving precision
* Powerful with four reasons
    1. White box
        * Allows you to build a model on top of white box like fundamental models
        * Helpful for quantamental firms
    2. Avoids overfitting
    3. More sophisticated model
        * E.g., allows you to build a model focusing on long or short positions 
    4. Able to divide decisions depending on the bet size
        * High accuracy on small bets and low accuracy on large bets ruins you
* You can add a meta-labeling layer to any primary model
* Drop under-populated labels
    * ML algorithms do not perform well on too imbalanced classes
    * scikit-learn bug

# Financial Data Structures
## Types of Data
* Fundamental Data
* Market Data
* Analytics
* Alternative Data

## Bars
### Standard Bars
* Time Bars: Sampled with fixed time interval
* Tick Bars: Sampled with fixed number of ticks
* Volume Bars: Sampled with fixed volume
* Dollar Bars: Sampled with fixed amount of value

### Information-Driven Bars
* Use the followings to estimate the amount of information
    * $b_t = \begin{cases}
        b_{t-1} if \Delta p_t = 0\\
        \frac{\Delta p_t}{| \Delta p_t| } if \Delta p_t \neq 0
      \end{cases} $
    * $T^* = \underset{T}{arg min} \{|\theta_T| \geq E[\theta_T]\}$ for defined $\theta_T$
* Tick Imbalanced Bars (TIB)
    * Take into consideration how many times prices changes
    * $\theta_T = \sum_{t=1}^T b_t$
    * Look at flow imbalance. If imbalance is more than expected, make a new bar
* Volume/Dollar Imbalanced Bars (VIB and DIB)
    * $\theta_T = \sum_{t=1}^T b_t v_t$
* Tick Runs Bars
    * Monitor the sequence of buys
    * $\theta_T = max\{ \sum_{t|b_t=1}^T b_t, - \sum_{t|b_t=-1}^T b_t\}$
* Volume/Dollar Runs Bars
    * $\theta_T = max\{ \sum_{t|b_t=1}^T b_t v_t, - \sum_{t|b_t=-1}^T b_t v_t\}$
    
## Dealing with Multi-Product Series
* Example cases:
    * Model spreads with changing weights
    * Basket of securities where dividends/coupons must be reinvested
    * Basket that must be rebalanced
    * Index whose constitutes changed
    * Replace an expired/matured contract/security
* Goal is to transform any complex multi-product dataset into a single dataset that resembles a total-return ETF

### ETF Trick
* Problems when trading a spread of futures
    * The spread is characterized by a vector of weights changing over time and may converge.
    * Spreads can be negative values
    * Trading times  will not align exactly for all constituents
* The goal is to model a basket of future as if it was a single non-expiring cash product
    * Changes in the series reflects PnL
    * Strictly positive
    * Shortfall is taken into consideration
    
##### Method
For instrument $i = 1, \dots, I$ at bar $t = 1, \dots, T$
* $o_{i, t}$: Raw open price
* $p_{i, t}$: Raw close price
* $\phi_{i, t}$: Exchange rate to USD
* $v_{i, t}$: Volume
* $d_{i, t}$: Dividend or coupon
    
For allocation vector $\omega_t$ rebalanced on bars $B \subseteq \{1, \dots, T\}$,
* $h_{i, t} = \begin{cases}
        \frac{\omega_{i, t} K_t}{o_{i, t + 1} \phi_{i, t} \sum_i |\omega_{i, t}|} if t \in B\\
        \frac{\Delta p_t}{| \Delta p_t| } if \Delta p_t \neq 0
      \end{cases} $
      

## Sampling Features
* Not all of ML algorithms are scalable, e.g., SVM
* ML works well when trained on relevant features
* Event-Based Sampling: Sample feature relevant to certain events, e.g., spike of volatility
    * CUSUM (Cumulative Sum) Filter: Sample when target value deviates larger than defined threshold