# Part 2 - Modelling
## Chapter 7 - Cross Validation in Finance

### 7.1 Why is shuffling a dataset before conducting k-fold CV generally a bad idea in finance? What is the purpose of shuffling? Why does shuffling defeat the purpose of k-fold CV in financial datasets?

Shuffling is bad idea because of data leakage and non IID samples, if we shuffle the data we cant monitor if there is a leakage from our training to the testing model. makes the k fold irrelevant because each test set has similar samples in training set and it may lead to false features discoveries.  

### 7.2 Take a pair of matrices (X, y), representing observed features and labels. These could be one of the datasets derived from the exercises in Chapter 3.

In [1]:
import numpy as np
import pandas as pd

from afml.modelling.ensamble_methods import RandomForestClassifier
from afml.data_analyst.financial_data_structures import get_t_events
from afml.data_analyst.labels import get_vertical_next_day, get_events_triple_barrier, get_bins, get_daily_vol
from afml.data_analyst.fractionally_differentiated_features import frac_diff_ffd

btc_dollar = pd.read_parquet('../data/dollar-bars-100000000-True.parquet')
btc_dollar = btc_dollar[btc_dollar.index > pd.Timestamp('2023-01-01').timestamp() * 1000]

cum_log_prices = np.log(btc_dollar['close']).cumsum()
fracdiff_series = frac_diff_ffd(pd.DataFrame(np.log(btc_dollar['close']).cumsum()), 2, tau=1e-5)
t0 = get_t_events(fracdiff_series.index.values, fracdiff_series['close'].values,
                  float(4 * np.std(fracdiff_series.values)))

daily_vol = get_daily_vol(btc_dollar)
ms_a_day = 24 * 60 * 60 * 1000
num_days = 5
t1 = get_vertical_next_day(btc_dollar, t0, num_days * ms_a_day)

events = get_events_triple_barrier(
    close=btc_dollar.loc[fracdiff_series.index, 'close'],
    t0=t0,
    tp_scale=2,
    sl_scale=2,
    target=daily_vol,
    min_return=0,
    t1=t1,
)
labels = get_bins(events, btc_dollar['close'], t1)

X = pd.DataFrame({
    'close': fracdiff_series['close'].loc[labels.index],
    'close_lag_1': fracdiff_series['close'].shift(1).loc[labels.index],
    'close_lag_2': fracdiff_series['close'].shift(2).loc[labels.index],
},
    index=labels.index
)

model = RandomForestClassifier(n_estimators=100, random_state=42)

#### 7.2 (a) Derive the performance from a 10-fold CV of an RF classifier on (X, y), without shuffling.

In [2]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

scores = []
kf = KFold(n_splits=10, shuffle=False)  # 10-fold CV without shuffling

for train_index, test_index in kf.split(X.index):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = labels['bin'].iloc[train_index], labels['bin'].iloc[test_index]
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the validation fold
    y_pred = model.predict(X_test)
    
    # Calculate and store the performance metric
    score = accuracy_score(y_test, y_pred)  # Replace with your metric
    scores.append(score)

# Average performance across the 10 folds
mean_performance = np.mean(scores)
mean_performance

np.float64(0.4878676470588236)

#### 7.2 (b) Derive the performance from a 10-fold CV of an RF on (X, y), with shuffling.

In [3]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

scores = []
kf = KFold(n_splits=10, shuffle=True)  # 10-fold CV without shuffling

for train_index, test_index in kf.split(X.index):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = labels['bin'].iloc[train_index], labels['bin'].iloc[test_index]
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the validation fold
    y_pred = model.predict(X_test)
    
    # Calculate and store the performance metric
    score = accuracy_score(y_test, y_pred)  # Replace with your metric
    scores.append(score)

# Average performance across the 10 folds
mean_performance = np.mean(scores)
mean_performance

np.float64(0.4885152345251826)

#### 7.2 (c) Why are both results so different?
The results are quite similar because I barely use any features! But the results when the data is shuffled are better because there is high leakage means that the test data also "exist" in the test data, these leakages are 

#### 7.2 (d) How does shuffling leak information?
Because the data samples is not IID, each sample take since we sample it till we decide a label it takes T tick, it's easier to be overlaps between samples between the train set to the test set

### 7.3 Take the same pair of matrices (X, y) you used in exercise 2.
#### 7.3 (a) Derive the performance from a 10-fold purged CV of an RF on (X, y), with 1% embargo.

In [16]:
import importlib
from afml.modelling import cross_validation

importlib.reload(cross_validation)
from afml.modelling.cross_validation import PurgedKFold

scores = []
kf = PurgedKFold(n_splits=10, t1=t1, pct_embargo=0.01)

for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = labels['bin'].iloc[train_index], labels['bin'].iloc[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Predict on the validation fold
    y_pred = model.predict(X_test)

    # Calculate and store the performance metric
    score = accuracy_score(y_test, y_pred)  # Replace with your metric
    scores.append(score)

# Average performance across the 10 folds
mean_performance = np.mean(scores)
mean_performance

np.float64(0.4885416666666666)

#### 7.3 (b) Why is the performance lower?
Again, there is not data so I cant get any result that worth something, but the reason why this code should give a lower result is because there is less data leakage, thanks to the embargo and the purged data, the samples are not leake from trainding / testing to the testing / training and there is less correlation between the neighbors samples.  

#### 7.3 (c) Why is this result more realistic?
the test sets is without samples that has more probability to be correct because of unrealized correlation.

### 7.4 In this chapter we have focused on one reason why k-fold CV fails in financial applications, namely the fact that some information from the testing set leaks into the training set. Can you think of a second reason for CV’s failure?

Because financial data is non-stationary and high correlated in time, than even though we use purged and embargo tools, we may see a good result with k-fold CV, but it may be because $S_{k_i}$ and $S_{k_{i+2}}$ that is in the training set are correlated with similar market conditions to the test set $S_{k_{i+1}$

### 7.5 Suppose you try one thousand configurations of the same investment strategy, and perform a CV on each of them. Some results are guaranteed to look good, just by sheer luck. If you only publish those positive results, and hide the rest, your audience will not be able to deduce that these results are false positives, a statistical fluke. This phenomenon is called “selection bias.”
#### 7.5 (a) Can you imagine one procedure to prevent this?
1. Divide the data to train and test, the cross-validation run only on the train set, and the test set is used only for the final test.
2. Run cross-validation with different k-size to see if you get similar results.
3. Sometimes overfit can be determined in case we change the parameters just a little and the results change dramatically. publish results with different closer parameters may help to see if the results are stable.


#### 7.5 (b) What if we split the dataset in three sets: training, validation, and testing? The validation set is used to evaluate the trained parameters, and the testing is run only on the one configuration chosen in the validation phase. In what case does this procedure still fail?
If failing means the CV shows a good result, but the testing shows bad results, it probably means overfit.
In case failing means shows a good result but the model may fail in the future it might be because of non-stationary process and regime changed and the model is not predicted correct anymore.  

#### 7.5 (c) What is the key to avoiding selection bias?
* Cross-validation with small K
* Show results with different configurations
* Show results with similar configurations
* Check the results on test set