# Part 2 - Modelling
## Chapter 6 - Cross Validation in Finance

### 7.1 Why is shuffling a dataset before conducting k-fold CV generally a bad idea in finance? What is the purpose of shuffling? Why does shuffling defeat the purpose of k-fold CV in financial datasets?

Shuffling is bad idea because of data leakage and non IID samples, if we shuffle the data we cant monitor if there is a leakage from our training to the testing model. makes the k fold irrelevant because each test set has similar samples in training set and it may lead to false features discoveries.  

### 7.2 Take a pair of matrices (X, y), representing observed features and labels. These could be one of the datasets derived from the exercises in Chapter 3.

In [None]:
from sklearn.model_selection import train_test_split
from afml.modelling.ensamble_methods import RandomForestClassifier
from afml.data_analyst.financial_data_structures import get_t_events
from afml.data_analyst.labels import get_vertical_next_day, get_events_triple_barrier, get_bins, get_daily_vol
import numpy as np
import pandas as pd
from afml.data_analyst.sample_weights import sample_w_by_return, num_co_events
from afml.data_analyst.fractionally_differentiated_features import frac_diff_ffd

btc_dollar = pd.read_parquet('../data/dollar-bars-100000000-True.parquet')
btc_dollar = btc_dollar[btc_dollar.index > pd.Timestamp('2023-01-01').timestamp() * 1000]
btc_dollar = btc_dollar.iloc[:-2]

cum_log_prices = np.log(btc_dollar['close']).cumsum()
fracdiff_series = frac_diff_ffd(pd.DataFrame(np.log(btc_dollar['close']).cumsum()), 2, tau=1e-5)
t0 = get_t_events(fracdiff_series.index.values, fracdiff_series['close'].values,
                  float(4 * np.std(fracdiff_series.values)))

daily_vol = get_daily_vol(btc_dollar)
ms_a_day = 24 * 60 * 60 * 1000
num_days = 5
t1 = get_vertical_next_day(btc_dollar, t0, num_days * ms_a_day)

events = get_events_triple_barrier(
    close=btc_dollar.loc[fracdiff_series.index, 'close'],
    t0=t0,
    tp_scale=2,
    sl_scale=2,
    target=daily_vol,
    min_return=0,
    t1=t1,
)
labels = get_bins(events, btc_dollar['close'], t1)

X = pd.DataFrame({
    'close': fracdiff_series['close'].loc[labels.index],
    'close_lag_1': fracdiff_series['close'].shift(1).loc[labels.index],
    'close_lag_2': fracdiff_series['close'].shift(2).loc[labels.index],
},
    index=labels.index
)


model = RandomForestClassifier(n_estimators=100, random_state=42)

# to avoid leakage
# t1_train = t1[t1.index < X_test.index[0]]
# 
# w_return = sample_w_by_return(
#         t1=t1_train,
#         co_events=num_co_events(X_train.index, t1_train),
#         close=btc_dollar['close']
#     )

#### 7.2 (a) Derive the performance from a 10-fold CV of an RF classifier on (X, y), without shuffling.

In [1]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

scores = []
kf = KFold(n_splits=10, shuffle=False)  # 10-fold CV without shuffling

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = labels.loc[train_index, 'bin'], labels.loc[test_index, 'bin']
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the validation fold
    y_pred = model.predict(X_test)
    
    # Calculate and store the performance metric
    score = accuracy_score(y_test, y_pred)  # Replace with your metric
    scores.append(score)

# Average performance across the 10 folds
mean_performance = np.mean(scores)
mean_performance

NameError: name 'X' is not defined

#### 7.2 (b) Derive the performance from a 10-fold CV of an RF on (X, y), with shuffling.

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

scores = []
kf = KFold(n_splits=10, shuffle=True)  # 10-fold CV without shuffling

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = labels.loc[train_index, 'bin'], labels.loc[test_index, 'bin']
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the validation fold
    y_pred = model.predict(X_test)
    
    # Calculate and store the performance metric
    score = accuracy_score(y_test, y_pred)  # Replace with your metric
    scores.append(score)

# Average performance across the 10 folds
mean_performance = np.mean(scores)
mean_performance

#### 7.2 (c) Why are both results so different?
The results are quite similar because I barely use any features! but the results when the data is shuffled is better because there is high leakage means that the test data also "exist" in the test data, these leakage is 

#### 7.2 (d) How does shuffling leak information?
Because the data samples is not IID, each sample take since we sample it till we decide a label it takes T tick, it's easier to be overlaps between samples between the train set to the test set

### 7.3 Take the same pair of matrices (X, y) you used in exercise 2.
#### 7.3 (a) Derive the performance from a 10-fold purged CV of an RF on (X, y), with 1% embargo.

#### 7.3 (b) Why is the performance lower?

#### 7.3 (c) Why is this result more realistic?

### 7.4 In this chapter we have focused on one reason why k-fold CV fails in financial applications, namely the fact that some information from the testing set leaks into the training set. Can you think of a second reason for CV’s failure?

Because financial data is non-stationary and high correlated in time, than even though we use purged and embargo tools, we may see a good result with k-fold CV, but it may be because $S_{k_i}$ and $S_{k_{i+2}}$ that is in the training set are correlated with similar market conditions to the test set $S_{k_{i+1}$