# Part 2 - Modelling
## Chapter 6 - Cross Validation in Finance

### 7.1 Why is shuffling a dataset before conducting k-fold CV generally a bad idea in finance? What is the purpose of shuffling? Why does shuffling defeat the purpose of k-fold CV in financial datasets?

Shuffling is bad idea because of data leakage and non IID samples, if we shuffle the data we cant monitor if there is a leakage from our training to the testing model. makes the k fold irrelevant because each test set has similar samples in training set and it may lead to false features discoveries.  

### 7.2 Take a pair of matrices (X, y), representing observed features and labels. These could be one of the datasets derived from the exercises in Chapter 3.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from afml.data_analyst.financial_data_structures import get_t_events
from afml.data_analyst.labels import get_vertical_next_day, get_events_triple_barrier, get_bins, get_daily_vol
import numpy as np
import pandas as pd
from afml.data_analyst.sample_weights import sample_w_by_return, num_co_events
from afml.data_analyst.fractionally_differentiated_features import frac_diff_ffd

btc_dollar = pd.read_parquet('../data/dollar-bars-100000000-True.parquet')
btc_dollar = btc_dollar[btc_dollar.index > pd.Timestamp('2023-01-01').timestamp() * 1000]
btc_dollar = btc_dollar.iloc[:-2]

cum_log_prices = np.log(btc_dollar['close']).cumsum()
fracdiff_series = frac_diff_ffd(pd.DataFrame(np.log(btc_dollar['close']).cumsum()), 2, tau=1e-5)
t0 = get_t_events(fracdiff_series.index.values, fracdiff_series['close'].values,
                  float(4 * np.std(fracdiff_series.values)))

daily_vol = get_daily_vol(btc_dollar)
ms_a_day = 24 * 60 * 60 * 1000
num_days = 5
t1 = get_vertical_next_day(btc_dollar, t0, num_days * ms_a_day)

events = get_events_triple_barrier(
    close=btc_dollar.loc[fracdiff_series.index, 'close'],
    t0=t0,
    tp_scale=2,
    sl_scale=2,
    target=daily_vol,
    min_return=0,
    t1=t1,
)
labels = get_bins(events, btc_dollar['close'], t1)

X = pd.DataFrame({
    'close': fracdiff_series['close'].loc[labels.index],
    'close_lag_1': fracdiff_series['close'].shift(1).loc[labels.index],
    'close_lag_2': fracdiff_series['close'].shift(2).loc[labels.index],
},
    index=labels.index
)


model = RandomForestClassifier(n_estimators=100, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, labels['bin'], test_size=0.3,
                                                    shuffle=False,
                                                    random_state=42)

# to avoid leakage
t1_train = t1[t1.index < X_test.index[0]]

w_return = sample_w_by_return(
        t1=t1_train,
        co_events=num_co_events(X_train.index, t1_train),
        close=btc_dollar['close']
    )

#### 7.2 (a) Derive the performance from a 10-fold CV of an RF classifier on (X, y), without shuffling.


#### 7.2 (b) Derive the performance from a 10-fold CV of an RF on (X, y), with shuffling.
#### 7.2 (c) Why are both results so different?
#### 7.2 (d) How does shuffling leak information?