# Exercise 3.3: Cross Validation

Exercise based on Chapter 7 of the book Advances in Financial Machine Learning by Marcos Lopez de Prado.

In this exercise we will familiarize ourselves with some of the pitfalls of employing standard cross validation techniques to financial data. We will then take a look at some remedies, as well as explore when these have the greatest impact on measured performance.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

## K-fold Cross Validation

In order to play around with cross validation we first need to choose a model to test. For the purposes of this exercise we will utilize a simple random forest classifier to predict the labels generated using the triple-barrier method in exercise 2. As input features we will use the five previous prices known at the time the position was taken. We will then evaluate the performance of the classifier using variations of K-fold cross validation.

In [16]:
import pandas as pd

# Load labeled positions and parse datetime strings during file read
df = pd.read_csv('./data/labeled_positions.csv', index_col=0, parse_dates=['position_start', 'position_end'], infer_datetime_format=True)

df.drop(['profit_return_lim', 'loss_return_lim', 'touch_time'], axis=1, inplace=True)

# Load entire dollar bar dataset
df2 = pd.read_csv('./data/dollar_bars.csv', parse_dates=['datetime'], infer_datetime_format=True)

df.head()

Unnamed: 0,position_start,position_end,label
0,2010-01-04 18:26:23,2010-01-05 18:59:52,1
1,2010-01-04 18:52:10,2010-01-05 18:59:52,1
2,2010-01-04 21:21:08,2010-01-05 22:01:33,-1
3,2010-01-04 22:43:50,2010-01-06 00:08:46,-1
4,2010-01-05 00:14:59,2010-01-06 01:14:27,-1


Having loaded both the labeled positions and the entire dollar bar dataset we will for each position find the five previos prices in the dollar bar data and add them as features labeled `p0` to `p4`.

In [17]:
def find_prices(dfr, position):
    start = position['position_start']
    
    # Find row index of the position start in the whole dollar bar dataset
    i = dfr[dfr['datetime'] == start].index
    
    # Convert index to a numpy array and access elements by integer indexing
    i5 = i.to_numpy()[0] - np.arange(4, -1, -1)
    
    # Return five previous prices
    return dfr['price'].iloc[i5].values

In [18]:
df[[f'p{i}' for i in range(5)]] = df.apply(lambda x: find_prices(df2, x), axis=1, result_type='expand')
df.head()

Unnamed: 0,position_start,position_end,label,p0,p1,p2,p3,p4
0,2010-01-04 18:26:23,2010-01-05 18:59:52,1,1128.2,1128.2,1128.1,1127.6,1127.8
1,2010-01-04 18:52:10,2010-01-05 18:59:52,1,1128.2,1128.1,1127.6,1127.8,1127.5
2,2010-01-04 21:21:08,2010-01-05 22:01:33,-1,1127.6,1127.8,1127.5,1127.5,1128.6
3,2010-01-04 22:43:50,2010-01-06 00:08:46,-1,1127.8,1127.5,1127.5,1128.6,1128.4
4,2010-01-05 00:14:59,2010-01-06 01:14:27,-1,1127.5,1127.5,1128.6,1128.4,1128.1


Now that we have both our features and our label we can train a model and evaluate it using [K-Fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html). The function defined below trains a random forest classifier, evaluates it and prints out the average accuracy across folds.

An important parameter for the cross validation process is whether or not to shuffle the data before splitting into batches is done. For time series data such as ours this affects whether the testing data forms a contiguous block or not.

In [25]:
def kfold(df, shuffle=False, folds=8):
    X = df[[f'p{i}' for i in range(5)]].values
    y = df['label'].values

    scores = cross_val_score(RandomForestClassifier(), X, y, cv=folds)
    print(f'Average accuracy: {np.mean(scores)}')

In [26]:
df3 = df
#df3 = df.iloc[:10000]
kfold(df3, shuffle=False)

Average accuracy: 0.22155252845339857


In [22]:
df4 = df
#df4 = df.iloc[:10000]
kfold(df4, shuffle=True)

Average accuracy: 0.22165670689367079


### <u>Task 1</u>

Compare the reported accuracies with and without shuffling and with varying amounts of the data used. Analyze the results.

Which one gives a beter accuracy reading and does it depend on the amount of data used? Why is this the case, in particular with our financial timeseries dataset?

## Purged K-fold Cross validation

In order the improve the reliability of our cross validation results we can utilize a slight variant called purged K-fold cross validation. The difference to ordinary K-fold CV is that before training our model we will purge the training dataset of those positions that in any way overlap with a position in the testing dataset. This is to prevent information leakage between the two datasets.

By overlapping positions we mean any two positions that were open at the same time and thus depend on common parts of the price history.

The function below implements the purged variant of K-fold CV.

In [31]:
def purged_kfold(df, folds=8):
    
    kf = KFold(n_splits=folds, shuffle=False)

    scores = np.zeros(folds)
    for i, (train_index, test_index) in enumerate(kf.split(df)):

        df_train = df.iloc[train_index]
        df_test = df.iloc[test_index]

        test_first_start = df_test['position_start'].min()
        test_last_end = df_test['position_end'].max()

        # In order to prevent overlap we only keep training positions that ended before the first testing position
        # was opened or that started after the last testing position was closed
        keep = (df_train['position_end'] < test_first_start) | (df_train['position_start'] > test_last_end)
        print(f'Keeping {np.round(np.mean(keep.values)*100, 1)}% of train data in fold {i}')
        df_train = df_train[keep]

        X_train = df_train[[f'p{i}' for i in range(5)]].values
        y_train = df_train['label'].values

        X_test = df_test[[f'p{i}' for i in range(5)]].values
        y_test = df_test['label'].values

        clf = RandomForestClassifier()
        clf.fit(X_train, y_train)

        scores[i] = clf.score(X_test, y_test)
    
    print(f'\nAverage accuracy: {np.mean(scores)}')

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from mlfinlab.cross_validation import PurgedKFold

def purged_kfold(df, folds=8):


    scores = np.zeros(folds)
 
    previous_prices = []
    
    for start in df['position_start']:
        # Find the row index of the position start in the reference dataframe
        index = datetime_to_index.get(start, np.nan)
        
        # If the start datetime is not found, use NaN for previous prices
        if pd.isna(index) or index < 4:
            previous_prices.append([np.nan]*5)
        else:
            # Get the indices for the previous 5 prices
            indices = np.arange(index - 4, index + 1)
            # Append the previous 5 prices
            previous_prices.append(reference_df.loc[indices, 'price'].values)
    
    previous_prices = np.array(previous_prices)
    
    for i in range(5):
        df[f'p{i}'] = previous_prices[:, i]

    # Prepare the data for cross-validation
    X = df[[f'p{i}' for i in range(5)]]
    y = df['label']
    
    # Perform PurgedKFold cross-validation
    cv = PurgedKFold(n_splits=folds, t1=df['position_end'])
    
    scores = []

    for train_index, test_index in cv.split(X, y, df['position_start']):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        clf = RandomForestClassifier()
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        
        score = accuracy_score(y_test, y_pred)
        scores.append(score)
        print(f"Fold Score: {score}")

    mean_score = np.mean(scores)
    print(f"Mean Cross-Validation Score: {mean_score}")
    return mean_score



 


In [32]:
df3 = df
#df3 = df.iloc[:2000]
purged_kfold(df3)

Keeping 99.8% of train data in fold 0
Keeping 99.8% of train data in fold 1
Keeping 99.5% of train data in fold 2
Keeping 99.7% of train data in fold 3
Keeping 99.8% of train data in fold 4
Keeping 99.9% of train data in fold 5
Keeping 99.9% of train data in fold 6
Keeping 100.0% of train data in fold 7

Average accuracy: 0.36430351039892045


In [None]:
df3 = df
#df3 = df.iloc[:2000]
kfold(df3)

Average accuracy: 0.22164513818640602


: 

### <u>Task 2</u>

Compare the reported accuracies of normal and purged K-fold CV with varying amounts of the data used. Perform a similar analysis as in the previous task, with particular emphasis on the effect the amount of data used has. What would happen if we were to combine shuffling with purged K-fold CV?

**BONUS:** Add shuffling into the purged variant and modify the purging process to account for this.