This notebook shows an efficient way to create a grid search cross-validation training set for time series data.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [31]:
import yfinance as yf
import pandas as pd
import numpy as np

## Data preparation

Pick FAANG tech company as my dataset

In [26]:
tickers = ['META', 'AMZN', 'AAPL', 'NFLX', 'GOOG']
temp_list = []

for ticker in tickers:
    data = yf.download(ticker, '2019-01-01', '2023-12-31')
    data['Ticker'] = ticker
    temp_list.append(data)
    
temp = pd.concat(temp_list, keys = tickers, names=['Ticker', 'Date'])
temp = temp.drop(['Ticker'], axis = 1)
temp

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Unnamed: 1_level_0,Open,High,Low,Close,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
META,2019-01-02,128.990005,137.509995,128.559998,135.679993,135.536194,28146200
META,2019-01-03,134.690002,137.169998,131.119995,131.740005,131.600372,22717900
META,2019-01-04,134.009995,138.000000,133.750000,137.949997,137.803787,29002100
META,2019-01-07,137.559998,138.869995,135.910004,138.050003,137.903687,20089300
META,2019-01-08,139.889999,143.139999,139.539993,142.529999,142.378937,26263800
...,...,...,...,...,...,...,...
GOOG,2023-12-22,142.130005,143.250000,142.054993,142.720001,142.720001,18494700
GOOG,2023-12-26,142.979996,143.945007,142.500000,142.820007,142.820007,11170100
GOOG,2023-12-27,142.830002,143.320007,141.050995,141.440002,141.440002,17288400
GOOG,2023-12-28,141.850006,142.270004,140.828003,141.279999,141.279999,12192500


In [28]:
temp.groupby('Ticker').size()

Ticker
AAPL    1258
AMZN    1258
GOOG    1258
META    1258
NFLX    1258
dtype: int64

To construct a generator for use with an sklearn model, you'll need to define a class for the Generator. The approach involves dividing the dataset into $n$ segments, where each test set follows sequentially. For instance, if you divide your dataset into 10 segments with a testing period of 20 days, the most recent test set will encompass the final 20 days of your dataset. The test set before that will cover the period from the 40th to the 20th last day, and so on.

In [161]:
class TimeSeriesCV:
    def __init__(self,test_size = 10,n_split = 10,train_size = 50, lag = 1):
        self.test_size = test_size
        self.n_split = n_split
        self.lag = lag
        self.train_size = train_size
        
        
    def split(self, X: pd.DataFrame, y: np.ndarray = None, groups: np.ndarray = None):
        date = X.index.get_level_values('Date').unique()  #Get the unique date listed from the dateset
        date = date.sort_values(ascending = False) #Sort the date list in reverse order
        sto = []
        for i in range(self.n_split):
            test_end_idx = i * self.n_split #calculate the end date from the test set
            test_start_idx = test_end_idx + self.test_size
            train_end_idx = test_start_idx + self.lag - 1
            train_start_idx = train_end_idx + self.train_size
            sto.append([test_end_idx, test_start_idx, train_end_idx, train_start_idx])

            
#Use the beginning date and ending date 
        dates_col = X.reset_index()[['Date']]
        for i in sto:
            train_idx = dates_col[(dates_col['Date'] > date[i[3]]) & (dates_col['Date'] <= date[i[2]])].index
            test_idx = dates_col[(dates_col['Date'] > date[i[1]]) & (dates_col['Date'] <= date[i[0]])].index
            
            
            yield train_idx, test_idx
            

A quick test for the TimeSeriesCV

In [165]:
aa = TimeSeriesCV(test_size =20, n_split = 10, train_size = 50, lag = 2)
dates_col = temp.reset_index()[['Date']]
dates_col

for i in aa.split(temp):
    print(dates_col.iloc[i[0]].tail(5))
    print(dates_col.iloc[i[1]].head(5))

           Date
6264 2023-11-22
6265 2023-11-24
6266 2023-11-27
6267 2023-11-28
6268 2023-11-29
           Date
1238 2023-12-01
1239 2023-12-04
1240 2023-12-05
1241 2023-12-06
1242 2023-12-07
           Date
6254 2023-11-08
6255 2023-11-09
6256 2023-11-10
6257 2023-11-13
6258 2023-11-14
           Date
1228 2023-11-16
1229 2023-11-17
1230 2023-11-20
1231 2023-11-21
1232 2023-11-22
           Date
6244 2023-10-25
6245 2023-10-26
6246 2023-10-27
6247 2023-10-30
6248 2023-10-31
           Date
1218 2023-11-02
1219 2023-11-03
1220 2023-11-06
1221 2023-11-07
1222 2023-11-08
           Date
6234 2023-10-11
6235 2023-10-12
6236 2023-10-13
6237 2023-10-16
6238 2023-10-17
           Date
1208 2023-10-19
1209 2023-10-20
1210 2023-10-23
1211 2023-10-24
1212 2023-10-25
           Date
6224 2023-09-27
6225 2023-09-28
6226 2023-09-29
6227 2023-10-02
6228 2023-10-03
           Date
1198 2023-10-05
1199 2023-10-06
1200 2023-10-09
1201 2023-10-10
1202 2023-10-11
           Date
6214 2023-09-13
6215 202

Test passed. As you can see, there is a lag 2 between the training set and testing set.