# Time Series Cross Validation

Version: 2020-11-26a


## A. Basic Setup

Let us begin by importing the data we need using `pandas`.

In [6]:
import numpy as np
import pandas as pd

# Import data
gdp = pd.read_excel("../Data/hk-gdp.xlsx")
unemployment = pd.read_excel("../Data/unemployment.xlsx")

# Create end-month
gdp['end-month'] = gdp['quarter'] * 3

# Create gdp_growth
gdp['gdp_growth'] = gdp['gdp']/gdp['gdp'].shift(1) - 1

# Merge data
merged_data = unemployment.merge(gdp, how='left', on=['year','end-month'])
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_growth
0,2010,1,4.6,,,
1,2010,2,4.4,,,
2,2010,3,4.4,1.0,422783.0,
3,2010,4,4.6,,,
4,2010,5,4.8,,,
5,2010,6,4.8,2.0,412768.0,-0.023688
6,2010,7,4.6,,,
7,2010,8,4.6,,,
8,2010,9,4.4,3.0,456830.0,0.106748
9,2010,10,4.2,,,


## B. Walk Forward Split

When working with time series data we need to ensure the training data comes before the validation and test data. Instead of randomly splitting the data, what we want is this:

![walk-forward-split](https://i.stack.imgur.com/padg4.gif)

Scikit-learn's `TimeSeriesSplit` can produce such splits.

Syntax:
```python
tscv = TimeSeriesSplit(n_splits, max_train_size)
for train_index, test_index in tscv.split(merged_data):
    # do something
```
Options:
- `n_splits` controls the number of splits returned. The default is 5 splits. You probably want more if you have very long time series.
- `max_train_size` specifies the maximum number of training samples in a split. The default is `None`, which means there is no limit. This also means by default each subsequent split will be longer than before, so specify this number if you want the splits to have equal size.  

Note that `tscv.split()` returns *indexes*. You are responsible for fetching the data according to the indexes.

In [2]:
from sklearn.model_selection import TimeSeriesSplit

# 5 splits with 12 months of data in each split
tscv = TimeSeriesSplit(max_train_size=12)
for i, (train_index, test_index) in enumerate(tscv.split(merged_data)):
    print("Split",i)
    print("Train:",train_index)
    print("Test :",test_index)

Split 0
Train: [13 14 15 16 17 18 19 20 21 22 23 24]
Test : [25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45]
Split 1
Train: [34 35 36 37 38 39 40 41 42 43 44 45]
Test : [46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66]
Split 2
Train: [55 56 57 58 59 60 61 62 63 64 65 66]
Test : [67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87]
Split 3
Train: [76 77 78 79 80 81 82 83 84 85 86 87]
Test : [ 88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105
 106 107 108]
Split 4
Train: [ 97  98  99 100 101 102 103 104 105 106 107 108]
Test : [109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129]


In [3]:
# 20 splits with 12 months of data in each split
tscv = TimeSeriesSplit(n_splits=20,max_train_size=12)
for i, (train_index, test_index) in enumerate(tscv.split(merged_data)):
    print("Split",i)
    print("Train:",train_index)
    print("Test :",test_index)

Split 0
Train: [0 1 2 3 4 5 6 7 8 9]
Test : [10 11 12 13 14 15]
Split 1
Train: [ 4  5  6  7  8  9 10 11 12 13 14 15]
Test : [16 17 18 19 20 21]
Split 2
Train: [10 11 12 13 14 15 16 17 18 19 20 21]
Test : [22 23 24 25 26 27]
Split 3
Train: [16 17 18 19 20 21 22 23 24 25 26 27]
Test : [28 29 30 31 32 33]
Split 4
Train: [22 23 24 25 26 27 28 29 30 31 32 33]
Test : [34 35 36 37 38 39]
Split 5
Train: [28 29 30 31 32 33 34 35 36 37 38 39]
Test : [40 41 42 43 44 45]
Split 6
Train: [34 35 36 37 38 39 40 41 42 43 44 45]
Test : [46 47 48 49 50 51]
Split 7
Train: [40 41 42 43 44 45 46 47 48 49 50 51]
Test : [52 53 54 55 56 57]
Split 8
Train: [46 47 48 49 50 51 52 53 54 55 56 57]
Test : [58 59 60 61 62 63]
Split 9
Train: [52 53 54 55 56 57 58 59 60 61 62 63]
Test : [64 65 66 67 68 69]
Split 10
Train: [58 59 60 61 62 63 64 65 66 67 68 69]
Test : [70 71 72 73 74 75]
Split 11
Train: [64 65 66 67 68 69 70 71 72 73 74 75]
Test : [76 77 78 79 80 81]
Split 12
Train: [70 71 72 73 74 75 76 77 78 79 80 81]


In [9]:
# Fetching the actual data
for i, (train_index, test_index) in enumerate(tscv.split(merged_data)):
    print("Split",i)
    print("Train:",merged_data[["unemployment-rate","gdp_growth"]].iloc[train_index])

Split 0
Train:     unemployment-rate  gdp_growth
0                 4.6         NaN
1                 4.4         NaN
2                 4.4         NaN
3                 4.6         NaN
4                 4.8         NaN
5                 4.8   -0.023688
6                 4.6         NaN
7                 4.6         NaN
8                 4.4    0.106748
9                 4.2         NaN
10                3.9         NaN
11                3.7    0.059368
12                3.5         NaN
13                3.4         NaN
14                3.4   -0.042327
15                3.6         NaN
16                3.7         NaN
17                3.7   -0.014801
18                3.7         NaN
19                3.5         NaN
20                3.4    0.085178
21                3.4         NaN
22                3.2         NaN
23                3.1    0.047136
24                3.0         NaN
Split 1
Train:     unemployment-rate  gdp_growth
0                 4.6         NaN
1                 

In [10]:
# Predict GDP with unemployment rate
n_splits = 5

from sklearn.linear_model import Ridge

merged_data_2 = merged_data.dropna(subset=['gdp_growth'])
ridge = Ridge(alpha=50)
tscv = TimeSeriesSplit(n_splits=n_splits)
oos_score_list = []

print("Split  In-sample R^2  Out-of-Sample R^2")
print("-"*40)

# Loop through the splits. Run a Ridge Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(merged_data_2)):
    X_train = merged_data_2[["unemployment-rate"]].iloc[train_index]
    y_train = merged_data_2[["gdp_growth"]].iloc[train_index]
    X_test = merged_data_2[["unemployment-rate"]].iloc[test_index]
    y_test = merged_data_2[["gdp_growth"]].iloc[test_index]
    ridge.fit(X_train,y_train)
    oos_score = ridge.score(X_test,y_test)
    print(i,
          " "*4,
          round(ridge.score(X_train,y_train),2),
          " "*10, 
          round(oos_score,2))
    oos_score_list.append(oos_score)

print("-"*40)
print("Average out-of-sample score:",round(np.mean(oos_score_list),2))

Split  In-sample R^2  Out-of-Sample R^2
----------------------------------------
0      0.0            -0.07
1      0.0            -0.05
2      0.0            0.0
3      0.0            -0.05
4      0.0            -0.18
----------------------------------------
Average out-of-sample score: -0.07


This is obviously a pretty bad model, but you get the idea.