# Imputation and Time Series Cross Validation


## A. Basic Setup

Let us begin by importing the data we need using `pandas`.

In [10]:
import pandas as pd

# Import data
hsi = pd.read_csv("../Data/hsi.csv")
gdp = pd.read_excel("../Data/hk-gdp.xlsx")
unemployment = pd.read_excel("../Data/unemployment.xlsx")

In [15]:
# gdp data
gdp.head()

Unnamed: 0,year,quarter,gdp
0,2010,1,422783
1,2010,2,412768
2,2010,3,456830
3,2010,4,483951
4,2011,1,463467


In [16]:
# Unemployment rate data
unemployment.head()

Unnamed: 0,year,end-month,unemployment-rate
0,2010,1,4.6
1,2010,2,4.4
2,2010,3,4.4
3,2010,4,4.6
4,2010,5,4.8


In order to merge the two sets of data, we need to generate `end-month` for `gdp`. We will also compute quarter-to-quarter GDP growth.

In [108]:
# Create end-month
gdp['end-month'] = gdp['quarter'] * 3

# Create gdp_growth
gdp['gdp_growth'] = gdp['gdp']/gdp['gdp'].shift(1) - 1

gdp.head()

Unnamed: 0,year,quarter,gdp,end-month,gdp_growth
0,2010,1,422783,3,
1,2010,2,412768,6,-0.023688
2,2010,3,456830,9,0.106748
3,2010,4,483951,12,0.059368
4,2011,1,463467,3,-0.042327


In [109]:
merged_data = unemployment.merge(gdp, how='left', on=['year','end-month'])
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_growth
0,2010,1,4.6,,,
1,2010,2,4.4,,,
2,2010,3,4.4,1.0,422783.0,
3,2010,4,4.6,,,
4,2010,5,4.8,,,
5,2010,6,4.8,2.0,412768.0,-0.023688
6,2010,7,4.6,,,
7,2010,8,4.6,,,
8,2010,9,4.4,3.0,456830.0,0.106748
9,2010,10,4.2,,,


## B. Pandas: Replace Missing Values with a Single Value

```python
DataFrame['new_column'] = DataFrame['existing_col'].fillna(DataFrame['existing_col'].ops())
```

For example, if we would like to replace missing GDP values with the mean of the same series:

In [80]:
merged_data['gdp_imputed'] = merged_data['gdp'].fillna(merged_data['gdp'].mean())
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
0,2010,1,4.6,,,590507.906977
1,2010,2,4.4,,,590507.906977
2,2010,3,4.4,1.0,422783.0,422783.0
3,2010,4,4.6,,,590507.906977
4,2010,5,4.8,,,590507.906977
5,2010,6,4.8,2.0,412768.0,412768.0
6,2010,7,4.6,,,590507.906977
7,2010,8,4.6,,,590507.906977
8,2010,9,4.4,3.0,456830.0,456830.0
9,2010,10,4.2,,,590507.906977


If you prefer to replace the original column instead of generating a new one, you can add the option `inplace=True`:

```python
DataFrame['existing_col'].fillna(DataFrame['existing_col'].ops(), inplace=True)
```

## C. Pandas: Index and Interpolation

If you want to fill missing values using interpolation instead of a single value, you will have to make a decision on the format of the index, because this affects the types of interpolation pandas allows you to use.

First let us try using more than one column as the index. This is called `MultiIndex` in pandas:

In [71]:
merged_data.index = [merged_data['year'],merged_data['end-month']]
merged_data.head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
year,end-month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010,1,2010,1,4.6,,,
2010,2,2010,2,4.4,,,
2010,3,2010,3,4.4,1.0,422783.0,422783.0
2010,4,2010,4,4.6,,,419444.666667
2010,5,2010,5,4.8,,,416106.333333
2010,6,2010,6,4.8,2.0,412768.0,412768.0
2010,7,2010,7,4.6,,,427455.333333
2010,8,2010,8,4.6,,,442142.666667
2010,9,2010,9,4.4,3.0,456830.0,456830.0
2010,10,2010,10,4.2,,,465870.333333


The syntax for interpolating a column is: 

```python
DataFrame['new_column'] = DataFrame['existing_column'].interpolate(method='some_method')
```

`MultiIndex` only supports linear interpolation, which treats all observations as equally spaced:

In [72]:
merged_data['gdp_imputed'] = merged_data['gdp'].interpolate()
merged_data.head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
year,end-month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010,1,2010,1,4.6,,,
2010,2,2010,2,4.4,,,
2010,3,2010,3,4.4,1.0,422783.0,422783.0
2010,4,2010,4,4.6,,,419444.666667
2010,5,2010,5,4.8,,,416106.333333
2010,6,2010,6,4.8,2.0,412768.0,412768.0
2010,7,2010,7,4.6,,,427455.333333
2010,8,2010,8,4.6,,,442142.666667
2010,9,2010,9,4.4,3.0,456830.0,456830.0
2010,10,2010,10,4.2,,,465870.333333


Next we will try a single index. We will need to combine year and month into a single number:

In [73]:
merged_data.index = merged_data['year'] - 2010 + merged_data['end-month']
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
1,2010,1,4.6,,,
2,2010,2,4.4,,,
3,2010,3,4.4,1.0,422783.0,422783.0
4,2010,4,4.6,,,419444.666667
5,2010,5,4.8,,,416106.333333
6,2010,6,4.8,2.0,412768.0,412768.0
7,2010,7,4.6,,,427455.333333
8,2010,8,4.6,,,442142.666667
9,2010,9,4.4,3.0,456830.0,456830.0
10,2010,10,4.2,,,465870.333333


A single index allows for many more [interpolations methods](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate). The default interoplation method is `linear`, giving the same result as before:

In [74]:
merged_data['gdp_imputed'] = merged_data['gdp'].interpolate()
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
1,2010,1,4.6,,,
2,2010,2,4.4,,,
3,2010,3,4.4,1.0,422783.0,422783.0
4,2010,4,4.6,,,419444.666667
5,2010,5,4.8,,,416106.333333
6,2010,6,4.8,2.0,412768.0,412768.0
7,2010,7,4.6,,,427455.333333
8,2010,8,4.6,,,442142.666667
9,2010,9,4.4,3.0,456830.0,456830.0
10,2010,10,4.2,,,465870.333333


Another possibly is `pad`, which simply uses the previous non-missing value:

In [75]:
merged_data['gdp_imputed'] = merged_data['gdp'].interpolate(method='pad')
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
1,2010,1,4.6,,,
2,2010,2,4.4,,,
3,2010,3,4.4,1.0,422783.0,422783.0
4,2010,4,4.6,,,422783.0
5,2010,5,4.8,,,422783.0
6,2010,6,4.8,2.0,412768.0,412768.0
7,2010,7,4.6,,,412768.0
8,2010,8,4.6,,,412768.0
9,2010,9,4.4,3.0,456830.0,456830.0
10,2010,10,4.2,,,456830.0


Some interpolation methods do not make sense in our case. For example, `values` use the actual numerical values of the index:

In [76]:
merged_data['gdp_imputed'] = merged_data['gdp'].interpolate(method='values')
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
1,2010,1,4.6,,,
2,2010,2,4.4,,,
3,2010,3,4.4,1.0,422783.0,422783.0
4,2010,4,4.6,,,463467.0
5,2010,5,4.8,,,483654.0
6,2010,6,4.8,2.0,412768.0,412768.0
7,2010,7,4.6,,,535907.0
8,2010,8,4.6,,,572160.0
9,2010,9,4.4,3.0,456830.0,456830.0
10,2010,10,4.2,,,526194.0


## D. Scikit-learn Imputers

You can also use scikit-learn's imputation classes. The `SimpleImputer` class replaces missing values with a single value, while the `IterativeImputer` replaces missing values by the prediction of a model fitted on non-missing values.

Let us first try the `SimpleImputer`:

In [94]:
import numpy as np
from sklearn.impute import SimpleImputer

# Replace missing values with the mean of the series
imp = SimpleImputer(strategy='mean')
X = imp.fit_transform(merged_data[['gdp']])
merged_data['gdp_imputed'] = X
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
0,2010,1,4.6,,590507.906893,590507.906893
1,2010,2,4.4,,590507.906908,590507.906908
2,2010,3,4.4,1.0,422783.0,422783.0
3,2010,4,4.6,,590507.906893,590507.906893
4,2010,5,4.8,,590507.906878,590507.906878
5,2010,6,4.8,2.0,412768.0,412768.0
6,2010,7,4.6,,590507.906893,590507.906893
7,2010,8,4.6,,590507.906893,590507.906893
8,2010,9,4.4,3.0,456830.0,456830.0
9,2010,10,4.2,,590507.906923,590507.906923


In [93]:
# Replace missing values with the most frequent value of the series
imp = SimpleImputer(strategy='most_frequent')
merged_data['gdp_imputed'] = imp.fit_transform(merged_data[['gdp']])
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_imputed
0,2010,1,4.6,,590507.906893,590507.906893
1,2010,2,4.4,,590507.906908,590507.906908
2,2010,3,4.4,1.0,422783.0,422783.0
3,2010,4,4.6,,590507.906893,590507.906893
4,2010,5,4.8,,590507.906878,590507.906878
5,2010,6,4.8,2.0,412768.0,412768.0
6,2010,7,4.6,,590507.906893,590507.906893
7,2010,8,4.6,,590507.906893,590507.906893
8,2010,9,4.4,3.0,456830.0,456830.0
9,2010,10,4.2,,590507.906923,590507.906923


With `IterativeImputer`, you can choose a model to predict the missing values. The default is a Bayesian Ridge Regression, which is similar to the usual Ridge Regression but with the strength of regularization estimated from data. To predict the missing value of a variable, the model will use all other variables you provide. 

Since it does not make sense to predict the absolute level of GDP with unemployment rate, we will predict GDP growth instead.

In [110]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer()
X = imp.fit_transform(merged_data[['unemployment-rate','gdp_growth']])
merged_data['ur_imputed'] = X[:,0]
merged_data['gdpg_imputed'] = X[:,1]
merged_data.head(12)

Unnamed: 0,year,end-month,unemployment-rate,quarter,gdp,gdp_growth,ur_imputed,gdpg_imputed
0,2010,1,4.6,,,,4.6,0.014928
1,2010,2,4.4,,,,4.4,0.014782
2,2010,3,4.4,1.0,422783.0,,4.4,0.014782
3,2010,4,4.6,,,,4.6,0.014928
4,2010,5,4.8,,,,4.8,0.015074
5,2010,6,4.8,2.0,412768.0,-0.023688,4.8,-0.023688
6,2010,7,4.6,,,,4.6,0.014928
7,2010,8,4.6,,,,4.6,0.014928
8,2010,9,4.4,3.0,456830.0,0.106748,4.4,0.106748
9,2010,10,4.2,,,,4.2,0.014636


## E. Walk Forward Split

When working with time series data we need to ensure the training data comes before the validation and test data. Instead of randomly splitting the data, what we want is this:

![walk-forward-split](https://i.stack.imgur.com/padg4.gif)

Scikit-learn's `TimeSeriesSplit` can produce such splits.

Syntax:
```python
tscv = TimeSeriesSplit(n_splits, max_train_size)
for train_index, test_index in tscv.split(merged_data):
    # do something
```
Options:
- `n_splits` controls the number of splits returned. The default is 5 splits. You probably want more if you have very long time series.
- `max_train_size` specifies the maximum number of training samples in a split. The default is `None`, which means there is no limit. This also means by default each subsequent split will be longer than before, so specify this number if you want the splits to have equal size.  

Note that `tscv.split()` returns *indexes*. You are responsible for fetching the data according to the indexes.

In [125]:
from sklearn.model_selection import TimeSeriesSplit

# 5 splits with 12 months of data in each split
tscv = TimeSeriesSplit(max_train_size=12)
for i, (train_index, test_index) in enumerate(tscv.split(merged_data)):
    print("Split",i)
    print("Train:",train_index)
    print("Test :",test_index)

Split 0
Train: [13 14 15 16 17 18 19 20 21 22 23 24]
Test : [25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45]
Split 1
Train: [34 35 36 37 38 39 40 41 42 43 44 45]
Test : [46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66]
Split 2
Train: [55 56 57 58 59 60 61 62 63 64 65 66]
Test : [67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87]
Split 3
Train: [76 77 78 79 80 81 82 83 84 85 86 87]
Test : [ 88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105
 106 107 108]
Split 4
Train: [ 97  98  99 100 101 102 103 104 105 106 107 108]
Test : [109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129]


In [129]:
# 20 splits with 12 months of data in each split
tscv = TimeSeriesSplit(n_splits=20,max_train_size=12)
for i, (train_index, test_index) in enumerate(tscv.split(merged_data)):
    print("Split",i)
    print("Train:",train_index)
    print("Test :",test_index)

Split 0
Train: [0 1 2 3 4 5 6 7 8 9]
Test : [10 11 12 13 14 15]
Split 1
Train: [ 4  5  6  7  8  9 10 11 12 13 14 15]
Test : [16 17 18 19 20 21]
Split 2
Train: [10 11 12 13 14 15 16 17 18 19 20 21]
Test : [22 23 24 25 26 27]
Split 3
Train: [16 17 18 19 20 21 22 23 24 25 26 27]
Test : [28 29 30 31 32 33]
Split 4
Train: [22 23 24 25 26 27 28 29 30 31 32 33]
Test : [34 35 36 37 38 39]
Split 5
Train: [28 29 30 31 32 33 34 35 36 37 38 39]
Test : [40 41 42 43 44 45]
Split 6
Train: [34 35 36 37 38 39 40 41 42 43 44 45]
Test : [46 47 48 49 50 51]
Split 7
Train: [40 41 42 43 44 45 46 47 48 49 50 51]
Test : [52 53 54 55 56 57]
Split 8
Train: [46 47 48 49 50 51 52 53 54 55 56 57]
Test : [58 59 60 61 62 63]
Split 9
Train: [52 53 54 55 56 57 58 59 60 61 62 63]
Test : [64 65 66 67 68 69]
Split 10
Train: [58 59 60 61 62 63 64 65 66 67 68 69]
Test : [70 71 72 73 74 75]
Split 11
Train: [64 65 66 67 68 69 70 71 72 73 74 75]
Test : [76 77 78 79 80 81]
Split 12
Train: [70 71 72 73 74 75 76 77 78 79 80 81]


In [138]:
# Fetching the actual data
for i, (train_index, test_index) in enumerate(tscv.split(merged_data)):
    print("Split",i)
    print("Train:",merged_data[["ur_imputed","gdpg_imputed"]].iloc[train_index])

Split 0
Train:    ur_imputed  gdpg_imputed
0         4.6      0.014928
1         4.4      0.014782
2         4.4      0.014782
3         4.6      0.014928
4         4.8      0.015074
5         4.8     -0.023688
6         4.6      0.014928
7         4.6      0.014928
8         4.4      0.106748
9         4.2      0.014636
Split 1
Train:     ur_imputed  gdpg_imputed
4          4.8      0.015074
5          4.8     -0.023688
6          4.6      0.014928
7          4.6      0.014928
8          4.4      0.106748
9          4.2      0.014636
10         3.9      0.014417
11         3.7      0.059368
12         3.5      0.014125
13         3.4      0.014052
14         3.4     -0.042327
15         3.6      0.014198
Split 2
Train:     ur_imputed  gdpg_imputed
10         3.9      0.014417
11         3.7      0.059368
12         3.5      0.014125
13         3.4      0.014052
14         3.4     -0.042327
15         3.6      0.014198
16         3.7      0.014271
17         3.7     -0.014801
18       

In [170]:
# Predict GDP with unemployment rate
n_splits = 5

from sklearn.linear_model import Ridge

merged_data_2 = merged_data.dropna(subset=['gdp_growth'])
ridge = Ridge(alpha=50)
tscv = TimeSeriesSplit(n_splits=n_splits)
oos_score_list = []

print("Split  In-sample R^2  Out-of-Sample R^2")
print("-"*40)

# Loop through the splits. Run a Ridge Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(merged_data_2)):
    X_train = merged_data_2[["unemployment-rate"]].iloc[train_index]
    y_train = merged_data_2[["gdp_growth"]].iloc[train_index]
    X_test = merged_data_2[["unemployment-rate"]].iloc[test_index]
    y_test = merged_data_2[["gdp_growth"]].iloc[test_index]
    ridge.fit(X_train,y_train)
    oos_score = ridge.score(X_test,y_test)
    print(i,
          " "*4,
          round(ridge.score(X_train,y_train),2),
          " "*10, 
          round(oos_score,2))
    oos_score_list.append(oos_score)

print("-"*40)
print("Average out-of-sample score:",round(np.mean(oos_score_list),2))

Split  In-sample R^2  Out-of-Sample R^2
----------------------------------------
0      0.0            -0.07
1      0.0            -0.05
2      0.0            0.0
3      0.0            -0.05
4      0.0            -0.18
----------------------------------------
Average out-of-sample score: -0.07


This is obviously a pretty bad model, but you get the idea.