# Imputation and Time Series Cross Validation


## A. Basic Setup

Let us begin by importing the data we need using `pandas`.

In [None]:
import pandas as pd

# Import data
gdp = pd.read_excel("../Data/hk-gdp.xlsx")
unemployment = pd.read_excel("../Data/unemployment.xlsx")

In [None]:
# gdp data
gdp.head()

In [None]:
# Unemployment rate data
unemployment.head()

In order to merge the two sets of data, we need to generate `end-month` for `gdp`. We will also compute quarter-to-quarter GDP growth.

In [None]:
# Create end-month


# Create gdp_growth


gdp.head()

Now let us merge the two datasets:

In [None]:
# Merge data


## B. Pandas: Replace Missing Values with a Single Value

```python
DataFrame['new_column'] = DataFrame['existing_col'].fillna(DataFrame['existing_col'].ops())
```

For example, if we would like to replace missing GDP values with the mean of the same series:

In [None]:

merged_data.head(12)

If you prefer to replace the original column instead of generating a new one, you can add the option `inplace=True`:

```python
DataFrame['existing_col'].fillna(DataFrame['existing_col'].ops(), inplace=True)
```

## C. Pandas: Index and Interpolation

If you want to fill missing values using interpolation instead of a single value, you will have to make a decision on the format of the index, because this affects the types of interpolation pandas allows you to use.

First let us try using more than one column as the index. This is called `MultiIndex` in pandas:

In [None]:

merged_data.head(12)

The syntax for interpolating a column is: 

```python
DataFrame['new_column'] = DataFrame['existing_column'].interpolate(method='some_method')
```

`MultiIndex` only supports linear interpolation, which treats all observations as equally spaced:

In [None]:

merged_data.head(12)

Next we will try a single index. We will need to combine year and month into a single number:

In [None]:

merged_data.head(12)

A single index allows for many more [interpolations methods](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate). The default interoplation method is `linear`, giving the same result as before:

In [None]:

merged_data.head(12)

Another possibly is `pad`, which simply uses the previous non-missing value:

In [None]:

merged_data.head(12)

## D. Scikit-learn Imputers

You can also use scikit-learn's imputation classes. The `SimpleImputer` class replaces missing values with a single value, while the `IterativeImputer` replaces missing values by the prediction of a model fitted on non-missing values.

Let us first try the `SimpleImputer`:

In [None]:
import numpy as np
from sklearn.impute import SimpleImputer

# Replace missing values with the mean of the series



merged_data.head(12)

In [None]:
# Replace missing values with the most frequent value of the series


merged_data.head(12)

With `IterativeImputer`, you can choose a model to predict the missing values. The default is a Bayesian Ridge Regression, which is similar to the usual Ridge Regression but with the strength of regularization estimated from data. To predict the missing value of a variable, the model will use all other variables you provide. 

Since it does not make sense to predict the absolute level of GDP with unemployment rate, we will predict GDP growth instead.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer



merged_data.head(12)