## Prediction of Volatility with Econometrics-Deep Learning Integrated Model   
### : Predicting VKOSPI with GARCH-RNN Model

In the era of uncertainty, the market volatility has never been higher. VIX, also known as the 'fear index' had hit record high in the midst of COVID-19. Predicting tomorrow's volatility is key to the future investment.  

Meanwhile, machine learning boom is overtaking every industry from automobile to retail. The machine learning solution is revolutionary; in other words, it has little respect to the conventional ways of doing things. This rigidness might come as a shortcoming in some cases.   

This notebook aims to predict market volatility. Instead of mere machine learning, it is integrated with a good ol' econometric(statistical) model. Data from Korean stock market during 2009 to 2019 was employed. The target data is VKOSPI which represents the volatility in Korean stock market.


[Link to dataset](https://www.kaggle.com/ninetyninenewton/vkospi)

### Volatility
The volatility $\sigma$, of a stock is a measure of our uncertainty about the returns provided by the stock. It can be defined as the standard deviation of the return provided by the stock. Volatility can be measured in several ways.
1. Historical Volatility  
> Historical volatility is calculated from historical data of a stock price.  

1. Implied Volatility  
> Implied volatility is calculated from options prices observed in the market. VKOSPI indicates implied volatility.

**Therefore, we**  
1. **Estimate historical volatility using econometric model.**  
1. **Close gap between historical volatility and implied volatility with option-related variables using deep learning.**

## 0. Setup

In [None]:
# Import packages
import os
import pickle

import numpy as np 
import pandas as pd 

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import tensorflow as tf
from tensorflow import keras

from scipy import stats
import scipy.optimize as optimize 

In [None]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## 1. Preprocess Data

### 1-1. Import data

In [None]:
# Import data - Korean Market Data
org_data = pd.read_csv("../input/vkospi/options_KR.csv")


# Import data - Volatility index around the world

CBOE_Volatility = pd.read_csv("../input/volatility-index-around-the-world/CBOE Volatility Index Historical Data.csv")
HSI_Volatility  = pd.read_csv("../input/volatility-index-around-the-world/HSI Volatility Historical Data.csv")
Nikkei_Volatility = pd.read_csv("../input/volatility-index-around-the-world/Nikkei Volatility Historical Data.csv")

CBOE_Volatility = CBOE_Volatility.loc[:,['Date','Price']]
HSI_Volatility = HSI_Volatility.loc[:,['Date','Price']]
Nikkei_Volatility = Nikkei_Volatility.loc[:,['Date','Price']]

In [None]:
org_data.head()

In [None]:
CBOE_Volatility.head()

### 1-2. Convert date data

Convert date data into pd.datetime


In [None]:
# Convert date data

org_data['Date'] = pd.to_datetime(org_data['Date'])
CBOE_Volatility['Date'] = pd.to_datetime(CBOE_Volatility['Date'])
HSI_Volatility['Date'] = pd.to_datetime(HSI_Volatility['Date'])
Nikkei_Volatility['Date'] = pd.to_datetime(Nikkei_Volatility['Date'])

### 1-3. Preprocess time series data

Target data of Day n is predicted from the variables of Day n-1, except for 'Day_till_expiration' and 'Day_of_a_week' since they can be known in advance.  

It would be convenient to preprocess the dataframe so the input variables and the corresponding target variables are in the same row.

In [None]:
pivot = ['Date','VKOSPI','Day_till_expiration','Day_of_a_week']

# Save the original data
full_data = org_data.copy()

# Shift the pivot data
full_data[pivot] = full_data[pivot].shift(periods=-1)

# Drop the last row
full_data = full_data.drop(full_data.index[-1])

In [None]:
full_data.head()

### 1-4. Format foreign volatility indices as input data

Like most of the other markets, Korean market is largely affected by global markets. Thus, volatility indices of major foreign markets can be helpful to predict the volatility of Korean market. Volatility indices of US(VIX), China(HSI), and Japan(Nikkei) markets are used here.

Considering that Seoul's timezone is ahead or the same of US, China, or Japan, 
any volatility index of the same day cannot be known in advance.   

Therefore, when predicting Day n, foreign volatility indices of Day n-1 are used.   
If Day n-1 is not avilable, Day n-2.     
If Day n-2 is not avilable, Day n-3, and so on.    


In [None]:
# Format foreign volatility indices as input data

def correspond_foreign_vol(date,data):
    """
    find the 'Price' of the date that is most recent to the given 'Date' in the 'data'
    """
    
    while True:
        date = date - pd.Timedelta('1 Day')  # go back one day 
        result_series = data['Price'].loc[data['Date']==date]  # find the 'Price' in the data that matches with the 'date'
        
        if not result_series.empty:  # if not empty (which means there is a row that matches with the 'date')
            result_series = result_series.reset_index()  # reset index
            result_value = result_series['Price'][0]  # and get the value
            return result_value

# Apply function

full_data['CBOE'] = full_data['Date'].apply(correspond_foreign_vol,data=CBOE_Volatility)
full_data['HSI'] = full_data['Date'].apply(correspond_foreign_vol,data=HSI_Volatility)
full_data['Nikkei'] = full_data['Date'].apply(correspond_foreign_vol,data=Nikkei_Volatility)

In [None]:
full_data[['Date','CBOE','HSI','Nikkei']].head()

## 2. Estimate historical volatility with GARCH model
GARCH, or GARCH(1,1) to be precise, is an widely used econometric model to estimate historical volatility.

GARCH estimates the historical volatility in Day n with: 
* **Rate of Return** (of an underlying asset) in Day n-1 (denoted by $ u $)
* **Volatility** in Day n-1 (denoted by $\sigma$)
* Note that $\sigma^{2}$ is often referred to **Variance**

There are three coefficients in the model:  
* **alpha**, coefficent of $ u^{2} $
* **beta**, coefficient of $ \sigma^{2} $
* **omega**, constant (It is actually not just a constant, but we will make it simple here.)

The formula for estimating (historical) volatility of Day n is:  
  
$$
\sigma^{2}_{n}= \omega + \alpha u^{2}_{n-1} + \beta \sigma^{2}_{n-1}
$$
  
  
    
All referenced from John C. Hull's "Options, Futures, And Other Derivatives (8th edition)".    

For more info,  
[GARCH Model](https://vlab.stern.nyu.edu/docs/volatility/GARCH)

### 2-1. Calculate return 
  
Calculate 'rate of return', or 'return' in short.  

The formula for return is the following, where $u$ denotes return and $S$ denotes the price of an underlying asset(e.g. stock). There is an another formula which involves logarithm, but this one will be used here.  
  
    
$$
u_{n} = \frac{S_{n} - S_{n-1}}{S_{n-1}}
$$


In [None]:
# Calculate return

KOSPI200_yesterday = org_data['KOSPI200'].shift(periods=1)  # Series made up of shifted KOSPI200
return_array = (org_data['KOSPI200']-KOSPI200_yesterday) / KOSPI200_yesterday  # Calculate return

### 2-2. Build GARCH model

Build a function which estimates the volatility of Day n with GARCH model.  
Following is a simple reminder of the formula.  

$$
\sigma^{2}_{n}= \omega + \alpha u^{2}_{n-1} + \beta \sigma^{2}_{n-1}
$$

In [None]:
# GARCH model
def garch_forward(return_rate,variance,coefficients):
    ''' data type: float, float, 1d array(length=3)'''
    
    # Coefficients
    alpha,beta,omega = coefficients
    # Calculate
    return omega + alpha*return_rate*return_rate + beta*variance

### 2-3. Choose the values for coefficients in the model - Build function

#### 2-3-1. Method

To estimate the volatility with the model, the values for the coefficients must be chosen. This is done with historical data by using "maximum likelihood method". This method involves maximizing the probability with which historical data occurs.   
  
It is assumed that the return at any Day n ($ u_{n}$) follows the nomral distribution with zero mean. Variance at each day is $v_{n}$. Therefore, a probability of observing $u_{n}$ is  
$$\frac{1}{\sqrt{2\pi v_{n}}} exp(\frac{-u_{i}^{2}}{2v_{n}})$$
  
Taking logarithms and ignoring constant multiplicative factors, the expression we wish to maximize in each day is the following.
$$-ln(v_{n})-\frac{u_{n}^{2}}{v_{n}}$$
  
Sum of the above expression over the data is what we finally want to maximize. 
  
  
#### 2-3-2. Values
* Return($u_{n}$) is calculated directly from the data. (This was done in 2-1.)  
* Variance($v_{n}$) is estimated by GARCH. Initial variance is set from the real VKOSPI value at that date.  
  
  **Variance($\sigma^{2}$)** represents 1 day.  
**VKSOPI** is an index of **volatility($\sigma$)** for 1 year expressed in percentage(%).  
Assuming there are 252 trading days per year, **VKOSPI** can be converted into **variance** by the following.  
  
$$
Variance = (VKOSPI)^{2} \: / \: (252 \times 100)
$$

In [None]:
# Set VKOSPI as an initial variance in the model

initial_vkospi = full_data['VKOSPI'][1] 
initial_variance = initial_vkospi*initial_vkospi/2520000 

In [None]:
# Function for optimization

def garch_for_optimization(array): 
    ''' data type: 1d array(length=3)'''
    
    # Coeffcients
    alpha,beta,omega = array 
    
    # Variables
    sum_probability = 0 # to maximize
    variance = initial_variance
    
    for i in range(1,return_array_train.shape[0]):  # exclude the first value because it's nan.
        return_rate = return_array_train[i]  
        
        # in case something goes wrong
        if variance<=0:
            print("Negative variance")
            break
        
        # calculate probability in a single day
        probability = -np.log(variance) - return_rate * return_rate / variance
        
        # add to the sum
        sum_probability += probability
        
        # calculate next day's variance by GARCH
        variance = garch_forward(return_rate,variance,array)
   
    return -sum_probability # note the sign(-); because scipy.optimize requires a function to be minimized

### 2-4. Choose the values for coefficients in the model - Optimize
  
Using scipy.optimize library, we optimize the function we built right before. 

NYU V-Lab's estimate (in Feb 15, 2020) was used for initial guess. The reason not simply taking V-Lab's estimate is because the ideal coefficients can differ by the time period. The bounds are set due to the underlying concept of GARCH, which will not be further explained here.

Since optimization consumes quite a time, the result is already pickled and you can load it.


In [None]:
# Optimize
# if execute is set TRUE, optimization which takes some time will be initiated
execute = False    

if execute:
    bounds = optimize.Bounds([0,0,0],[1,1,np.inf])
    initial_guess = [0.14,0.76,2.97]  # V-Lab's estimate

    # Trust-constr performed the best 
    optimize_res_trust = optimize.minimize(garch_for_optimization,initial_guess,method='trust-constr',bounds=bounds)

    # Pickle the result since it takes too much time
    with open('../input/optimize_res_trust.pkl','wb') as file:
        pickle.dump(optimize_res_trust,file)

In [None]:
# Load the pickle: optimize_res_trust
with open('/kaggle/input/optimize-res-trust/optimize_res_trust.pkl','rb') as file:
    optimize_res_trust = pickle.load(file)

# Optimization result
print('optimization result:')
print('alpha =',optimize_res_trust.x[0])
print('beta =',optimize_res_trust.x[1])
print('omega =',optimize_res_trust.x[2])

### 2-5. Estimate historical volatility

Now the coefficients are chosen, we can estimate historical volatility of each day using GARCH model.  

In [None]:
# Estimate historical volatility with estimated coefficients

variance_array = np.zeros(full_data.shape[0],)  # Create array to store variance

# Values to be pre-assigned
variance_array[0] = np.nan  # Historical volatility cannot be estimated for the first one in full_data (06/03/2009)
variance_array[1] = initial_variance 

# Calculate historical volatilities using GARCH
for i in range(2,full_data.shape[0]):
    variance_array[i]=garch_forward(return_array[i-1],variance_array[i-1],optimize_res_trust.x)
    
# Adjust value to compare with VKOSPI (elaborated in 2-3-2)  
historical_volatility = np.sqrt(variance_array * 252) * 100

# Add to the dataset
full_data['Historical Volatility'] = historical_volatility
full_data = full_data.dropna()  # Drop NA (because of nan in historical volatility)

### 2-6. Result

Though prediction is not finished, it's worth visualizaing and calculating the error.  

In [None]:
# Before visualizing, few matplotlib settings
mpl.rcParams['axes.labelsize'] = 'x-large'
mpl.rcParams['axes.labelpad'] = 5.5  # space between axis and label
mpl.rcParams['axes.titlesize'] = 'large'
mpl.rcParams['axes.titlepad'] = '20.0'

mpl.rcParams['legend.fontsize'] = 'x-large'

# mpl color settings
default_clrs = plt.rcParams['axes.prop_cycle'].by_key()['color']

# and seaborn settings
sns.set_style('darkgrid')

In [None]:
# Compare with VKOSPI (Visualization)
pd.plotting.register_matplotlib_converters()  # because of compatilbility issue with pd.Timestamp and matplotlib

fig, ax = plt.figure(figsize=(20,7)), plt.axes()
sns.lineplot(data=full_data,x='Date',y='VKOSPI',label='VKOSPI (target)',ax=ax)
sns.lineplot(data=full_data,x='Date',y='Historical Volatility',label='Historical Volatility (prediction)',ax=ax)
plt.ylabel('Volatility')
plt.show()

In [None]:
# Calculate error
abs_garch = abs(full_data['VKOSPI']-full_data['Historical Volatility'])
square_garch = (full_data['VKOSPI']-full_data['Historical Volatility'])**2

print('Total set')
print('MAE:',abs_garch.mean())
print('RMSE:',square_garch.mean()**0.5)

print('\nTest set ([-516:])')
print('MAE:',abs_garch[-516:].mean())
print('RMSE:',square_garch[-516:].mean()**0.5)

## 3. Exploratory Data Analysis
  
In this section, we examine and visualize various market variables to see whether they qualify as an input variable of the nerual network. 


1. Day of a week 
> `'Day_of_a_week'`
1. Foreign volatility indices  
> `'CBOE'` `'HSI'` `'Nikkei'`
1. Days left untill expiration date 
>`'Day_till_expiration'`
1. Other market variables
>`'KOSPI200','Open_interest','For_KOSPI_Netbuying_Amount','For_Future_Netbuying_Quantity',
 'For_Call_Netbuying_Quantity','For_Put_Netbuying_Quantity','Indiv_Future_Netbuying_Quantity',
 'Indiv_Call_Netbuying_Quantity','Indiv_Put_Netbuying_Quantity','PCRatio'`

In [None]:
# Split data into train data(including validation data) and test data
split_ratio = 0.8

data = full_data.set_index('Date')
len_data = full_data.shape[0]

train_org = full_data.iloc[:int(len_data*split_ratio),]
test_org = full_data.iloc[int(len_data*split_ratio):,]

print('train set',train_org.shape)
print('test set',test_org.shape)

### 3-1. Day of a week ('Day_of_a_week')


The correlation seems to be statistically insignificant. 

In [None]:
# 1. Day of a week
# confidence interval = 95%
barplot = sns.barplot(x='Day_of_a_week',y='VKOSPI',data=train_org,ci=95,order=['Mon','Tue','Wed','Thu','Fri']) 
barplot.set_ylim((16,18))
plt.show()

### 3-2. Foreign volatility indices ('CBOE', 'HSI',  'Nikkei')
1. ```VKOSPI(KOSPI200,South Korea)``` has the highest correlation with ```VIX(S&P500,US)```, just like the rest of the world.  
(VIX is denoted by `'CBOE'` in the code, in which VIX is calculated and disseminated.) 
2. ```VHSI(HSI,Hong Kong)``` follows, as Korean economy is highly dependent on Chinese economy.  
3. ```JNVI(Nikkei,Japan)``` then follows.  

Considering the high correlations with each other, we're only including `'CBOE'(VIX)`.  



In [None]:
# 2. Foreign volatility indices
corr = train_org.loc[:,['VKOSPI','CBOE','HSI','Nikkei']].corr()

# heatmap
mask = np.zeros_like(corr)
for i in range(mask.shape[0]):
    mask[i,i]=True
sns.heatmap(corr,annot=True,mask=mask,cmap='coolwarm',vmin=-1,vmax=1)
plt.show()

### 3-3. Days left untill expiration date ('Day_till_expiration')

We will include `Day_till_expiration`.

In [None]:
# start with simple scatterplot
sns.scatterplot(x='Day_till_expiration',y='VKOSPI',data=train_org)
plt.show()

# too bizarre when VKOSPI is small.
print('Too bizarre when VKOSPI is small.')
print('Instead of plotting every data, plot mean of VKOSPI within each Day_till_expiration.')
      
mean_by_dtexp = train_org.groupby('Day_till_expiration').mean()['VKOSPI']
plt.figure(figsize=(15,5))
plt.ylabel('Mean of VKOSPI')
mean_by_dtexp_plot = mean_by_dtexp - 10  # for plotting purpose
sns.barplot(x=mean_by_dtexp.index,y=mean_by_dtexp_plot.values,bottom=10)
plt.show()

# reasonable to suspect that there aren't much data with very high day_till_expiration
print('It is reasonable to suspect that there aren\'t much data when day_till_expiration is high')
len_by_dtexp = train_org['Day_till_expiration'].value_counts()
plt.figure(figsize=(15,5))
plt.xlabel('Day_till_expiration')
plt.ylabel('Frequency of Data')
sns.barplot(x=len_by_dtexp.index,y=len_by_dtexp.values)
plt.show()

print('The amount of data seems to be insufficient when Day_till_expiration is high')

In [None]:
# Exclude the cases with insufficient amount of data

def slice_and_anaylze(slice_at):
    # slice dataframe till the given parameter(slice_at)
    
    print(f'Plot from 0 to {slice_at}.')
    mean_by_dtexp = full_data.groupby('Day_till_expiration').mean()['VKOSPI']
    mean_by_dtexp = mean_by_dtexp[:slice_at]  # slice
    mean_by_dtexp = pd.DataFrame({'Day_till_expiration':mean_by_dtexp.index,
                                  'Mean_of_VKOSPI':mean_by_dtexp.values})  # Convert series to dataframe (for sns plot)
    
    # lmplot
    sns.lmplot(x='Day_till_expiration',y='Mean_of_VKOSPI',data=mean_by_dtexp)
    plt.show()

    # correlation test (pearson)
    corr_coef = stats.pearsonr(mean_by_dtexp['Day_till_expiration'],mean_by_dtexp['Mean_of_VKOSPI'])
    print('Pearson correlation test')
    print('='*50)
    print('correlation coefficient:',corr_coef[0])
    print('2-tailed p-value:', corr_coef[1])

It isn't clear how much data is sufficient enough. Hereby we test a couple of cases.  

1. 0~24  
> This case excludes only 25 and 26, in which there is almost no data at all. 

1. 0~19  
> This case only includes ones with more than 50 days, which is the half of maximum.

In [None]:
#~24
slice_and_anaylze(24)

In [None]:
#~19
slice_and_anaylze(19)

> It's hard to determine how much data is needed to consider mean of VKOSPI valid.     
  
> However, the first analysis(\~24) definitely shows the correlation between **the days left until expiration date** and **VKOSPI**.   
Even the second analysis(\~19) wasn't so bad. 

Therefore, it's reasonable to think that correlation is statistically significant, or at least **this variable can be fed into a neural network later on**.  

### 3-4. Other market variables
`'KOSPI200','Open_interest','For_KOSPI_Netbuying_Amount','For_Future_Netbuying_Quantity',
 'For_Call_Netbuying_Quantity','For_Put_Netbuying_Quantity','Indiv_Future_Netbuying_Quantity',
 'Indiv_Call_Netbuying_Quantity','Indiv_Put_Netbuying_Quantity','PCRatio'`
 
 
> Only `KOSPI200` seems to have linear correlation with `VKOSPI`
  
> Within the variables, only `Indiv_Future_Netbuying_Quantity` and `For_Future_Netbuying_Quantity` seems to have correlation with each other. However, it seems unnecessary to exclude one of the variables from the poteintial neural net input variables.

Since visualization doesn't reveal everything, we will not entirely exclude these variables.

In [None]:
# Plot each market variable with VKOSPI

# Columns to plot 
plot_columns = ['KOSPI200','Open_interest','For_KOSPI_Netbuying_Amount','For_Future_Netbuying_Quantity',
                'For_Call_Netbuying_Quantity','For_Put_Netbuying_Quantity','Indiv_Future_Netbuying_Quantity',
                'Indiv_Call_Netbuying_Quantity','Indiv_Put_Netbuying_Quantity','PCRatio']

# plot
fig, axes = plt.subplots(10,2,figsize=(25,75))
for i,col in enumerate(plot_columns):
    sns.regplot(col,'VKOSPI',data=train_org, ax=axes[i,0]) # regression plot
    sns.kdeplot(train_org[col],train_org['VKOSPI'],shade=True, ax=axes[i,1]) # kernel density estimate plot
  

In [None]:
# Correlation matrix
corr = train_org[plot_columns].corr()

mask = np.zeros_like(corr)
for i in range(mask.shape[0]):
    mask[i,i]=True
plt.figure(figsize=(12,5))
sns.heatmap(corr,annot=True,mask=mask,cmap='coolwarm',vmin=-1,vmax=1)
plt.show()

In [None]:
# Plot variables with high correlation coefficient

def cor_reg_kde(x,y,data):
    # Correlation coefficient
    print('correlation coefficient:',corr.loc[x,y])
    
    # Subplot adjust
    plt.subplots_adjust(wspace=0.5)
    
    # Plot (regression and kernel density)
    fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,5))
    sns.regplot(x,y,data=data,ax=ax1)
    sns.kdeplot(data[x],data[y],shade=True,ax=ax2)
    
cor_reg_kde('Indiv_Future_Netbuying_Quantity','For_Future_Netbuying_Quantity',train_org)
cor_reg_kde('Indiv_Put_Netbuying_Quantity','Open_interest',train_org)
cor_reg_kde('Indiv_Put_Netbuying_Quantity','Indiv_Call_Netbuying_Quantity',train_org)
cor_reg_kde('Indiv_Call_Netbuying_Quantity','Open_interest',train_org)

## 4. Data Preprocessing for Neural Net
In this section, we preprocess data before feeding into neural network.

### 4-1. Normalize data

In [None]:
# Normalize data

# Keep the original
train = train_org.copy()
test = test_org.copy()

# Variables that has to be normalized
to_normalize = ['VKOSPI','KOSPI200', 'Open_interest',
                  'For_KOSPI_Netbuying_Amount', 'For_Future_Netbuying_Quantity',
                  'For_Call_Netbuying_Quantity', 'For_Put_Netbuying_Quantity',
                  'Indiv_Future_Netbuying_Quantity', 'Indiv_Call_Netbuying_Quantity',
                  'Indiv_Put_Netbuying_Quantity', 'PCRatio', 'Day_till_expiration',
                  'CBOE','Historical Volatility'
                 ]
# Normalize
mean_train = train[to_normalize].mean()
std_train = train[to_normalize].std()

train[to_normalize] = (train[to_normalize]-mean_train)/std_train
test[to_normalize] = (test[to_normalize]-mean_train)/std_train

### 4-2. Select input variables

In [None]:
# Select input variables
input_var = ['KOSPI200', 'Open_interest','Day_till_expiration', 'CBOE', 'Historical Volatility',
             'For_KOSPI_Netbuying_Amount','For_Future_Netbuying_Quantity',
             'For_Call_Netbuying_Quantity','For_Put_Netbuying_Quantity',
             'Indiv_Future_Netbuying_Quantity', 'Indiv_Call_Netbuying_Quantity',
            'Indiv_Put_Netbuying_Quantity'
            ] 
print(f'total {len(input_var)} variables')

# Subsample data
x_train_org = train[input_var]
x_test_org = test[input_var]
y_train_org = train['VKOSPI']
y_test_org = test['VKOSPI']

print('x_train_org', x_train_org.shape)
print('y_train_org', y_train_org.shape)
print('x_test_org', x_test_org.shape)
print('y_test_org', y_test_org.shape)

### 4-3. Reshape data
* We set our **time steps to be 4**.   
* This means that we predict the VKOSPI with previous data from **last 5 days**.   
* It's 5, not 4, because we already shifted 1 day before EDA (in 1-3). 
* 5 days usually represent one week. (excluding weekend)
  
Shape of reshpaed data is ***(batch_size, time_steps, number_of_feautures)***.   
  
  
`number of features` will remain the same.    
`time steps` will be set to 4.

In [None]:
TIME_STEPS = 4  #predict with 5 days (+1 because we already shifted 1 day in 1-3)

# function
def reshape(data, time_steps = TIME_STEPS):
    stack = [data.iloc[i:i+time_steps].to_numpy() for i in range(len(data)-time_steps)]
    reshaped = np.stack(stack, axis=0) 
    return reshaped

# reshape
x_train_val = reshape(x_train_org) # train data and validation data
y_train_val = reshape(y_train_org)
x_test = reshape(x_test_org)
y_test = reshape(y_test_org)

print('x_train + x_val', x_train_val.shape)
print('y_train + y_val', y_train_val.shape)
print('x_test', x_test.shape)
print('y_test', y_test.shape)

### 4-4. Seperate validation data from train data

In [None]:
# Seperate validation data from train data
i = int(len(x_train_val) * 0.8)

x_train = x_train_val[:i]
x_val = x_train_val[i:]
y_train = y_train_val[:i]
y_val = y_train_val[i:]

print('train data:', x_train.shape[0])
print('validation data:', x_val.shape[0])
print('test data:', x_test.shape[0])

## 5. Neural Network
In this section, we build a neural network which includes historical volatility as an input variable.  
Other option-related variables are also included in order to predict implied volatility.
  
  
We will use a couple of different neural network.
1. Vanilla NN
2. LSTM
3. GRU

### 5-1. Hyperparameter tuning
Bayesian optimization will be applied using `kerastuner`.

In [None]:
from kerastuner import BayesianOptimization

# Function 'model_builder' builds and compiles neural network with given hyperparameters

# vanilla nn
def nn_builder(hp):
    # hp
    UNITS = hp.Int('UNITS', min_value = 16, max_value = 128, step = 16)
    ACTIVATION_1 = hp.Choice('ACTIVATION',values = ['relu','linear','tanh'])
    ACTIVATION_2 = hp.Choice('ACTIVATION',values = ['relu','linear','tanh'])
    
    # model instance
    model = keras.Sequential([
        keras.layers.Flatten(),
        keras.layers.Dense(units = UNITS, activation = ACTIVATION_1),
        keras.layers.Dropout(rate = 0.2),
        keras.layers.Dense(1, activation = ACTIVATION_2)
    ])
    
    # compile
    model.compile(optimizer = keras.optimizers.Adam(),
                  loss = 'mse', metrics = ['mae'])    
    
    return model

# lstm
def lstm_builder(hp):
    # hp
    UNITS_1 = hp.Int('UNITS_1', min_value = 16, max_value = 128, step = 16)
    UNITS_2 = hp.Int('UNITS_2', min_value = 8, max_value = 32, step = 8)
    ACTIVATION = hp.Choice('ACTIVATION',values = ['relu','linear','tanh'])
    
    # model instance
    model = keras.Sequential([
        # input_shape = (time_steps, # of features)
        keras.layers.LSTM(units = UNITS_1, input_shape = (None, x_train.shape[2]), return_sequences = False), 
        keras.layers.Dense(UNITS_2, activation = ACTIVATION),
        keras.layers.Dropout(rate = 0.2),
        keras.layers.Dense(1, activation = 'linear')
    ])
    
    # compile
    model.compile(optimizer = keras.optimizers.Adam(),
                  loss = 'mse', metrics = ['mae'])
    
    return model

# gru
def gru_builder(hp):
    # hp
    UNITS_1 = hp.Int('UNITS_1', min_value = 16, max_value = 128, step = 16)
    UNITS_2 = hp.Int('UNITS_2', min_value = 8, max_value = 32, step = 8)
    ACTIVATION = hp.Choice('ACTIVATION',values = ['relu','softmax','linear'])
    
    # model instance
    model = keras.Sequential([
        # input_shape = (time_steps, # of features)
        keras.layers.GRU(units = UNITS_1, input_shape = (None, x_train.shape[2]), return_sequences = False),
        keras.layers.Dense(UNITS_2, activation = ACTIVATION),
        keras.layers.Dropout(rate = 0.2),
        keras.layers.Dense(1, activation = 'linear')
    ])
    
    # compile
    model.compile(optimizer = keras.optimizers.Adam(),
                  loss = 'mse', metrics = ['mae'])
    
    return model

In [None]:
# Implement earlystopping with keras callback 

PATIENCE = 5 # number of epochs with no improvement after which training will be stopped.

Earlystopping = keras.callbacks.EarlyStopping(monitor='val_loss',
                                              min_delta = 0.001,
                                              patience=PATIENCE, 
                                              mode='min', 
                                              restore_best_weights=True)

In [None]:
# Search hyperparameters
SEED = 121

# NN
tuner_nn = BayesianOptimization(nn_builder,
                                objective = 'val_loss',
                                max_trials = 20,
                                seed = SEED,
                                directory = 'kerastuner',
                                overwrite = True
                                )

tuner_nn.search(x_train, y_train, epochs=50, validation_data=(x_val, y_val), verbose=0, callbacks=[Earlystopping])

## Build model based on the optimized hyperparameters
besthp_nn = tuner_nn.get_best_hyperparameters()[0]
model_nn = tuner_nn.hypermodel.build(besthp_nn)


# lstm
tuner_lstm = BayesianOptimization(lstm_builder,
                            objective = 'val_loss',
                            max_trials = 20,
                            seed = SEED,
                            directory = 'kerastuner')

tuner_lstm.search(x_train, y_train, epochs=50, validation_data=(x_val, y_val), verbose=0, callbacks=[Earlystopping])

## Build model based on the optimized hyperparameters
besthp_lstm = tuner_lstm.get_best_hyperparameters()[0]
model_lstm = tuner_lstm.hypermodel.build(besthp_lstm)


# gru
tuner_gru = BayesianOptimization(gru_builder,
                            objective = 'val_loss',
                            max_trials = 20,
                            seed = SEED,
                            directory = 'kerastuner')

tuner_gru.search(x_train, y_train, epochs=50, validation_data=(x_val, y_val), verbose=0, callbacks=[Earlystopping])

## Build model based on the optimized hyperparameters
besthp_gru = tuner_gru.get_best_hyperparameters()[0]
model_gru = tuner_gru.hypermodel.build(besthp_gru)




In [None]:
besthp_nn

### 5-2. Train model

In [None]:
# Colors designated to each model
palette = default_clrs[:4]
palette.append(default_clrs[6])

**PALETTE**
* 0: target data (VKOSPI)
* 1: GARCH (benchmark)
* 2: GARCH-NN
* 3: GARCH-LSTM
* 4: GARCH-GRU

In [None]:
# Ensure reproducible result
print(SEED)
tf.random.set_seed(SEED)

In [None]:
# Train model - NN
history = model_nn.fit(x_train, y_train, epochs=20, validation_data=(x_val, y_val), verbose=0, callbacks=[Earlystopping])

# Plot loss
plt.plot(history.history['loss'], label = 'loss')
plt.plot(history.history['val_loss'], label = 'val_loss', color=palette[2])
plt.legend()

# Loss
print('final loss:',history.history['loss'][-1])
print('final val_loss:',history.history['val_loss'][-1])

In [None]:
model_nn.summary()

In [None]:
# Train model - LSTM
history = model_lstm.fit(x_train, y_train, epochs=20, validation_data=(x_val, y_val), verbose=0, callbacks=[Earlystopping])

# Plot loss
plt.plot(history.history['loss'], label = 'loss')
plt.plot(history.history['val_loss'], label = 'val_loss',color=palette[3])
plt.legend()

# Loss
print('final loss:',history.history['loss'][-1])
print('final val_loss:',history.history['val_loss'][-1])

In [None]:
model_lstm.summary()

In [None]:
# Train model - GRU
history = model_gru.fit(x_train, y_train, epochs=20, validation_data=(x_val, y_val), verbose=0, callbacks=[Earlystopping])

# Plot loss
plt.plot(history.history['loss'], label = 'loss')
plt.plot(history.history['val_loss'], label = 'val_loss',color=palette[4])
plt.legend()

# Loss
print('final loss:',history.history['loss'][-1])
print('final val_loss:',history.history['val_loss'][-1])

In [None]:
model_gru.summary()

## 6. Result

In [None]:
# Data for comparison
target_date = test_org['Date'].iloc[TIME_STEPS:] 
target_vkospi = test_org['VKOSPI'].iloc[TIME_STEPS:]
predictions_hv = test_org['Historical Volatility'].iloc[TIME_STEPS:]

# Function to scale back predictions
def scale_back(predictions):
    predictions = [prediction[0] for prediction in predictions] # unpack array

    # scale back
    mean_vol = train_org['VKOSPI'].mean()
    std_vol = train_org['VKOSPI'].std() 
    predictions = [prediction * std_vol + mean_vol for prediction in predictions]
    
    return predictions

# Predict
predictions_nn = scale_back(model_nn.predict(x_test))
predictions_lstm = scale_back(model_lstm.predict(x_test))
predictions_gru = scale_back(model_gru.predict(x_test))

# pd.Series-ization
predictions_nn = pd.Series(predictions_nn, index=target_vkospi.index, name='predictions_nn')
predictions_lstm = pd.Series(predictions_lstm, index=target_vkospi.index, name='predictions_lstm')
predictions_gru = pd.Series(predictions_gru, index=target_vkospi.index, name='predictions_gru')

In [None]:
# Plot
fig1, ax1 = plt.figure(figsize=(20,7)), plt.axes()
fig2, ax2 = plt.figure(figsize=(20,7)), plt.axes()
fig3, ax3 = plt.figure(figsize=(20,7)), plt.axes()
fig4, ax4 = plt.figure(figsize=(20,7)), plt.axes()

# target data
ax1.plot(target_date, target_vkospi, label='VKOSPI (target data)')
ax2.plot(target_date, target_vkospi, label='VKOSPI (target data)')
ax3.plot(target_date, target_vkospi, label='VKOSPI (target data)')
ax4.plot(target_date, target_vkospi, label='VKOSPI (target data)')

# predictions
ax1.plot(target_date, predictions_hv, label='GARCH',color=palette[1])
ax2.plot(target_date, predictions_nn, label='GARCH-NN',color=palette[2])
ax3.plot(target_date, predictions_lstm, label='GARCH-LSTM',color=palette[3])
ax4.plot(target_date, predictions_gru, label='GARCH-GRU',color=palette[4])

# show
ax1.legend()
ax2.legend()
ax3.legend()
ax4.legend()
plt.show()

### MSE and MAE

In [None]:
# Calculate error
def cal_MSE(predictions):
    errors = (predictions - target_vkospi)**2
    return errors.mean()

def cal_MAE(predictions):
    errors = abs(predictions - target_vkospi)
    return errors.mean()


# Print error
print('GARCH')
print('MSE:',cal_MSE(predictions_hv))
print('MAE:',cal_MAE(predictions_hv))

print('GARCH-NN')
print('MSE:',cal_MSE(predictions_nn))
print('MAE:',cal_MAE(predictions_nn))

print('GARCH-LSTM')
print('MSE:',cal_MSE(predictions_lstm))
print('MAE:',cal_MAE(predictions_lstm))

print('GARCH-GRU')
print('MSE:',cal_MSE(predictions_gru))
print('MAE:',cal_MAE(predictions_gru))

### Number of best predictions

In [None]:
# Compare day by day

# error list
errors_hv = abs(target_vkospi - predictions_hv)
errors_nn = abs(target_vkospi - predictions_nn)
errors_lstm = abs(target_vkospi - predictions_lstm)
errors_gru = abs(target_vkospi - predictions_gru)

# count the best models
best_models = []
for i in range(len(errors_hv)):
    each_predictions = [errors_hv.iloc[i], errors_nn.iloc[i], errors_lstm.iloc[i], errors_gru.iloc[i]]
    best_models.append(each_predictions.index(min(each_predictions)))

best_counts = [best_models.count(i) for i in range(4)]

    
# plot
fig, ax = plt.figure(), plt.axes()
ax.pie(best_counts, labels=['GARCH','GARCH-NN','GARCH-LSTM','GARCH-GRU'], autopct='%.0f%%', startangle=90, colors=palette[1:])
plt.title('Number of days in which it performed the best')
plt.show()

### Number of predicting the right direction (up or down)

In [None]:
# direction arrays
directions_vkospi = target_vkospi - target_vkospi.shift()
directions_hv = predictions_hv - predictions_hv.shift()
directions_nn = predictions_nn - predictions_nn.shift()
directions_lstm = predictions_lstm - predictions_lstm.shift()
directions_gru = predictions_gru - predictions_gru.shift()

# multiplicate element-wise, and count the positivie elements
def count_pos(series):
    count = 0
    for x in series:
        if x>0:
            count += 1
    return count

direction_counts = [511,0,0,0,0]
direction_counts[1] = count_pos(directions_hv * directions_vkospi)
direction_counts[2] = count_pos(directions_nn * directions_vkospi)
direction_counts[3] = count_pos(directions_lstm * directions_vkospi)
direction_counts[4] = count_pos(directions_gru * directions_vkospi)

# plot
fig, ax = plt.figure(), plt.axes()
ax.bar(['Total Days','GARCH','GARCH-NN','GARCH-LSTM','GARCH-GRU'],direction_counts, color=palette)
plt.title('Number of days in which it predicted the right direction(up or down)')
plt.show()

# direction prediction accuracy
print('GARCH')
print(direction_counts[1] / direction_counts[0]*100,'%')
print('GARCH-NN')
print(direction_counts[2] / direction_counts[0]*100,'%')
print('GARCH-LSTM')
print(direction_counts[3] / direction_counts[0]*100,'%')
print('GARCH-GRU')
print(direction_counts[4] / direction_counts[0]*100,'%')