# **Optiver Realized Volatility Predic**

This reproduces the naive solution from the [notebook](https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data) provided by the contest organizers, but replaces the way that WAP is calculted.

Also i used the notebook shared by Slawek Biel as a template to organize the information.

In [None]:
import numpy as np 
import pandas as pd
import glob
import warnings
from sklearn.metrics import r2_score
warnings.filterwarnings('ignore')

### **Usefull functions**

log_return and realized_volatility are the same function as [Jiashen Liu](https://www.kaggle.com/jiashenliu) from Optiver provides.



In [None]:
#log return
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff()

#realized volatility
def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

#rmspe
def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

#### **Load the train set**

Note that i create a new column called row_id. It is goin to be used for merge the table with the results

In [None]:
df_train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
df_train['row_id'] = df_train['stock_id'].astype(str) + '-' + df_train['time_id'].astype(str)
df_train = df_train[['row_id','target']]

In the next cell i will calculate the wap for each stock at a particular time_id making a 80-20 distribution betwen the most important and the second.

In [None]:
def predictions(list_order_book_file):
    list_pred = []

    for file in list_order_book_file:
        stock_id = file.split('=')[1]

        df_book = pd.read_parquet(file)


        #I consider important both prices, the most importans and the second, so 0.8 of weight for the best wap and a 0.2 for the second sounds ok for me
        df_book['wap1'] = (df_book['bid_price1'] * df_book['ask_size1']\
                              + df_book['ask_price1'] * df_book['bid_size1'])\
                                /(df_book['bid_size1']+ df_book['ask_size1'])

        df_book['wap2'] = (df_book['bid_price2'] * df_book['ask_size2']\
                              + df_book['ask_price2'] * df_book['bid_size2'])\
                                /(df_book['bid_size2']+ df_book['ask_size2'])

        df_book['wap']= df_book['wap1']*0.8 + df_book['wap2']*0.2

        #this is the unique list of time_id elements. it's used for looping each stock on each time_id
        list_times = list(pd.unique(df_book['time_id']))


        for t in list_times:
            df_pre= df_book[df_book['time_id']== t]

            df_pre.loc[:,'log_return'] = log_return(df_pre['wap'])
            df_pre = df_pre[~df_pre['log_return'].isnull()]

            realized_vol = realized_volatility(df_pre['log_return'])

            list_pred.append({
                'row_id': str(stock_id)+'-'+str(t),
                'pred':realized_vol
            })
            
    return pd.DataFrame(list_pred)

Show the results

In [None]:
list_order_book_file_train = glob.glob('../input/optiver-realized-volatility-prediction/book_train.parquet/*')
df_pred_train = predictions(list_order_book_file_train)
df_pred_train.head(5)

Join both dataframes, the previos and the train using the row_id as a key value

In [None]:
df_joined = df_train.merge(df_pred_train[['row_id','pred']], on = ['row_id'], how = 'left')\
.dropna().reset_index(drop=True)

In [None]:
R2 = round(r2_score(y_true = df_joined['target'], y_pred = df_joined['pred']),3)
RMSPE = round(rmspe(y_true = df_joined['target'], y_pred = df_joined['pred']),3)
print(f'Performance of the naive prediction: R2 score: {R2}, RMSPE: {RMSPE}')

### Submission
As a last step, we will make a submission via the tutorial notebook -- through a file written to output folder. 
The naive submission scored a RMSPE 0.308 on public LB, the room of improvement is big for sure!

In [None]:
list_order_book_file_test = glob.glob('../input/optiver-realized-volatility-prediction/book_test.parquet/*')

In [None]:
df_pred_test = predictions(list_order_book_file_test)
df_pred_test.to_csv('submission.csv',index = False)