A simplified reproduction of the [tutorial notebook](http://)https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data provided by the content organizers. By "simplified" I am meaning that the usage of ultra familiar tools is maximized. Hope it will ease the on-boarding of new joiners in this competition. Heavily inspired by this [notebook](http://)https://www.kaggle.com/lucasmorin/realised-vol-weighted-regression-baseline?select=sample_submission.csv. 

### Importing all necessary libraries

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score
import glob

In [None]:
# For quickly switching between training and test data
def train_test(mode):
    # mode = "train"/"test"
    file_name = '../input/optiver-realized-volatility-prediction/' + mode + '.csv'
    return pd.read_csv(file_name)



In [None]:
train = train_test("train")
train.head()

The *order_book* data are partitioned on the basis of *stock_id*. The following command lists all the parquet file names, it will help us to iterate over all stocks in later section.

In [None]:
order_book_training = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/book_train.parquet/*')

Each member in *order_book_training* corresponds to a single stock. Each stock contains several *time_id*. The goal is to predict volatility for each (*stock_id, time_id*) tuples.

Here we utilize the fact that the Panda's groupby operation retains the order of the rows. We write a custom aggregate function to calculate WAP -> tik to tik returns -> realized volatility.

**Warning**: Custom aggregate functions are not Cythonized and they are slow. Basically cutom aggregate functions are syntactic sugars with a sprinkle of enhanced readability.

In [None]:
# custom aggregate function
def wap2vol(df):
    # wap2vol stands for WAP to Realized Volatility
    temp = np.log(df).diff() # calculating tik to tik returns
    # returning realized volatility
    return np.sqrt(np.sum(temp**2)) 

In [None]:
# function for calculating realized volatility per time id for a given stock
def rel_vol_time_id(path):
    # book: book is an order book
    book = pd.read_parquet(path) # order book for a stock id loaded
    # calculating WAP
    p1 = book["bid_price1"]
    p2 = book["ask_price1"]
    s1 = book["bid_size1"]
    s2 = book["ask_size1"]
    
    book["WAP"] = (p1*s2 + p2*s1) / (s1 + s2)
    # calculating realized volatility for each time_id
    transbook = book.groupby("time_id")["WAP"].agg(wap2vol)
    return transbook

Now we iterate over all order books and compute realized volatility of each (*stock_id, time_id*) tuples.

Instead of concatenating dataframes corresponding to each stock, I am rather taking an unified approach by listing all *stock_id, time_id* and their realized volatility. Later these lists will be converted to a dataframe. This approach is recommended/reiterated in this [stack exchange discussion](http://)https://stackoverflow.com/questions/13784192/creating-an-empty-pandas-dataframe-then-filling-it 

In [None]:
%%time 
stock_id = []
time_id = []
relvol = []
for i in order_book_training:
    # finding the stock_id
    temp_stock = int(i.split("=")[1])
    # find the realized volatility for all time_id of temp_stock
    temp_relvol = rel_vol_time_id(i)
    stock_id += [temp_stock]*temp_relvol.shape[0]
    time_id += list(temp_relvol.index)
    relvol += list(temp_relvol)

Now we create the dataframe containing realized volatilities for all _(stock_id, time_id)_ tuples.

In [None]:
past_volatility = pd.DataFrame({"stock_id": stock_id, "time_id": time_id, "volatility": relvol})

Now we join *past_volatility* with *training* to calculate the error metrics, mainly for a sanity check to confirm that it is a correct reproduction. Here we are naively assuming that **past level of volatility = future level of volatility**.

In [None]:
joined = train.merge(past_volatility, on = ["stock_id","time_id"], how = "left")
R2 = round(r2_score(y_true = joined['target'], y_pred = joined['volatility']),3)
print(f'The R2 score of the naive prediction for training set is {R2}')

In [None]:
def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

rmspe = rmspe(joined["target"], joined["volatility"])
print(f'The RMSPE score of the native prediciton for the training set is {rmspe}')

Let's train a simple OLS model for each stock_id. Use *degree* to specify the degree of the linear model.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# for training
def linear_training(X,y,degree):
    # instantiating polynomial features
    polyfeat = PolynomialFeatures(degree = degree)
    linreg = LinearRegression()
    # preprocessing the training data
    x = np.array(X).reshape(-1,1)
    # creating the polynomial features
    X_ = polyfeat.fit_transform(x)
    # training the model
    weights = 1/np.square(y)
    return linreg.fit(X_, np.array(y).reshape(-1,1), sample_weight = weights)


stock_id_train = train.stock_id.unique() # all stock_id for the train set
models = {} # dictionary for holding trained models for each stock_id
degree = 2
for i in stock_id_train:
    temp = joined[joined["stock_id"]==i]
    X = temp["volatility"]
    y = temp["target"]
    models[i] = linear_training(X,y,degree)
    

In [None]:
models

Let's make prediction on the test set and submit a sample submission

In [None]:
# listing all test order books
order_book_test = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/book_test.parquet/*')

We start by calculating the past volatility of the test set.

In [None]:
%%time 
stock_id = []
time_id = []
relvol = []
for i in order_book_test:
    # finding the stock_id
    temp_stock = int(i.split("=")[1])
    # find the realized volatility for all time_id of temp_stock
    temp_relvol = rel_vol_time_id(i)
    stock_id += [temp_stock]*temp_relvol.shape[0]
    time_id += list(temp_relvol.index)
    relvol += list(temp_relvol)
    
past_test_volatility = pd.DataFrame({"stock_id": stock_id, "time_id": time_id, "volatility": relvol})

We will be using *linear_inference* for prediction

In [None]:
# for inference
def linear_inference(models, stock_id, past_volatility, degree):
    model = models[stock_id]
    polyfeat = PolynomialFeatures(degree = degree)
    return model.predict(polyfeat.fit_transform([[past_volatility]]))[0][0]
    

In [None]:
# creating the header for the submission file
submission = pd.DataFrame({"row_id" : [], "target" : []})  
submission["row_id"] = past_test_volatility.apply(lambda x: str(int(x.stock_id)) + '-' + str(int(x.time_id)), axis=1)
# prediction for test data
submission["target"] = past_test_volatility.apply(lambda x: linear_inference(models,\
                                                                            x.stock_id,\
                                                                            x.volatility,\
                                                                            degree),\
                                                 axis = 1)

In [None]:
submission.to_csv('submission.csv',index = False)

In [None]:
submission