# Understanding Optiver Realized Volatility Prediction 

**Welcome fellow Kaggler, I know you might have got bored of looking at everyone's notebook trying to explain what's going on. And I know I am too late to this party, but I am one of those who take a lot more time to understand anything. So if you still have some doubts left and want to understand this competition and its data, I have tried my best to explain. Optiver was kind enough to provide us with a basic notebook structure it was really good to develop understanding, but I felt the language was not very easy, so I removed the useless part of it, made it simple, and also explained some of the terms and features in simple words. This is my first public notebook, so please be supportive "but" I am ready to take questions (to answer them and learn from them)**.

## Note: 
**Please read the commented lines also to understand the code line by line, I have tried to be as thorough as possible.**

**Required Libraries**

In [None]:
import pandas as pd
import numpy as np
import os
import glob
from IPython.display import Image
import plotly.express as px
from sklearn.metrics import r2_score

## **Important Terms**

**Try to understand , but as Lord Ng says**

In [None]:
Image(filename="../input/optiver2/andrewng.jpeg")

## 1. Bid/Ask spread

As different stocks trade on different level on the market we take the ratio of best offer price and best bid price to calculate the bid-ask spread. Best prices corresponds to Level 1 data (the least ask price, the highest bid price in order book), scroll down & look for order_book image to understand more.

$$BidAskSpread = BestOffer / BestBid -1$$

## 2. Weighted averaged price

The order book is also one of the primary source for stock valuation. A fair book-based valuation must take two factors into account: the level and the size(volume) of orders. In this competition we used weighted averaged price, or WAP, to calculate the instantaneous stock valuation(price) and calculate realized volatility as our target. 

The formula of WAP can be written as below, which takes the top level price and volume information into account:

$$ WAP = \frac{BidPrice_{1}*AskSize_{1} + AskPrice_{1}*BidSize_{1}}{BidSize_{1} + AskSize_{1}} $$

As you can see, if two books have both bid and ask offers on the same price level respectively, the one with more sell offers in place will generate a lower stock valuation, as there are more intended seller in the book, and more seller implies a fact of more supply on the market resulting in a lower stock valuation.

*Point to note, if a trade occurs its entry is not stored in Order Book(It will be available in trade book).

## 3. Log returns

**How can we compare the price of a stock between yesterday and today?**

**stock return** - Example, the return for stock A will be given by $\frac{\$102 - \$100 }{\$100} = 2\%$, Where,  102 is  price at  t  and  100  is price at  t-1.

**log returns** - These are preferred whenever some mathematical modelling is required. Calling $S_t$ the price of the stock $S$ at time $t$, we can define the log return between $t_1$ and $t_2$ as:
$$
r_{t_1, t_2} = \log \left( \frac{S_{t_2}}{S_{t_1}} \right)
$$
Usually, we look at log returns over fixed time intervals, so with 10-minute log return we mean $r_t = r_{t - 10 min, t}$.

Log returns present several advantages, for example:
- they are additive across time $r_{t_1, t_2} + r_{t_2, t_3} = r_{t_1, t_3}$
- regular returns cannot go below -100%, while log returns are not bounded

## 4. Realized volatility

We will compute the log returns over all consecutive book updates and we define the **realized volatility, $\sigma$,** as the squared root of the sum of squared log returns.

$$
\sigma = \sqrt{\sum_{t}r_{t-1, t}^2}
$$

Where we use **WAP** as price of the stock to compute log returns. Scroll to **Example section** to learn more about WAP.

**Note** : The data is not annualizing the volatility (which is how it is actually calculated when you need Realized Volatility) and it is assumed that log returns have 0 mean => So basically, do not go for calculating Standard Deviation and all. The above formula is what you need here.

# OKAY LET'S START WITH WHAT IS GIVEN

**To read Trade and Order book for stock_id = 0**

In [None]:
book_example = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
trade_example =  pd.read_parquet('../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=0')
stock_id = '0'
book_example = book_example[book_example['time_id']==5]
book_example.loc[:,'stock_id'] = stock_id
trade_example = trade_example[trade_example['time_id']==5]
trade_example.loc[:,'stock_id'] = stock_id

# Order Book Data

This is the snapshot of a moment of running market , where each second thousands or maybe millions of prices and volumes are quoted by buyers and sellers. At a given moment it shows current availability of buy/sell opportunity. If it is filled neck to neck , the order book is called liquid ,but if **Bid/Ask spread** is large -> it is less liquid, which means it is a bad order book as not many sellers/buyers available. So if you are a buyer, you will need to go at much higher price and vice-versa if you are seller(as the gap is large) to make a trade.

In [None]:
book_example.head()

**What are these features?** 

**1.   [bid_price1, ask_price1, bid_price2, ask_price2, bid_size1, ask_size1, bid_size2, ask_size2]**

In [None]:
Image(filename="../input/optiver/OrderBook3 - Copy.png")

**The immediate rows adjacent to boundary between Bidder(Bid) and Seller(ask) are level1(which gives best offer price and best bid price,and further used to calculate bid/ask spread - as mentioned in tutorial notebook) and the next adjacents are level 2,This is what Level 1 and level 2 depth means here. But as you can see, there can be many such levels if the order book is liquid and has lot of buyer/sellers. But here only two immediates are given.**

**2.  [time_id,  seconds_in_bucket]**

* **time_id** : It is basically a group of 10 min(which refers to the orderbook snapshot during those 10 min, which are further divided into seconds represented by seconds_in_bucket), which is identified with the integers they are assigned like- 0,1,2..so on. So what that means , group 0 is the first 10 min group and then other groups in ascending order.

* **seconds_in_bucket(sib)** : refers to the seconds in those 10 min of a time_id. So it shows order book's snapshot of **i-1**th second. 

In [None]:
train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
train.head()

Taking the first row of data, it implies that the realized vol of the **target bucket** for time_id 5, stock_id 0 is 0.004136.

# **Trade data**

This data contains information about trade that actually occured between the above order book data. So you might think that - Ok, so trade data is basically like subset of order_book -> Yes and No.

It can be seen as subset as it occured within the time_id(10 min blocks) range. But the point to note is - if at a particular moment a trade occured , it is not saved in order book(as it is a book to maintain available buyers/sellers and what they are offering in the market, like no. of stocks, at what price etc.), so basically you will never see trade data in order book, you can just see the volumes decreasing/increasing as someone buy/sell the stocks or options.

In [None]:
trade_example.head()

**Let's discuss what are these features ->**

* **price** : It is the aggregated mean of the prices of all the trades that occured during last second, ex- from 20-21 sec will be shown at 21'th second.

* **size** : It is the aggregated sum of the volumes traded during last second.

* **order_count** : This corresponds to number transactions or trades occured during last second.


# Functions used in this notebook

In [None]:
#Function returns log_return value of the given feature
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() 

#Calculate Realized Volatility (outputs a aggregated value) for given feature
def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

#This function uses above function(as agg function) to calculate Volatility per time_id for a given stock_id
def realized_volatility_per_time_id(file_path, prediction_column_name):
    df_book_data = pd.read_parquet(file_path)
    
    # As I said you can play with WAP (be little sensible while doing so)
    #wap1
    df_book_data['wap1'] = calculate_wap1(df_book_data)
    df_book_data['log_return1'] = df_book_data.groupby(['time_id'])['wap1'].apply(log_return)
    df_book_data = df_book_data[~df_book_data['log_return1'].isnull()]
    #wap2
    df_book_data['wap2'] = calculate_wap2(df_book_data)
    df_book_data['log_return2'] = df_book_data.groupby(['time_id'])['wap2'].apply(log_return)
    df_book_data = df_book_data[~df_book_data['log_return2'].isnull()]
    # final log_return_price from log_return1 & log_return2
    df_book_data['final_log_return'] = 0.6*df_book_data['log_return1'] + 0.4*df_book_data['log_return2']
    # Using final_returns to calculate predicted volatility for each time_id by agg. it using "realized_volatility" function
    df_realized_vol_per_stock =  pd.DataFrame(df_book_data.groupby(['time_id'])['final_log_return'].agg(realized_volatility)).reset_index()
    df_realized_vol_per_stock = df_realized_vol_per_stock.rename(columns = {'final_log_return':prediction_column_name})
    #For a given stock_id
    stock_id = file_path.split('=')[1]
    # Stock_id_time_id -> submission format (AS THEY SAY IN MANDALORIAN , THIS IS THE WAY)
    df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock['time_id'].apply(lambda x:f'{stock_id}-{x}')
    return df_realized_vol_per_stock[['row_id',prediction_column_name]]

#Loop through stock_id's to calculate  Volatility per stock
def past_realized_volatility_per_stock(list_file,prediction_column_name):
    df_past_realized = pd.DataFrame()
    for file in list_file:
        df_past_realized = pd.concat([df_past_realized, realized_volatility_per_time_id(file, prediction_column_name)])
    return df_past_realized

#RMSPE
def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

#WAP1
def calculate_wap1(df):
    a1 = df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1']
    b1 = df['bid_size1'] + df['ask_size1']
    a2 = df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2']
    b2 = df['bid_size2'] + df['ask_size2']
    
    x = (a1/b1 + a2/b2)/ 2
    
    return x

#WAP2
def calculate_wap2(df):
        
    a1 = df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1']
    a2 = df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2']
    b = df['bid_size1'] + df['ask_size1'] + df['bid_size2']+ df['ask_size2']
    
    x = (a1 + a2)/ b
    return x

# Example Section 

**Realized volatility calculation and more about WAP**

In this competition, our target is to predict short-term realized volatility.
As realized volatility is a statistical measure of price changes on a given stock, to calculate the price change we first need to have a stock price at each second( as the data is given for seconds). We will use weighted averaged price, or **WAP**, of the given order book data. You can play with the level1 , level 2 features as the aim of WAP is to show affect of randomness or agrression or noise, due to quoted price and volume that got listed in market at a given point of time.

In [None]:
#Given formula for WAP
book_example['wap'] = (book_example['bid_price1'] * book_example['ask_size1'] +
                                book_example['ask_price1'] * book_example['bid_size1']) / (
                                       book_example['bid_size1']+ book_example['ask_size1'])

**The WAP of the stock is plotted below**

In [None]:
fig = px.line(book_example, x="seconds_in_bucket", y="wap", title='WAP of stock_id_0, time_id_5')
fig.show()

To compute the log return, we can simply take **the logarithm of the ratio** between two consecutive(rows) **WAP**. The first row will have an empty return as the previous book update is unknown, therefore the empty return data point will be dropped.

**Note** : We already have function for log_return in **Functions** section.

In [None]:
book_example.loc[:,'log_return'] = log_return(book_example['wap'])
book_example = book_example[~book_example['log_return'].isnull()]

**Let's plot the tick-to-tick return of this instrument over this time bucket**

In [None]:
fig = px.line(book_example, x="seconds_in_bucket", y="log_return", title='Log return of stock_id_0, time_id_5')
fig.show()

The realized vol of stock 0 in this feature bucket, will be:

In [None]:
#  Calculate Realized Volatility (outputs a aggregated value) for given feature
realized_vol = realized_volatility(book_example['log_return'])
print(f'Realized volatility for stock_id 0 on time_id 5 is {realized_vol}')

# Back to the problem now and let's do a "Naive prediction": using past realized volatility as target

A commonly known fact about volatility is that it tends to be autocorrelated (previous trends can dictate future trends to some extent). We can use this property to implement a naive model that just "predicts" realized volatility by using whatever the realized volatility was in the initial 10 minutes.

Let's calculate the past realized volatility across the training set to see how predictive a single naive signal can be.

In [None]:
#Loading Train_Order_book parquet folder => and storing path for each stock's order book
list_order_book_file_train = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/book_train.parquet/*')

We can get the past realized volatility as prediction for each individual stocks, 
using **("past_realized_volatility_per_stock" func for looping through each stock)** which internally calls **("realized_volatility_per_time_id" for looping through each time_id within the stock(id))** function.

In [None]:
#Looping through each stock using "past_realized_volatility_per_stock" func
df_past_realized_train = past_realized_volatility_per_stock(list_file=list_order_book_file_train,
                                                           prediction_column_name='pred')

**Let's join the output dataframe with train.csv to see the performance of the naive prediction on training set.**

In [None]:
train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str) # remember that Mandalorian's line , yes CTRL+F
train = train[['row_id','target']]
df_joined = train.merge(df_past_realized_train[['row_id','pred']], on = ['row_id'], how = 'left')

**We will evaluate the naive prediction result by two metrics: RMSPE and R squared.**

In [None]:
R2 = round(r2_score(y_true = df_joined['target'], y_pred = df_joined['pred']),3)
RMSPE = round(rmspe(y_true = df_joined['target'], y_pred = df_joined['pred']),3)
print(f'Performance of the naive prediction: R2 score: {R2}, RMSPE: {RMSPE}')

**It is a reasonable benchmark to start with. Remember we have not used trade data yet.**

# Submission

**I am using some of the codes from introductory notebook from Optiver , including the one below - for submission.
As there's only one time_id given in order book of test set i.e 4 , but submission is required for 3 time ids 4, 32, 34. So its possible you  try and submit prediction with that one row itself and then it will error at submission. So...yeah **USE THIS****

In [None]:
list_order_book_file_test = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/book_test.parquet/*')
df_naive_pred_test = past_realized_volatility_per_stock(list_file=list_order_book_file_test,
                                                           prediction_column_name='target')
df_naive_pred_test.to_csv('submission.csv',index = False)

In [None]:
df_naive_pred_test

## If I have got anything wrong, please tell me in the comments, it will help me and everyone reading this notebook. 

# AND PLEASE UPVOTE IF IT WAS HELPFUL