In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
train.head()

In [None]:
book = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
trade =  pd.read_parquet('../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=0')
stock_id = '0'
book = book[book['time_id']==5]
book.loc[:,'stock_id'] = stock_id
trade = trade[trade['time_id']==5]
trade.loc[:,'stock_id'] = stock_id

In [None]:
book.head()

In [None]:
trade.head()

**Weighted averaged price**

The order book is also one of the primary source for stock valuation. A fair book-based valuation must take two factors into account: the level and the size of orders. In this competition we used weighted averaged price, or WAP, to calculate the instantaneous stock valuation and calculate realized volatility as our target.

The formula of WAP can be written as below, which takes the top level price and volume information into account:

WAP=BidPrice1∗AskSize1+AskPrice1∗BidSize1BidSize1+AskSize1
 
As you can see, if two books have both bid and ask offers on the same price level respectively, the one with more offers in place will generate a lower stock valuation, as there are more intended seller in the book, and more seller implies a fact of more supply on the market resulting in a lower stock valuation.

Note that in most of cases, during the continuous trading hours, an order book should not have the scenario when bid order is higher than the offer, or ask, order. In another word, most likely, the bid and ask should never be in cross.

In this competition the target is constructed from the WAP. The WAP of the order book snapshot is 147.5317797.

In [None]:
book['wap'] = (book['bid_price1'] * book['ask_size1'] +
                                book['ask_price1'] * book['bid_size1']) / (
                                       book['bid_size1']+ book['ask_size1'])

In [None]:
fig = px.line(book, x="seconds_in_bucket", y="wap", title='WAP of stock_id_0, time_id_5')
fig.show()

**Log returns**

How can we compare the price of a stock between yesterday and today?

The easiest method would be to just take the difference. This is definitely the most intuitive way, however price differences are not always comparable across stocks. For example, let's assume that we have invested  $ 1000 dollars in both stock A and stock B and that stock A moves from  $ 100 to  $ 102 and stock B moves from  $ 10 to  $ 11. We had a total of 10 shares of A ( $1000 / $100=10 ) which led to a profit of  10⋅($102−$100)=$20  and a total of 100 shares of B that yielded $100. So the price increase was larger for stock A, although the move was proportionally much larger for stock B.

We can solve the above problem by dividing the move by the starting price of the stock, effectively computing the percentage change in price, also known as the stock return. In our example, the return for stock A was  $102−$100$100=2% , while for stock B it was  $11−$10$10=10% . The stock return coincides with the percentage change in our invested capital.

Returns are widely used in finance, however log returns are preferred whenever some mathematical modelling is required. Calling  St  the price of the stock  S  at time  t , we can define the log return between  t1  and  t2  as:
rt1,t2=log(St2St1)
 
Usually, we look at log returns over fixed time intervals, so with 10-minute log return we mean  rt=rt−10min,t .

Log returns present several advantages, for example:

they are additive across time  rt1,t2+rt2,t3=rt1,t3 
regular returns cannot go below -100%, while log returns are not bounded

In [None]:
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() 

In [None]:
book.loc[:,'log_return'] = log_return(book['wap'])
book = book[~book['log_return'].isnull()]

**Realized volatility**

When we trade options, a valuable input to our models is the standard deviation of the stock log returns. The standard deviation will be different for log returns computed over longer or shorter intervals, for this reason it is usually normalized to a 1-year period and the annualized standard deviation is called volatility.

In this competition, you will be given 10 minutes of book data and we ask you to predict what the volatility will be in the following 10 minutes. Volatility will be measured as follows:

We will compute the log returns over all consecutive book updates and we define the realized volatility,  σ , as the squared root of the sum of squared log returns.
σ=∑tr2t−1,t−−−−−−−√
 
Where we use WAP as price of the stock to compute log returns.

We want to keep definitions as simple and clear as possible, so that Kagglers without financial knowledge will not be penalized. So we are not annualizing the volatility and we are assuming that log returns have 0 mean.

In [None]:
fig = px.line(book, x="seconds_in_bucket", y="log_return", title='Log return of stock_id_0, time_id_5')
fig.show()

In [None]:
def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))
realized_vol = realized_volatility(book['log_return'])
print(f'Realized volatility for stock_id 0 on time_id 5 is {realized_vol}')