# EDA On the Orderbook

This competition is really exciting to me because I am actually writing a research paper on Ethereum Orderbook modelling (although that is on the forwards spread rather than vol) currently and so thought I would share some of the techniques I have picked up with the amazing kaggle community!

I am eventually going to build models based of this EDA but feel free to beat me to it because i can tell this competition will definitley be the hardest thing I have attempted in a long time

<div class="alert alert-info">
  <strong>This notebook is indeed a work in progress so please check back later ;)</strong>
</div>

# Import Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (10, 8)

# Select Single Stock For EDA 

I eventually hope to automate it for everything and build an automated PDF!

In [None]:
book = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
trade =  pd.read_parquet('../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=0')

In [None]:
book = book.set_index('time_id')
trade = trade.set_index('time_id')

In [None]:
book.head()

In [None]:
trade.head()

In [None]:
book

# Calculate Orderbook Statistics

In [None]:
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() 

def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

# More to come...
def calc_stats(df):
    df['size_spread_l1'] = df['ask_size1'] - df['bid_size1']
    df['size_spread_l2'] = df['ask_size2'] - df['bid_size2']
    
    df['price_spread_l1'] = df['ask_price1'] - df['bid_price1']
    df['price_spread_l2'] = df['ask_price2'] - df['bid_price2']
    
    df['wap'] = (df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1']) / (df['bid_size1']+ df['ask_size1'])
    
    df.loc[:,'log_return'] = log_return(df['wap'])
    df = df[~df['log_return'].isnull()]
    
    # This is wrong
    df['realized_vol'] = realized_volatility(df['log_return'])
    
    return df

In [None]:
book = calc_stats(book)

In [None]:
book

In [None]:
book.describe()

# Size Spread Plots

From the plot below we can see that the spread on the size offered widens significantly at certain periods, this is most likely due to market volatility which causes a sudden bout of illiquidity as market participants try and all dump at once - we can also seperate most markets into two regimes which is something I have done on my website [here](https://www.munum-butt.tech/blog-posts/gold-volatility-modelling-using-hidden-markov-models-hmms/)

In [None]:
plt.plot(book['size_spread_l1'])
plt.title('Layer 1 Size Spread on Stock 0')
plt.xlabel('Time ID')
plt.ylabel('Spread')
plt.show()

This is most definitely not normally distributed! Something like a Cauchy or Exponential may work far better - do not let the log scale fool you

In [None]:
plt.hist(book['size_spread_l1'], bins='auto')
plt.title('Layer 1 Size Spread on Stock 0 Distribution')
plt.yscale('log')
plt.xlabel('Time ID')
plt.ylabel('Spread')
plt.show()

In [None]:
plt.hist(book['size_spread_l1'], bins='auto')
plt.title('Layer 1 Size Spread on Stock 0 Distribution')
plt.xlabel('Time ID')
plt.ylabel('Spread')
plt.show()

In [None]:
book['size_spread_l1'].describe()

You can see that the second layer is more volatile, as you would expect for a slightly worse quote. These are likely larger players.

In [None]:
plt.plot(book['size_spread_l2'])
plt.title('Layer 2 Size Spread on Stock 0')
plt.xlabel('Time ID')
plt.ylabel('Spread')
plt.show()

In [None]:
plt.hist(book['size_spread_l2'], bins='auto')
plt.title('Layer 2 Size Spread on Stock 0 Distribution')
plt.yscale('log')
plt.xlabel('Time ID')
plt.ylabel('Spread')
plt.show()

In [None]:
plt.hist(book['size_spread_l2'], bins='auto')
plt.title('Layer 2 Size Spread on Stock 0 Distribution')
plt.xlabel('Time ID')
plt.ylabel('Spread')
plt.show()

In [None]:
book['size_spread_l2'].describe()

# Price Spread Plots

In [None]:
book.head()

From the plot below we can see that the ask is consistently higher than the bid which makes perfect sense considering market dynamics (sellers always want more than you can buy for!)

In [None]:
plt.plot(book['bid_price1'], c='b', label='Bid Price')
plt.plot(book['ask_price1'], c='r', label='Ask Price', alpha=0.7)
plt.title('Best Bid/Ask Prices')
plt.xlabel('Time ID')
plt.ylabel('Price')
plt.legend()
plt.show()

In [None]:
plt.plot(book[:100000]['bid_price1'], c='b', label='Bid Price')
plt.plot(book[:100000]['ask_price1'], c='r', label='Ask Price', alpha=0.7)
plt.title('Best Bid/Ask Prices')
plt.xlabel('Time ID')
plt.ylabel('Price')
plt.legend()
plt.show()

It is subtle, but you can see the ask is a touch higher regularly implying again that these are worse quotes

In [None]:
plt.plot(book['bid_price2'], c='b', label='Bid Price')
plt.plot(book['ask_price2'], c='r', label='Ask Price', alpha=0.7)
plt.title('L2 Bid/Ask Prices')
plt.xlabel('Time ID')
plt.ylabel('Price')
plt.legend()
plt.show()

In [None]:
plt.plot(book[:100000]['bid_price2'], c='b', label='Bid Price')
plt.plot(book[:100000]['ask_price2'], c='r', label='Ask Price', alpha=0.7)
plt.title('L2 Bid/Ask Prices')
plt.xlabel('Time ID')
plt.ylabel('Price')
plt.legend()
plt.show()

Again it is very clear that neither of the bids offered are from a normal distribution!

**Proceeds to throw out 90% of markets research which has that BS assumption**

In [None]:
plt.hist(book['bid_price1'], bins='auto', label='Best Bids')
plt.hist(book['bid_price2'], bins='auto', label='L2 Bids', alpha=0.7)
plt.title('Bids on Stock 0 Distribution')
plt.xlabel('Time ID')
plt.ylabel('Bid Value')
plt.legend()
plt.show()

In [None]:
plt.hist(book['bid_price1'], bins='auto', label='Best Bids')
plt.hist(book['bid_price2'], bins='auto', label='L2 Bids', alpha=0.7)
plt.title('Bids on Stock 0 Distribution')
plt.xlabel('Time ID')
plt.ylabel('Bid Value')
plt.yscale('log')
plt.legend()
plt.show()

In [None]:
plt.hist(book['ask_price1'], bins='auto', label='Best Ask')
plt.hist(book['ask_price2'], bins='auto', label='L2 Ask', alpha=0.7)
plt.title('Asks on Stock 0 Distribution')
plt.xlabel('Time ID')
plt.ylabel('Ask Value')
plt.legend()
plt.show()

In [None]:
plt.hist(book['ask_price1'], bins='auto', label='Best Ask')
plt.hist(book['ask_price2'], bins='auto', label='L2 Ask', alpha=0.7)
plt.title('Asks on Stock 0 Distribution')
plt.xlabel('Time ID')
plt.ylabel('Ask Value')
plt.yscale('log')
plt.legend()
plt.show()

# Returns Analysis

Notice how you get clusters of peaks? Why is explained [here](https://www.kaggle.com/c/jane-street-market-prediction/discussion/227793#1248042)

In [None]:
plt.plot(book['log_return'], label='Log Returns')
plt.title('Log Returns on Stock 0')
plt.xlabel('Time ID')
plt.ylabel('Returns')
plt.legend()
plt.show()

# Realised Volatility Analysis

Volatility is an immensly rich subject...where most of the research is utter rubbish.

For those who really want to blow their minds (and have a major advantage in this competition) have a look at 'Rough Volatility'...

<div class="alert alert-info">
  <strong>This notebook is indeed a work in progress so please check back later ;):</strong>
</div>