# **Optiver Realized Volatility Prediction**

**I am a bit late to the party and I suppose there are already good EDA based notebooks in the Code section of the competition. Nonetheless, I think it is an exercise almost required before trying to address the main objectives. For me it is one of the greatest way to get your hands on an unknown dataset so let's go !**

In [None]:
import gc
import time
import numpy as np
import pandas as pd
from tqdm import tqdm

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence
rs = RandomState(MT19937(SeedSequence(7879)))

# Data Discovery

## Loading Data

In [None]:
base_path = "/kaggle/input/optiver-realized-volatility-prediction/"

df_train = pd.read_csv(base_path+"train.csv")
df_test = pd.read_csv(base_path+"test.csv")
df_submission = pd.read_csv(base_path+"sample_submission.csv")

In [None]:
%%time
min_id_select = 20
max_id_select = 45

df_book = pd.read_parquet(base_path+'book_train.parquet', engine='pyarrow', filters=[[('stock_id', '>=', min_id_select), ('stock_id', '<', max_id_select)]])
#df_book = pd.read_parquet(base_path+'book_train.parquet')
df_trade = pd.read_parquet(base_path+'trade_train.parquet')
df_book_test = pd.read_parquet(base_path+'book_test.parquet')
df_trade_test = pd.read_parquet(base_path+'trade_test.parquet')

In [None]:
size_trade_train = df_trade.memory_usage().sum() / 1024**2
size_book_train = df_book.memory_usage().sum() / 1024**2
print("Memory usage for book_train.parquet: %.2f MB" % size_book_train)
print("Memory usage for trade_train.parquet: %.2f MB" % size_trade_train)

`
 Memory usage for book_train.parquet once loaded: 5901.70 MB
 Memory usage for trade_train.parquet once loaded: 549.07 MB`

**Ok, this smells bad. It reminds me an other kaggle competition where having huge dataset were a huge pain in the foot. Everything will be slow and tedious : unexpected notebook restart, plotting data, feature enginnering, ...**

**Some of the following will be a generalization of a subset of book data. It is probably possible to build something or circumvent the issue but I am not sure it is worthy.**

In [None]:
df_train.info()

In [None]:
df_book.info()

In [None]:
df_trade.info()

In [None]:
df_test.info()

In [None]:
df_trade.stock_id = df_trade.stock_id.astype("int8")
df_book.stock_id = df_book.stock_id.astype("int8")

**Casting back `stock_id` to int8 instead of categorical. For the other columns, types look coherent.**

# Explanatory Data Analysis

In [None]:
print(df_trade.stock_id.unique().shape)
print(df_book.stock_id.unique().shape)
print(df_train.stock_id.unique().shape)
print(df_test.stock_id.unique().shape)

**There is a total of 112 different stocks in the dataset.**

## train.csv

In [None]:
df_train.head(10)

**This file contains the realized volatility (called `target`) for several couple of `(stock_id, time_id)`, couples that will constitute the unique identifier called `row_id` for predictions submission. They also allow to identify trade and book orders from parquet files.**

## test.csv

In [None]:
df_test.head(10)

**There are only three rows there for the sake of automation and to have an idea of what the hidden data will look like. This file tells us which `row_id` to predict.**

## sample_submission.csv 

In [None]:
df_submission.info()

In [None]:
df_submission.head(20)

**Prediction for the rows from `test.csv`.**

## trade_[train|test].parquet

In [None]:
df_trade.head(10)

In [None]:
df_trade.describe()

* **Here, it is possible to note that the size of each order is not available. They are all blended under `size`.**
* **The rest of the fields are relatively explanatory by themselves.**

In [None]:
grouped_order_count = df_trade.groupby(by="stock_id").agg({"order_count": 'sum'}).reset_index()
fig, ax = plt.subplots(figsize=(30, 10))
sns.barplot(x="stock_id", y="order_count", data=grouped_order_count, ax=ax)
sorted_order_count = grouped_order_count.sort_values(by="order_count")
print(sorted_order_count.head(10))
print(sorted_order_count.tail(10))
plt.show()
plt.clf()

In [None]:
grouped_order_count = df_trade.groupby(by="stock_id").agg({"size": 'sum'}).reset_index()
fig, ax = plt.subplots(figsize=(30, 10))
sns.barplot(x="stock_id", y="size", data=grouped_order_count, ax=ax)
sorted_order_count = grouped_order_count.sort_values(by="size")
print(sorted_order_count.head(10))
print(sorted_order_count.tail(10))
plt.show()
plt.clf()

## book_[train|test].parquet

In [None]:
df_book.head(10)

In [None]:
df_book_test.head(10)

**Not much to say except that `book_test.parquet` is composed of only 3 rows that match the first row from `test.csv`.**

In [None]:
df_book.describe()

* **Max `seconds_in_bucket` is 5.99e+02 which match the 10 min timeframe for the forecast.**
* **Mean and max of `bid_price1` > `bid_price2` which is coherent with the two levels of order book.**
* **All the columns description are available in the ["Data"](https://www.kaggle.com/c/optiver-realized-volatility-prediction/data) tab of the competition so I will not go further on this.**

In [None]:
# %%time
# fig, axs = plt.subplots(2, 2, figsize=(15, 15))
# sns.histplot(df_book.ask_price1, kde=True, bins=100, ax=axs[0][0])
# sns.histplot(df_book.ask_price2, kde=True, bins=100, ax=axs[0][1])
# sns.histplot(df_book.bid_price2, kde=True, bins=100, ax=axs[1][0])
# sns.histplot(df_book.bid_price2, kde=True, bins=100, ax=axs[1][1])
# plt.show()
# plt.clf()

**Looking at `bid_price[1|2]` and `ask_price[1|2]`, one can deduce that prices has been normalized.**

In [None]:
max_time = df_book.time_id.max()
print(max_time)
book_sample = df_book[df_book.time_id == max_time]
print(book_sample.shape)
book_sample.head()

In [None]:
grouped_stocks = book_sample.groupby(by="stock_id").agg(["mean", "median"]).reset_index()
grouped_stocks

In [None]:
record_number = max_id_select - min_id_select
x = 4
y = int(np.ceil(record_number / x))
fig, axs = plt.subplots(y, x, sharex=True, sharey=True, figsize=(40, 40))

for index in range(record_number):
    sns.histplot(book_sample.loc[book_sample.stock_id == (index+min_id_select)].seconds_in_bucket, kde=True, bins=60, ax=axs[int(np.floor(index/x))][index%x])

plt.tight_layout()
plt.show()
plt.clf()

In [None]:
rand_time = rs.choice(df_book.time_id.unique())
print(rand_time)
book_sample = df_book[df_book.time_id == rand_time]

record_number = max_id_select - min_id_select
x = 4
y = int(np.ceil(record_number / x))
fig, axs = plt.subplots(y, x, sharex=True, sharey=True, figsize=(40, 40))

for index in range(record_number):
    sns.histplot(book_sample.loc[book_sample.stock_id == (index+min_id_select)].seconds_in_bucket, kde=True, bins=60, ax=axs[int(np.floor(index/x))][index%x])

plt.tight_layout()
plt.show()
plt.clf()

* **`stock_id = 24` and `stock_id = 25` do not exist hence the empty plots**
* **Some stocks seem steadily traded over time while for some others the activity is more sparse.**
* **Picking random `time_id` seems to converge to the conclusion that each stock have roughly the same trade pattern over time (i.e. the one that are very active are always very active and the ones that are not very active stay that way).**
* **High `order_count` and high `size` are correlated with the activity from `seconds_in_bucket` (see `stock_id` equal 29, 43, 69, 124).**

In [None]:
rand_stock = rs.choice(df_book.stock_id.unique())
print(rand_stock)
book_sample = df_book[df_book.stock_id == rand_stock][-50:]

fig, ax = plt.subplots(figsize=(20,20))

plt.plot(book_sample.ask_price1, color="firebrick")
plt.plot(book_sample.ask_price2, color="goldenrod")
plt.plot(book_sample.bid_price1, color="forestgreen")
plt.plot(book_sample.bid_price2, color="deepskyblue")

plt.show()
plt.clf()

# Other Comments

In [None]:
df_book.time_id.value_counts().sort_index()

In [None]:
np.sort(df_trade.stock_id.unique())

**Some of the indexes are non sequential (i.e. `time_id`, `stock_id`). For now, I do not know if it may be a problem but it may cause trouble in order to apply time series machine learning algorithm (e.g. LSTM).** 

* **Should one consider that the price did not move between gaps ?**
* **Should one assume there was no trade during gaps period ?**