In [None]:
from pathlib import Path

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# [EDA] Visualising trading data - Optiver

This notebook contains visualisations of the trade and book datasets provided for the Optiver Realized Volatility Prediction challenge.

We visualise individual trading sessions for randomly selected stocks, to start to get a feeling for the data and how trading sessions typically behave.

This is an active notebook and will continually be added to as we progress through the competition.


In [None]:
def load_data(root_path, stock_ids=None):
    """Loads all parquet files from given root path. Will only load stock_ids in stock_ids list
    if provided"""
    # Go through all folders and append parquet data to data
    data = []
    for folder in root_path.iterdir():
        assert len([child for child in folder.iterdir()]) == 1
        # folders take format stock_id=X
        stock_id = int(folder.name.split("=")[1])
        for parquet_file in folder.iterdir():
            if stock_ids is None or stock_id in stock_ids:
                df = pd.read_parquet(parquet_file)
                df["stock_id"] = stock_id
                data.append(df)
    return pd.concat(data)

In [None]:
# Define the root paths for book and trade training data
root_path_trade = Path("/kaggle/input/optiver-realized-volatility-prediction/trade_train.parquet")
root_path_book = Path("/kaggle/input/optiver-realized-volatility-prediction/book_train.parquet")

## 1. Trade data

Let's investigate the trade dataset.

As a reminder the following are the definitions of the columns:

* `stock_id` - ID code for the stock. Not all stock IDs exist in every time bucket - this is because this dataset shows trades that have taken place in a 10 minute period - less liquid stocks may not be traded in a given 10 min period.
* `time_id` - ID code for the time bucket. Time IDs are not necessarily sequential but are consistent across all stocks.
* `seconds_in_bucket` - Number of seconds from the start of the bucket, always starting from 0. Note that since trade and book data are taken from the same time window and trade data is more sparse in general, this field is not necessarily starting from 0.
* `price` - The average price of executed transactions happening in one second. Prices have been normalized and the average has been weighted by the number of shares traded in each transaction.
* `size` - The sum number of shares traded.
* `order_count` - The number of unique trade orders taking place.

The data is split into buckets as identified by the `time_id` column. Each time bucket represents 10 minutes of trading and the `seconds_in_bucket` column indicates how far through the time bucket each trading event takes place.

In [None]:
# First load the full trade dataset into memory
trade_df = load_data(root_path_trade)

In [None]:
print(trade_df.shape)
trade_df.dtypes

In [None]:
trade_df.describe()

### 1.1 Visualising trading sessions

Let's visualise some of the trading data for a few different stocks and sessions. We will randomly pick `stock_id` and `time_id` values so we can run the cell multiple times to start to get a sense of the dataset.

We will use relplots from seaborn, where the size of the bubbles indicates the number of shares traded and the colour of the bubble indicates the number of orders traded at that time period.

In [None]:
from random import sample

# Get some randomly sampled time_ids and stock_ids
time_id_sample = sample(list(trade_df["time_id"].unique()), 3)
stock_id_sample = sample(list(trade_df["stock_id"].unique()), 5)

mask = (trade_df["time_id"].isin(time_id_sample)) & (trade_df["stock_id"].isin(stock_id_sample))
trade_df_sample = trade_df[mask]

# Create relplot on the subset of data
sns.relplot(x="seconds_in_bucket", y="price", hue="order_count", size="size",
            col="time_id", row="stock_id", sizes=(40,400), data=trade_df_sample)

Rerunning the cell above a few times starts to paint a picture of typical trading conditions.

**Some observations**:
* Usually trading conditions are relatively benign and trading is within a ~0.1% range
* Sometimes prices trend up/down but usually just trade in a fairly tight range
* Very occasionally there are sizeable moves (~2%)
* Benign trading conditions can have few or many orders taking place - no obvious relationship from eyeballing the data
* Different stocks trade differently - some have many more trades in a 10 min window than others, order counts and sizes also vary.
* Usually there are lots of trades in the 10 min period - but for some stocks this is can be low (10s)

**Some questions this brings up that can be explored in further EDA**:
* Do large price changes occure when order counts / size are low?
* Are sudden price changes triggered by large orders?

### 1.2 Variability in trading conditions across the different stocks 

We will now group the trade data by `stock_id` to explore variability between stocks a bit further

In [None]:
# Group on stock_id and get the mean and standard deviation of price, order_count and size
grouped_trade = trade_df.groupby(["stock_id"]).agg({"price": ["mean", "std", "count"],  # Count here gives the number of trades
                                                    "order_count": ["mean", "std", "sum"],
                                                    "size": ["mean", "std", "sum"]}).reset_index()

In [None]:
grouped_trade.hist(figsize=(15, 15));

In [None]:
grouped_trade.describe()

**Observations**:

* 2 order of magnitude range in total number of trades between stocks
* Some stocks have much greater variability in price than others (~5x bottom to top)
* Order counts are typically around 3-4 for most stocks on average and not too much variation between stocks.
* Some stocks have much greater variability in the order count than others
* A few outlier stocks in terms of order size (maybe low price stocks so size higher for given \$ value order)

### 1.3 Questions for further EDA I'll answer in updates to this notebook

* Which stocks are most traded, which stocks least traded?
* Which stocks have the largest price fluctuations?
* Are there periods where all stocks are volatile?
* Are there periods where individual stocks are particularly volatile?

## 2. Order data

Let's look at the order data now. As a reminder the following columns are defined:

* `stock_id` - ID code for the stock. Not all stock IDs exist in every time bucket. Parquet coerces this column to the categorical data type when loaded; you may wish to convert it to int8.
* `time_id` - ID code for the time bucket. Time IDs are not necessarily sequential but are consistent across all stocks.
* `seconds_in_bucket` - Number of seconds from the start of the bucket, always starting from 0.
* `bid_price[1/2]` - Normalized prices of the most/second most competitive buy level.
* `ask_price[1/2]` - Normalized prices of the most/second most competitive sell level.
* `bid_size[1/2]` - The number of shares on the most/second most competitive buy level.
* `ask_size[1/2]` - The number of shares on the most/second most competitive sell level.

In [None]:
# All the book data won't fit in Kaggle notebook memory so will pick a random stock id
all_stock_ids = [int(path.name.split("=")[1]) for path in root_path_book.iterdir()]
stock_id_sample = sample(all_stock_ids, 5)
print(f"Loading data for stock(s):{stock_id_sample}")
book_df = load_data(root_path_book, stock_ids=stock_id_sample)

We'll compute the weighted average price and the log return. Since the log return formula uses diff we should apply it to each stock / time_id individually.

In [None]:
# WAP based on the most competitive bid / ask prices
book_df['wap'] = (book_df['bid_price1'] * book_df['ask_size1'] + book_df['ask_price1'] * book_df['bid_size1']) / \
                 (book_df['bid_size1']+ book_df['ask_size1'])

In [None]:
# Compute the log return for each unique stock_id and time_id pairing
# NOTE: this takes some time to run
updated_data = []
stock_time_pairs = book_df.groupby(["time_id", "stock_id"]).size().reset_index().sort_values(["stock_id", "time_id"])
for i, row in stock_time_pairs.iterrows():
    if i % 2000 == 0:
        print(f"Completed {i} of {stock_time_pairs.shape[0]} rows.")
    mask = (book_df["stock_id"] == row["stock_id"]) & (book_df["time_id"] == row["time_id"])
    subset = book_df[mask].copy(deep=True)
    subset['log_return'] = np.log(subset['wap']).diff()
    updated_data.append(subset)
    
book_df = pd.concat(updated_data)

In [None]:
print(book_df.shape)
book_df.head()

In [None]:
# Get some randomly sampled time_ids and stock_ids for plots
time_id_sample = sample(list(book_df["time_id"].unique()), 3)
stock_id_sample = sample(list(book_df["stock_id"].unique()), 5)

### 2.1 Visualising WAP and bid/ask prices

We'll plot some line plots of bid, ask and WAP for another random sample of `stock_id` and `time_id` values.

In [None]:
mask = (book_df["time_id"].isin(time_id_sample)) & (book_df["stock_id"].isin(stock_id_sample))
book_df_sample = book_df.loc[mask, ["stock_id", "time_id", "bid_price1", "ask_price1", "wap", "seconds_in_bucket", "log_return"]]
book_df_sample_melted = pd.melt(book_df_sample, id_vars=["stock_id", "time_id", "seconds_in_bucket"], 
                         value_vars=["bid_price1", "ask_price1", "wap"])

# Create relplot on the subset of data
sns.relplot(x="seconds_in_bucket", y="value", hue="variable", 
            col="time_id", row="stock_id",data=book_df_sample_melted, kind="line")

**Observations**:

* Some stocks have wide bid ask spreads relative to others which tends to lead to higher log returns and therefore realised volatility
* time_id seems to affect general trading conditions for all stocks - some time_ids have more liquid trading conditions and some less liquid.

### Visualising log returns

Using the same sample of `stock_id` and `time_id` we can plot log returns too

In [None]:
book_df_sample_melted = pd.melt(book_df_sample, id_vars=["stock_id", "time_id", "seconds_in_bucket"], 
                         value_vars=["log_return"])

# Create relplot on the subset of data
sns.relplot(x="seconds_in_bucket", y="value", hue="variable", 
            col="time_id", row="stock_id",data=book_df_sample_melted, kind="line")