In [None]:
%%html
<style>
h1 { color: #7c795d; text-align:center; font-align:center; font-family: 'Trocchi', serif; font-size: 45px; font-weight: normal; line-height: 48px; margin: 0; }

h2 { color: #7c795d; text-align:center; font-align:center; font-family: 'Trocchi', serif; font-size: 20px; font-weight: normal; line-height: 48px; margin: 0; }

h3 { color: #7c795d; text-align:center; font-align:center; font-family: 'Trocchi', serif; font-size: 16px; font-weight: normal; line-height: 48px; margin: 0; }

h4 { color: #7c795d; text-align:center; font-align:center; font-family: 'Trocchi', serif; font-size: 14px; font-weight: normal; line-height: 48px; margin: 0; }
      
</style>
<hr>
<h1>Optiver Market Volatility Prediction</h1>
<h2>By Thomas Meli</h2>
<hr>

![Market Pic](https://www.thomasmeli.tech/wp-content/uploads/2021/06/Monitor-Stock-Business-Trading-Exchange-Finance-1863880.jpg)

In [None]:
%%html
<style>
blockquote {
font-family: Helvetica, Arial, serif;
font-size: 14px;
font-style: italic;
width: 90%;
margin: 0.25em 0 2em;
padding: 0.25em 40px;
line-height: 1.45;
position: relative;
color: #383838;
}

blockquote:before {
font-family: Georgia, serif;
display: block;
content: "\201C";
font-size: 100px;
position: absolute;
left: -10px;
top: -35px;
color: #ccc;
}

blockquote cite {
color: #870237;
font-size: 13px;
display: block;
margin-top: 5px;
}
 
blockquote cite:before {
content: "\2014 \2009";
}

</style>

<blockquote>
<p><br></p>
<p><b>Realized volatility</b> is the assessment of variation in returns for an investment product by analyzing its historical returns within a defined time period. Assessment of degree of uncertainty and/or potential financial loss/gain from investing in a firm may be measured using variability/ volatility in stock prices of the entity. </p>
<p>The realized volatility or actual volatility in the market is caused by two components- a continuous volatility component and a jump component, which influence the stock prices. Continuous volatility in a stock market is affected by the intra-day trading volumes. For example, a single high volume trade transaction can introduce a significant variation in the price of an instrument. </p>
<cite>https://www.wallstreetmojo.com/realized-volatility/</cite>
</blockquote>

In [None]:
%%html

<hr>
<h2>Introduction</h2>
<hr>

**Introduction**

**Purpose of competition: Given 10 minutes of book data, predict the volatility in the next 10 minutes.  The units of the prediction are in a weighted average price of those 10 minutes.**

As explained in the introduction notebook, Optiver is, among other things, a **market maker** - that is, they are a mediator that takes information about buying prices and selling prices and provides bids to see who is willing to buy and sell.

**They are interested in predicting volatility accurately** since increased volatility allows more flexibility in trading and allows them to make more appealing offers between sellers and buyers.  Although for consumers, volatility can be more uncertain and risky, for traders, volatility also makes assets more liquid (mobile).  Since it refers to the amount of dispersion (spread) of an asset's returns, a higher volatility makes gains as well as losses higher.  Traders prefer it because it makes more profit possible.

One notion of **volatility** refers to **the standard deviation of the percent change of the price returns of an asset.**  At the most basic level, you can find it by performing **.pct_change().std()** in a dataframe along with some timestep (daily volatility, weekly (5 trading days), monthly (21 trading days), yearly (252 trading days) etc.).  **The date range is important for the notion of realized vs. implied volatility.**   However, in this project you are advised to use a standardized value and use log returns.

We are also given **the book data which represents the intentions of buyers and sellers.** The more dense this book is, the more variance there may be in bids and asks in relation to the trade that (possibly) happens.  So this can also be used to measure the volatility as well if we find the difference between the best ask price and the best bid price - known as **the bid-ask spread**.  These prices are weighted depending on the level and size of the orders.

In common terms, volatility measures the variance of returns of an asset, the the more variance there is, the more uncertainty there is and the riskier the asset is.  In quantitative analysis, in order to make these metrics comparable across assets and across different time measures, they are standardized over time (annualized) and returns are converted into log returns to make assets comparable.  

**Load code and imports**

In [None]:
import numpy as np # linear algebra

import matplotlib.pyplot as plt

import seaborn as sns
sns.set(rc={'figure.figsize':(10,8)})

import pandas as pd 
pd.set_option("precision", 3)  # Display precision

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

import tensorflow as tf
import sklearn as sk
from IPython.display import display, HTML, IFrame

import plotly.express as px

import os

In [None]:
book_example = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
trade_example =  pd.read_parquet('../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=0')

In [None]:
full_train_w_features = pd.read_csv("../input/optiver-full-train-ml-ready/full_train.csv").drop("('seconds_in_bucket', 'min')_book", axis = 1)
full_train_w_features = full_train_w_features[full_train_w_features.stock_id==0]
full_train_w_features.replace([np.inf, -np.inf], 0, inplace=True)  # Replace infinities from pct_change with zero.
full_train_w_features = full_train_w_features.dropna()

print("The dataframe we will be investigating is a dataframe with statistically engineered features and the target already merged.\n")
print(f"Shape of engineered features where stock_id = 0 is {full_train_w_features.shape} \n\n")
full_train_w_features.head(2)

---
## TL:DR - Insight Summary
* Standard Deviations of Price Variables Most POSITIVELY Highly Correlated with Target in Stock = 0
* Minimum Prices Variables Most NEGATIVELY Highly Correlated with Target in Stock = 0
* All time IDs in book and trade match and none are missing.
* Total "Seconds in Bucket" values are different in trade and book by an order of magnitude.
* There are no missing cells in either dataframe.

---
### Insight 1: Standard Deviations of Price Variables Most POSITIVELY Highly Correlated with Target in Stock = 0
### Insight 2: Minimum Prices Variables Most NEGATIVELY Highly Correlated with Target in Stock = 0
---

In [None]:
corr = full_train_w_features.corr()
target_corr = pd.DataFrame(corr["target"]).rename(columns = {"target":"pearson"})

spearman_corr = full_train_w_features.corr(method = "spearman")
spearman_target_corr = pd.DataFrame(spearman_corr["target"]).rename(columns = {"target":"spearman"})

kendall_corr = full_train_w_features.corr(method = "kendall")
kendall_target_corr = pd.DataFrame(kendall_corr["target"]).rename(columns = {"target":"kendall"})

In [None]:
merged_corr = pd.concat([target_corr, spearman_target_corr, kendall_target_corr], axis = 1).drop("stock_id").sort_values("pearson", ascending=False)

**3 Correlation Values with target - some notes on the different wap values**

* wap1 = derived from bid1, ask1, etc.
* wap2 = derived from bid2, ask2, etc.
* wap = averaged from wap1 and wap2

All values were calulated with a groupby aggregation.  See comments with Yirun.

In [None]:
merged_corr.shape

In [None]:
merged_corr.dropna().style.background_gradient(cmap ='coolwarm', 
                                      axis = 0,
                                      vmin=-1,
                                      vmax=1) \
                   .set_properties(**{'font-size': '14px'}) \
    .set_caption("3 Correlation Coefficients of Statistical Features") \
    .set_properties(padding="20px", border="2px solid white")

## Seaborn Regression Plots of Higher Correlated Features and Targets

In [None]:
sns.set_theme()
sns.set_style("whitegrid")

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2)


sns.regplot(data=full_train_w_features, x="('bid_price2', 'min')", y="target", ax = ax1)
ax1.set_title("Minimum Bid Price2")

sns.regplot(x="('wap2', 'min')", y="target", data=full_train_w_features, ax=ax2)
ax2.set_title("Minimum Weighted Average Price Computed from '2' Columns")

sns.regplot(x="('wap_log_return', 'Median_abs_deviation')", y="target", data = full_train_w_features, ax=ax3)
ax3.set_title("Median Absolute Deviation of Avg. Weighted Avg. Price")

sns.regplot(x="('ask_price2', 'max')", y="target", data = full_train_w_features, ax=ax4)
ax4.set_title("Max Asking Price 2")

plt.tight_layout(pad=3)
plt.show()

## Are Standardized Features More Highly Correlated than the most correlated above? - No ##

In [None]:
std_features = pd.DataFrame(
    StandardScaler().fit_transform(full_train_w_features.fillna(0)),
    columns = full_train_w_features.columns
)

In [None]:
std_corr = pd.DataFrame(std_features.corr()['target']) \
            .sort_values('target', ascending=False)

In [None]:
std_corr.head()

Much of the data is already standardized (or the std. dev. is an explicit feature and probably does not need standardization)

## Are Any Interactions Between These Variables Higher in Correlation with Target Than The Original Variables? (presentation in process - see output for polynomial correlation csv for now) ##

In [None]:
stock_ids = full_train_w_features['stock_id']
time_ids = full_train_w_features['time_id']

polyfeaturizer = PolynomialFeatures(degree=2).fit(full_train_w_features.drop(['stock_id', 'time_id'], axis=1))
train_polyfeats = pd.DataFrame(polyfeaturizer.transform(full_train_w_features.drop(['stock_id', 'time_id'], axis=1)), 
                               columns = polyfeaturizer.get_feature_names())

poly_target = train_polyfeats['x0']
train_polyfeats = train_polyfeats.rename(columns={"x0":"target"}).drop("1", axis=1)
train_polyfeats.shape

In [None]:
train_polyfeats.head()

In [None]:
poly_corr = train_polyfeats.corr().sort_values(by="target", ascending=False)
poly_corr.to_csv("polynomial_features_corr.csv", index = True)

In [None]:
#poly_corr.head(20)

In [None]:
#poly_corr.tail(20)

## What are the features that are highly correlated with the target, but least correlated with each other? (in process)

---
## Automated Pandas Profiles Attached Below with Basic Data Exploration
---

## Links to Pandas Profiles.

I intend this notebook to be a resource that will have pandas profiles up for
* Now: just the basic dataset.
* As I go: Features generated such as various technical indicators.

**Externally Processed Pandas Profile explorative Reports**

Minimal reports are attached to this notebook.  All reports and figures will be attached to this dataset. 

* **Optiver Visual Data Reports:** https://www.kaggle.com/tpmeli/optivervisualreports.

<hr>

## Exploring the Data (Visual guide to the data coming soon)

We are primarily given Book Data and Trade Data.  Tip - You may want to keep the file format in parquet or convert to csv offline because the csv conversion in the notebook tends to kill it.

**Book Data** - is information about the asking price and the selling price over time.  This can be used for feature generation.

**Trade Data** - is information about the trade price of the stock over time as well as the size of the trade and the number of orders.

### Important Point: Time ID is NOT sequential

![https://www.thomasmeli.tech/wp-content/uploads/2021/07/time_id_important.jpg](https://www.thomasmeli.tech/wp-content/uploads/2021/07/time_id_important.jpg)

As I shared [here](https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/250706), the information we have about the next 10 minutes is really contained in target, which has the volatility of the next 10 minutes.

There are a lot of cool ideas to still get some information from this.

* prices is normally distributed, hence the volatility may actually allow us to infer information about the 'missing' next dataset.
* It might be useful to label each row in each bucket and have a sense if interesting things are happening towards the beginning or end of the bucket.

I'll get working on these ideas and publish some ideas in a new notebook. 

Let me know if you want me to explore any other ideas.

## External Resources and further information

**Read more about:**
* Why volatility is so important to investors - https://www.investopedia.com/articles/financial-theory/08/volatility.asp
* Bid ask spread: https://www.investopedia.com/trading/basics-of-the-bid-ask-spread/
* Parquet files: https://miuv.blog/2018/08/21/handling-large-amounts-of-data-with-parquet-part-1/

#### Thanks for reading through my (ongoing) notebook.  I hope it was useful for you :) 

<hr>