All data wrangling + volatility calculation codes were pulled from my another notebook [Overly simplified OLS prediction](https://www.kaggle.com/shahmahdihasan/overly-simplified-ols-prediction). The goal of this codebook to cluster the *stock_id* based on their *realized volatility*. 

### Importing all the necessary librarires

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score
import glob
from collections import Counter
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

In [None]:
order_book_training = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/book_train.parquet/*')

# custom aggregate function
def wap2vol(df):
    # wap2vol stands for WAP to Realized Volatility
    temp = np.log(df).diff() # calculating tik to tik returns
    # returning realized volatility
    return np.sqrt(np.sum(temp**2)) 



# function for calculating realized volatility per time id for a given stock
def rel_vol_time_id(path):
    # book: book is an order book
    book = pd.read_parquet(path) # order book for a stock id loaded
    # calculating WAP
    p1 = book["bid_price1"]
    p2 = book["ask_price1"]
    s1 = book["bid_size1"]
    s2 = book["ask_size1"]
    
    book["WAP"] = (p1*s2 + p2*s1) / (s1 + s2)
    # calculating realized volatility for each time_id
    transbook = book.groupby("time_id")["WAP"].agg(wap2vol)
    return transbook



All the necessary functions are there, now let's calculate the realized volatility for each *(stock_id, time_id)* tuples.

In [None]:
%%time 
stock_id = []
time_id = []
relvol = []
for i in order_book_training:
    # finding the stock_id
    temp_stock = int(i.split("=")[1])
    # find the realized volatility for all time_id of temp_stock
    temp_relvol = rel_vol_time_id(i)
    stock_id += [temp_stock]*temp_relvol.shape[0]
    time_id += list(temp_relvol.index)
    relvol += list(temp_relvol)

past_volatility = pd.DataFrame({"stock_id": stock_id, "time_id": time_id, "volatility": relvol})

In [None]:
train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
joined = train.merge(past_volatility, on = ["stock_id","time_id"], how = "left")
stockID = 0
sns.scatterplot(data = joined[joined["stock_id"]==stockID], x = "target", y = "volatility")
plt.show()

In [None]:
# finding the number of time_id each stock_id has 
count_stock_id = Counter(stock_id)

In [None]:
count_stock_id

We can see not all the *stock_id* has equal number of time_id. For now I am proceeding with *stock_id* with number of *time_id* = 3830 

In [None]:
eligible_stock_id = []
for i in count_stock_id:
    if count_stock_id[i] == 3830:
        eligible_stock_id.append(i)

In [None]:
past_volatility = past_volatility.loc[past_volatility["stock_id"].isin(eligible_stock_id),:]

In [None]:
vecx = np.array(past_volatility["volatility"])

In [None]:
X = np.reshape(vecx, (3830,-1))

Principal Component Analysis (PCA) is scale sensitive, hence I am preprocessing the data using *StandardScalar* from sklearn. I am also using a 2-component PCA for the ease of visualization to see if there actually exists any stock classes.

In [None]:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X.T)
pca = PCA(n_components=3)
PC = pca.fit_transform(X)
PC.shape

In [None]:
sns.set_style("whitegrid", {'axes.grid' : False})
fig = plt.figure(figsize=(6,6))
ax = Axes3D(fig) 

x = PC[:,0]
y = PC[:,1]
z = PC[:,2]


ax.scatter(x, y, z, c=x, marker='o')
ax.set_xlabel('PC_0')
ax.set_ylabel('PC_1')
ax.set_zlabel('PC_2')

plt.show()


In [None]:
print(pca.explained_variance_ratio_)

In [None]:
sns.set_theme(style="white")
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(np.corrcoef(X), cmap=cmap)
plt.show()

Conclusion being: in terms of volatility, the stocks are almost uncorrelated. 