# Are the time selected at random around the day or are they selected around specific events ?

In my optionion, one of the biggest question we have here is : do we have the same distribution for input realized vol and target ? or maybe something specific happen between the two period ? 

One idea to check for that is to agregate training and target realized volatilities, and try to see if you can guess which one is which. Doing so you end up with a binary classification problem, where you try to guess if volatility come from training or target data. This is usually used to check for time consitency between training and testing. 

By calibrating a small model and checking common binary classification metrics we can know if both data set are different or not. I use a lgbm, which is fast, powerfull and rather explainable. 

This notebook is relying on the amazingly fast notebook : https://www.kaggle.com/slawekbiel/naive-but-fast-submission

In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics import r2_score
import glob

def ffill(data_df):
    data_df=data_df.set_index(['time_id', 'seconds_in_bucket'])
    data_df = data_df.reindex(pd.MultiIndex.from_product([data_df.index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket']), method='ffill')
    return data_df.reset_index()

In [None]:
# A function to calculate realized volatility for all time intervals in a single book file
def realized_volatility_single_stock(file_path, prediction_column_name):
    
    df_book_data = pd.read_parquet(file_path)

    df_book_data = ffill(df_book_data)
    
    
    stock_id = file_path.split('=')[1]
    time_ids, bpr, bsz, apr, asz = (df_book_data[col].values for col in ['time_id', 'bid_price1','bid_size1','ask_price1','ask_size1' ])
    wap = (bpr * asz +apr * bsz) / (asz + bsz)
    log_wap = np.log(wap)
    ids, index = np.unique(time_ids, return_index=True)

    splits = np.split(log_wap, index[1:])
    ret=[]
    for time_id, x in zip(ids.tolist(), splits):
        log_ret = np.diff(x)
        volatility = np.sqrt((log_ret ** 2).sum())
        ret.append((f'{stock_id}-{time_id}', volatility.item()))
    return pd.DataFrame(ret, columns=['row_id', prediction_column_name])

In [None]:
def realized_volatility_all(files_list, prediction_column_name):
    return pd.concat( [realized_volatility_single_stock(file, prediction_column_name) for file in files_list])

## Run on the train set to sanity check

In [None]:
list_order_book_file_train = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/book_train.parquet/*')

In [None]:
%%time
df_past_realized_train = realized_volatility_all(list_order_book_file_train, 'pred')

Less than a minute !

# Let's build the two data sets

In [None]:
df_past_realized_train['stock_id'] = df_past_realized_train['row_id'].str.partition('-')[0].astype('int')
df_past_realized_train['time_id'] = df_past_realized_train['row_id'].str.partition('-')[2].astype('int')

In [None]:
train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str)

In [None]:
df_past_realized_train = df_past_realized_train[['pred','stock_id','time_id']]
df_train = train[['target','stock_id','time_id']]

table_train = pd.pivot_table(df_past_realized_train, values='pred', index=['time_id'],columns=['stock_id'], aggfunc=np.sum)
table_test = pd.pivot_table(df_train, values='target', index=['time_id'],columns=['stock_id'], aggfunc=np.sum)

# Add the targets

In [None]:
table_train['target'] = 0
table_test['target'] = 1

In [None]:
whole_df = pd.concat([table_train,table_test])

X = whole_df.drop(['target'], axis=1)
y = whole_df['target']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Fit a lgbm model to guess which data come from which data set

In [None]:
import lightgbm

train_data = lightgbm.Dataset(X_train, label=y_train)
test_data = lightgbm.Dataset( X_test, label=y_test)


parameters = {
    'application': 'binary',
    'objective': 'binary',
    'metric': 'auc',
    'is_unbalance': 'true',
    'boosting': 'gbdt',
    'num_leaves': 31,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.5,
    'bagging_freq': 20,
    'learning_rate': 0.05,
    'verbose': 1
}

model = lightgbm.train(parameters,
                       train_data,
                       valid_sets=test_data,
                       num_boost_round=50000,
                       early_stopping_rounds=200)

0.54 AUC not exactly random, but not far from it. Remember that for application in finance you don't really much more than 0.5 precision to make money. So that 0.54 AUC could be significant.

In [None]:
lightgbm.plot_importance(model,max_num_features = 10)

The plot clearly shows some outliers that allow to identify a difference between the training and target realized vol.

# We can look at the difference in distribution between the training and testing set for a given stock

In [None]:
from matplotlib import pyplot

stock_number = 83

h1 = whole_df[stock_number][whole_df['target']==0]
h2 = whole_df[stock_number][whole_df['target']==1]

bins = np.linspace(0, 0.08, 25)

pyplot.hist(h1, bins=bins, alpha=0.5)
pyplot.hist(h2, bins=bins, alpha=0.5)
pyplot.legend(loc='upper right')
pyplot.show()