## Introduction
This notebook creates the submission file for the competition [Optiver Realized Volatility Prediction](https://www.kaggle.com/c/optiver-realized-volatility-prediction).

### Structure of the notebook
1. **Load the data** - In the beginning, training and test sets are loaded. 
2. **Preprocess the data** - The test set is preprocessed using functions contained in this notebook.
3. **Encode and normalize features** - Data from the sets are standardized and labels are encoded. 
4. **Combine the models and run inference** - The notebook uses nine models to predict the solution.
5. **Get average of the results** - In the end, the weighted average is calculated (using each model public score as weight) and the result is saved to the submission file.

### Models used
The first three models were created by me, the rest were made by other Kagglers. Models were trained in separate notebooks to be able to check the public score of each model. Models include models uses tree-based learning algorithms, simple models like linear regression and deep neural model.

**My models**:
1. `tab_net_norm` - [TabNet (with normalization)](https://www.kaggle.com/kingakocol/master-tabnet-with-normalization)
2. `xgb_norm` - [XGB (with normalization)](https://www.kaggle.com/kingakocol/master-xgb-with-normalization)
3. `lgbm_norm` - [LGBM (with normalization)](https://www.kaggle.com/kingakocol/master-lgbm-with-normalization)
4. `catboost_norm` - [CatBoost (with normalization)](https://www.kaggle.com/kingakocol/master-catboost-with-normalization)

**Kaggle community models**:

5. `linreg_norm` - [LinearRegression (with normalization)](https://www.kaggle.com/kingakocol/master-linearregression-with-normalization)
6. `bayridge_norm` - [Bayesian Ridge (with normalization)](https://www.kaggle.com/kingakocol/master-bayesian-ridge-with-normalization)
7. `gradboost_norm` - [GradientBoostingRegressor (with normalization)](https://www.kaggle.com/kingakocol/gradientboostingregressor-with-normalization)
8. `lightGBM_norm` - [LightGBM (with normalization)](https://www.kaggle.com/kingakocol/master-lightgbm-with-normalization)
9. `dnm_norm` - [Deep Neural Model (with normalization)](https://www.kaggle.com/kingakocol/master-deep-neural-model-with-normalization)

## Loading and preprocessing the data

### Import of libraries

Firstly, I import the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import joblib
import xgboost as xgb
import numpy as np
from catboost import Pool, CatBoostRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import KFold
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adamax
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.layers import LeakyReLU, Reshape
from tensorflow.keras.layers import Dropout, Concatenate
from tensorflow.keras.layers import Embedding, Dense, Flatten
from tensorflow.keras.layers import Input, BatchNormalization
from lightgbm import LGBMRegressor
from joblib import Parallel, delayed
from tqdm import tqdm

### Functions used in preprocessing

This notebook (https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data) shared by Optiver contains an introduction to financial concepts. The notebook explains how to calculate the instantaneous stock valuation (WAP), how to compare prices of a stock (log returns) and target (realized volatility).

This section includes functions used to calculate parameters from the data contained in the test and training set, such as wap (weighted average price), log returns or spreads, for aggregation and loading data from files.

I used code from this notebook: https://www.kaggle.com/tensorchoko/optiver-tabnet-beginner.

In [None]:
def calc_wap1(df):
    wap = (df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1'])/(df['bid_size1'] + df['ask_size1'])
    return wap

def calc_wap2(df):
    wap = (df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2'])/(df['bid_size2'] + df['ask_size2'])
    return wap

def calc_wap3(df):
    wap = (df['bid_price1'] * df['bid_size1'] + df['ask_price1'] * df['ask_size1'])/(df['bid_size1'] + df['ask_size1'])
    return wap

def calc_wap4(df):
    wap = (df['bid_price2'] * df['bid_size2'] + df['ask_price2'] * df['ask_size2'])/(df['bid_size2'] + df['ask_size2'])
    return wap

def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() 

def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

def count_unique(series):
    return len(np.unique(series))

def preprocessor_book(file_path):
    df = pd.read_parquet(file_path)

    df['wap1'] = calc_wap1(df)
    df['log_return1'] = df.groupby('time_id')['wap1'].apply(log_return)
    
    df['wap2'] = calc_wap2(df)
    df['log_return2'] = df.groupby('time_id')['wap2'].apply(log_return)
    
    df['wap3'] = calc_wap3(df)
    df['log_return3'] = df.groupby('time_id')['wap3'].apply(log_return)
    
    df['wap4'] = calc_wap4(df)
    df['log_return4'] = df.groupby('time_id')['wap4'].apply(log_return)
    
    df['wap_avg'] = (df['wap1'] + df['wap2']) / 2
    df['log_return_avg'] = df.groupby('time_id')['wap_avg'].apply(log_return)
    
    df['wap_balance'] = abs(df['wap1'] - df['wap2'])
    
    df['price_spread1'] = (df['ask_price1'] - df['bid_price1']) / ((df['ask_price1'] + df['bid_price1'])/2)
    df['price_spread2'] = (df['ask_price2'] - df['bid_price2']) / ((df['ask_price2'] + df['bid_price2'])/2)
    
    df['bid_spread'] = df['bid_price1'] - df['bid_price2']
    df['ask_spread'] = df['ask_price1'] - df['ask_price2']
    df['bid_ask_spread'] = abs(df['bid_spread'] - df['ask_spread'])
    
    df['total_volume'] = (df['ask_size1'] + df['ask_size2']) + (df['bid_size1'] + df['bid_size2'])
    df['volume_imbalance'] = abs((df['ask_size1'] + df['ask_size2']) - (df['bid_size1'] + df['bid_size2']))
    df['invent_liquidity'] = (df['ask_size1'] + df['ask_size2']) / (df['bid_size1'] + df['bid_size2'])

    #dict for aggregate
    create_feature_dict = {
        'log_return1':[realized_volatility, np.sum, np.mean, np.std],
        'log_return2':[realized_volatility, np.sum, np.mean, np.std],
        'log_return3':[realized_volatility, np.sum, np.mean, np.std],
        'log_return4':[realized_volatility, np.sum, np.mean, np.std],
        'log_return_avg':[realized_volatility, np.sum, np.mean, np.std],
        'wap_balance':[np.sum, np.mean, np.std],
        'wap_avg':[np.sum, np.mean, np.std],
        'price_spread1':[np.sum, np.mean, np.std],
        'price_spread2':[np.sum, np.mean, np.std],
        'bid_spread':[np.sum, np.mean, np.std],
        'ask_spread':[np.sum, np.mean, np.std],
        'bid_ask_spread':[np.sum, np.mean, np.std],
        'volume_imbalance':[np.sum, np.mean, np.std],
        'total_volume':[np.sum, np.mean, np.std],
        'invent_liquidity':[np.sum, np.mean, np.std],
        'wap1':[np.sum, np.mean, np.std],
        'wap2':[np.sum, np.mean, np.std],
        'wap3':[np.sum, np.mean, np.std],
        'wap4':[np.sum, np.mean, np.std],
            }

    #groupby / last XX seconds (dla ostatnich 300 sekund - druga połowa przedziału czasu)
    df_feature = pd.DataFrame(df.groupby(['time_id']).agg(create_feature_dict)).reset_index()
    
    df_feature.columns = ['_'.join(col) for col in df_feature.columns] #time_id is changed to time_id_
     
    #groupby / last XX seconds (dla ostatnich 300 sekund - druga połowa przedziału czasu)
    last_seconds = [300]
    
    for second in last_seconds:
        second = 600 - second 
    
        df_feature_sec = pd.DataFrame(df.query(f'seconds_in_bucket >= {second}').groupby(['time_id']).agg(create_feature_dict)).reset_index()

        df_feature_sec.columns = ['_'.join(col) for col in df_feature_sec.columns] #time_id is changed to time_id_
     
        df_feature_sec = df_feature_sec.add_suffix('_' + str(second))

        df_feature = pd.merge(df_feature,df_feature_sec,how='left',left_on='time_id_',right_on=f'time_id__{second}')
        df_feature = df_feature.drop([f'time_id__{second}'],axis=1)
    
    #create row_id
    stock_id = file_path.split('=')[1]
    df_feature.insert(loc=0, column='row_id', value=df_feature['time_id_'].apply(lambda x:f'{stock_id}-{x}'))
    df_feature = df_feature.drop(['time_id_'],axis=1)

    return df_feature

def preprocessor_trade(file_path):
    df = pd.read_parquet(file_path)
    df['log_return'] = df.groupby('time_id')['price'].apply(log_return)
    
    aggregate_dictionary = {
        'log_return':[realized_volatility],
        'seconds_in_bucket':[count_unique],
        'size':[np.sum],
        'order_count':[np.mean],
    }
    
    #groupby / last XX seconds (dla ostatnich 300 sekund - druga połowa przedziału czasu)
    df_feature = df.groupby('time_id').agg(aggregate_dictionary)
    
    df_feature = df_feature.reset_index()
    df_feature.columns = ['_'.join(col) for col in df_feature.columns]
    
    #groupby / last XX seconds (dla ostatnich 300 sekund - druga połowa przedziału czasu)
    last_seconds = [300]
    
    for second in last_seconds:
        second = 600 - second
    
        df_feature_sec = df.query(f'seconds_in_bucket >= {second}').groupby('time_id').agg(aggregate_dictionary)
        df_feature_sec = df_feature_sec.reset_index()
        
        df_feature_sec.columns = ['_'.join(col) for col in df_feature_sec.columns]
        df_feature_sec = df_feature_sec.add_suffix('_' + str(second))
        
        df_feature = pd.merge(df_feature,df_feature_sec,how='left',left_on='time_id_',right_on=f'time_id__{second}')
        df_feature = df_feature.drop([f'time_id__{second}'],axis=1)
    
    df_feature = df_feature.add_prefix('trade_')
    stock_id = file_path.split('=')[1]
    df_feature['row_id'] = df_feature['trade_time_id_'].apply(lambda x:f'{stock_id}-{x}')
    df_feature = df_feature.drop(['trade_time_id_'],axis=1)
    
    return df_feature

def preprocessor(list_stock_ids, is_train = True):
    df = pd.DataFrame()
    
    def for_joblib(stock_id):
        if is_train:
            file_path_book = '../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=' + str(stock_id)
            file_path_trade = '../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=' + str(stock_id)
        else:
            file_path_book = '../input/optiver-realized-volatility-prediction/book_test.parquet/stock_id=' + str(stock_id)
            file_path_trade = '../input/optiver-realized-volatility-prediction/trade_test.parquet/stock_id=' + str(stock_id)
            
        df_tmp = pd.merge(preprocessor_book(file_path_book),preprocessor_trade(file_path_trade),on='row_id',how='left')
     
        return pd.concat([df,df_tmp])
    
    df = Parallel(n_jobs=1, verbose=1)(delayed(for_joblib)(stock_id) for stock_id in list_stock_ids)

    df =  pd.concat(df,ignore_index = True)
    return df

### Training set

The training set is loaded from a parquet file. In a separate notebook data from the training set was preprocessed and saved to the file to save time.

The notebook: https://www.kaggle.com/kingakocol/dataset-optiver.

In [None]:
df_train = pd.read_parquet('../input/dataset-optiver/df_train.parquet')
df_train_catboost = df_train.copy()

### Test set

Preprocessing of the test set.

In [None]:
test = pd.read_csv('../input/optiver-realized-volatility-prediction/test.csv')

test_ids = test.stock_id.unique()
df_test = preprocessor(list_stock_ids= test_ids, is_train = False)
df_test = test.merge(df_test, on = ['row_id'], how = 'left')

## Normalization and label encoding

In this section, the test set was standardized (all columns except row_id, target, time_id, stock_id) and the columns (time_id, stock_id) were encoded.

In [None]:
for col in df_test.columns.to_list()[3:]:
    df_test[col] = df_test[col].fillna(df_test[col].mean())
    df_test[col] = df_test[col].fillna(0)

df_test = df_test.drop(['row_id'], axis = 1)

df_train[['stock_id', 'time_id']] = df_train['row_id'].str.split('-', expand=True)
col = df_train.pop('stock_id')
df_train.insert(1, 'stock_id', col)
col = df_train.pop('time_id')
df_train.insert(2, 'time_id', col)

for col in df_train.columns.to_list()[4:]:
    df_train[col] = df_train[col].fillna(df_train[col].mean())

In [None]:
scales = df_train.drop(['row_id', 'target', 'time_id','stock_id'], axis = 1).columns.to_list()

scaler = StandardScaler()
scaler.fit(df_train[scales])

test = df_test.drop(['time_id','stock_id'], axis = 1)
test = scaler.transform(test)
test = pd.DataFrame(test)
test = test.set_axis(scales, axis=1, inplace=False)

test_x = test.copy()

test['time_id'] = df_test['time_id']
test.set_index(test.columns[-1], inplace=True)
test.reset_index(inplace=True)

test['stock_id'] = df_test['stock_id']
test.set_index(test.columns[-1], inplace=True)
test.reset_index(inplace=True)

le = LabelEncoder()
le.fit(df_test['stock_id'])
test['stock_id'] = le.transform(test['stock_id'])

le = LabelEncoder()
le.fit(df_test['time_id'])
test['time_id'] = le.transform(test['time_id'])

test_dnm = test.copy()

## Load the trained models
### `tab_net_norm`

In this notebook: https://www.kaggle.com/kingakocol/tabnet-gpu-with-normalization, I trained the TabNet model using Optuna for optimization. The notebook shows different kinds of plots after the model training. The plots enable analyze what parameters were selected during training.

In [None]:
model_path = '../input/tabnet-gpu-with-normalization/best_model'
model = tf.keras.models.load_model(model_path)

### `tab_net_norm` inference 

In [None]:
col = test.columns.to_list()

preds_tab_net_norm = model.predict(test[col])
preds_tab_net_norm = preds_tab_net_norm.flatten()

### `xgb_norm`

In this notebook: https://www.kaggle.com/kingakocol/xgboost-with-normalization-model, I trained the XGBoost model. I created the simple model.

In [None]:
model_path = '../input/xgboost-with-normalization-model/xg'
model = joblib.load(model_path)

### `xgb_norm` inference

In [None]:
col = test_x.columns.to_list()
preds_xgb_norm = model.predict(test_x[col])

### `lgbm_norm`

In this notebook: https://www.kaggle.com/kingakocol/lgbm-with-normalization-model, I trained the LightGBM model.

In [None]:
model_path = '../input/lgbm-with-normalization-model/lgbm'
model = joblib.load(model_path)

### `lgbm_norm` inference

In [None]:
col = test_x.columns.to_list()
preds_lgbm_norm = model.predict(test_x[col])

### `catboost_norm`

Next model I used from this notebook: https://www.kaggle.com/sweetjane/catboost. I trained the model in a separate notebook and saved it: https://www.kaggle.com/kingakocol/catboost-with-normalization-model.

In [None]:
df_train_catboost[['stock_id', 'time_id']] = df_train_catboost['row_id'].str.split('-', expand=True)
col = df_train_catboost.pop('stock_id')
df_train_catboost.insert(1, 'stock_id', col)
col = df_train_catboost.pop('time_id')
df_train_catboost.insert(2, 'time_id', col)

In [None]:
for col in df_train_catboost.columns.to_list()[4:]:
    df_train_catboost[col] = df_train_catboost[col].fillna(df_train_catboost[col].mean())

In [None]:
scales = df_train_catboost.drop(['row_id', 'target', 'time_id','stock_id'], axis = 1).columns.to_list()

scaler = StandardScaler()
scaler.fit(df_train_catboost[scales])

target = df_train_catboost['target']
train = df_train_catboost.drop(['row_id', 'target', 'time_id','stock_id'], axis = 1)
train = scaler.transform(train)
train = pd.DataFrame(train)
train = train.set_axis(scales, axis=1, inplace=False)

train_dnm = train.copy()

### `catboost_norm` inference

In [None]:
model_path = '../input/catboost-with-normalization-model/catboost'

In [None]:
preds_xgb_catboost = np.zeros(test.shape[0])

kfold = KFold(n_splits = 5, random_state = 66, shuffle = True)

for fold, (trn_ind, val_ind) in enumerate(kfold.split(df_train)):
    print(f'Training fold {fold + 1}')
    
    test_pool = Pool(test) 
    model = joblib.load(model_path)
    preds_xgb_catboost += model.predict(test_pool) / 5

Next five models I used from this notebook: https://www.kaggle.com/dlaststark/orvp-pulp-fiction/notebook#Base-models.

Links:
1. https://www.kaggle.com/kingakocol/linearregression-with-normalization-model
2. https://www.kaggle.com/kingakocol/bayesian-ridge-with-normalization-model
3. https://www.kaggle.com/kingakocol/gradientboostingregressor-model
4. https://www.kaggle.com/kingakocol/lightgbm-with-normalization-model
5. https://www.kaggle.com/kingakocol/deep-neural-model-with-normalization-model

### `linreg_norm`,`bayridge_norm`,`gradboost_norm`,`lightGBM_norm`

In [None]:
FOLD = 10
SEEDS = [2018, 2020]
COUNTER = 0

y_pred_final_lr = 0
y_pred_final_ridge = 0
y_pred_final_gbr = 0
y_pred_final_lgb = 0

In [None]:
Ytrain_strat = pd.qcut(target, q=10, labels=range(0,10))

### `linreg_norm`,`bayridge_norm`,`gradboost_norm`,`lightGBM_norm` inference

In [None]:
model_path_lreg = '../input/linearregression-with-normalization-model/lreg'
model_path_ridge = '../input/bayesian-ridge-with-normalization-model/ridge'
model_path_gbr = '../input/gradientboostingregressor-model/GBR'
model_path_lgb = '../input/lightgbm-with-normalization-model/lightGBM'

In [None]:
for sidx, seed in enumerate(SEEDS):

    kfold = StratifiedKFold(n_splits=FOLD, shuffle=True, random_state=seed)

    for idx, (xtrain, val) in enumerate(kfold.split(train, Ytrain_strat)):
        COUNTER += 1

        model_lreg = joblib.load(model_path_lreg)
        y_pred_final_lr += model_lreg.predict(test_x)
        
        model_ridge = joblib.load(model_path_ridge)
        y_pred_final_ridge += model_ridge.predict(test_x)
        
        model_gbr = joblib.load(model_path_gbr)
        y_pred_final_gbr += model_gbr.predict(test_x)
        
        model_lgb = joblib.load(model_path_lgb)
        y_pred_final_lgb += model_lgb.predict(test_x, num_iteration=model.best_iteration_)

In [None]:
y_pred_final_lr = y_pred_final_lr / float(COUNTER)
y_pred_final_lr = np.array([y_pred_final_lr]).T

y_pred_final_ridge = y_pred_final_ridge / float(COUNTER)
y_pred_final_ridge = np.array([y_pred_final_ridge]).T

y_pred_final_gbr = y_pred_final_gbr / float(COUNTER)
y_pred_final_gbr = np.array([y_pred_final_gbr]).T

y_pred_final_lgb = y_pred_final_lgb / float(COUNTER)
y_pred_final_lgb = np.array([y_pred_final_lgb]).T

y_pred_final_lr = y_pred_final_lr.flatten()
y_pred_final_ridge = y_pred_final_ridge.flatten()
y_pred_final_gbr = y_pred_final_gbr.flatten()
y_pred_final_lgb = y_pred_final_lgb.flatten()

### `dnm_norm`

In [None]:
train_dnm['time_id'] = df_train['time_id']
train_dnm.set_index(train_dnm.columns[-1], inplace=True)
train_dnm.reset_index(inplace=True)

train_dnm['stock_id'] = df_train['stock_id']
train_dnm.set_index(train.columns[-1], inplace=True)
train_dnm.reset_index(inplace=True)

le = LabelEncoder()
le.fit(df_train['stock_id'])
train_dnm['stock_id'] = le.transform(train_dnm['stock_id'])

le.fit(df_train['time_id'])
train_dnm['time_id'] = le.transform(train_dnm['time_id'])

In [None]:
cat_cols = ['stock_id','time_id']

train_dnm[cat_cols] = train_dnm[cat_cols].astype(int)
test[cat_cols] = test[cat_cols].astype(int)
cat_cols_indices = [train_dnm.columns.get_loc(col) for col in cat_cols]

num_cols = [col for col in train_dnm.columns if col not in cat_cols]

In [None]:
for col in tqdm(num_cols):
    transformer = QuantileTransformer(n_quantiles=5000, 
                                      random_state=2020, 
                                      output_distribution="normal")
    
    vec_len = len(train_dnm[col].values)
    vec_len_test = len(test[col].values)

    raw_vec = train_dnm[col].values.reshape(vec_len, 1)
    test_vec = test_dnm[col].values.reshape(vec_len_test, 1)
    transformer.fit(raw_vec)
    
    train_dnm[col] = transformer.transform(raw_vec).reshape(1, vec_len)[0]
    test[col] = transformer.transform(test_vec).reshape(1, vec_len_test)[0]

print(f"train: {train_dnm.shape} \ntest: {test.shape}")

In [None]:
model_path = '../input/deep-neural-model-with-normalization-model/dnm'
model = tf.keras.models.load_model(model_path)

In [None]:
VERBOSE = 0
BATCH_SIZE = 2048

y_pred_final_dnn = 0
counter = 0

### `dnm_norm` inference

In [None]:
for sidx, seed in enumerate(SEEDS):
    seed_score = 0
    
    kfold = StratifiedKFold(n_splits=FOLD, shuffle=True, random_state=seed)

    for idx, (xtrain, val) in enumerate(kfold.split(train, Ytrain_strat)):
        counter += 1

        y_pred_final_dnn += model.predict([[test[col] for col in cat_cols], test[num_cols]], batch_size=BATCH_SIZE)

In [None]:
y_pred_final_dnn = y_pred_final_dnn / float(counter)
y_pred_final_dnn = y_pred_final_dnn.flatten()

## Inference

In this section, the weighted average is calculated. Thanks to the fact that I had separate notebooks with models, I could get to know their public scores. I used each model public score as weight. The result was saved to the submission file.

In [None]:
w_xgb_norm = 1 / 0.27199
w_tab_net_norm = 1 / 0.26633
w_lgbm_norm = 1 / 0.29614
w_catboost_norm = 1 / 0.29050
w_lr_norm = 1 / 0.23332
w_ridge_norm = 1 / 0.23339
w_gbr_norm = 1 / 0.22499
w_lgb_norm = 1 / 0.22344
w_dnm_norm = 1 / 0.30256

weight_sum = (w_xgb_norm + w_tab_net_norm + w_lgbm_norm + w_catboost_norm + w_lr_norm + w_ridge_norm
             + w_gbr_norm + w_lgb_norm + w_dnm_norm)

preds = (preds_xgb_norm * w_xgb_norm + preds_tab_net_norm * w_tab_net_norm + preds_xgb_norm * w_lgbm_norm
        + preds_xgb_catboost * w_catboost_norm + y_pred_final_lr * w_lr_norm + y_pred_final_ridge * w_ridge_norm
        + y_pred_final_gbr * w_gbr_norm + y_pred_final_lgb * w_lgb_norm + y_pred_final_dnn * w_dnm_norm) / weight_sum

In [None]:
sub = pd.read_csv('../input/optiver-realized-volatility-prediction/sample_submission.csv')
sub.target = preds

sub.to_csv('submission.csv',index=False)