Reproduction is a good way for me/us to quickly build the initial baseline and momentum on a new problem, also a scientific way to prove the proposed approach while answering quite a few what & why questions in our mind. (Of course, we should raise our concerns when we found something perhaps not that many senses)

So don't be shy, even feeling shame on "redo" the work. The difference between a "reproduction" and a "copy & paste" is full of Q & A, which helps us know the problem and further build a better solution with our dimensions.

Appreciate the work [1] [2] I referred and learned from to help me reach here.

(Here the following model tuning https://www.kaggle.com/austinzhao/model-tuning-lgbm-baseline added on 3rd Sep 2021)

**Version Notes**
- V2.0 - here the [[Model Tuning] LGBM Baseline](https://www.kaggle.com/austinzhao/model-tuning-lgbm-baseline), improved a bit & shared ideas | 3rd Sep 2021
- V1.0 - 1st published & submitted version | 29th Aug 2021

**Work Principles**
- A clean & comfortable format to help our brains digest
- Occam's Razor - Non sunt multiplicanda entia sine necessitate
- Refactoring 

**General Notes**
- This notebook will be focusing on building the baseline, and further improvement, like model tuning and features improving, will be added in the following work.

# Overview
- Libs Import & Dataset Setup
- EDA
- Model
    - Utility Funcs
    - Feature Engineering
    - Trainning & Evaluating
    - Main Func
- Summary
- Reference

# Libs Import & Dataset Setup

In [None]:
# Import order: data manipulating -> machine/deep learning -> utilities/helpers/improvement -> configuration
import pandas as pd
import numpy as np
import scipy as sc

from sklearn.model_selection import KFold
import lightgbm as lgb

from joblib import Parallel, delayed

import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_columns', 300)

# Define data_directory and data_read func
data_dir = '../input/optiver-realized-volatility-prediction/'
def read_train_test():
    train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
    test = pd.read_csv('../input/optiver-realized-volatility-prediction/test.csv')
    # Create a key to merge with book and trade data
    train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str)
    test['row_id'] = test['stock_id'].astype(str) + '-' + test['time_id'].astype(str)
    print(f'Our training set has {train.shape[0]} rows')
    return train, test

# EDA

Luckily, we had this dataset with a relatively clean format (no lots of None or NaN values we need to clean first). Guess this credit should be due to the data from the online trading platform -- data got formatted and specified already. 

So we will have a brief check over book's and trade's train and test data, also the submission format. Just have a general idea about which columns we have here.

In [None]:
# Need to define the problem before we want to solve (could be a iterative way while we collec more and more information)
sample = pd.read_csv("../input/optiver-realized-volatility-prediction/sample_submission.csv")
sample

In [None]:
test = pd.read_csv("../input/optiver-realized-volatility-prediction/test.csv")
test

In [None]:
train = pd.read_csv("../input/optiver-realized-volatility-prediction/train.csv")
train

In [None]:
book_example = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
book_example

In [None]:
trade_example = pd.read_parquet("../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=0")
trade_example

# Utility Funcs 

**WAP**  
Motivation - take price and size of orders into considertaion

In [None]:
# Calculate 1st WAP
def calc_wap1(df):
    wap = (df['ask_price1'] * df['bid_size1'] + df['bid_price1'] * df['ask_size1']) / (df['bid_size1'] + df['ask_size1'])
    return wap
# Calculate 2nd WAP
def calc_wap2(df):
    wap = (df['ask_price2'] * df['bid_size2'] + df['bid_price2'] * df['ask_size2']) / (df['bid_size2'] + df['ask_size2'])
    return wap

**Log Return**   
Motivation - quantify stock prices changes   
Why log - the addtivie attribute by log (so "addable" for different time window), and possible to go below 100% as ref[1]

In [None]:
# Calculate Log Return
def log_return(series):
    return np.log(series).diff() # log(x / y) = log(x) - log(y), ref[2]

**Realized Volatility**   
Motivation - need a scalar to evaluate volatility   
Why square - measure changes in 2 directions (postivie/negative)   
Why square root - back to original unit (a typical math operation, as for meter -> suare will convert to square meter, so we root it back)

In [None]:
# Realized Volatility
def realized_volatility(series):
    return np.sqrt(np.sum(series**2))

In [None]:
# Count Unique Elements of Series
def count_unique(series):
    return len(np.unique(series))

# Feature Engineering

As ref[2], a general 3-step will be used to generate features for our later training:
1. Compute basic quantities used in the problem field as ref[1] -- 1st-level features
2. Combine with aggregated STAT (sum, mean, std) operations on 1st-level under different time windows -- 2nd-level features
3. Cheery-pick Realized-Volatility under different time windows to apply STAT (mean, std, max, min) operations -- 3rd-level features

A reflection point here is, for primarily numerical data, we could use these STAT operations to form 2nd-level features to see if it helps. Also believe, there will be improvement room with time series.

**Utility Func - Calculate Features by Specifized feature_dict**

In [None]:
# Calculate features as specifized feature_dict
def calc_features(df, feature_dict):
    # Calculate STATs (sum, mean, std) for different time-window (seconds in bucket)
    def calc_certain_window(window, add_suffix=False):
        # Filter by time-window, Groupy by time_id, then Apply feature_dict
        df_feature = df[df['seconds_in_bucket'] >= window].groupby(['time_id']).agg(feature_dict).reset_index()
        # Rename features/columns by joining suffix
        df_feature.columns = ['_'.join(col) for col in df_feature.columns]
        # Add a suffix for different time-window
        if add_suffix:
            df_feature = df_feature.add_suffix('_' + str(window))
        return df_feature
    
    windows = [0, 150, 300, 450]
    df_feature = pd.DataFrame()
    
    for window in windows:
        if window == 0:
            df_feature = calc_certain_window(window=window, add_suffix=False)
        else:
            df_feature_tmp = calc_certain_window(window=window, add_suffix=True)
            df_feature = df_feature.merge(df_feature_tmp, how='left', left_on='time_id_', right_on='time_id__'+str(window))
        
    df_feature.drop(['time_id__450', 'time_id__300', 'time_id__150'], axis = 1, inplace = True)
        
    return df_feature

**Preprocess(Generate Features) Func for Book & Trade Data**

In [None]:
# Preprocess book-data (applied for each stock_id)
def preprocess_book(file_path):
    df = pd.read_parquet(file_path)
    
    # Calculate Wap
    df['wap1'] = calc_wap1(df)
    df['wap2'] = calc_wap2(df)
    # Calculate Log-Return
    df['log_return1'] = df.groupby(['time_id'])['wap1'].apply(log_return)
    df['log_return2'] = df.groupby(['time_id'])['wap2'].apply(log_return)
    # Calculate Wap-Balance
    df['wap_balance'] = abs(df['wap1'] - df['wap2'])
    # Calculate Various-Spread
    df['price_spread'] = (df['ask_price1'] - df['bid_price1']) / ((df['ask_price1'] + df['bid_price1']) / 2)
    df['bid_spread'] = df['bid_price1'] - df['bid_price2']
    df['ask_spread'] = df['ask_price1'] - df['ask_price2']
    df['total_volume'] = (df['ask_size1'] + df['ask_size2']) + (df['bid_size1'] + df['bid_size2'])
    df['volume_imbalance'] = abs((df['ask_size1'] + df['ask_size2']) - (df['bid_size1'] + df['bid_size2']))
    
    # Feature (Generating) Dict for aggregated operations
    feature_dict = {
        'wap1': [np.sum, np.mean, np.std],
        'wap2': [np.sum, np.mean, np.std],
        'log_return1': [np.sum, realized_volatility, np.mean, np.std],
        'log_return2': [np.sum, realized_volatility, np.mean, np.std],
        'wap_balance': [np.sum, np.mean, np.std],
        'price_spread':[np.sum, np.mean, np.std],
        'bid_spread':[np.sum, np.mean, np.std],
        'ask_spread':[np.sum, np.mean, np.std],
        'total_volume':[np.sum, np.mean, np.std],
        'volume_imbalance':[np.sum, np.mean, np.std]
    }
    
    df_feature = calc_features(df, feature_dict=feature_dict)
    
    # Generate row_id (for later merge)
    stock_id = file_path.split('=')[1]
    df_feature['row_id'] = df_feature['time_id_'].apply(lambda x: f'{stock_id}-{x}')
    # Drop the left time_id_ (after using for generating row_id)
    df_feature.drop(['time_id_'], axis = 1, inplace = True)
    
    return df_feature

In [None]:
# Preprocess trade-Data (applied for each stock_id)
def preprocess_trade(file_path):
    df = pd.read_parquet(file_path)
    
    # Calculate Log-Return
    df['log_return'] = df.groupby('time_id')['price'].apply(log_return)
    
    # Feature (Generating) Dict for aggregated operations
    feature_dict = {
        'log_return':[realized_volatility],
        'seconds_in_bucket':[count_unique],
        'size':[np.sum],
        'order_count':[np.mean],
    }
    
    df_feature = calc_features(df, feature_dict=feature_dict)
    
    df_feature = df_feature.add_prefix('trade_')
    stock_id = file_path.split('=')[1]
    df_feature['row_id'] = df_feature['trade_time_id_'].apply(lambda x:f'{stock_id}-{x}')
    df_feature.drop(['trade_time_id_'], axis = 1, inplace = True)
    
    return df_feature

**The Preprocess Entry Func - Call Parallelized Preprocessing for Train/Test Data**

A valuable point from ref[2] is that the parallized processing, as for each stock_id (no overlap). Should check [Embarrassingly parallel for loops](https://joblib.readthedocs.io/en/latest/parallel.html) to master this "booster". 

In [None]:
# Preprocess/Feature-Engineering in parallel (applied for each stock_id)
def preprocess(list_stock_ids, is_train=True):
    
    def preprocess_for_stock_id(stock_id):
        # Generate file_path for train-dataset
        if is_train:
            file_path_book = data_dir + "book_train.parquet/stock_id=" + str(stock_id)
            file_path_trade = data_dir + "trade_train.parquet/stock_id=" + str(stock_id)
        # ... for test-dataset
        else:
            file_path_book = data_dir + "book_test.parquet/stock_id=" + str(stock_id)
            file_path_trade = data_dir + "trade_test.parquet/stock_id=" + str(stock_id)
    
        # Preprocess book- and trade- data, then merge both
        df_tmp = pd.merge(preprocess_book(file_path_book), preprocess_trade(file_path_trade), on='row_id', how='left')

        return df_tmp
    
    # Parallelize Preprocessing for Every stock_id
    df = Parallel(n_jobs=-1, verbose=1)(delayed(preprocess_for_stock_id)(stock_id) for stock_id in list_stock_ids)
    
    # Concatenate All Dataframes from Parallelized Preprocessing
    df = pd.concat(df, ignore_index=True)
    
    return df

**Further Feature Generating - Utlize Realized Volatility under Different Time Window**

In [None]:
# Calculate STATs (mean, std, max, min) for realized volatility while groupped by stock_id and time_id
def get_time_stock(df_feature):
    # Enumerate realized volatility features/columns
    vol_cols = ['log_return1_realized_volatility', 'log_return2_realized_volatility', 'log_return1_realized_volatility_450', 'log_return2_realized_volatility_450', 
                'log_return1_realized_volatility_300', 'log_return2_realized_volatility_300', 'log_return1_realized_volatility_150', 'log_return2_realized_volatility_150', 
                'trade_log_return_realized_volatility', 'trade_log_return_realized_volatility_450', 'trade_log_return_realized_volatility_300', 'trade_log_return_realized_volatility_150']

    # Group by the stock id
    df_stock_id = df_feature.groupby(['stock_id'])[vol_cols].agg(['mean', 'std', 'max', 'min']).reset_index()
    # Rename columns joining suffix
    df_stock_id.columns = ['_'.join(col) for col in df_stock_id.columns]
    df_stock_id = df_stock_id.add_suffix('_' + 'stock')

    # Group by the stock id
    df_time_id = df_feature.groupby(['time_id'])[vol_cols].agg(['mean', 'std', 'max', 'min']).reset_index()
    # Rename columns joining suffix
    df_time_id.columns = ['_'.join(col) for col in df_time_id.columns]
    df_time_id = df_time_id.add_suffix('_' + 'time')
    
    # Merge with original dataframe
    df_feature = df_feature.merge(df_stock_id, how='left', left_on=['stock_id'], right_on=['stock_id__stock'])
    df_feature = df_feature.merge(df_time_id, how='left', left_on=['time_id'], right_on=['time_id__time'])
    df_feature.drop(['stock_id__stock', 'time_id__time'], axis=1, inplace=True)
    
    return df_feature

# Trainning & Evaluating

As this notebook focuses on building the baseline, we will be "a bit lazy" on tuning the model for better performance rather than using basic hyperparams. 

There should be a lot of room for me to discuss gridsearch, local testing/debug env, perhaps using cloud computing for a better speed (take around 30mins for training on Kaggle and about 15mins on my 2019 MacBook). So, see me in another notebook soon! 

In [None]:
# Calculate the root mean squared percentage error
def rmspe(y_true, y_pred):
    return np.sqrt(np.mean(np.square((y_true - y_pred) / y_true)))

# Early stop with root mean squared percentage error
def feval_rmspe(y_pred, lgb_train):
    y_true = lgb_train.get_label()
    return 'RMSPE', rmspe(y_true, y_pred), False

In [None]:
def train_and_evaluate(train, test):
    # Hyperparammeters (basic here, could be tuned with GridSearch later)
    params = {
      'objective': 'rmse',  
      'boosting_type': 'gbdt',
      'num_leaves': 100,
      'n_jobs': -1,
      'learning_rate': 0.1,
      'feature_fraction': 0.8,
      'bagging_fraction': 0.8,
      'verbose': -1
    }
    
    # Split features and target
    x = train.drop(['row_id', 'target', 'time_id'], axis = 1)
    y = train['target']
    x_test = test.drop(['row_id', 'time_id'], axis = 1)
    
    # Transform stock id to a numeric value
    x['stock_id'] = x['stock_id'].astype(int)
    x_test['stock_id'] = x_test['stock_id'].astype(int)
    
    # Create out of folds array
    oof_predictions = np.zeros(x.shape[0])
    # Create test array to store predictions
    test_predictions = np.zeros(x_test.shape[0])
    # Create a KFold object
    kfold = KFold(n_splits = 5, random_state = 66, shuffle = True)
    # Iterate through each fold
    for fold, (trn_ind, val_ind) in enumerate(kfold.split(x)):
        print(f'Training fold {fold + 1}')
        x_train, x_val = x.iloc[trn_ind], x.iloc[val_ind]
        y_train, y_val = y.iloc[trn_ind], y.iloc[val_ind]
        # Root mean squared percentage error weights
        train_weights = 1 / np.square(y_train)
        val_weights = 1 / np.square(y_val)
        train_dataset = lgb.Dataset(x_train, y_train, weight = train_weights, categorical_feature = ['stock_id'])
        val_dataset = lgb.Dataset(x_val, y_val, weight = val_weights, categorical_feature = ['stock_id'])
        model = lgb.train(params = params, 
                          train_set = train_dataset, 
                          valid_sets = [train_dataset, val_dataset], 
                          num_boost_round = 10000, 
                          early_stopping_rounds = 50, 
                          verbose_eval = 50,
                          feval = feval_rmspe)
        # Add predictions to the out of folds array
        oof_predictions[val_ind] = model.predict(x_val)
        # Predict the test set
        test_predictions += model.predict(x_test) / 5
        
    rmspe_score = rmspe(y, oof_predictions)
    print(f'Our out of folds RMSPE is {rmspe_score}')
    # Return test predictions
    return test_predictions

# Main Func

A good point I learned from ref[2] is the "single-entry" for all steps, which offers us a clear view of each step. And feel free to dig into any of them to make improvements for us! 

In [None]:
# Read train and test
train, test = read_train_test()

# Get unique stock_id (as prediction by stock_id)
train_stock_ids = train['stock_id'].unique()

# Generate features
train_feature = preprocess(train_stock_ids, is_train=True)
# Merge with intiail train data
train = train.merge(train_feature, on=['row_id'], how='left')

# Same for test datas
test_stock_ids = test['stock_id'].unique()
test_feature = preprocess(test_stock_ids, is_train=False)
test = test.merge(test_feature, on=['row_id'], how='left')

# Further generate features with realized-volatility
train = get_time_stock(train)
test = get_time_stock(test)

# Traing and evaluate
test_predictions = train_and_evaluate(train, test)

# Save test predictions
test['target'] = test_predictions
test[['row_id', 'target']].to_csv('submission.csv', index=False)

# Summary

This is my first attended live competition since learning quite a few about machine learning... Looking back, I would like to say that we perhaps don't need to learn that much in the 1st place while ignoring how important of applying them to solving problems (so you will master them deeper and faster finally!). 

So like the Ockham's Razor, try to learn the basics enough for us to start working on a warm-up problem. Then we perhaps will have quite a few questions for sure and things we think "ugh...guess I need to learn a bit more about this". Then we "come back online school" to learn them by priorities.

Learn around "enough" -> Work on a problem (a bit more than comfortable) -> Resolve questions and learning points -> Next practice...

Hmm, seems we got our iteration/loop/sprint (as you want ;) to be a Machine Learning Engineer?!

# Reference

[[1] Introduction to financial concepts and data](https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data)   
[[2] Optiver Realized Volatility LGBM Baseline](https://www.kaggle.com/ragnar123/optiver-realized-volatility-lgbm-baseline)