__Edits (Final features names cell): Updated order of Volatility column names, this provided a wrong final dataframe labelling, which was detrimental to further data analysis__

__Edits (Per-time-id proocessor Code Cell): A very stupid error on my part, I made some errors in the publish version, which I found later to be erroraneous. I tested them and updated them, but forgot to publish that version. I am sorry for any issues caused.__

__The error: Due to the way I previously indexed split index calculations in line`for split_pos in np.nonzero(np.diff(dataset[:,0]))[0]`, overlooked one of the splits. Further, the last split was not being handled, so additional lines were added to handle this__

*Thank you [Tobias Tesch](https://www.kaggle.com/tobiit) for reminding me to push the latest version in the comments below.*

# Numba for Data Pre-processing and Feature Extraction

This notebook outlines a method of using numba for faster data pre-processing and feature extraction as compared to baseline pandas implementation. The notebook performs quite fast - check the last cell for `timeit` profile runs. My baseline pandas implementation took over 18 minutes to perform a similar pre-processing, while this notebook performs a slightly modified version of it in just *27.7 s ± 878 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)*. **I hence attain a 36x speedup by using custom-built Numba kernels and pool processing**

In [None]:
!pip install line_profiler
%load_ext line_profiler

In [None]:
import numpy as np
import pandas as pd
from glob import glob
from numba import njit
from multiprocessing import Pool

## Loading Data

This pre-processing pipeline ignores the trade data completely and is optimized to work on book data alone. Features can be constructed based off of the order book data provided.

In [None]:
train_targets = pd.read_csv("../input/optiver-realized-volatility-prediction/train.csv")
train_targets['row_id'] = train_targets['stock_id'].astype(str) + '-' + train_targets['time_id'].astype(str)
train_targets = train_targets[['row_id','target']].set_index("row_id")
train_files = glob("../input/optiver-realized-volatility-prediction/book_train.parquet/*")

## Numba-based feature-engineering and data processing kernels

### Data Structure

The feature arrays of the input data as in the provided dataset is as follows. Numerical access to data is the trade-off for faster computation of features.

In [None]:
column_names = [
    "time_id",           # 0
    "seconds_in_bucket", # 1
    "bid_price1",        # 2
    "ask_price1",        # 3
    "bid_price2",        # 4
    "ask_price2",        # 5
    "bid_size1",         # 6
    "ask_size1",         # 7
    "bid_size2",         # 8
    "ask_size2"          # 9
]

### The prefil kernel

This kernel fills in for places where the `seconds_in_bucket` parameter does not have an entry in the original dataset. In the FAQ, this is mentioned as durations or seconds when trading does not take place and the data for this should be the exact same as the last traded second's. This kernel fills in those missing seconds as well as forward fills the data.

In [None]:
@njit
def fill_array(book_data, filled_data):
    filled_data[0] = book_data[0]
    last_read_idx = 0
    for row_idx in range(1, 600):
        # print(row_idx, last_read_idx, int(book_data[last_read_idx + 1][1]), int(book_data[last_read_idx + 1][1]) == row_idx)
        if int(book_data[last_read_idx + 1][1]) == row_idx:
            last_read_idx += 1
        filled_data[row_idx] = book_data[last_read_idx]
        filled_data[row_idx][1] = row_idx

### Feature Computing kernel

This is the kernel where different features are computed. It will return a array of `float32`s containing features. Some generic features are being calculated. Any new features to be calculated should have its computation definition here based on the given dataset's columns directly or indirectly.

In [None]:
@njit
def calculate_features(filled_data):
    filled_data = filled_data.transpose()
    
    trade_vols1 = filled_data[6] + filled_data[7]
    trade_vols2 = filled_data[8] + filled_data[9]
    trade_diffs1 = filled_data[7] - filled_data[6]
    trade_diffs2 = filled_data[9] - filled_data[8]
    
    spreads1 = (filled_data[2] / filled_data[3]) - 1
    spreads2 = (filled_data[4] / filled_data[5]) - 1
    
    waps1 = (filled_data[2] * filled_data[7] + filled_data[3] * filled_data[6]) / (filled_data[6] + filled_data[7])
    waps2 = (filled_data[4] * filled_data[9] + filled_data[5] * filled_data[8]) / (filled_data[8] + filled_data[9])
    
    logs1 = np.diff(np.log(waps1))
    logs2 = np.diff(np.log(waps2))
    
    return [
        waps1.mean(), 
        waps2.mean(),
        waps1[300:].mean(),
        waps2[300:].mean(),
        waps1.std(),
        waps2.std(),
        waps1[300:].std(),
        waps2[300:].std(),
        logs1.mean(),
        logs2.mean(),
        logs1[300:].mean(),
        logs2[300:].mean(),
        logs1.std(), # Essentially volatility1
        logs2.std(), # Essentially volatility2
        trade_vols1.mean(),
        trade_vols2.mean(),
        trade_vols1[300:].mean(),
        trade_vols2[300:].mean(),
        trade_diffs1.mean(),
        trade_diffs2.mean(),
        trade_diffs1[300:].mean(),
        trade_diffs2[300:].mean(),
        int(filled_data[0][0])
    ]

### Per-`time-id` processor

During experimentation it was found that compiling the code that essentially handles the processing of a single time_id of a stock_id was slightly faster than performing it directly in python. This could be primarly because of the fact that this piece of code runs ~3800 times per stock over ~120 stocks and compiling it once would help tremendously.

In [None]:
@njit
def process_groups(dataset, stock_id):
    ret_lis = []
    last_split_pos = 0
    filled_data = np.zeros((600, 10), dtype=np.float32)
    for split_pos in np.nonzero(np.diff(dataset[:,0]))[0]:
        data_split = dataset[last_split_pos:split_pos]
        fill_array(data_split, filled_data)
        features = calculate_features(filled_data)
        ret_lis.append(features + [stock_id])
        last_split_pos = split_pos
    data_split = dataset[last_split_pos:]
    fill_array(data_split, filled_data)
    features = calculate_features(filled_data)
    ret_lis.append(features + [stock_id])
    return ret_lis

## Processing data using the defined kernels

### Final features names

This list contains the names of the columns that would be present in the final feature set. **Any new features added in the above kernels must be added here too in the exact order.**

In [None]:
feature_columns = [
    "wap1", "wap2", "wap1l", "wap2l", "wap1_std", "wap2_std", "wap1l_std", "wap2l_std", "log1", "log2", "log1l", "log2l", "vol1", "vol2",
    "volume1", "volume2", "volume1l", "volume2l", "diff1", "diff2", "diff1l", "diff2l", "time_id", "stock_id"
]

### Stock Processing function

This function essentially reads data from the parquet file of a single stock and calls the kernels in order to extract features.

In [None]:
def process_single_stock(file_path):
    book = pd.read_parquet(file_path, engine="pyarrow").to_numpy(dtype=np.float32)
    group_features = process_groups(book, int(file_path.split('=')[1]))
    return group_features

### Data Processesor Main Handler

Here we are using basic multi-processing of python to process multiple `stock_id`'s at the same time. This adds a slight speed up to the overall processing pipeline.

**Note: The returned dataframe would have all the columns in `float32` format. Any type changes would have to be handled manually**

In [None]:
def preprocess_data():
    worker_pool = Pool(processes=None)
    full_feature_list_matrix = worker_pool.map(process_single_stock, train_files)
    worker_pool.close()
    worker_pool.join()
    return_feature_list = []
    for feature_list in full_feature_list_matrix:
        return_feature_list += feature_list
        
    return pd.DataFrame(return_feature_list, columns=feature_columns)

### Time test

This might be slightly slow for the very first time as the kernels have to compile. But would speed up for any subsequent runs. Even if that is not needed, the overall time is still lower compared to using just pandas.

In [None]:
%timeit preprocess_data()