# Optiver - volatility prediction

Challenge homepage: https://www.kaggle.com/c/optiver-realized-volatility-prediction/overview

Intro notebook: https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data

### Datasets
The goal of the challenge is to predict realized volatility for given stocks.

Datasets contains data for 112 stocks. For each stock there are 3830 time intervals (time_id). Time intervals are 20 minutes long.
There are 3 datasets provided:
1. train.csv - provides target variable used for training. It the actual value of realized volatility for each stock_id - time_id pair.
2. orderbook - contains top 2 levels of order prices and sizes
3. tradebook - contains data for all the trades that occured (price, size, order count)

Orderbook and tradebook datasets contains data on a per second level and they only contain first 10 minutes of the data.
The goal is to use the data of the first 10 minutes of the time interval and predict the volatility of the full 20 minutes.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk("/kaggle/input"):
    for filename in filenames:
        #print(os.path.join(dirname, filename))
        pass

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import packages

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib.style as style
style.use("fivethirtyeight")

import plotly.express as px
import seaborn as sns
sns.set_style("whitegrid")

import os
import warnings
warnings.filterwarnings("ignore")

# stats
from scipy import stats

# preprocessing packages
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Modeling packages
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold, RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer

# XGB
import xgboost as xg

# LGBM
from lightgbm import LGBMRegressor

# Keras
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
import keras.backend as kb

## Settings

In [None]:
# set true to run feature generation, if false it will read prebuilt dataset
run_feature_generation = False

# set path to the stored dataset you want to load
train_dataset_path = "../input/optiver-feature-set-4/train_df.csv"
test_dataset_path = "../input/optiver-feature-set-4/test_df.csv"

# set to True to perform model tuning
run_tuning = False

# chose final model
# options: "Keras" "lgbm"

final_model = "lgbm"

## Data preparation and feature engineering

### Feature engineering plan

The folowing features were created:
1. **rv_full** -> This is the realized volatilty calculated from the orderbook using both first and second level price/size.
2. **spread1** -> Mean spread between bid and ask price of the first level prices.
3. **size_balance1** -> Mean difference between the first level bid and ask sizes.
4. **rv_full_bin_30_slope** -> orderbook was sliced into 30 time intervals (bins). For each interval a realized volatility was calculated. Finally a line was fit to the datapoints to get the slope. The idea is that if the volatility in the first 10 minutes is falling it will also fall in the near future (second part of the interval).
5. **order_count** -> Sum of all order counts.
6. **size** -> Sum of size of the orders.
7. **lr_full_slope** -> Slope of the calculated log returns. The idea is the same as with rv_full_bin_30_slope.
8. **size_slope** -> Slope of the linear line fitted over trade order sizes.
9. **order_count_slope** -> Slope of the linear line fitted over trade order counts.
10. **time_id_rank** -> Given that time_ids are always representing the same time across stock_ids the idea is to rank them based on the average volatility. More volatile time_ids get higher rank.

In [None]:
train = pd.read_csv("../input/optiver-realized-volatility-prediction/train.csv")
test = pd.read_csv("../input/optiver-realized-volatility-prediction/test.csv")

In [None]:
# functions to access orderbook and tradebook by stock_id

def ffill(data_df):
    """
    Forward fill the missing "seconds_in_bucket".
    """
    data_df = data_df.set_index(['time_id', 'seconds_in_bucket'])
    data_df = data_df.reindex(pd.MultiIndex.from_product([data_df.index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket']), method='ffill')
    return data_df.reset_index()

def ffill_fast(data_df):
    """
    Forward fill the missing "seconds_in_bucket".
    This method is about 7x faster than ffill
    """

    time_ids = pd.DataFrame(pd.Series(data_df.time_id.unique()), columns=["time_id"])
    seconds_in_bucket = pd.DataFrame(pd.Series(np.arange(0,600)), columns=["seconds_in_bucket"])

    # create key to join on
    time_ids['key'] = 1
    seconds_in_bucket['key'] = 1

    # merge dataframes and drop "key"
    prod = pd.merge(time_ids, seconds_in_bucket, on ='key').drop("key", 1)

    # merge with orderbook
    data_df = prod.merge(data_df, on=["time_id","seconds_in_bucket"], how="left")
    data_df.ffill(inplace=True)
    
    return data_df

def get_orderbook_df(dataset, stock_id):
    book = pd.read_parquet(f"../input/optiver-realized-volatility-prediction/book_{dataset}.parquet/stock_id={stock_id}")
    book["stock_id"] = stock_id
    return book

def get_tradebook_df(dataset, stock_id):
    trade = pd.read_parquet(f"../input/optiver-realized-volatility-prediction/trade_{dataset}.parquet/stock_id={stock_id}")
    trade["stock_id"] = stock_id
    return trade

# functions for WAP and volatility calculations

def calculate_wap1(df):
    """
    Calcualte WAP from price1
    """
    return (df["bid_price1"] * df["ask_size1"] + df["ask_price1"] * df["bid_size1"]) / (df["bid_size1"] + df["ask_size1"])

def calculate_wap2(df):
    """
    Calcualte WAP from price2
    """
    return (df["bid_price2"] * df["ask_size2"] + df["ask_price2"] * df["bid_size2"]) / (df["bid_size2"] + df["ask_size2"])

def calculate_wap_full(df):
    """
    Calcualte WAP from price1 and price2
    """
    wf = (
        df["bid_price1"] * df["ask_size1"] + df["ask_price1"] * df["bid_size1"] + 
        df["bid_price2"] * df["ask_size2"] + df["ask_price2"] * df["bid_size2"]
    ) / (df["bid_size1"] + df["ask_size1"] + df["bid_size2"] + df["ask_size2"])
    return wf

def log_return(wap):
    return np.log(wap).diff().fillna(0)

def realized_volatility(log_return):
    return np.sqrt(np.sum(log_return**2))

def rmspe(y_true, y_pred):
    return (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

def generate_feature_set_for_stock(stock_id, dataset):
    """
    stock_id - id of the stock you want to generate features for
    dataset - "train" or "test"
    """
    
    df = pd.read_csv(f"../input/optiver-realized-volatility-prediction/{dataset}.csv")
    
    df_sub = df[df["stock_id"] == stock_id].copy(deep=True)

    orderbook = get_orderbook_df(dataset, stock_id)
    tradebook = get_tradebook_df(dataset, stock_id)

    # takes about 7 seconds - reduced to about 1 second
    orderbook = ffill_fast(orderbook)

    # calculate realized volatility (only full) - 4-5 seconds - reduced to 1.5 seconds
    orderbook["wap_full"] = calculate_wap_full(orderbook)
    orderbook["wap_full_log"] = np.log(orderbook["wap_full"])
    orderbook['lr_full'] = orderbook.groupby(['time_id'])['wap_full_log'].diff().fillna(0)
    orderbook["lr_full_sq"] = orderbook["lr_full"]**2
    rv_full = orderbook.groupby(["time_id"])["lr_full_sq"].sum().pow(1/2).to_frame()
    rv_full["stock_id"] = stock_id
    df_sub = df_sub.merge(rv_full, on=["time_id","stock_id"], how="left")

    # calculate mean spread1 - 50ms
    orderbook["spread1"] = orderbook["ask_price1"] - orderbook["bid_price1"]
    mean_spread1 = orderbook.groupby(["time_id"])["spread1"].mean().to_frame()
    mean_spread1["stock_id"] = stock_id
    df_sub = df_sub.merge(mean_spread1, on=["time_id","stock_id"], how="left")

    # calculate mean size_balance1 -50ms
    orderbook["size_balance1"] = orderbook["bid_size1"] - orderbook["ask_size1"]
    mean_size_balance1 = orderbook.groupby(["time_id"])["size_balance1"].mean().to_frame()
    mean_size_balance1["stock_id"] = stock_id
    df_sub = df_sub.merge(mean_size_balance1, on=["time_id","stock_id"], how="left")

    # calculate seconds_in_bucket bins - 60ms
    orderbook["seconds_in_bucket_bin_30"] = (orderbook["seconds_in_bucket"] / 30).astype(int).to_frame()

    # calculate realized volatility for each bin - 30 seconds - reduced to 0.5 seconds!
    rv_full_bin_30 = orderbook.groupby(["time_id", 'seconds_in_bucket_bin_30'])["lr_full_sq"].sum().pow(1/2).to_frame()
    rv_full_bin_30.reset_index(inplace=True)
    rv_full_bin_30["bin_index"] = rv_full_bin_30.groupby(["time_id"])["lr_full_sq"].cumcount()

    # calculate slope of the change in volatility - 2 seconds - reduced to <1 second
    rv_full_bin_30_slope = rv_full_bin_30.groupby(["time_id"]).apply(lambda x: np.polyfit(x.bin_index, x.lr_full_sq, 1)[0]).reset_index(name='rv_full_bin_30_slope')
    rv_full_bin_30_slope["stock_id"] = stock_id
    df_sub = df_sub.merge(rv_full_bin_30_slope, on=["time_id","stock_id"], how="left")

    # order count and order size from tradebook
    order_count = tradebook.groupby(["time_id","stock_id"])["order_count"].sum()
    order_count["stock_id"] = stock_id
    df_sub = df_sub.merge(order_count, on=["time_id","stock_id"], how="left")

    order_size = tradebook.groupby(["time_id","stock_id"])["size"].sum()
    order_size["stock_id"] = stock_id
    df_sub = df_sub.merge(order_size, on=["time_id","stock_id"], how="left")
    
    # calculate slope in change of log return
    lr_full_slope = orderbook.groupby(["time_id"]).apply(lambda x: np.polyfit(x.seconds_in_bucket, x.lr_full_sq, 1)[0]).reset_index(name='lr_full_slope')
    lr_full_slope["stock_id"] = stock_id
    df_sub = df_sub.merge(lr_full_slope, on=["time_id","stock_id"], how="left")
    
    # calculate slope in change of order size
    size_slope = tradebook.groupby(["time_id"]).apply(lambda x: np.polyfit(x.seconds_in_bucket+1, x["size"], 1)[0]).reset_index(name='size_slope')
    size_slope["stock_id"] = stock_id
    df_sub = df_sub.merge(size_slope, on=["time_id","stock_id"], how="left")
    
    # calculate slope in change of order counts
    order_count_slope = tradebook.groupby(["time_id"]).apply(lambda x: np.polyfit(x.seconds_in_bucket+1, x["order_count"], 1)[0]).reset_index(name='order_count_slope')
    order_count_slope["stock_id"] = stock_id
    df_sub = df_sub.merge(order_count_slope, on=["time_id","stock_id"], how="left")
    
    return df_sub
    
def generate_feature_set_1(dataset):
    """
    Calculates realized volatility from both price levels
    """
    
    if dataset == "train":
        stock_ids = list(train["stock_id"].unique())
    elif dataset == "test":
        stock_ids = list(test["stock_id"].unique())
    else:
        print("Please provide dataset parameter")
        return
    
    final_df = pd.DataFrame()

    for stock_id in stock_ids:
        
        # append
        df_sub = generate_feature_set_for_stock(stock_id, dataset)
        final_df = final_df.append(df_sub)

    final_df.rename({"lr_full_sq": "rv_full"}, axis=1, inplace=True)
    final_df = final_df.reset_index(drop=True)
    
    return final_df

In [None]:
# data transformation functions
def log_transform(df, list_of_features):
    """
    Perform log tranform on a list of features.
    Useful for e.g. exponentially distributed variables.
    """
    df1 = df.copy(True)
    df1[list_of_features] = np.log(df1[list_of_features])
        
    return df1


def scale(df, list_of_features):
    """
    Scale variables using StandardScaler
    """
    new_df = df.copy(True)
        
    new_df = new_df.fillna(0)
    new_df = new_df.reset_index(drop=True)
    
    scaler = StandardScaler()
    
    new_df[list_of_features] = scaler.fit_transform(new_df[list_of_features])
    
    return new_df


def oh_encode(df, list_of_features):
    """
    Performe one-hot encoding on categorical variables.
    """
    new_df = df.copy(True)
    
    unenc_features = [item for item in df.columns if item not in list_of_features]
    
    enc = OneHotEncoder()
    
    df_enc = enc.fit_transform(new_df[list_of_features]).toarray()
    column_names = enc.get_feature_names(list_of_features)
    df_enc = pd.DataFrame(df_enc, columns=column_names)
    
    df_unenc = new_df[unenc_features]

    new_df = df_enc.join(df_unenc)
    
    return new_df

def to_categorical(df, list_of_features):
    """
    Change type of variables to "category".
    """
    new_df = df.copy(True)
    
    for feature in list_of_features:
        new_df[feature] = new_df[feature].astype('category').cat.as_ordered()
    
    return new_df

In [None]:
%%time

# Generate (or read) train and test dataset

if run_feature_generation:
    # Generate features and add them to train and test dataframes
    train_df = generate_feature_set_1("train")
    test_df = generate_feature_set_1("test")

    # store datasets for future use
    train_df.to_csv("train_df.csv",index=False)
    test_df.to_csv("test_df.csv",index=False)

else:
    train_df = pd.read_csv(train_dataset_path)
    test_df = pd.read_csv(test_dataset_path) # change back before deploy
    
train_df_copy = train_df.copy(True)
test_df_copy = test_df.copy(True)

In [None]:
## calculate time_id_rank

def calculate_time_id_rank(df):
    """
    Calculate "time id rank" - rank time ids by volatility.
    """
    mean_rv_full_by_time = df[["time_id","rv_full"]].groupby(["time_id"]).mean()
    mean_rv_full_by_time = mean_rv_full_by_time.reset_index()
    mean_rv_full_by_time.head()

    mean_rv_full_by_time.sort_values(by="rv_full", inplace=True, ignore_index=True)

    mean_rv_full_by_time["time_id_rank"] = mean_rv_full_by_time.index
    mean_rv_full_by_time

    df = df.merge(mean_rv_full_by_time[["time_id","time_id_rank"]], on="time_id", how="left")
    
    return df

train_df = calculate_time_id_rank(train_df)
test_df = calculate_time_id_rank(test_df)

In [None]:
train_df.head()

### Ploting the data

In [None]:
# lets check variable distributions
fig = plt.figure(figsize=[18,18])

for index, variable in enumerate(list(train_df.columns[2:])):
    ax = fig.add_subplot(4, 3, index+1)
    sns.distplot(train_df[variable], ax=ax)

plt.show()

In [None]:
sns.pairplot(train_df.drop(["time_id","stock_id","time_id_rank"], axis=1).sample(1000))

In [None]:
fig, ax = plt.subplots(figsize=(14,10))
sns.heatmap(train_df.drop(["time_id","stock_id"], axis=1).corr(), annot=True, linewidths=.5, ax=ax)

### Log transformation, Scaling and Encoding

In [None]:
# Perform feature scaling, encoding, etc.

list_of_features_to_log = ["rv_full","order_count","size","spread1","lr_full_slope","size_slope","order_count_slope"]
list_of_features_to_scale = ["rv_full","order_count","size","rv_full_bin_30_slope","spread1","size_balance1","lr_full_slope","size_slope","order_count_slope"]
list_of_features_to_encode = ["stock_id"]
list_of_features_to_categorical = ["stock_id","time_id","time_id_rank"]

# Log (not used)
train_prep_log_df = log_transform(train_df, list_of_features_to_log)
test_prep_log_df = log_transform(test_df, list_of_features_to_log)

# Scale
train_prep_df = scale(train_df, list_of_features_to_scale)
test_prep_df = scale(test_df, list_of_features_to_scale)

# OH encode
train_prep_enc_df = scale(oh_encode(train_df, list_of_features_to_encode), list_of_features_to_scale)
test_prep_enc_df = scale(oh_encode(test_df, list_of_features_to_encode), list_of_features_to_scale)

# To categorical (Keras)
train_prep_keras_df = to_categorical(scale(train_df, list_of_features_to_scale), list_of_features_to_categorical)
test_prep_keras_df = to_categorical(scale(test_df, list_of_features_to_scale), list_of_features_to_categorical)

train_prep_df.head()

## Modeling

In [None]:
# Split datasets into X and y part

X_train = train_prep_df.drop(["time_id", "target"], axis=1)
y_train = train_prep_df["target"]
X_test = test_prep_df.drop(["time_id","row_id"], axis=1)

X_train_enc = train_prep_enc_df.drop(["time_id", "target"], axis=1)
y_train_enc = train_prep_enc_df["target"]
X_test_enc = test_prep_enc_df.drop(["time_id","row_id"], axis=1)

X_train_keras = train_prep_keras_df.drop(["time_id", "target"], axis=1)
y_train_keras = train_prep_keras_df["target"]
X_test_keras = test_prep_keras_df.drop(["time_id","row_id"], axis=1)

# set up scorer
scorer = make_scorer(rmspe, greater_is_better = False)

In [None]:
## Modeling functions

def test_model(model, X_train, n_jobs=1, trans=None):
    """
    Test model using X-validation
    """
    pipeline = Pipeline(steps=[('t', trans), ('m', model)])
    cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
    
    scores = cross_val_score(
        pipeline, 
        X_train, 
        y_train, 
        cv=cv, 
        scoring=scorer, 
        verbose=3, 
        n_jobs=n_jobs, 
        error_score='raise'
    )
    
    print('RMSPE: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
    return model


# hyper parameter optimization functions

#simple performance reporting function
def reg_performance(model):
    print('Best Score: ' + str(model.best_score_))
    print('Best Parameters: ' + str(model.best_params_))
    
#tuning function using GridSearch
def tune_model(model, X_train, param_grid, trans=None, n_jobs=1):
    """
    Tune model hyperparameters using GridSearchCV
    """
    pipeline = Pipeline(steps=[('t', trans), ('model', model)])

    grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=n_jobs, scoring=scorer)
    best_grid_search = grid_search.fit(X_train,y_train)
    reg_performance(best_grid_search)

### Tuning

In [None]:
# tunning
# LGBM

lgbm = LGBMRegressor()

param_grid_lgbm = {'model__learning_rate': [0.13, 0.15, 0.18, 0.20],
                    'model__n_estimators': [70],
                    'model__boosting_type': ['gbdt'],
                    'model__objective': ['regression_l1'],
                    'model__reg_alpha': [3],
                    'model__reg_lambda': [3]
                    }

if run_tuning:
    tune_model(lgbm, X_train, param_grid_lgbm, n_jobs=3)

Best Score: -0.2876800501071248
Best Parameters: {'model__boosting_type': 'gbdt', 'model__learning_rate': 0.13, 'model__n_estimators': 70, 'model__objective': 'regression_l1', 'model__reg_alpha': 3, 'model__reg_lambda': 3}

In [None]:
# tunning
# XGB

xgb = xg.XGBRegressor(random_state = 0)

param_grid_xgb = {'model__learning_rate': [0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.18, 0.20],
                  'model__n_estimators': [50,70,90],
                  'model__booster': ["gbtree", "gblinear"],
                  'model__reg_alpha': [1,2,3,5,10],
                  'model__reg_lambda': [1,2,3,5,10]
                 }

if run_tuning:
    tune_model(xgb, X_train_enc, param_grid_xgb, n_jobs=3)

### Final build and submission

In [None]:
# Keras
if final_model == "Keras":
    def rmspe_keras(y_true, y_pred):
        return (kb.sqrt(kb.mean(kb.square((y_true - y_pred) / y_true))))

    def build_regressor():
        regressor = Sequential()
        regressor.add(Dense(units = len(X_train_keras.columns)**2, kernel_initializer = 'uniform', activation = 'relu', input_dim = len(X_train_keras.columns)))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns)*2, kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = len(X_train_keras.columns), kernel_initializer = 'uniform', activation = 'relu'))
        regressor.add(Dense(units = 1, kernel_initializer = 'uniform'))
        regressor.compile(optimizer = 'adam', loss = rmspe_keras, metrics = [rmspe_keras])
        return regressor

    keras_regressor = KerasRegressor(build_fn = build_regressor, batch_size = 2**5, epochs = 100)

    keras_regressor.fit(X_train_keras,y_train_keras)
    predictions = pd.DataFrame(keras_regressor.predict(X_test_keras), columns=["target"])
    submission = test_prep_keras_df.join(predictions)[["row_id","target"]]
    submission.to_csv("submission.csv", index=False)

In [None]:
# LGBM
if final_model == "lgbm":
    lgbm1 = LGBMRegressor(n_estimators=500)

    lgbm1.fit(X_train,y_train)
    predictions = pd.DataFrame(lgbm1.predict(X_test), columns=["target"])
    submission = test_prep_df.join(predictions)[["row_id","target"]]
    submission.to_csv("submission.csv", index=False)