This notebook explores different stock and volatility predictors. I gathered features engineered by the community, and will model and explain benefit.

This notebook combines the preprocessing and feature engineering from the following sources. Please check them out!
https://www.kaggle.com/ragnar123/optiver-realized-volatility-lgbm-baseline  
https://www.kaggle.com/konradb/we-need-to-go-deeper-and-validate  
https://www.kaggle.com/tommy1028/lightgbm-starter-with-feature-engineering-idea


Feedback is greatly appreciated! Thanks!!!!

In [None]:
import os
import glob
from joblib import Parallel, delayed

import pandas as pd
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold
import lightgbm as lgb
from lightgbm import LGBMRegressor
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_columns', 300)

import optuna


In [None]:
book_example = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
book_example.head()



# Feature Exploration

# WAP

The Weighted Average Price is explained by the [tutorial notebook](http://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data) and is the measure of stock price as a function of bid/ask price and volume.

The first two WAP calculations are from the first two levels. The third averages the first two.

# Log Return

The multiplicative change in price between consecutive snapshots within each time_id. Large returns contribute to more volatility.

# Spread

Spread features measure the gap between competing prices in the book. Larger gaps indicate less liquidity and may lead to bigger movements if the stock is traded as participants may settle for "worse" prices on either sides of the gap. On the other hand, large gaps could also indicate that there are not many interested in trading the stock.

# Volume 

Volume features show intent. Large volume may act as a resistance that prevents the WAP from easily moving through a price-point. On the other hand, an imbalance could drive the price in a direction as supply outruns demand or vice-versa. Log return function is applied to measure changes in volume. A sudden change in the difference or the volume ratio between bid and ask are explored as features.

# Aggregate Features

Aggregate features summarize all of the above. Sum, mean, standard deviation, min, max functions are used to track how our features behave within a single time-window or stock. Sum tracks total activity. Mean averages across all events in the window. Std measures spread. Min and max track extreme values.

# Summarized Results

From analysis, the realized volatility of log returns dominates feature importance. Put in another way, volatility values of the past 10 minutes are excellent predictors of volatility the next 10 minutes.

Looking at non-volatility features, stock_id ranks highly. It is confirmed that the same stock_ids will be present in the test set, so the feature is very important to the model. Some volume and spread features are also represented, but their utility falls off compared to volatility features.

In [None]:
create_features = True
train_mode = False


# data directory
data_dir = '../input/optiver-realized-volatility-prediction/'

# Function to calculate first WAP
# def calc_wap1(df):
#     wap = (df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1']) / (df['bid_size1'] + df['ask_size1'])
#     return wap

# # Function to calculate second WAP
# def calc_wap2(df):
#     wap = (df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2']) / (df['bid_size2'] + df['ask_size2'])
#     return wap


def calc_wap1(df):
    a1 = df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1']
    b1 = df['bid_size1'] + df['ask_size1']
    
    return a1/b1

def calc_wap2(df):
    a1 = df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2']
    b1 = df['bid_size2'] + df['ask_size2']
    
    return a1/b1

def calc_wap3(df):
    a1 = df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1']
    b1 = df['bid_size1'] + df['ask_size1']
    a2 = df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2']
    b2 = df['bid_size2'] + df['ask_size2']
    
    x = (a1/b1 + a2/b2)/ 2
    
    return x


def calc_wap4(df):
        
    a1 = df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1']
    a2 = df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2']
    b = df['bid_size1'] + df['ask_size1'] + df['bid_size2']+ df['ask_size2']
    
    x = (a1 + a2)/ b
    
    return x

# Function to calculate the log of the return
# Remember that logb(x / y) = logb(x) - logb(y)
def log_return(series):
    return np.log(series).diff()

# Calculate the realized volatility
def realized_volatility(series):
    return np.sqrt(np.sum(series**2))

# Function to count unique elements of a series
def count_unique(series):
    return len(np.unique(series))

# Function to read our base train and test set
def read_train_test():
    train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
    test = pd.read_csv('../input/optiver-realized-volatility-prediction/test.csv')
    # Create a key to merge with book and trade data
    train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str)
    test['row_id'] = test['stock_id'].astype(str) + '-' + test['time_id'].astype(str)
    print(f'Our training set has {train.shape[0]} rows')
    return train, test

# Function to preprocess book data (for each stock id)
def book_preprocessor(file_path):
    df = pd.read_parquet(file_path)
    df.sort_values(by=['time_id', 'seconds_in_bucket'])
    # Calculate Wap
    df['wap1'] = calc_wap1(df)
    df['wap2'] = calc_wap2(df)
    df['wap3'] = calc_wap3(df)
    #df['wap4'] = calc_wap4(df)
    
    
    # Calculate log returns
    
    df['log_return1'] = df.groupby(['time_id'])['wap1'].apply(log_return)
    df['log_return2'] = df.groupby(['time_id'])['wap2'].apply(log_return)
    df['log_return3'] = df.groupby(['time_id'])['wap3'].apply(log_return)
    #df['log_return4'] = df.groupby(['time_id'])['wap4'].apply(log_return)
 
    # Calculate spread
    df['price_spread'] = (df['ask_price1'] - df['bid_price1']) / ((df['ask_price1'] + df['bid_price1']) / 2)
    df['ask_div_bid_price'] = df['ask_price1'] / df['bid_price1']
    
    #df['bid_spread'] = df['bid_price1'] - df['bid_price2']
    #df['ask_spread'] = df['ask_price1'] - df['ask_price2']
    
    #Calculate volume
    df['total_volume'] = (df['ask_size1'] + df['ask_size2']) + (df['bid_size1'] + df['bid_size2'])
    df['volume_imbalance'] = abs((df['ask_size1'] + df['ask_size2']) - (df['bid_size1'] + df['bid_size2']))
    
    df['ask_div_bid_size'] = df['ask_size1'] / df['bid_size1']
    
    #Calculate log volume changes
    #df['log_return_bid_size1'] = df.groupby(['time_id'])['bid_size1'].apply(log_return)
    #df['log_return_ask_size1'] = df.groupby(['time_id'])['ask_size1'].apply(log_return)
    df['log_return_ask_div_bid_size'] = df.groupby(['time_id'])['ask_div_bid_size'].apply(log_return)
    df['log_return_total_volume'] = df.groupby(['time_id'])['total_volume'].apply(log_return)
    
    # Dict for aggregations
    create_feature_dict = {
        
        'wap1': [np.sum, np.mean, np.std],
        'wap2': [np.sum, np.mean, np.std],
        'wap3': [np.sum, np.mean, np.std],
        #'wap4': [np.sum, np.mean, np.std, np.max],
        'log_return1': [np.sum, realized_volatility, np.mean, np.std, np.max, np.min],
        'log_return2': [np.sum, realized_volatility, np.mean, np.std, np.max, np.min],
        'log_return3': [np.sum, realized_volatility, np.mean, np.std, np.max, np.min],
        #'log_return4': [np.sum, realized_volatility, np.mean, np.std, np.max],
        'price_spread':[np.sum, np.mean, np.std, np.max, np.min],
        #'bid_spread':[np.sum, np.mean, np.std, np.max],
        #'ask_spread':[np.sum, np.mean, np.std, np.max],
        'total_volume':[np.sum, np.mean, np.std, np.max, np.min],
        'volume_imbalance':[np.sum, np.mean, np.std, np.max, np.min],
        'ask_div_bid_price': [np.sum, np.mean, np.std, np.max, np.min],
        'ask_div_bid_size': [np.sum, np.mean, np.std, np.max, np.min],
        #'log_return_bid_size1': [np.sum, np.mean, np.std, np.max],
        #'log_return_ask_size1': [np.sum, np.mean, np.std, np.max],
        'log_return_total_volume': [np.sum, np.mean, np.std, np.max],
        'log_return_ask_div_bid_size': [np.sum, np.mean, np.std, np.max]
        
    }
    
    # Function to get group stats for different windows (seconds in bucket)
    def get_stats_window(seconds_in_bucket, add_suffix = False):
        # Group by the window
        df_feature = df[df['seconds_in_bucket'] >= seconds_in_bucket].groupby(['time_id']).agg(create_feature_dict).reset_index()
        # Rename columns joining suffix
        df_feature.columns = ['_'.join(col) for col in df_feature.columns]
        # Add a suffix to differentiate windows
        if add_suffix:
            df_feature = df_feature.add_suffix('_' + str(seconds_in_bucket))
        return df_feature
    
    # Get the stats for different windows
    df_feature = get_stats_window(seconds_in_bucket = 0, add_suffix = False)
    df_feature_450 = get_stats_window(seconds_in_bucket = 450, add_suffix = True)
    df_feature_300 = get_stats_window(seconds_in_bucket = 300, add_suffix = True)
    df_feature_150 = get_stats_window(seconds_in_bucket = 150, add_suffix = True)
    
    # Merge all
    df_feature = df_feature.merge(df_feature_450, how = 'left', left_on = 'time_id_', right_on = 'time_id__450')
    df_feature = df_feature.merge(df_feature_300, how = 'left', left_on = 'time_id_', right_on = 'time_id__300')
    df_feature = df_feature.merge(df_feature_150, how = 'left', left_on = 'time_id_', right_on = 'time_id__150')
    # Drop unnecesary time_ids
    df_feature.drop(['time_id__450', 'time_id__300', 'time_id__150'], axis = 1, inplace = True)
    
    # Create row_id so we can merge
    stock_id = file_path.split('=')[1]
    df_feature['row_id'] = df_feature['time_id_'].apply(lambda x: f'{stock_id}-{x}')
    df_feature.drop(['time_id_'], axis = 1, inplace = True)
    return df_feature

# Function to preprocess trade data (for each stock id)
def trade_preprocessor(file_path):
    df = pd.read_parquet(file_path)
    df['log_return'] = df.groupby('time_id')['price'].apply(log_return)
    
    # Dict for aggregations
    create_feature_dict = {
        'log_return':[realized_volatility],
        'seconds_in_bucket':[count_unique],
        'size':[np.sum],
        'order_count':[np.mean],
    }
    
    # Function to get group stats for different windows (seconds in bucket)
    def get_stats_window(seconds_in_bucket, add_suffix = False):
        # Group by the window
        df_feature = df[df['seconds_in_bucket'] >= seconds_in_bucket].groupby(['time_id']).agg(create_feature_dict).reset_index()
        # Rename columns joining suffix
        df_feature.columns = ['_'.join(col) for col in df_feature.columns]
        # Add a suffix to differentiate windows
        if add_suffix:
            df_feature = df_feature.add_suffix('_' + str(seconds_in_bucket))
        return df_feature
    
    # Get the stats for different windows
    df_feature = get_stats_window(seconds_in_bucket = 0, add_suffix = False)
    df_feature_450 = get_stats_window(seconds_in_bucket = 450, add_suffix = True)
    df_feature_300 = get_stats_window(seconds_in_bucket = 300, add_suffix = True)
    df_feature_150 = get_stats_window(seconds_in_bucket = 150, add_suffix = True)

    # Merge all
    df_feature = df_feature.merge(df_feature_450, how = 'left', left_on = 'time_id_', right_on = 'time_id__450')
    df_feature = df_feature.merge(df_feature_300, how = 'left', left_on = 'time_id_', right_on = 'time_id__300')
    df_feature = df_feature.merge(df_feature_150, how = 'left', left_on = 'time_id_', right_on = 'time_id__150')
    # Drop unnecesary time_ids
    df_feature.drop(['time_id__450', 'time_id__300', 'time_id__150'], axis = 1, inplace = True)
    
    df_feature = df_feature.add_prefix('trade_')
    stock_id = file_path.split('=')[1]
    df_feature['row_id'] = df_feature['trade_time_id_'].apply(lambda x:f'{stock_id}-{x}')
    df_feature.drop(['trade_time_id_'], axis = 1, inplace = True)
    return df_feature

# Function to get group stats for the stock_id and time_id
def get_time_stock(df):
    # Get realized volatility columns
    vol_cols = ['log_return1_realized_volatility', 'log_return2_realized_volatility', 'log_return1_realized_volatility_450', 'log_return2_realized_volatility_450', 
                'log_return1_realized_volatility_300', 'log_return2_realized_volatility_300', 'log_return1_realized_volatility_150', 'log_return2_realized_volatility_150', 
                'trade_log_return_realized_volatility', 'trade_log_return_realized_volatility_450', 'trade_log_return_realized_volatility_300', 'trade_log_return_realized_volatility_150']

    # Group by the stock id
    df_stock_id = df.groupby(['stock_id'])[vol_cols].agg(['mean', 'std', 'max', 'min', ]).reset_index()
    # Rename columns joining suffix
    df_stock_id.columns = ['_'.join(col) for col in df_stock_id.columns]
    df_stock_id = df_stock_id.add_suffix('_' + 'stock')

    # Group by the stock id
    df_time_id = df.groupby(['time_id'])[vol_cols].agg(['mean', 'std', 'max', 'min', ]).reset_index()
    # Rename columns joining suffix
    df_time_id.columns = ['_'.join(col) for col in df_time_id.columns]
    df_time_id = df_time_id.add_suffix('_' + 'time')
    
    # Merge with original dataframe
    df = df.merge(df_stock_id, how = 'left', left_on = ['stock_id'], right_on = ['stock_id__stock'])
    df = df.merge(df_time_id, how = 'left', left_on = ['time_id'], right_on = ['time_id__time'])
    df.drop(['stock_id__stock', 'time_id__time'], axis = 1, inplace = True)
    return df
    
# Funtion to make preprocessing function in parallel (for each stock id)
def preprocessor(list_stock_ids, is_train = True):
    
    # Parrallel for loop
    def for_joblib(stock_id):
        # Train
        if is_train:
            file_path_book = data_dir + "book_train.parquet/stock_id=" + str(stock_id)
            file_path_trade = data_dir + "trade_train.parquet/stock_id=" + str(stock_id)
        # Test
        else:
            file_path_book = data_dir + "book_test.parquet/stock_id=" + str(stock_id)
            file_path_trade = data_dir + "trade_test.parquet/stock_id=" + str(stock_id)
    
        # Preprocess book and trade data and merge them
        df_tmp = pd.merge(book_preprocessor(file_path_book), trade_preprocessor(file_path_trade), on = 'row_id', how = 'left')
        
        # Return the merge dataframe
        return df_tmp
    
    # Use parallel api to call paralle for loop
    df = Parallel(n_jobs = -1, verbose = 1)(delayed(for_joblib)(stock_id) for stock_id in list_stock_ids)
    # Concatenate all the dataframes that return from Parallel
    df = pd.concat(df, ignore_index = True)
    return df

# Function to calculate the root mean squared percentage error
def rmspe(y_true, y_pred):
    return np.sqrt(np.mean(np.square((y_true - y_pred) / y_true)))

# Function to early stop with root mean squared percentage error
def feval_rmspe(y_pred, lgb_train):
    y_true = lgb_train.get_label()
    return 'RMSPE', rmspe(y_true, y_pred), False

# def train_and_evaluate(train, test):
#     # Hyperparammeters (just basic)
#     params = {
#       'objective': 'rmse',  
#       'boosting_type': 'gbdt',
#       'num_leaves': 100,
#       'n_jobs': -1,
#       'learning_rate': 0.1,
#       'feature_fraction': 0.8,
#       'bagging_fraction': 0.8,
#       'verbose': -1
#     }
    
#     # Split features and target
#     x = train.drop(['row_id', 'target', 'time_id'], axis = 1)
#     y = train['target']
#     x_test = test.drop(['row_id', 'time_id'], axis = 1)
#     # Transform stock id to a numeric value
#     x['stock_id'] = x['stock_id'].astype(int)
#     x_test['stock_id'] = x_test['stock_id'].astype(int)
    
#     # Create out of folds array
#     oof_predictions = np.zeros(x.shape[0])
#     # Create test array to store predictions
#     test_predictions = np.zeros(x_test.shape[0])
#     # Create a KFold object
#     kfold = KFold(n_splits = 5, random_state = 66, shuffle = True)
#     # Iterate through each fold
#     for fold, (trn_ind, val_ind) in enumerate(kfold.split(x)):
#         print(f'Training fold {fold + 1}')
#         x_train, x_val = x.iloc[trn_ind], x.iloc[val_ind]
#         y_train, y_val = y.iloc[trn_ind], y.iloc[val_ind]
#         # Root mean squared percentage error weights
#         train_weights = 1 / np.square(y_train)
#         val_weights = 1 / np.square(y_val)
#         train_dataset = lgb.Dataset(x_train, y_train, weight = train_weights, categorical_feature = ['stock_id'])
#         val_dataset = lgb.Dataset(x_val, y_val, weight = val_weights, categorical_feature = ['stock_id'])
#         model = lgb.train(params = params, 
#                           train_set = train_dataset, 
#                           valid_sets = [train_dataset, val_dataset], 
#                           num_boost_round = 10000, 
#                           early_stopping_rounds = 50, 
#                           verbose_eval = 50,
#                           feval = feval_rmspe)
#         # Add predictions to the out of folds array
#         oof_predictions[val_ind] = model.predict(x_val)
#         # Predict the test set
#         test_predictions += model.predict(x_test) / 5
        
#     rmspe_score = rmspe(y, oof_predictions)
#     print(f'Our out of folds RMSPE is {rmspe_score}')
#     # Return test predictions
#     return test_predictions


In [None]:
#N_TRIALS = 100
TIME = 3600*6.5
N_SPLITS = 5
RANDOM_STATE = 99
kfold = KFold(N_SPLITS, random_state=RANDOM_STATE, shuffle=True)

FIXED_PARAMS = {'n_estimators': 10000,
                'learning_rate': 0.1,
                'metric': 'rmse',
                'verbosity': -1,
                'n_jobs': -1,
                #'max_bin': 127,
                'seed': RANDOM_STATE}

def rmspe(y_true, y_pred):
        return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

    
def objective(trial, cv=kfold):
    
    params = {
        
        'num_leaves': trial.suggest_int('num_leaves', 2, 1024),
        'max_depth': trial.suggest_int('max_depth', 2, 20),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 10),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 10),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 1),
        'subsample': trial.suggest_float('subsample', 0.4, 1),
        'cat_smooth': trial.suggest_float('cat_smooth', 10, 100.0),  
        'cat_l2': trial.suggest_int('cat_l2', 1, 20),
       
    }
    
    params.update(FIXED_PARAMS)

    pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "rmse", valid_name='valid_1')
    rmspe_list = []
    
    for kfold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        X_train = X.iloc[train_idx]
        y_train = y.iloc[train_idx]
        X_val = X.iloc[val_idx]
        y_val = y.iloc[val_idx]

        d_train = lgb.Dataset(X_train, label=y_train)
        d_valid = lgb.Dataset(X_val, label=y_val)

        model = lgb.train(params,
                      train_set=d_train,
                      valid_sets=[d_train, d_valid],
                      verbose_eval=0,
                      early_stopping_rounds=100,
                      callbacks=[pruning_callback])

        preds = model.predict(X_val)
        score = rmspe(y_val, preds)
        
        rmspe_list.append(score)
        
    
    return np.mean(rmspe_list)


In [None]:
if create_features:
    # Read train and test
    train, test = read_train_test()

    # Get unique stock ids 
    train_stock_ids = train['stock_id'].unique()
    # Preprocess them using Parallel and our single stock id functions
    train_ = preprocessor(train_stock_ids, is_train = True)
    train = train.merge(train_, on = ['row_id'], how = 'left')

    # Get unique stock ids 
    test_stock_ids = test['stock_id'].unique()
    # Preprocess them using Parallel and our single stock id functions
    test_ = preprocessor(test_stock_ids, is_train = False)
    test = test.merge(test_, on = ['row_id'], how = 'left')

# Get group stats of time_id and stock_id

    
    train = get_time_stock(train)
    test = get_time_stock(test)

    train.to_pickle('train_features_df.pickle')
    test.to_pickle('test_features_df.pickle')
else:
#     train = pd.read_pickle("../input/features/train_features_df.pickle")
#     test = pd.read_pickle("../input/features/test_features_df.pickle")
    train = pd.read_pickle("../input/optiver-feats/train_features_df.pickle")
    test = pd.read_pickle("../input/optiver-feats/test_features_df.pickle")

In [None]:
train.shape

In [None]:
X_display = train.drop(['row_id', 'time_id', 'target'], axis = 1)
X = X_display
y = train['target']


In [None]:
X_display.head(10)

In [None]:
if train_mode:
    study = optuna.create_study(direction='minimize', pruner=optuna.pruners.MedianPruner(n_warmup_steps=25))
    study.optimize(objective, timeout=TIME)
    
    print('Number of finished trials:', len(study.trials))
    print('Best trial:', study.best_trial.params)

Trial 24 finished with value: 0.24486020313983187 and parameters: {'num_leaves': 883, 'max_depth': 25, 'min_child_samples': 51, 'reg_alpha': 0.044271390027111224, 'reg_lambda': 4.3452654181772346, 'colsample_bytree': 0.7423339290785697, 'subsample': 0.9574494840064829, 'cat_smooth': 50.81669106814311, 'cat_l2': 14}. Best is trial 24 with value: 0.24486020313983187.

In [None]:
if train_mode:
    optuna.visualization.plot_optimization_history(study)

In [None]:
if train_mode:
    optuna.visualization.plot_param_importances(study)

In [None]:
if train_mode:
    study.best_params

In [None]:
if train_mode:
    best_lgbmparams = study.best_params
    
else:
    best_lgbmparams = {'num_leaves': 181, 'max_depth': 20, 'min_child_samples': 33, 'reg_alpha': 0.004098886302163573, 'reg_lambda': 4.865281409093663, 'colsample_bytree': 0.43240377400000285, 'subsample': 0.9902166770372244, 'cat_smooth': 54.672676882001234, 'cat_l2': 19}
    best_lgbmparams.update(FIXED_PARAMS)
#     best_lgbmparams = {'learning_rate': 0.11352081667311227,
#                          'max_depth': 206,
#                          'lambda_l1': 3.802017952502632e-06,
#                          'lambda_l2': 6.33667047986424,
#                          'num_leaves': 69,
#                          'n_estimators': 540,
#                          'feature_fraction': 0.6675754770916755,
#                          'bagging_fraction': 0.694980646638328,
#                          'bagging_freq': 2,
#                          'min_child_samples': 8}
#     best_lgbmparams.update(FIXED_PARAMS)

In [None]:
test = test.drop(['time_id'], axis = 1)

X_test = test.drop(['row_id'], axis = 1)

In [None]:
def calc_model_importance(model, feature_names=None, importance_type='gain'):
    importance_df = pd.DataFrame(model.feature_importance(importance_type=importance_type),
                                 index=feature_names,
                                 columns=['importance']).sort_values('importance')
    return importance_df


In [None]:
kfold = KFold(N_SPLITS, random_state=RANDOM_STATE, shuffle=True)
# Create out of folds array
oof_predictions = np.zeros(X.shape[0])
# Create test array to store predictions
test_predictions = np.zeros(X_test.shape[0])

# Importance lists for plotting
gain_importance_list = []
split_importance_list = []

for fold, (trn_ind, val_ind) in enumerate(kfold.split(X)):
        print(f'Training fold {fold + 1}')
        x_train, x_val = X.iloc[trn_ind], X.iloc[val_ind]
        y_train, y_val = y.iloc[trn_ind], y.iloc[val_ind]
        # Root mean squared percentage error weights
        train_weights = 1 / np.square(y_train)
        val_weights = 1 / np.square(y_val)
        train_dataset = lgb.Dataset(x_train, y_train, weight = train_weights, categorical_feature = ['stock_id'])
        val_dataset = lgb.Dataset(x_val, y_val, weight = val_weights, categorical_feature = ['stock_id'])
        model = lgb.train(params = best_lgbmparams, 
                          train_set = train_dataset, 
                          valid_sets = [train_dataset, val_dataset], 
                          num_boost_round = 10000, 
                          early_stopping_rounds = 100, 
                          verbose_eval = 50,
                          feval = feval_rmspe)
        # Add predictions to the out of folds array
        oof_predictions[val_ind] = model.predict(x_val)
        # Predict the test set
        test_predictions += model.predict(X_test) / 5
        
        #Feature importance
        feature_names = x_train.columns.values.tolist()
        gain_importance_df = calc_model_importance(
            model, feature_names=feature_names, importance_type='gain')
        gain_importance_list.append(gain_importance_df)

        split_importance_df = calc_model_importance(
            model, feature_names=feature_names, importance_type='split')
        split_importance_list.append(split_importance_df)

rmspe_score = rmspe(y, oof_predictions)
print(f'Our out of folds RMSPE is {rmspe_score}')

In [None]:
non_vol = [x for x in list(X.columns.values) if 'volatility' not in x]

def calc_mean_importance(importance_df_list):
    mean_importance = np.mean(
        np.array([df['importance'].values for df in importance_df_list]), axis=0)
    mean_df = importance_df_list[0].copy()
    mean_df['importance'] = mean_importance
    
    return mean_df

def plot_importance(importance_df, title='', save_filepath=None, figsize=(8, 12), exclude_vol=False):
    fig, ax = plt.subplots(figsize=figsize)
    if exclude_vol:
        importance_df = importance_df.loc[importance_df.index.isin(non_vol)]
    importance_df = importance_df.sort_values(by='importance',ascending=False).head(15)
    importance_df.plot.barh(ax=ax)
    if title:
        plt.title(title)
    plt.tight_layout()
    if save_filepath is None:
        plt.show()
    else:
        plt.savefig(save_filepath)
    plt.close()
    

mean_gain_df = calc_mean_importance(gain_importance_list)
plot_importance(mean_gain_df, title='Model feature importance by gain')


In [None]:
plot_importance(mean_gain_df, title='Non-volatility feature importance', exclude_vol=True)

In [None]:
test['target'] = test_predictions
test[['row_id', 'target']].to_csv('submission.csv',index = False)
