# Overview

## Feature Engineering
### Order and Trade features
For each stock, features were generated by grouping orders/trades by `time_id` and performing an aggregation function on a given column.  This methodology was applied to data chunked into separate time buckets.

### Market Aggregates
Although I found the inclusion of market aggregates decreased the cross validation score, the public leaderboard score was significantly worse.  
I suspect there was some leakage that I did not account for in the cross validation or there is some technical difference in the test set that does not work with the approach I tried.


## Model

### Structure
The model used is a Neural network (NN) that combines a couple of different ideas:

* Encoded features are used with the original features in the final output NN
* `stock_id` values are transformed with an Embedding layer - this could group similar stocks together and also reduces the dimensional space in comparison to one-hot-encoding
* Features have been computed for different time buckets e.g. (data between 0 seconds and 100 seconds, etc).  A 1d convolutional layer is applied along the time-bucket dimension of these features.

High-level structure
```
Input Features -> Encoded Features -> Decoded Features -> output_0
       \            /
       Concat Features -> (Layers) -> output_final
```
Auto-encoder model attempts to minimise the error of `Decoded Features`, `output_0` and `output_final` with respect to, input features, target, and target.


### Training Model

* A sample weighting of `1/y^2` was used such that root mean square percentage error is minimised (as specified in https://www.kaggle.com/c/optiver-realized-volatility-prediction/overview/evaluation)
* Dynamic Learning Rate - When validation errors are no longer improving the learning rate is reduced to help converge towards better values
* Early stopping - When validation errors are no longer improving (within a given number of epochs) the training is stopped early


## Cross Validation
KFold cross validation was used with 5 folds.  
Data was sorted by time_id and KFolds were not shuffled - this was chosen such that the time_ids included in the validation set were nearly entirely distinct from the training set


## Hyperparameter Tuning
Keras Tuner with a custom `run_trial` method was used for tuning hyperparameters.  The custom `run_trial` method utilises KFold cross validation as described above and the score of a trial is the mean of the out-of-fold (OOF) scores.


## Output Predictions

* The median of each prediction from cross validation models is used.  The median was preferred here as there is an asymmetric relationship between prediction and score.  (e.g. using actual_value +/- const. does not give the same score)
* Null values were replaced by optimal constant value as described in my other notebook here - https://www.kaggle.com/cldavies/single-value-baseline


## References

* Model - https://www.kaggle.com/gogo827jz/jane-street-supervised-autoencoder-mlp/comments#Jane-Street:-Supervised-Autoencoder-MLP
* Metrics - https://www.kaggle.com/tommy1028/lightgbm-starter-with-feature-engineering-idea


------

### Imports

In [None]:
import os
from multiprocessing import Pool

import tqdm
import glob

import numpy as np
import pandas as pd
import cufflinks as cf
import kerastuner as kt

from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import QuantileTransformer

cf.go_offline()

In [None]:
%load_ext memory_profiler

### Constants


In [None]:
SEED = 0
OUTPUT_DIR = '.'

kaggle_input_path = '../input/optiver-realized-volatility-prediction'
if os.path.exists(kaggle_input_path):
    DATA_DIR = kaggle_input_path
else:
    # running locally
    DATA_DIR = 'data'
    
DATA_DIR

## Data files

In [None]:
orders_files = sorted(glob.glob(f'{DATA_DIR}/book_train.parquet/*') + glob.glob(f'{DATA_DIR}/book_test.parquet/*'))
trades_files = sorted(glob.glob(f'{DATA_DIR}/trade_train.parquet/*') + glob.glob(f'{DATA_DIR}/trade_test.parquet/*'))

files_df = pd.concat([
    pd.Series(orders_files, name='orders_file'),
    pd.Series(trades_files, name='trades_file')
], axis=1)
files_df['stock_id'] = (files_df['orders_file'].str.split('=').str[-1]).astype(int)
files_df['stock_id2'] = (files_df['trades_file'].str.split('=').str[-1]).astype(int)
files_df['is_train'] = files_df['orders_file'].str.contains('train')
assert files_df.eval('stock_id == stock_id2').all(), 'StockId of files do not match'

## Augment columns

In [None]:
def augment_trades(trades_df):
    """Supplement dataframe with columns computed by values from other columns
    
    Parameters
    ----------
    trades_df: pandas.DataFrame
        DataFrame with columns: ['time_id', 'seconds_in_bucket', 'price', 'size', 'order_count']
    """
    trades_df['notional'] = trades_df.eval('price * size')
    trades_df['log_return'] = trades_df.groupby('time_id')['price'].apply(log_return)
    return trades_df


def log_return(df):
    """Log return of prices
    
    Parameters
    ----------
    df: pandas.Series
    """
    return df.apply(np.log).diff()


def augment_book(book_df):
    """Supplement dataframe with columns computed by values from other columns
    
    Parameters
    ----------
    trades_df: pandas.DataFrame
        DataFrame with columns: 
        ['time_id', 'seconds_in_bucket', 
         'bid_price1', 'ask_price1', 
         'bid_price2', 'ask_price2', 
         'bid_size1', 'ask_size1', 
         'bid_size2', 'ask_size2']
    """
    book_df['midpoint'] = book_df.eval('(bid_price1 + ask_price1)/2')
    book_df['weighted_midpoint_1'] = book_df.eval('(bid_price1*ask_size1 + ask_price1*bid_size1)/(ask_size1 + bid_size1)')
    book_df['weighted_midpoint_2'] = book_df.eval('(bid_price2*ask_size2 + ask_price2*bid_size2)/(ask_size2 + bid_size2)')
    book_df['log_return_1'] = book_df.groupby('time_id')['weighted_midpoint_1'].apply(log_return)
    book_df['log_return_2'] = book_df.groupby('time_id')['weighted_midpoint_2'].apply(log_return)
    book_df['volume_imbalance_1'] = book_df.eval('(ask_size1 - bid_size1)/(ask_size1 + bid_size1)')
    book_df['volume_imbalance_2'] = book_df.eval('(ask_size1 + ask_size2 - bid_size1 - bid_size2)/(ask_size1 + bid_size1 + ask_size2 + bid_size2)')
    book_df['abs_volume_imbalance_2'] = book_df.eval('(ask_size1 + ask_size2 - bid_size1 - bid_size2)').abs()
    book_df['liquidity_near_bbo'] = book_df.eval('bid_size1 + bid_size2 + ask_size1 + ask_size2')
    book_df['abs_mid_diff'] = book_df.eval('abs(weighted_midpoint_2 - weighted_midpoint_1)')
    book_df['relative_spread'] = book_df.eval('(ask_price1 - bid_price1)/midpoint')
    book_df['bid_spread'] = book_df.eval('bid_price1 - bid_price2')
    book_df['ask_spread'] = book_df.eval('ask_price1 - ask_price2')
    return book_df

## Order and Trade Metrics

In [None]:
BUCKET_TIME = 100


def count_unique(x):
    """Number of unique values in numpy array / pandas series"""
    return len(np.unique(x))


def realised_vol(df):
    """Realised volatility where the input is the log return of a price series"""
    return np.sqrt(df.apply(np.square).sum())


def order_metrics(book_df):
    """Order book related features
    
    Returns
    -------
    pandas.DataFrame
        DataFrame with `time_id` as index and each column is a computed feature.
    """
    feature_dict = {
        'log_return_1': [realised_vol],
        'log_return_2': [realised_vol],
        'abs_mid_diff': [np.mean],
        'relative_spread': [np.mean],
        'bid_spread': [np.mean],
        'ask_spread': [np.mean],
        'abs_volume_imbalance_2': [np.mean],
        'liquidity_near_bbo': [np.mean],
        'weighted_midpoint_1': [np.mean],
        
        'time_id': ['count'], 
        'bid_size1': 'mean', 
        'ask_size1': 'mean', 
        'bid_size2': 'mean', 
        'ask_size2': 'mean', 
        'volume_imbalance_1': 'mean', 
        'volume_imbalance_2': 'mean',
        'liquidity_near_bbo': 'mean'
    }
    
    return _metrics(book_df, feature_dict)


def trade_metrics(trade_df):
    """Trade related features
    
    Returns
    -------
    pandas.DataFrame
        DataFrame with `time_id` as index and each column is a computed feature.
    """
    feature_dict = {
        'log_return': [realised_vol],
        'seconds_in_bucket': [count_unique, 'count'],
        'order_count': ['sum', 'mean'], 
        'size': ['sum', 'mean'], 
        'notional': ['sum', 'mean'], 
        'price': ['mean', 'first', 'last', 'min', 'max']
    }
    
    results_df = _metrics(trade_df, feature_dict)
    
    suffixes = ['0', '300', 'bucket_0_100', 'bucket_100_200', 'bucket_200_300', 'bucket_300_400', 'bucket_400_500', 'bucket_500_600']
    for suffix in suffixes:
        results_df[f'vwap|mean|{suffix}'] = results_df[f'notional|mean|{suffix}'] / results_df[f'size|mean|{suffix}']
    
    return results_df


def _metrics(df, feature_dict):
    """Compute features on a dataframe
    
    Parameters
    ----------
    df: pandas.DataFrame
        DataFrame with either order book data or trades data
    feature_dict: dict
        Dictionary of {<column name>: <list of aggregation methods to apply>}
    """
    results = []

    for earliest_time in [0, 300]:
        result = df.query('seconds_in_bucket >= @earliest_time').groupby('time_id').agg(feature_dict)
        result.columns = [f'{"|".join(col)}|{earliest_time}' for col in result.columns]
        results.append(result)       
        
    
    n_buckets = int(600/BUCKET_TIME)
    
    for bucket_index in range(n_buckets):
        start = BUCKET_TIME * bucket_index
        end = BUCKET_TIME * (bucket_index + 1)
        result = df.query('@start <= seconds_in_bucket < @end').groupby('time_id').agg(feature_dict)
        result.columns = [f'{"|".join(col)}|bucket_{start}_{end}' for col in result.columns]
        results.append(result)
    
    results_df = pd.concat(results, axis=1)
    return results_df


## Compute Metrics

In [None]:
def compute_metrics(row):
    trade_metrics_df = _compute_trade_metrics(row['trades_file'])
    order_metrics_df = _compute_order_metrics(row['orders_file'])
    metrics_df = pd.concat([trade_metrics_df, order_metrics_df], axis=1)
    metrics_df = metrics_df.assign(stock_id=row['stock_id'], is_train=row['is_train']).reset_index().set_index(['stock_id', 'time_id']).reset_index()
    return metrics_df


def _compute_trade_metrics(filename):
    trades_df = pd.read_parquet(filename)   
    trades_df = augment_trades(trades_df)
    metrics_df = trade_metrics(trades_df)
    return _prefix_columns(metrics_df, 'trade')


def _compute_order_metrics(filename):
    book_df = pd.read_parquet(filename)   
    book_df = augment_book(book_df)
    metrics_df = order_metrics(book_df)
    return _prefix_columns(metrics_df, 'order')


def _prefix_columns(df, prefix):
    df.columns = [f'{prefix}|{col}' for col in df.columns]
    return df


def merge_training_data(results_df):
    train_target_df = pd.read_csv(f'{DATA_DIR}/train.csv')
    combined_df = pd.merge(results_df, train_target_df, on=['stock_id', 'time_id'], how='left').set_index(['stock_id', 'time_id', 'target', 'is_train']).reset_index()
    return combined_df

## Generate Metrics

In [None]:
%%time
%%memit

with Pool() as pool:
    input_rows = files_df.to_dict('records')
    results = list(tqdm.tqdm(pool.imap(compute_metrics, input_rows), total=len(input_rows)))
    results_df = pd.concat(results)
    combined_df = merge_training_data(results_df)

    # Save to file
    (
        combined_df
        .reset_index(drop=True)
        .astype('float32')
        .astype({'stock_id': int, 'time_id': int, 'is_train': bool})
        .to_feather(f'{OUTPUT_DIR}/metrics3.feather')
    )

    del results
    del results_df
    del combined_df

# Modelling

## Load Dataset

In [None]:
%%memit

try:
    del df
except NameError:
    print('df not known')
    pass

COLUMNS_TO_DROP = [
    #'stock_id', 
    'time_id', 
    'target', 
    'is_train'
]

df = pd.read_feather(f'{OUTPUT_DIR}/metrics3.feather')
df_train = df.query('is_train').copy().sort_values(by='time_id').reset_index(drop=True)
df_test = df.query('not is_train').copy().reset_index(drop=True)

## Transform / Normalise Features

In [None]:
%%memit

def split_feature_target(df):
    features_df = df.drop(columns=COLUMNS_TO_DROP)
    target = df['target']
    return features_df, target


stock_id_mean_target_map = df_train.groupby(['stock_id'])['target'].mean()
stock_id_mean_target_map

stock_time_test_df = df_test[['stock_id', 'time_id']].copy()  # used in output file

X_train, y_train = split_feature_target(df_train)
X_test, _ = split_feature_target(df_test)

colNames = [col for col in list(df.columns)
            if col not in {"stock_id", "time_id", "target", "row_id", "is_train"}]


X_train['num_nan'] = X_train[colNames].isna().sum(axis=1)
X_test['num_nan'] = X_test[colNames].isna().sum(axis=1)

colNames += ['num_nan']

qt_train = []
for col in tqdm.tqdm(colNames):
    qt = QuantileTransformer(random_state=21,n_quantiles=2000, output_distribution='normal')
    X_train[col] = qt.fit_transform(X_train[[col]]).astype('float32')
    X_test[col] = qt.transform(X_test[[col]]).astype('float32')  
    qt_train.append(qt)
    
X_train = X_train.fillna(0).astype('float32').astype({'stock_id': int})
X_test = X_test.fillna(0).astype('float32').astype({'stock_id': int})

print('transformed data')

# clean up
del df
del df_train
del df_test
print('cleaned up df')

## Selected Features and Reshaping

In [None]:
#suffixes = [f'bucket_{i*100}_{(i+1)*100}' for i in range(6)]
bucket_suffixes = [
    'bucket_0_100',
    'bucket_100_200',
    'bucket_200_300',
    'bucket_300_400',
    'bucket_400_500',
    'bucket_500_600']


#base_metrics = [row[:-13] for row in df.filter(like='|bucket_0_100').columns.tolist()]
base_metrics = [
    'trade|log_return|realised_vol',
    'trade|seconds_in_bucket|count_unique',
    'trade|seconds_in_bucket|count',
    'trade|order_count|sum',
    'trade|order_count|mean',
    'trade|size|sum',
    'trade|size|mean',
    'trade|notional|sum',
    'trade|notional|mean',
    'trade|price|mean',
    'trade|price|first',
    'trade|price|last',
    'trade|price|min',
    'trade|price|max',
    'trade|vwap|mean',
    'order|log_return_1|realised_vol',
    'order|log_return_2|realised_vol',
    'order|abs_mid_diff|mean',
    'order|relative_spread|mean',
    'order|bid_spread|mean',
    'order|ask_spread|mean',
    'order|abs_volume_imbalance_2|mean',
    'order|liquidity_near_bbo|mean',
    'order|weighted_midpoint_1|mean',
    'order|time_id|count',
    'order|bid_size1|mean',
    'order|ask_size1|mean',
    'order|bid_size2|mean',
    'order|ask_size2|mean',
    'order|volume_imbalance_1|mean',
    'order|volume_imbalance_2|mean',
]


def cubify(df):
    """Transform the data into a 3 dimensional tensor (<num samples>, <base_metrics>, <suffixes>) instead of a 2d representation
    
    Parameters
    ----------
    df: pandas.DataFrame
    
    Returns
    -------
    numpy.ndarray
        Returns an array of shape (<num samples>, <base_metrics>, <suffixes>)
    """
    cube = []

    for base_metric in base_metrics:
        metrics = [f'{base_metric}|{suffix}' for suffix in bucket_suffixes]
        cube.append(df[metrics].values)

    arr = np.swapaxes(np.array(cube), 0, 1)
    return arr

In [None]:
%%memit
pass

## Scoring

In [None]:
def get_prediction(model, X):
    """Get prediction of a model given an input X"""
    X_cat, X = X
    if isinstance(X, pd.DataFrame):
        X = X.values
    y_predict = np.mean(
        model([X_cat.values, X], training=False)[2].numpy(), 
    axis=1)

    results = pd.Series(y_predict, index=X_cat.index)
    results.name = 'prediction'
    return results
    

def get_result(model, X, y):
    """Get the RMSPE of a model for a given input `X` and target `y`"""
    results = get_prediction(model, X)
    return score(results, y)


def score(results, y):
    results_df = pd.concat([results, y], axis=1)
    results_df['sq_pc_error'] = results_df.eval('(target - prediction) / target').apply(np.square)
    return np.sqrt(results_df['sq_pc_error'].sum() / results_df['sq_pc_error'].shape[0])

## Auto-encoder Model

(plot of model below)

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, GaussianNoise
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping


MAX_ENCODING_VAL = 127  # max stock id + 1


def create_model(hidden_units, 
                 dropout_rates,
                 learning_rate, 
                 num_columns, 
                 num_labels, 
                 stock_embedding_size,
                 kernel_size,
                 num_kernels):
    
    stock_id_input = tf.keras.Input(shape=(1,), name='stock_id')
    features_input = tf.keras.layers.Input(shape=(len(base_metrics), len(bucket_suffixes), 1,))
    
    cnn_layer = tf.keras.layers.Conv1D(filters=num_kernels,
                                       kernel_size=kernel_size,
                                       activation='swish',
                                       input_shape=(
                                           len(base_metrics),
                                           len(bucket_suffixes),
                                       )
                                      )(features_input)
    convolved_features = tf.keras.layers.Flatten()(cnn_layer)


    stock_embedded = tf.keras.layers.Embedding(MAX_ENCODING_VAL, 
                                               stock_embedding_size, 
                                               input_length=1, 
                                               name='stock_embedding')(stock_id_input)

    stock_flattened = tf.keras.layers.Flatten()(stock_embedded)
    concat_input = tf.keras.layers.Concatenate()([stock_flattened, convolved_features])
    norm_input = tf.keras.layers.BatchNormalization()(concat_input)

    encoder = tf.keras.layers.GaussianNoise(dropout_rates[0])(norm_input)
    encoder = tf.keras.layers.Dense(hidden_units[0])(encoder)
    encoder = tf.keras.layers.BatchNormalization()(encoder)
    encoder = tf.keras.layers.Activation('swish')(encoder)

    decoder = tf.keras.layers.Dropout(dropout_rates[1])(encoder)
    decoder = tf.keras.layers.Dense(num_columns, name = 'decoder')(decoder)
    
    x_ae = tf.keras.layers.Dense(hidden_units[1])(decoder)
    x_ae = tf.keras.layers.BatchNormalization()(x_ae)
    x_ae = tf.keras.layers.Activation('swish')(x_ae)
    x_ae = tf.keras.layers.Dropout(dropout_rates[2])(x_ae)
    
    out_ae = tf.keras.layers.Dense(num_labels, activation = 'linear', name = 'ae_action')(x_ae)
    
    x = tf.keras.layers.Concatenate()([norm_input, encoder])
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(dropout_rates[3])(x)
    
    for i in range(2, len(hidden_units)):
        x = tf.keras.layers.Dense(hidden_units[i])(x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation('swish')(x)
        x = tf.keras.layers.Dropout(dropout_rates[i + 2])(x)
    
    out = tf.keras.layers.Dense(num_labels, activation = 'linear', name = 'action')(x)

    
    model_encoder = tf.keras.models.Model(inputs=[stock_id_input, features_input], outputs=encoder)
    
    model = tf.keras.models.Model(inputs=[stock_id_input, features_input], outputs=[decoder, out_ae, out])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss=tf.keras.losses.MeanSquaredError(reduction="auto", name="mean_squared_error"),
        loss_weights=[1, 1, 1],
    )

    return model, model_encoder




#############
# Callbacks #
#############

class NEpochLogger(tf.keras.callbacks.Callback):
    """
    A Logger that log average performance per `display` steps.
    https://github.com/keras-team/keras/issues/2850#issuecomment-371353851
    """
    def __init__(self, display):
        self.display = display

    def on_epoch_end(self, epoch, logs={}):       
        if epoch % self.display == 0:
            print(f'Epoch {epoch}  \t|\t', {k: round(v, 6) for k, v in logs.items()})
            logs.clear()


def get_callbacks():
    callback_early_stopping = tf.keras.callbacks.EarlyStopping(
        monitor='val_action_loss', 
        patience=50,
        verbose=1,
        mode='min', 
    )

    callback_plateau = tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_action_loss', 
        factor=0.5, 
        patience=10, 
        verbose=1,
        mode='min',
        min_lr=1e-7,
        cooldown=5,
    )
    callback_epoch_logger = NEpochLogger(50)
    
    return [callback_early_stopping, callback_plateau, callback_epoch_logger]

## Summary and Plot Model

In [None]:
def get_default_parameters():
    return {
        'dropout_rates': [0.022391, # noise
                          0.05, 0.05, 0.05, 0.05, 0.05, 0.05],
        'hidden_units': [
            20,  # number of encoded features
            50,  # layer for output of encoded features
            64, 10  # layers for output model
        ],
        'learning_rate': 1e-3,
        'num_columns': X_train.shape[1] - 1,
        'num_labels': 1,
        'stock_embedding_size': 50,
        'kernel_size': 3,
        'num_kernels': 4,
    }


model, encoder = create_model(**get_default_parameters())
model.summary()

tf.keras.utils.plot_model(
    model,
    show_shapes=False,
    show_dtype=False,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96,
)

In [None]:
import gc
gc.collect()

## Hyperparameter tuning

(The line that runs this is currently commented out)

In [None]:
class CVTuner(kt.engine.tuner.Tuner):       
    
    def run_trial(self, trial, X, y, batch_size=32, epochs=1, callbacks=None):
        """Run trial with cross validation
        
        Note: X data is transformed before using in model in the following way:
        'stock_id' column is split out from the main bulk of the features
        The other features are reshaped into a tensor such that 
          - horizontally adjacent cells are the same feature but for a different time bucket
          - vertically adjacent cells are different features
        """
       
        tf.random.set_seed(SEED)
        np.random.seed(SEED)
        
        val_losses = []
        oof_scores = []
        
        kf = KFold(
            n_splits=5,
            shuffle=False
        )
        
        for fold, (train_index, validate_index) in enumerate(kf.split(X_train, y_train)):
            X = X_train.loc[train_index]
            y = y_train.loc[train_index]
            X_val = X_train.loc[validate_index]
            y_val = y_train.loc[validate_index]

            X_cat = X.pop('stock_id')
            X_cat_val = X_val.pop('stock_id')
            
            X_cube = cubify(X)
            X_cube_val = cubify(X_val)
            
            model = self.hypermodel.build(trial.hyperparameters)
                            
            sample_weight = 1/y**2
            hist = model.fit([X_cat, X_cube], [X, y, y], 
                             epochs=epochs,
                             sample_weight=sample_weight,
                             batch_size=batch_size,
                             validation_data=([X_cat_val, X_cube_val], [X_val, y_val, y_val], 1/np.square(y_val)),
                             callbacks=callbacks,
                             shuffle=True,
                             verbose=0)

            val_losses.append([hist.history[k][-1] for k in hist.history])
            oof_score = get_result(model, [X_cat_val, X_cube_val], y_val)
            oof_scores.append(oof_score)
            print(f'Fold {fold}, oof_score: {round(oof_score, 4)}')
            
        val_losses = np.asarray(val_losses)
        metrics = {k:np.mean(val_losses[:,i]) for i,k in enumerate(hist.history.keys())}
        metrics['oof_score'] = np.mean(oof_scores)
        
        self.oracle.update_trial(trial.trial_id, metrics)
        self.save_model(trial.trial_id, model)


def get_create_model(hp):
    # overwrite parameters for hyperparameter tuning
    parameters = get_default_parameters()
    #     parameters['stock_embedding_size'] = hp.Int('stock_embedding_size', 2, 130)
    #     parameters['kernel_size'] = hp.Int('kernel_size', 2, 6)
    #     parameters['num_kernels'] = hp.Int('num_kernels', 2, 20)
    parameters['hidden_units'][2] = hp.Int('num_hidden_units_out_layer_1', 16, 200)
    return create_model(**parameters)[0]


def get_tuner(project_name, max_number_of_trials):
    return CVTuner(
        hypermodel=get_create_model,
        oracle=kt.oracles.BayesianOptimization(
            objective= kt.Objective('oof_score', direction='min'),
            num_initial_points=7,
            max_trials=max_number_of_trials),
        project_name=project_name,
    )

def hyper_param_search(X, y, project_name, max_number_of_trials):
    tuner = get_tuner(project_name, max_number_of_trials)
    tuner.search((X,), (y,), batch_size=2000, epochs=1000, callbacks=get_callbacks())

    hp = tuner.get_best_hyperparameters(1)[0]
    display(tuner.project_name)
    display(hp.values)
    return tuner


    # Run hyperparameter tuning
# tuner = hyper_param_search(X_train, y_train, 'hyperparameter_cnn_num_hidden_units_out_layer_1', max_number_of_trials=30)


    # Plot Results
# tuner = get_tuner(project_name='hyperparameter_cnn_num_hidden_units_out_layer_1', max_number_of_trials=0)
# hp_results_df = pd.DataFrame([{'score': row.score, **row.hyperparameters.values} for row in tuner.oracle.get_best_trials(-1)])
# hp_results_df.set_index('num_hidden_units_out_layer_1').sort_index().iplot()

## Cross Validation (5-fold KFold)

In [None]:
%%time
import numpy as np
import tensorflow as tf
import random as python_random

np.random.seed(SEED)
python_random.seed(SEED)
tf.random.set_seed(SEED)


kf = KFold(
    n_splits=5,
    shuffle=False
)
models = []

for fold, (train_index, validate_index) in enumerate(kf.split(X_train, y_train)):
    X = X_train.loc[train_index]
    y = y_train.loc[train_index]
    X_val = X_train.loc[validate_index]
    y_val = y_train.loc[validate_index]
    
    X_cat = X.pop('stock_id')
    X_cat_val = X_val.pop('stock_id')   

    sample_weight = 1/y**2

    X_cube = cubify(X)
    X_cube_val = cubify(X_val)

    model, encoder = create_model(**get_default_parameters())
    model.fit([X_cat, X_cube], [X, y, y], 
          epochs=1000,
          sample_weight=sample_weight,
          batch_size=2000,
          validation_data=([X_cat_val, X_cube_val], [X_val, y_val, y_val], 1/np.square(y_val)),
          callbacks=get_callbacks(),
          shuffle=True,
          verbose=0)

    fold_score = get_result(model, [X_cat, X_cube], y)
    oof_score = get_result(model, [X_cat_val, X_cube_val], y_val)

    print('\n')
    print(f'fold {fold}, out of fold score: {oof_score}')
    print('\n\n')
    models.append({'model': model, 'fold_score': fold_score, 'oof_score': oof_score})

# OOF Score

In [None]:
print(np.mean([row['oof_score'] for row in models]).round(4))
pd.DataFrame(models)

## Create Submission File

In [None]:
predictions = []

X_final = X_test.copy(deep=True)
X_final_cat = X_final.pop('stock_id')
X_final_cube = cubify(X_final)

for model_map in models:
    model = model_map['model']
    predictions.append(get_prediction(model, [X_final_cat, X_final_cube]))
    
prediction = pd.concat(predictions, axis=1).median(axis=1)
prediction.name = 'prediction'

In [None]:
prediction_df = pd.concat([stock_time_test_df, prediction], axis=1).astype({'prediction': 'float64'})
prediction_df

In [None]:
prediction_df.dtypes

In [None]:
submission_df = pd.read_csv(f'{DATA_DIR}/test.csv')
submission_df = submission_df.merge(prediction_df, how='left', on=['stock_id', 'time_id'])

# Fill null predictions with mean stock target and then mean target
submission_df['prediction'] = submission_df['prediction'].fillna(submission_df['stock_id'].map(stock_id_mean_target_map))
submission_df = submission_df[['row_id', 'prediction']].rename(columns={'prediction': 'target'})
submission_df.to_csv('submission.csv', index=False)

In [None]:
submission_df