<center><h2>Jane Street Market Prediction | LGB Hyperparameter Optimization | katsu1110 </h2></center><hr>

![](https://optuna.org/assets/img/optuna-logo@2x.png)

Here I demonstrate how to use [Optuna](https://optuna.org/) to get a better set of hyperparameters by the Bayesian Optimization. I need a good LGB model for my ensemble:D

As a bonus, I save the tuned model in the [Treelite](https://treelite.readthedocs.io/en/latest/) format to accelerate the inference speed.

This notebook loads feathered-data from [my another notebook](https://www.kaggle.com/code1110/janestreet-save-as-feather?scriptVersionId=47635784) such that we don't have to spend our time on waiting long for loading csv files.

In this notebook we treat the task as a binary classification.

# Install Treelite

In [None]:
!pip --quiet install ../input/treelite/treelite-0.93-py3-none-manylinux2010_x86_64.whl

In [None]:
!pip --quiet install ../input/treelite/treelite_runtime-0.93-py3-none-manylinux2010_x86_64.whl

In [None]:
import numpy as np
import pandas as pd

import os, sys
import gc
import math
import random
import pathlib
from tqdm import tqdm
from typing import List, NoReturn, Union, Tuple, Optional, Text, Generic, Callable, Dict
from sklearn.preprocessing import MinMaxScaler, StandardScaler, QuantileTransformer
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn import metrics
import operator
import xgboost as xgb
import lightgbm as lgb
import optuna
from tqdm import tqdm_notebook as tqdm

# treelite
import treelite
import treelite_runtime 

# visualize
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from matplotlib_venn import venn2
from matplotlib import pyplot
from matplotlib.ticker import ScalarFormatter
sns.set_context("talk")
style.use('fivethirtyeight')
pd.options.display.max_columns = None

import warnings
warnings.filterwarnings('ignore')

# Config
Some configuration setups.

In [None]:
SEED = 20201225 # Merry Christmas!
# INPUT_DIR = '../input/jane-street-market-prediction/'
INPUT_DIR = '../input/janestreet-save-as-feather/'
TRADING_THRESHOLD = 0.50 # 0 ~ 1: The smaller, the more aggressive
DATE_BEGIN = 0 # 0 ~ 499: set 0 for model training using the complete data 

# Load data
I have already saved the training data in the feather-format in [my another notebook](https://www.kaggle.com/code1110/janestreet-save-as-feather?scriptVersionId=47635784). Loading csv takes time but loading feather is really light:)

In [None]:
os.listdir(INPUT_DIR)

In [None]:
%%time

# load data blitz fast!
def load_data(input_dir=INPUT_DIR):
    train = pd.read_feather(pathlib.Path(input_dir + 'train.feather'))
    features = pd.read_feather(pathlib.Path(input_dir + 'features.feather'))
    example_test = pd.read_feather(pathlib.Path(input_dir + 'example_test.feather'))
    ss = pd.read_feather(pathlib.Path(input_dir + 'example_sample_submission.feather'))
    return train, features, example_test, ss

train, features, example_test, ss = load_data(INPUT_DIR)

In [None]:
# delete irrelevant files to save memory
del features, example_test, ss
gc.collect()

# Model fitting

In [None]:
# remove weight = 0 for saving memory 
original_size = train.shape[0]
train = train.query('weight > 0').reset_index(drop=True)

# use data later than DATE_BEGIN
train = train.query(f'date >= {DATE_BEGIN}')

print('Train size reduced from {:,} to {:,}.'.format(original_size, train.shape[0]))

In [None]:
# target
train['action'] = train['resp'] * train['weight']
train['action'] = 1 * (train['action'] > 0)

In [None]:
# features to use
feats = [f for f in train.columns.values.tolist() if f.startswith('feature')]
print('There are {:,} features.'.format(len(feats)))

# Hyperparameter optimization
I use the last dates as a validation data (Time-series split) to Bayesian-Optimize hyperparameters of my LGB.

In [None]:
train['date'].unique()

In [None]:
# time series split like
pivot = 460
x_train = train.query(f'date < {pivot}')[feats]
y_train = train.query(f'date < {pivot}')['action']
x_val = train.query(f'date >= {pivot}')[feats]
y_val = train.query(f'date >= {pivot}')['action']

In [None]:
# from https://www.kaggle.com/gogo827jz/jane-street-super-fast-utility-score-function/notebook
from numba import njit

@njit(fastmath = True)
def utility_score_numba(date, weight, resp, action):
    Pi = np.bincount(date, weight * resp * action)
    t = np.sum(Pi) / np.sqrt(np.sum(Pi ** 2)) * np.sqrt(250 / len(Pi))
    u = min(max(t, 0), 6) * np.sum(Pi)
    return u

In [None]:
# Theoretical best score for this validation period
date = train.query(f'date >= {pivot}')['date'].values
weight = train.query(f'date >= {pivot}')['weight'].values
resp = train.query(f'date >= {pivot}')['resp'].values
action = 1 * (train.query(f'date >= {pivot}')['action'].values > TRADING_THRESHOLD)
score = utility_score_numba(date, weight, resp, action)
print(f"Utility Score = {score}")

In [None]:
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_val, y_val)

def objective(trial):    
    params = {
            'num_leaves': trial.suggest_int('num_leaves', 32, 1024),
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'metric': 'binary_logloss',
            'max_depth': trial.suggest_int('max_depth', 4, 16),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 12),
            'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
            'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
            'bagging_freq': trial.suggest_int('bagging_freq', 1, 8),
            'min_child_samples': trial.suggest_int('min_child_samples', 4, 80),
            'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-6, 1.0),
            'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-6, 1.0),
            }

    model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_eval], 
                      early_stopping_rounds=10, verbose_eval=1000)
    val_pred = model.predict(x_val)
    
    # score
    date = train.query(f'date >= {pivot}')['date'].values
    weight = train.query(f'date >= {pivot}')['weight'].values
    resp = train.query(f'date >= {pivot}')['resp'].values
    action = 1 * (val_pred > TRADING_THRESHOLD)
    score = utility_score_numba(date, weight, resp, action)
    print(f"Utility Score = {score}")
    return score

In [None]:
%%time

# Bayesian optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=40)

# Best sets of hyperparameters

In [None]:
print('Number of finished trials: {}'.format(len(study.trials)))

print('Best trial:')
trial = study.best_trial

print('  Value: {}'.format(trial.value))

print('  Params: ')
for key, value in trial.params.items():
    print('    {}: {}'.format(key, value))

In [None]:
# plot history
from optuna.visualization import plot_optimization_history
plot_optimization_history(study)

# Re-training LGB with the best params

In [None]:
%%time

print('Starting training...')
lgb_train = lgb.Dataset(train[feats], train['action'])
model = lgb.train(trial.params,
                lgb_train,
                num_boost_round=480,
                valid_sets=lgb_train,  # eval training data
                feature_name=feats,
                categorical_feature=[])

print('Saving model...')
# save model to file
model.save_model('my_model.txt')

# Feature importance
Let's see feature importance given by the model.

In [None]:
lgb.plot_importance(model, importance_type="gain", figsize=(7, 40))

# Treelite
I believe Treelite is must in this competition, to avoid the sumission error due to the long inference time.

In [None]:
# load LGB with Treelite
model = treelite.Model.load('my_model.txt', model_format='lightgbm')

In [None]:
# generate shared library
toolchain = 'gcc'
model.export_lib(toolchain=toolchain, libpath='./mymodel.so',
                 params={'parallel_comp': 32}, verbose=True)

In [None]:
# predictor from treelite
predictor = treelite_runtime.Predictor('./mymodel.so', verbose=True)

# Submit
Let's use Treelite for faster inference.

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set
    
for (test_df, pred_df) in tqdm(iter_test):
    if test_df['weight'].item() > 0:
        # inference with treelite
        batch = treelite_runtime.Batch.from_npy2d(test_df[feats].values)
        pred_df.action = (predictor.predict(batch) > TRADING_THRESHOLD).astype('int')
    else:
        pred_df.action = 0
    env.predict(pred_df)

All done!