In [None]:
# !pip list --format=freeze > requirements.txt

# Optiver Realized Volatility Prediction

# Problématique

La société Optiver, une société de négoce pour compte propre et un broker/dealer pour divers instruments financiers, a lancée un concours sur Kaggle.  
Il s'agit de prédire la volatilité de "stocks" financiers.

Prédire avec précision la volatilité est essentiel pour la négociation d'options, dont le prix est directement lié à la volatilité du produit sous-jacent (ici le stock).  
Les options ont souvent un rôle d'effet levier de l'action.  
Dans notre cas une volatilité importante de notre stock créera probablement une variation encore plus importe de l'option associée.

## Terminologie

**Stock** : action financière  
Ex: Apple
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/appl_stock.PNG?raw=true" width="800px">  
<br>

**Option**  
Produit dérivé qui établit un contrat entre un acheteur et un vendeur.  
L'acheteur de l'option obtient le droit, et non pas l'obligation, d'acheter ou de vendre un actif sous-jacent à un prix fixé à l'avance, pendant un temps donné ou à une date fixée.  
<br>

**Order book** (Carnet d'ordres)  
Liste électronique d'ordres d'achat et de vente pour un titre ou un instrument financier spécifique organisé par niveau de prix.  
Les ordres d'achat prévus sont sur le côté gauche affichés comme "bid" tandis que tous les ordres de vente prévus sont sur la droite côté du livre affiché comme "ask"  
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/OrderBook3.png?raw=true" width="200px">  
<br>

**Trade book** (carnets de transactions effectuées)  
Un carnet d'ordres est une représentation de l'intention de négociation sur le marché, mais le marché a besoin d'un acheteur et d'un vendeur au même prix pour que la transaction se produise.  
Le trade book trace l'ensemble des transactions qui ont eu lieu

**bid/ask spread**  
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/spread.PNG?raw=true" width="350px">
<br><br>
**WAP** (Weighted averaged price)  
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/wap.PNG?raw=true" width="450px">  
<br><br>

Example :
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/wap_bidask.PNG?raw=true" width="450px">  
<br>
Dans cette compétition nous n'avons accès qu'aux rangs 1 & 2 des ordres.

**Log return**
Permet de comparer le cours d'une action entre deux moments.  
En appelant St le prix de l'action S à l'instant t , nous pouvons définir le retour de log entre t1 et t2 comme
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/logrtn.PNG?raw=true" width="200px">  
<br>

**Volatilité**
Grâce aux calculs des log return sur toutes les données consécutives du book nous pouvons définir la volatilité réalisée.  
Il s'agit de la racine carrée de la somme des log return au carré.
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/volatility.PNG?raw=true" width="200px">  
<br>

# Description du dataset

Le dataset est constitué de données financières et plus particulièrement de carnets d'ordres et de carnets de transactions effectuées.  
Ces deux "book" sont deux ensembles de fichiers séparés.

Chaque book est classé par stock qui représente un indice financier.  
Pour chaque stock nous avons plusieurs time_id.  
Celles ci font référence à une fenêtre de valeurs réelles de 20 min. Elles ne sont pas chronologiquement consécutives.  
Certains de ces time_id sont publiques et font partie de l'échantillon train, d'autres sont caché et constituent l'échantillon de test.

Dans chacune de ces fenêtres de 20 min nous avons accès aux premières 10 min de données et nous devons prédire la volatilité des 10 min suivantes.  
La volatilité de ces dernières 10 min nous est fournie (pour l'échantillon train) et sera notre target.

<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/DataBucketing.webp?raw=true" width="300px">  
<br>
<br>
Example et explication :
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/Data_chart.PNG?raw=true" width="800px"><br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/Data_explainations.PNG?raw=true" width="600px">

# Process machine learning

Il s'agit d'un problème de régression supervisé.

Dans un premier temps nous effectuerons un nettoyage éventuel et explorerons nos données.  
Puis nous Ferons du feature engineering sur nos dataset de book et trade.

## Evaluation

L'évaluation des performances de nos prédictions par rapport aux données de l'échantillon de test se fera avec une métrique imposée : le RMSPE (Root mean square percentage error).  
C'est ainsi une erreur quadratique moyenne normalisée puisqu'elle s'exprime en pourcentage.

<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/RMSPE.PNG?raw=true" width="300px"><br>

Il faut agréger nos données sur un seul dataframe avec une ligne par time_id avant d'entraîner et appliquer un modèle.  
En effet voici la structure que devra avoir notre prédiction :
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/sub_form.PNG?raw=true" width="250px"><br>

Ici la colonne row_id est composé de {stock}-{time_id} et target est la prédiction de notre volatilité.

## Références

Ces Kernels m'ont particulièrement aidé dans ma participation :

[https://www.kaggle.com/alexioslyon/lgbm-baseline](https://www.kaggle.com/alexioslyon/lgbm-baseline)  
[https://www.kaggle.com/munumbutt/feature-engineering-tuned-xgboost-lgbm](https://www.kaggle.com/munumbutt/feature-engineering-tuned-xgboost-lgbm)


# Librairies


In [None]:
%matplotlib inline
# generic libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
from joblib import Parallel, delayed
import pickle
import time
import plotly.graph_objects as go

# machine learning
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.ensemble import StackingRegressor, RandomForestRegressor
from sklearn.pipeline import make_pipeline
import optuna
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

# path and files treatment
import glob
import os

# Variables


In [None]:
# env could be 'local' or 'kaggle'
env = 'kaggle'

if env == 'local':
    data_folder = './data'
    output = './output/'
    save_path = './img/'
    if not(os.path.exists(output)):
        os.makedirs(output)
    if not(os.path.exists(save_path)):
        os.makedirs(save_path)    

elif env == 'kaggle':
    data_folder = '../input/optiver-realized-volatility-prediction'
    output = './output/'
    save_path = './img/'
    os.makedirs(save_path)
    os.makedirs(output)
else:
    print('env variable must be defined')

bk_train_fol = '/book_train.parquet/'
td_train_fol = '/trade_train.parquet/'
bk_test_fol = '/book_test.parquet/'
td_test_fol = '/trade_test.parquet/'

model_final = 'finalized_model.sav'

RANDOM_SEED = 42

# Remove non efficient (and slow) cells to be faster
fast = False

# if None take all the dataset
number_of_stocks = 5

pd.set_option('max_rows', 300)
pd.set_option('max_columns', 300)

# Exploration


## Functions


In [None]:
def load_df(df_folder, nb_stock_to_load=0, data_folder=data_folder):
    '''load a parquet 
    
    arguments
    ---------------
    data_folder (str)
    df_folder (str)
    nb_stock_to_load (int)
        number of subfolders to load
    '''
    stock_list = os.listdir(data_folder + df_folder)

    if nb_stock_to_load == 0:
        nb_stock_to_load = len(stock_list)
    nb_stock_to_load = min(nb_stock_to_load, len(stock_list))
    
    if nb_stock_to_load == 1:
        df = pd.read_parquet(data_folder + df_folder + '/stock_id=0')
        df['stock_id'] = 0
    else:
        ## depreciated
        # subset_paths = []
        # for stock in stock_list[:nb_stock_to_load]:
        #     subset_path = glob.glob(data_folder + df_folder + stock + '/*')
        #     subset_paths.append(subset_path[0])

        subset_paths = [glob.glob(data_folder + df_folder + stock + '/*')[0] for stock in stock_list[:nb_stock_to_load]]
        ## doesn't work
        # subset_paths = glob.glob(data_folder + df_folder + '/*')[:nb_stock_to_load]
        
        df = pd.read_parquet(subset_paths)
        df['stock_id'] = df['stock_id'].astype(int)
    return df

In [None]:
###############################
# Functions to add features
###############################

def add_wap(df, number=1, column_prefix='wap', standard=True):
    '''adding one wap

    number (int): the position of the price to take it could be 1 or 2
    standard (bool): use standard method to calculate wap or use a custom method
    '''
    if standard:
        df[column_prefix + str(number)] = (
            df['bid_price'+ str(number)] * df['ask_size'+ str(number)] + df['ask_price'+ str(number)] * df['bid_size'+ str(number)]) / (
                df['ask_size'+ str(number)]+ df['bid_size'+ str(number)])
    else:
        df[column_prefix + str(number) + '_ns'] = (
            df['bid_price'+ str(number)] * df['bid_size'+ str(number)] + df['ask_price'+ str(number)] * df['ask_size'+ str(number)]) / (
                df['ask_size'+ str(number)]+ df['bid_size'+ str(number)])

def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff()

def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

def add_waps(df):
        '''add many waps'''
        add_wap(df, 1, column_prefix='wap')
        add_wap(df, 2, column_prefix='wap')
        add_wap(df, 1, column_prefix='wap', standard=False)
        add_wap(df, 2, column_prefix='wap', standard=False)
        df['wap_p'] = ((
                df['wap1'] * (df['ask_size1'] + df['bid_size1']) +
                df['wap2'] * (df['ask_size2'] + df['bid_size2'])) /
                (df['ask_size1'] + df['bid_size1'] + df['ask_size2'] + df['bid_size2']))
        df['wap_balance'] = abs(df['wap1'] - df['wap2'])

def add_log_return(df, price_col, log_col_name, group='time_id'):
        df[log_col_name] = df.groupby([group])[price_col].apply(log_return)

def add_spreads(df):
        # # tests with ponderates features
        # df['bid_spread_p'] = (df['bid_price1'] * df['bid_size1'] - df['bid_price2'] * df['bid_size1'])/(df['bid_size1'] + df['bid_size2'])
        # df['ask_spread_p'] = (df['ask_price1'] * df['ask_size1'] - df['ask_price2'] * df['ask_size1'])/(df['ask_size1'] + df['ask_size2'])
        # df["bid_ask_spread_p"] = abs(df['bid_spread_p'] - df['ask_spread_p'])
        df['bid_spread'] = df['bid_price1'] - df['bid_price2']
        df['ask_spread'] = df['ask_price1'] - df['ask_price2']
        df['price_spread'] = (df['ask_price1'] - df['bid_price1']) / ((df['ask_price1'] + df['bid_price1'])/2)
        df["bid_ask_spread1"] = (df['ask_price1'] - df['bid_price1'])/df['bid_price1']
        df["bid_ask_spread2"] = (df['ask_price2'] - df['bid_price2'])/df['bid_price2']
        df["bid_ask_spread_p"] = ((df['ask_price1'] + df['ask_price2']) - (df['bid_price1'] + df['bid_price2']))/(df['bid_price1'] + df['bid_price2'])

def add_volumes(df):
        df['total_volume'] = (df['ask_size1'] + df['ask_size2']) + (df['bid_size1'] + df['bid_size2'])
        df['volume_imbalance'] = abs((df['ask_size1'] + df['ask_size2']) - (df['bid_size1'] + df['bid_size2']))

def add_EMA(df, wap_col, nb_period):
        df[wap_col + '_' + str(nb_period) + 'sec_EWM'] = df[wap_col].ewm(span=nb_period, adjust=False).mean()

###############################
# Evaluation
###############################
def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

## Book train


In [None]:
book_train = load_df(bk_train_fol, nb_stock_to_load=1)
book_train.head()

In [None]:
book_train.info()

### Sample


In [None]:
# Sample
book_train_sample = book_train[(book_train['stock_id'] == 0) & (book_train['time_id'] < 35)].copy()
add_wap(book_train_sample)
fig = px.line(book_train_sample, x="seconds_in_bucket", y="wap1", title='WAP of stock_id_0, time_id <35', color='time_id')
if env == 'local':
    fig.write_image(save_path + 'wap_sample.png')
fig.show()

In [None]:
book_train_sample['log_return'] = book_train_sample.groupby(['time_id'])['wap1'].apply(log_return)
book_train_sample = book_train_sample[~book_train_sample['log_return'].isnull()] # removing each Nan of firsts time_id, ~ : invers the mask

In [None]:
fig = px.line(book_train_sample, x="seconds_in_bucket", y="log_return", title='Log return of stock_id_0, time_id <35', color='time_id')
if env == 'local':
    fig.write_image(save_path + 'logreturn_sample.png')
fig.show()

In [None]:
# Realized volatility on our sample
realized_vol = book_train_sample.groupby(['time_id'])['log_return'].agg(realized_volatility)
print('Realized volatility for stock_id 0 :')
for i in realized_vol.index:
    print(f'- time_id {i} is {round(realized_vol.loc[i], 7)}')

## Trade train


In [None]:
# Test tp know if the book and trade data are on same stocks
os.listdir(data_folder + td_train_fol) == os.listdir(data_folder + bk_train_fol)

In [None]:
trade_train = load_df(td_train_fol, nb_stock_to_load=2)

In [None]:
trade_train.head()

### Sample


In [None]:
trade_train_sample = trade_train[(trade_train.stock_id == 0) & (trade_train.time_id < 35)]

fig = px.line(trade_train_sample, x="seconds_in_bucket", y="price", title='Price of stock_id_0, time_id <35', color='time_id')
if env == 'local':
    fig.write_image(save_path + 'trade_prices_sample.png')
fig.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(data=trade_train_sample, x="price", y="size", hue="order_count")
if env == 'local' or env == 'kaggle':
    plt.savefig(save_path + 'trade_sample.png')
plt.show()

## Book/Trade test

These file are here just to show the shape and firsts value of the hidden 10 min window.


In [None]:
book_test = load_df(bk_test_fol)
book_test.head()

In [None]:
trade_test = load_df(td_test_fol)
trade_test.head()

## Targets / realized volatility


In [None]:
# this dataset is just a sample, it will be replaced by the real one at each submission.
vol_test = pd.read_csv(data_folder +'/test.csv')
vol_test

In [None]:
vol_train = pd.read_csv(data_folder +'/train.csv')
vol_train.head()

In [None]:
vol_train.shape

In [None]:
vol_stock0 = vol_train[vol_train['stock_id'] == 0]

sns.set_theme(style="ticks")
fig = plt.figure(figsize=(16, 6))
# fig.suptitle('Images after equalization preprocessing', fontsize=16)
# fig.tight_layout()

plt.subplot(1, 2, 1)
plt.title("Train realized volatility")
plt.hist(vol_stock0['target'], bins=50)

plt.subplot(1, 2, 2)
plt.title("Train realized volatility - log")
plt.hist(np.log(vol_stock0['target']), bins=50)

if env == 'local' or env == 'kaggle':
    plt.savefig(save_path + 'realized_volatility.png')

plt.show()

# Preprocessing & baseline

---


## Functions


In [None]:
###############################
# Lists of dataset paths
###############################

# Create a list of stocks paths books from the dataset
if number_of_stocks is None:
    list_order_book_file_train = glob.glob(data_folder + bk_train_fol + '*')
    list_order_trade_file_train = glob.glob(data_folder + td_train_fol + '*')
    stock_id_max = max([int(path.split('=')[1]) for path in list_order_trade_file_train]) # files on kaggle are random sorted
else:
    stock_id_max = number_of_stocks-1 # stocks start at 0
    # take only stocks <= stock_id_max
    list_order_book_file_train = [path for path in glob.glob(data_folder + bk_train_fol + '*') if int(path.split('=')[1]) <= stock_id_max]
    list_order_trade_file_train = [path for path in glob.glob(data_folder + td_train_fol + '*') if int(path.split('=')[1]) <= stock_id_max]


## Naive RMSPE

Un fait bien connu à propos de la volatilité est qu'elle a tendance à être autocorrélée. Nous pouvons utiliser cette propriété pour implémenter un modèle naïf qui "prédit" simplement la volatilité réalisée en utilisant la volatilité réalisée au cours des 10 premières minutes.

Calculons la volatilité réalisée de la première partie de la fenêtre sur le jeu de donnée train.


In [None]:
# select all stocks books
list_order_book_file_train = glob.glob(data_folder + bk_train_fol + '/*')
list_order_book_file_train[:2] # sample

In [None]:
# specific for naive model
def realized_volatility_per_time_id(file_path, prediction_column_name):
    '''load datas of one stock_id then calculate WAP, log_return
    set a new DF and put inside realized_volatility per time_id
    add a column with competition form : {stock_id}-{time_id} called row_id

    file_path : path of subfolders with stock_id
        example : ./data/book_train/stock_id=0
    prediction_column_name : name of the realized_volatility column
    
    return row_id, prediction_name columns'''
    df_book_data = pd.read_parquet(file_path)
    add_wap(df_book_data)

    df_book_data['log_return'] = df_book_data.groupby(['time_id'])['wap1'].apply(log_return)
    df_book_data = df_book_data[~df_book_data['log_return'].isnull()] # removing each Nan of firsts time_id, ~ : invers the mask

    df_realized_vol_per_stock =  pd.DataFrame(df_book_data.groupby(['time_id'])['log_return'].agg(realized_volatility)).reset_index()
    df_realized_vol_per_stock = df_realized_vol_per_stock.rename(columns = {'log_return':prediction_column_name})
    
    stock_id = file_path.split('=')[1]
    df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock['time_id'].apply(lambda x:f'{stock_id}-{x}')
    
    return df_realized_vol_per_stock[['row_id',prediction_column_name]]

In [None]:
def past_realized_volatility_per_stock(list_file,prediction_column_name):
    df_past_realized = pd.DataFrame()
    for file in list_file:
        df_past_realized = pd.concat([df_past_realized,
                                     realized_volatility_per_time_id(file,prediction_column_name)])
    return df_past_realized

# test on all 126 stocks 
# long ! 230 sec
if not(fast):
    df_past_realized_train = past_realized_volatility_per_stock(list_file=list_order_book_file_train,
                                                            prediction_column_name='pred')
    df_past_realized_train.head()

In [None]:
if not(fast):
    df_naive = vol_train.copy()
    # Let's join the output dataframe with train.csv to see the performance of the naive prediction on training set.
    # naive prediction = predict same volatility in the next 10min window (auto realisation)
    df_naive['row_id'] = df_naive['stock_id'].astype(str) + '-' + df_naive['time_id'].astype(str)
    df_naive = df_naive[['row_id','target']]
    df_naive = df_naive.merge(df_past_realized_train[['row_id','pred']], on = ['row_id'], how = 'left')
    df_naive.head()

In [None]:
if not(fast):
    R2 = round(r2_score(y_true = df_naive['target'], y_pred = df_naive['pred']),3)
    RMSPE = round(rmspe(y_true = df_naive['target'], y_pred = df_naive['pred']),3)
    print(f'Performance of the naive prediction: R2 score: {R2}, RMSPE: {RMSPE}')

## Order book train


**Process flow**  
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/p8_process_orders.png?raw=true" width="900px"><br>


In [None]:
# list of waps for applying log return, EMA and EMA log return
# this list is also used in the creation of aggregation dic
waps = ['wap1', 'wap2', 'wap1_ns', 'wap2_ns', 'wap_p']

def book_feature_eng_per_stock(file_path, waps=waps):
    ''' Load datas of one stock_id then adding features.
    Removing Nan rows of theses features
    
    file_path : path of subfolders with stock_id
        example : ./data/book_train/stock_id=0

    return the df '''
    book_train = pd.read_parquet(file_path)

    add_waps(book_train)
    
    for wap in waps:
        add_log_return(book_train, price_col=wap, log_col_name=wap + '_log_return')
        for period in  [20, 100]:
            add_EMA(book_train, wap, period)
            EMA_col_name = wap + '_' + str(period) + 'sec_EWM'
            add_log_return(book_train, price_col=EMA_col_name, log_col_name=EMA_col_name + '_log_return')
            book_train['diff_' + EMA_col_name] = abs(book_train[wap] - book_train[EMA_col_name])
    
    add_spreads(book_train)
    add_volumes(book_train)

    # book_train = book_train[~(book_train['wap1_log_return'].isnull() | book_train['log_return2'].isnull() | book_train['log_return_p'].isnull())] # at the end ?
    book_train = book_train.fillna(book_train.median())
    
    return book_train

In [None]:
# sample with stock 0
df_sample = book_feature_eng_per_stock(list_order_book_file_train[0])
df_sample.head()

In [None]:
# list of spreads to apply the aggregate functions
spreads = ['bid_ask_spread1', 'bid_ask_spread2', 'bid_ask_spread_p', 'bid_spread', 'ask_spread', 'price_spread']

#########################################
# Creation of order book aggregation dic
#########################################
bk_feature_dic = {}
for wap in waps:
    bk_feature_dic[wap + '_log_return'] = [realized_volatility]
    # bk_feature_dic[wap] = [np.std, pd.Series.mad]
    for period in [20, 100]:
        EMA_col_name = wap + '_' + str(period) + 'sec_EWM'
        # bk_feature_dic[EMA_col_name + '_log_return'] = [realized_volatility]
        bk_feature_dic['diff_' + EMA_col_name] = [np.sum, np.std]
for spread in spreads:
    bk_feature_dic[spread] = [np.sum, np.std]

bk_feature_dic['total_volume'] = [np.sum, np.mean]
bk_feature_dic['volume_imbalance'] = [np.std]
bk_feature_dic['wap_balance'] = [np.sum, np.mean]

bk_feature_dic

In [None]:
def book_agg_form_parallele(file):
    ''' Create a new df that aggregate data by time_id and apply the feature dic
    add :
    - a stock_id columns
    - a competition form column : {stock_id}-{time_id} called row_id
    return the new df
    '''
    stock_id = file.split('=')[1]
    df_agg_stock = book_feature_eng_per_stock(file)
    df_agg_stock = pd.DataFrame(df_agg_stock.groupby(['time_id']).agg(bk_feature_dic).reset_index())

    df_agg_stock.columns = ['_'.join(col).rstrip('_') for col in df_agg_stock.columns.values]
    df_agg_stock['row_id'] = df_agg_stock['time_id'].apply(lambda x:f'{stock_id}-{x}')
    # df_agg_stock['stock_id'] = stock_id
    df_agg_stock.drop('time_id', axis=1, inplace=True)

    return df_agg_stock

In [None]:
def agg_df_and_concatenate_parallel(paths_list, func):
    ''' Create an concateneted df of preprocessed stocks df by the func'''

    df_agg = Parallel(n_jobs=-1)(
        delayed(func)(file) 
        for file in paths_list)
    
    df_agg = pd.concat(df_agg, ignore_index = True)

    return df_agg

In [None]:
%%time
df_order_agg = agg_df_and_concatenate_parallel(list_order_book_file_train, book_agg_form_parallele)
df_order_agg.head()

---

## Trades book train


**Process flow**  
<br>
<img src="https://github.com/abugeia/P8_kaggle_competition/blob/master/img/p8_process_trades.png?raw=true" width="900px"><br>


In [None]:
###############################
# Functions to add features
###############################
def add_amount(df):
    df['amount'] = df['price'] * df['size']
def add_power(df):
    df['power'] = (df['price'] - df['price'].shift(1))/df['price']*df['size']

# Function to count unique elements of a series
def count_unique(series):
    return len(np.unique(series))

In [None]:
def trade_feature_eng_per_stock(file_path):
    ''' Load datas of one stock_id then adding features.
    Removing Nan rows of theses features
    
    file_path : path of subfolders with stock_id
        example : ./data/trade_train/stock_id=0

    return the df '''
    df = pd.read_parquet(file_path)
        
    add_log_return(df, price_col='price', log_col_name='td_log_return')

    add_amount(df)
    for period in  [20, 100]:
        add_EMA(df, 'amount', period)
        EMA_col_name = 'amount_' + str(period) + 'sec_EWM'
        add_log_return(df, price_col=EMA_col_name, log_col_name=EMA_col_name + '_log_return')
        
    df['diff_td'] = df.seconds_in_bucket.diff() # same Nan as log_return
    df['amount_p_order'] = df.amount / df.order_count
    add_power(df)

    df = df[~df['td_log_return'].isnull()]
    return df

In [None]:
# sample with stock 0
df_sample = trade_feature_eng_per_stock(list_order_trade_file_train[0])
df_sample.head()

In [None]:
#########################################
# Creation of order book aggregation dic
#########################################
td_feature_dic = {}
for period in  [20, 100]:
    EMA_col_name = 'amount_' + str(period) + 'sec_EWM_log_return'
    td_feature_dic[EMA_col_name] = [realized_volatility]
td_feature_dic['td_log_return'] = [realized_volatility]
# td_feature_dic['seconds_in_bucket'] = [count_unique] # removed after feature importance analysis
td_feature_dic['diff_td'] = [np.std]
td_feature_dic['amount_p_order'] = [np.mean]
td_feature_dic['price'] = [np.mean]
td_feature_dic['amount'] = [np.std, pd.Series.mad]
td_feature_dic['amount_p_order'] = [np.mean, np.sum]
# td_feature_dic['size'] = [np.mean, np.sum] # removed after feature importance analysis
td_feature_dic['order_count'] = [np.mean, np.sum]

td_feature_dic

In [None]:
def trade_agg_form_parallele(file):
    ''' Create a new df that aggregate data by time_id
    add :
    - a stock_id columns
    - a with competition form column : {stock_id}-{time_id} called row_id
    return the new df
    '''
    stock_id = file.split('=')[1]
    df_agg_stock = trade_feature_eng_per_stock(file)
    df_agg_stock = pd.DataFrame(df_agg_stock.groupby(['time_id']).agg(td_feature_dic)).reset_index()
    
    df_agg_stock.columns = ['_'.join(col).rstrip('_') for col in df_agg_stock.columns.values]
    df_agg_stock['row_id'] = df_agg_stock['time_id'].apply(lambda x:f'{stock_id}-{x}')
    # df_agg_stock['stock_id'] = stock_id
    df_agg_stock.drop('time_id', axis=1, inplace=True)

    return df_agg_stock


In [None]:
%%time
df_trade_agg = agg_df_and_concatenate_parallel(list_order_trade_file_train, trade_agg_form_parallele)
df_trade_agg.head()

---

## Final DF train


In [None]:
def process_final_df(df_order_agg, df_trade_agg, df_target):
    '''select the targets of the chosen stocks
    merge target df to order and trades df
    return new df'''

    df = df_target[df_target.stock_id <= stock_id_max].copy()
    #  adding the same index in our books df to merge
    df['row_id'] = df['stock_id'].astype(str) + '-' + df['time_id'].astype(str)

    df = df.merge(df_order_agg, on = ['row_id'], how = 'left')
    df = df.merge(df_trade_agg, on = ['row_id'], how = 'left')

    return df    

In [None]:
df_train = process_final_df(df_order_agg, df_trade_agg, vol_train)
df_train.head()

In [None]:
###########################################
# Ploting realized volatility per stock
###########################################

# #Création d'un sous échantillon par modalité
groupes = []
for s in df_train['stock_id'].unique():
    groupes.append(df_train[df_train['stock_id'] == s]['wap1_log_return_realized_volatility'])
 
# 'OO' méthode pour plot
fig, ax = plt.subplots(figsize=(30,8))

# Propriétés graphiques
medianprops = {'color':"black"}
meanprops = {'marker':'o', 'markeredgecolor':'black',
            'markerfacecolor':'firebrick'}

ax.boxplot(groupes,
           labels=df_train['stock_id'].unique(),
           showfliers=False,
           medianprops=medianprops, 
           vert=True,
           patch_artist=True,
           showmeans=True,
           meanprops=meanprops)

ax.set(title='Distribution des wap1_log_return_realized_volatility par stock',
      xlabel="Stock Id",
      ylabel='wap1_log_return_realized_volatility')

plt.show()

In [None]:
###############################
# Saving preprocessed train ds
###############################
# df_train.to_pickle(output + 'dataset_train.bz2', compression='bz2')

## Test dataset


In [None]:
# list of test books paths
list_order_book_file_test = glob.glob(data_folder + bk_test_fol + '*')
list_order_trade_file_test = glob.glob(data_folder + td_test_fol + '*')

# preprocess test dataset
df_order_test_agg = agg_df_and_concatenate_parallel(list_order_book_file_test, book_agg_form_parallele)
df_trade_test_agg = agg_df_and_concatenate_parallel(list_order_trade_file_test, trade_agg_form_parallele)

# Merging df
df_test = process_final_df(df_order_test_agg, df_trade_test_agg, vol_test)
df_test.head()

# Machine learning

---


### Functions and variables


In [None]:
# For optuna studies
n_trials = 10

kfolds = KFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

scorer_rmspe = make_scorer(rmspe,
    # greater_is_better=False
    )

In [None]:
## New idea for the dic structure
# dic_eval = dict.fromkeys(['names', 'models', 'rmspe_scores', 'r2_scores'])
# dic_eval

In [None]:
dic_eval = {}
def evaluate(name, model, dic, X_test, y_test):
    y_pred = model.predict(X_test)
    R2 = round(r2_score(y_test, y_pred), 6)
    RMSPE = round(rmspe(y_test, y_pred), 6)
    dic[name] = [model, RMSPE, R2]
    print(f'Performance of the {name} prediction: R2 score: {R2}, RMSPE: {RMSPE}')

def evaluateCV(name, model, dic, X_train, y_train, save=True):
    start_time = time.time()
    RMSPE =  round(cross_val_score(
        model, X_train, y_train, cv=kfolds, scoring=scorer_rmspe
    ).mean(), 6)
    # model.fit(X_train, y_train)
    if save:
        dic[name] = [model, RMSPE]
    print(f'RMSPE of the {name} prediction: {RMSPE} in {round(time.time() - start_time, 3)} sec.')
    if not(save):
        return RMSPE

### Dataset


In [None]:
###############################
# Loading preprocessed train ds
###############################
# df_train = pd.read_pickle(output + 'dataset_train.bz2')

In [None]:
df_train.head()

In [None]:
df_train.isnull().sum()

In [None]:
df_train.fillna(df_train.median(), inplace=True)
df_train.isnull().sum().sum()

### Scalers


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer


qtn = QuantileTransformer(output_distribution='normal', random_state=42)
qtu = QuantileTransformer(output_distribution='uniform', random_state=42)
std = StandardScaler()
minmax = MinMaxScaler()

scalers = [qtn, qtu, std, minmax]

In [None]:
def scaler_selection(scalers):
    rmspe_min = 1
    for scaler in scalers:
        X = df_train.drop(['row_id', 'target'], axis = 1)
        y = df_train['target']

        X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.15, random_state=42, shuffle=True)

        # model_xgb = make_pipeline(scaler,
        #                     XGBRegressor(tree_method='hist', random_state=42, n_jobs= - 1))
        if env == 'kaggle':
            model_xgb = make_pipeline(scaler,
            XGBRegressor(tree_method='gpu_hist', random_state=42, n_jobs= - 1))
        else:
            model_xgb = make_pipeline(scaler,
                XGBRegressor(tree_method='hist', random_state=42, n_jobs= - 1))

        # model_xgb.fit(X_train, y_train)
        # evaluateCV('XGBOOST_'+ str(scaler), model_xgb, dic_eval, X_test, y_test, save=False)
        rmspe_model = evaluateCV('XGBOOST_'+ str(scaler), model_xgb, dic_eval, X_train, y_train, save=False)

        # rmspe_model = dic_eval['XGBOOST_'+ str(scaler)][1]
        if rmspe_model < rmspe_min:
            rmspe_min = rmspe_model
            selected_scaler = scaler

    return selected_scaler, rmspe_min

selected_scaler, rmspe_scaler = scaler_selection(scalers)
print(f'the selected scaler is {selected_scaler}')

### Train values


In [None]:
X_train = df_train.drop(['row_id', 'target'], axis = 1)
# X_val = X.values
y_train = df_train['target']
# y_val = y.values

X_train.shape, y_train.shape

### Test values


In [None]:
X_test = df_test.drop(['row_id'], axis = 1)
df_pred = df_test[['row_id']]

In [None]:
## We don't split anymore as we use CV
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42, shuffle=False)
# X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Ridge

---


In [None]:
def tune(objective, n_trials=n_trials):
    start_time = time.time()
    study = optuna.create_study(direction="minimize", pruner=optuna.pruners.MedianPruner(n_warmup_steps=5))
    study.optimize(objective, n_trials=n_trials, gc_after_trial=True)

    params = study.best_params
    print("--- %s seconds ---" % (time.time() - start_time))
    return params

In [None]:
def ridge_objectiveCV(trial):

    _alpha = trial.suggest_float("alpha", 1e-8, 20, log=True)

    # normalize=True to add ?
    # model_ridge = make_pipeline(selected_scaler,
    #     Ridge(alpha=_alpha, random_state=RANDOM_SEED))    
    model_ridge = Ridge(alpha=_alpha, random_state=RANDOM_SEED, normalize=True)

    score = cross_val_score(
        model_ridge, X_train, y_train, cv=kfolds, scoring=scorer_rmspe
    ).mean()
    return score

In [None]:
if not(fast):
    ridge_params = tune(ridge_objectiveCV)

In [None]:
if not(fast):
    ridge_opt = Ridge(**ridge_params, random_state=RANDOM_SEED, normalize=True)
    ## V17
    # ridge_opt.fit(X, y)
    # evaluate('Ridge', ridge_opt, dic_eval, X_test, y_test)

    evaluateCV('Ridge', ridge_opt, dic_eval, X_train, y_train)

## Lasso

---


In [None]:
def lasso_objective(trial):
    _alpha = trial.suggest_loguniform("alpha", 0.0001, 10)
    lasso = Lasso(alpha=_alpha, random_state=RANDOM_SEED)

    score = cross_val_score(
        lasso, X_train, y_train, cv=kfolds, scoring=scorer_rmspe
    ).mean()
    return score

In [None]:
if not(fast):
    lasso_params = tune(lasso_objective, n_trials=5) # long and results are bad
    lasso_opt = Lasso(**lasso_params, random_state=RANDOM_SEED)

In [None]:
if not(fast):
    # lasso_opt.fit(X_train, y_train)
    # evaluate('Lasso', lasso_opt, dic_eval, X_test, y_test)

    evaluateCV('Lasso', lasso_opt, dic_eval, X_train, y_train)

## RandomForrest


In [None]:
# def randomforest_objective(trial):
#     _n_estimators = trial.suggest_int("n_estimators", 50, 200)
#     _max_depth = trial.suggest_int("max_depth", 5, 20)
#     _min_samp_split = trial.suggest_int("min_samples_split", 2, 10)
#     _min_samples_leaf = trial.suggest_int("min_samples_leaf", 2, 10)

#     rf = RandomForestRegressor(
#         max_depth=_max_depth,
#         min_samples_split=_min_samp_split,
#         min_samples_leaf=_min_samples_leaf,
#         n_estimators=_n_estimators,
#         n_jobs=-1,
#         random_state=RANDOM_SEED,
#     )

#     rf.fit(X_train, y_train)

#     preds = rf.predict(X_test)
#     return rmspe(y_test, preds)

def randomforest_objectiveCV(trial):
    _n_estimators = trial.suggest_int("n_estimators", 50, 200)
    _max_depth = trial.suggest_int("max_depth", 5, 20)
    _min_samp_split = trial.suggest_int("min_samples_split", 2, 10)
    _min_samples_leaf = trial.suggest_int("min_samples_leaf", 2, 10)

    rf = RandomForestRegressor(
        max_depth=_max_depth,
        min_samples_split=_min_samp_split,
        min_samples_leaf=_min_samples_leaf,
        n_estimators=_n_estimators,
        n_jobs=-1,
        random_state=RANDOM_SEED,
    )
    score = cross_val_score(
        rf, X_train, y_train, cv=kfolds, scoring=scorer_rmspe
    ).mean()
    return score

In [None]:
if not(fast):
    randomforest_params = tune(randomforest_objectiveCV, n_trials=5) # long, average results...
    rf_opt = RandomForestRegressor(n_jobs=-1, random_state=RANDOM_SEED, **randomforest_params)

In [None]:
if not(fast):
    ## V17
    # rf_opt.fit(X_train, y_train)
    # evaluate('RandomForrest', rf_opt, dic_eval, X_test, y_test)

    evaluateCV('RandomForrest', rf_opt, dic_eval, X_train, y_train)

## Basic XGB model

---


In [None]:
## Old function
# def plot_feature_importance(df_train, model):
#     feature_importances_df = pd.DataFrame({
#         'feature': df_train.columns,
#         'importance_score': model.feature_importances_
#     })
#     fig = plt.figure(figsize=(20, 5))
#     ax = sns.barplot(x = "feature", y = "importance_score", data = feature_importances_df)
#     ax.set(xlabel="Features", ylabel = "Importance Score")
#     # plt.xticks(ha='left', rotation=45)
#     fig.autofmt_xdate(bottom=0.2, rotation=30, ha='right')
#     plt.show()
#     # return feature_importances_df

def plot_feature_importance(df_train, model, name=None):

    feature_imp = pd.DataFrame(sorted(zip(model.feature_importances_,df_train.columns)), columns=['Value','Feature'])

    plt.figure(figsize=(20, 10))
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
    plt.title('Features importance')
    plt.tight_layout()
    if (env == 'local' or env == 'kaggle') and name is not None:
        plt.savefig(save_path + name +'_feat_imp.png')
    plt.show()

In [None]:
if env == 'kaggle':
    xgb = XGBRegressor(tree_method='gpu_hist', random_state=42, n_jobs= - 1)
else:
    xgb = XGBRegressor(tree_method='hist', random_state=42, n_jobs= - 1)

In [None]:
# %%time
# xgb.fit(X_train, y_train)
# evaluate('XGBOOST', xgb, dic_eval, X_test, y_test)

evaluateCV('XGBOOST', xgb, dic_eval, X_train, y_train)

In [None]:
if not(fast):
    xgb.fit(X_train, y_train)
    plot_feature_importance(X_train, xgb, 'xgb')

## Basic LGBMRegressor model


In [None]:
if env == 'kaggle':
    lgbm = LGBMRegressor(device='gpu', random_state=42)
else:
    lgbm = LGBMRegressor(device='cpu', random_state=42)


In [None]:
# %%time
# lgbm.fit(X_train, y_train)
# evaluate('LIGHTGBM', lgbm, dic_eval, X_test, y_test)

evaluateCV('LIGHTGBM', lgbm, dic_eval, X_train, y_train)

In [None]:
if not(fast):
    lgbm.fit(X_train, y_train)
    plot_feature_importance(X_train, lgbm, 'lgbm')

## Removing useless features


In [None]:
if not(fast):
    features_imp_lgbm_xgb = [x/sum(lgbm.feature_importances_) + y/sum(xgb.feature_importances_) for x, y in zip(lgbm.feature_importances_, xgb.feature_importances_)] 
    feature_imp = pd.DataFrame(sorted(zip(features_imp_lgbm_xgb,X_train.columns)), columns=['Value','Feature'])

    plt.figure(figsize=(20, 10))
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.show()


## CatBoost


In [None]:
cbr = CatBoostRegressor(iterations=500, random_seed=42)
# Fit model
# cbr.fit(X_train, y_train)
# evaluate('catboost', cbr, dic_eval, X_test, y_test)

evaluateCV('catboost', cbr, dic_eval, X_train, y_train)

## Optuna Tuned XGBoost

Optuna va nous permettre de trouver nos meilleurs hyperparamètres.  
Il suffira ensuite d'entraîner notre modèle avec ces paramètres pour l'évaluer.


In [None]:
# def objective_xgb(trial):

#     param = {'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
#             'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
#             'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
#             'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
#             'learning_rate': trial.suggest_categorical('learning_rate', [0.012,0.014,0.016,0.018, 0.02, 0.025]),
#             'n_estimators': trial.suggest_int('n_estimators', 500, 3000),
#             'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15,17,20]),
#             'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)}

#     if env == 'kaggle':
#         param['tree_method'] = 'gpu_hist'
#     else:
#         param['tree_method'] = 'hist'
    
#     # model = make_pipeline(selected_scaler, XGBRegressor(**param, random_state=42))
#     model = XGBRegressor(**param, random_state=42)
    
#     # pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "validation-rmse")
#     model.fit(X_train , y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=100, verbose=False)
    
#     preds = model.predict(X_test)
    
#     return rmspe(y_test, preds)


def objective_xgbCV(trial):

    param = {
            # 'lambda': trial.suggest_loguniform('lambda', 1e-3, 1),
            'alpha': trial.suggest_loguniform('alpha', 1e-3, 1),
            'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9]),
            'subsample': trial.suggest_categorical('subsample', [0.5,0.6,0.7,0.8,1.0]),
            'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.009,0.01,0.012,0.014,0.016,0.018, 0.02]),
            'n_estimators': trial.suggest_int('n_estimators', 500, 3000),
            # 'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15]),
            # 'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
            }

    if env == 'kaggle':
        param['tree_method'] = 'gpu_hist'
    else:
        param['tree_method'] = 'hist'
    
    # model = make_pipeline(selected_scaler, XGBRegressor(**param, random_state=42))
    model = XGBRegressor(**param, random_state=42)

    score = cross_val_score(
        model, X_train, y_train, cv=kfolds, scoring=scorer_rmspe
    ).mean()
    return score

In [None]:
%%time
study_xgb = optuna.create_study(direction='minimize', pruner=optuna.pruners.MedianPruner(n_warmup_steps=5))
study_xgb.optimize(objective_xgbCV, n_trials=n_trials, gc_after_trial=True)

In [None]:
print('Number of finished trials:', len(study_xgb.trials))
print('Best trial:', study_xgb.best_trial.params)

In [None]:
if not(fast):
    optuna.visualization.plot_optimization_history(study_xgb)

In [None]:
if not(fast):
    optuna.visualization.plot_param_importances(study_xgb)

In [None]:
best_xgbparams = study_xgb.best_params
best_xgbparams

In [None]:
# best_xgbparams = {'lambda': 0.050695864818244944,
#  'alpha': 0.23319827340456734,
#  'colsample_bytree': 0.5,
#  'subsample': 0.8,
#  'learning_rate': 0.02,
#  'n_estimators': 1590,
#  'max_depth': 9,
#  'min_child_weight': 218}

if env == 'Kaggle':
    xgb_opt = XGBRegressor(**best_xgbparams, tree_method='gpu_hist')
else:
    xgb_opt = XGBRegressor(**best_xgbparams, tree_method='hist', n_jobs= - 1)

In [None]:
# %%time
# xgb_opt.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=100, verbose=False)

# evaluate('XGB_opt', xgb_opt, dic_eval, X_test, y_test)

evaluateCV('XGB_opt', xgb_opt, dic_eval, X_train, y_train)

## Optuna Tuned LGBM


In [None]:
# def objective_lgbm(trial):
#         param = {"device": "gpu",
#                 "metric": "rmse",
#                 "verbosity": -1,
#                 'learning_rate':trial.suggest_loguniform('learning_rate', 0.005, 0.5),
#                 "max_depth": trial.suggest_int("max_depth", 2, 500),
#                 "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
#                 "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
#                 "num_leaves": trial.suggest_int("num_leaves", 2, 256),
#                 "n_estimators": trial.suggest_int("n_estimators", 100, 4000),
#         #         "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 100000, 700000),
#                 "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
#                 "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
#                 "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
#                 "min_child_samples": trial.suggest_int("min_child_samples", 5, 100)}

#         if env == 'kaggle':
#                 param["device"] = "gpu"
#         else:
#                 param["device"] = "cpu"

#         pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "rmse")
#         model = LGBMRegressor(**param)

#         model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False, callbacks=[pruning_callback], early_stopping_rounds=100)

#         preds = model.predict(X_test)
#         return rmspe(y_test, preds)

def objective_lgbmCV(trial):
        param = {
                "metric": "rmse",
                "verbosity": -1,
                'learning_rate':trial.suggest_loguniform('learning_rate', 0.005, 0.5),
                "max_depth": trial.suggest_int("max_depth", 2, 500),
                # "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
                # "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
                "num_leaves": trial.suggest_int("num_leaves", 2, 256),
                "n_estimators": trial.suggest_int("n_estimators", 100, 4000),
        #         "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 100000, 700000),
                # "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
                "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
                "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
                "min_child_samples": trial.suggest_int("min_child_samples", 5, 100)}

        if env == 'kaggle':
                param["device"] = "gpu"
        else:
                param["device"] = "cpu"

        # pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "rmse")
        model = LGBMRegressor(**param, random_state=42)

        score = cross_val_score(
        # model, X_train, y_train, cv=kfolds, scoring=scorer_rmspe, fit_params={'callbacks': [pruning_callback]}
        # ).mean()
        model, X_train, y_train, cv=kfolds, scoring=scorer_rmspe).mean()
        return score

In [None]:
%%time
study_lgbm = optuna.create_study(direction='minimize', pruner=optuna.pruners.MedianPruner(n_warmup_steps=5))
study_lgbm.optimize(objective_lgbmCV, n_trials=n_trials, gc_after_trial=True) # n_jobs=-1 make the calcul longer !

In [None]:
print('Number of finished trials:', len(study_lgbm.trials))
print('Best trial:', study_lgbm.best_trial.params)

In [None]:
if not(fast):
    optuna.visualization.plot_optimization_history(study_lgbm)

In [None]:
if not(fast):
    optuna.visualization.plot_param_importances(study_lgbm)


In [None]:
best_lgbmparams = study_lgbm.best_params
best_lgbmparams

In [None]:
# best_lgbmparams = {'learning_rate': 0.012206112226610026,
#     'max_depth': 176,
#     'lambda_l1': 0.0911256640760148,
#     'lambda_l2': 7.619751773104654e-07,
#     'num_leaves': 87,
#     'n_estimators': 2713,
#     'feature_fraction': 0.6744552501464487,
#     'bagging_fraction': 0.7249343934370382,
#     'bagging_freq': 7,
#     'min_child_samples': 53}

if env == 'Kaggle':
    lgbm_opt = LGBMRegressor(**best_lgbmparams, device='gpu')
else:
    lgbm_opt = LGBMRegressor(**best_lgbmparams, device='cpu')


In [None]:
# %%time
# lgbm_opt.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False, early_stopping_rounds=100)

# evaluate('LIGHTGBM_opt', lgbm_opt, dic_eval, X_test, y_test)

evaluateCV('LIGHTGBM_opt', lgbm_opt, dic_eval, X_train, y_train)


## Stacking Regressor

Stack of estimators with a final regressor.

Stacked generalization consists in stacking the output of individual estimator and use a regressor to compute the final prediction. Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator.


In [None]:
if not(fast):
    if env == 'kaggle':
        tree_method='gpu_hist'
        device='gpu'
        n_jobs=None
    else:
        tree_method='hist'
        device='cpu'
        n_jobs=-1

    xgb = XGBRegressor(tree_method=tree_method, random_state = RANDOM_SEED)
    lgbm = LGBMRegressor(device=device, random_state=RANDOM_SEED)

    estimators = [('lgbm_opt', lgbm_opt),
                ('xgb_opt', xgb_opt),
                ('lgbm', lgbm),
                ('xgb', xgb)]

    stack_reg = StackingRegressor(estimators=estimators, final_estimator=None, verbose=1, n_jobs=n_jobs)

In [None]:
if not(fast):
    evaluateCV('Stack_reg', stack_reg, dic_eval, X_train, y_train)

## Scaler on best model

In [None]:
def model_selection(dic):
    rmspe_min = 1
    for key in dic.keys():
        rmspe_model = dic[key][1]
        if rmspe_model < rmspe_min:
            rmspe_min = rmspe_model
            model = dic[key][0]
            name = key
    return model, name

In [None]:
dic_eval['XGBOOST'][1]

In [None]:
if rmspe_scaler < dic_eval['XGBOOST'][1]: #if scaler's perf is better than no scaler
    model_to_scale, model_name = model_selection(dic_eval)
    model = make_pipeline(selected_scaler, model_to_scale)
    evaluateCV(model_name + '_scaled', model, dic_eval, X_train, y_train)

## Score visualization


In [None]:
models = [k for k in dic_eval.keys()]
rmspe_scores = [val[1] for val in dic_eval.values()]

rmspe_scores, models = (list(t) for t in zip(*sorted(zip(rmspe_scores, models))))

In [None]:
plt.figure(figsize=(16,6))

sns.barplot(x=rmspe_scores, y=models)
plt.title('Models comparaison')
plt.tight_layout()

if (env == 'local' or env == 'kaggle'):
    plt.savefig(save_path + 'models_comparaison.png')
plt.show()


In [None]:
model_final, model_name = model_selection(dic_eval)

In [None]:
model_final.fit(X_train, y_train)

In [None]:
###############################
# Save the best model
###############################

filename = 'model_' + model_name + '.sav'
pickle.dump(model_final, open(output + filename, 'wb'))

# Submission


In [None]:
###############################
# Load a model previously saved
###############################

# model_final = pickle.load(open(output + filename, 'rb'))

In [None]:
###############################
# adding prediction to df & export
###############################

df_pred = df_pred.assign(target = model_final.predict(X_test))
df_pred.to_csv('submission.csv', index=False)

In [None]:
pd.read_csv('submission.csv')