# About this notebook

In this notebook I set up a stacking pipeline, using ***XGBoost*** and ***LightGBM*** as base estimators and a default Logistic Regression/Decision Tree as final estimator (I let the optimization choose the best stacking classifier).

**Both LightGBM and XGBoost use the GPU.**


I use **Optuna** to tune base classifiers and I use **PurgedGroupTimeSeriesSplitStacking** class to create indices also for testing the ensemble. 

The goal is to show how to stack together two base models exploiting the PurgedGroupTimeSeriesSplitStacking class to avoid data leakage.


### Props to:
    
- [PurgedGroupTimeSeriesSplit](https://www.kaggle.com/marketneutral/purged-time-series-cv-xgboost-optuna), the cross validation strategy and memory usage reduction was taken from there;
- [PurgedGroupTimeSeriesSplitStacking](https://www.kaggle.com/tomwarrens/purgedgrouptimeseriessplit-stacking-ensemble-mode), this is my extension of PurgedGroupTimeSeriesSplit class to allow stacking.


##### Pipeline: 

- Imports
- Data Loading
- PurgedGroupTimeSeriesSplitStacking class definition
- Optuna Parameters optimization
- Refit with Best Params
- Submission

### Edit: 

I trained LGBMClassifier on Colab just to have a performance benchmark. Also, in the objective function, the proper way of stacking would be to refit the base estimators (LightGBM and XGBoost) using also (or only) the validation data (otherwise the time gap between train and test data becomes too big and base estimators lose performance). 

Please give me a feedback in the comments section and upvote the kernel if you find it useful.

### Imports

In [None]:
import numpy as np
import datatable as dt
import pandas as pd
import random
import re
random.seed(28)
import tqdm
import os
import gc
import logging
import optuna

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = [20, 12]  # width, height

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

input_path = '/kaggle/input/'
root_path = os.path.join(input_path, 'jane-street-market-prediction')

In [None]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype.name

        if col_type not in ['object', 'category', 'datetime64[ns, UTC]']:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

### Data Loading

In [None]:
%%time

train = (dt.fread(os.path.join(root_path, "train.csv")).to_pandas()
        .query('weight > 0').pipe(reduce_mem_usage)
        .reset_index(drop = True))

train['action'] = (train.resp > 0).astype(int)

resp_cols = [i for i in train.columns if 'resp' in i]

features_names = [i for i in train.columns if 'feature_' in i]
features_index = list(map(lambda x: int(re.sub("feature_", "", x)), features_names))
features_tuples = sorted(list(zip(features_names, features_index)), key = lambda x: x[1])
just_features = [i[0] for i in features_tuples]

# PurgedGroupTimeSeriesSplitStacking Class

Let's see how data is splitted into train, validation and test folds with the PurgedGroupedTimeSeriesStacking class.

# Optuna Optimization

Look [here](https://optuna.readthedocs.io/en/stable/tutorial/) for reference

Here I'll use cv_dummy just to show you usage of the notebook. Please edit it as you like. 

In [None]:
N_TRIALS=20

Best Params

You can now refit the stacking Classifier with the best hyperparameters found through Optuna. 

# LightGBM Only submission after Optimizing Parameters in Colab

In [None]:
from sklearn.model_selection import StratifiedKFold

params = {
    'objective': 'binary',
    'metrics':['auc'],
}

nfolds=3

kfold = StratifiedKFold(n_splits=nfolds)
lgb_models = list()
import lightgbm as lgb
for k , (train_idx, valid_idx) in enumerate(kfold.split(train.query('date>150')[just_features],
                                                       train.query('date>150')['action'])): 
    
    lgb_train = lgb.Dataset(train.loc[train_idx, just_features],
                            train.loc[train_idx, 'action'])
    lgb_valid = lgb.Dataset(train.loc[valid_idx, just_features],
                            train.loc[valid_idx, 'action'])
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = [lgb_train,lgb_valid],
        num_boost_round = 10000,
        verbose_eval = 50,
        early_stopping_rounds = 10,
    )
    
    lgb_models.append(model)

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() 
rcount=0
for (test_df, sample_prediction_df) in iter_test:
    
    #test_df.fillna(train_mean,inplace=True)
    
    prediction = 0
    for model in lgb_models:
        prediction += model.predict(test_df[just_features])[0]
    
    prediction /= len(lgb_models)
    prediction = prediction > 0.5
    sample_prediction_df.action = prediction.astype(int)
    env.predict(sample_prediction_df)
    rcount+=1
    if rcount % 1000 == 0:
        print('Processed: {} rows\n'.format(rcount))
        
print(f'Finished processing {rcount} rows.')