# M5-Accuracy Challenge: Ensemble Learning

## Team: MLiPioneers

### Starter Code Credits: [M5 - Three shades of Dark: Darker magic](https://www.kaggle.com/kyakovlev/m5-three-shades-of-dark-darker-magic)

### Additional Input Data:
1. [Simple FE](https://www.kaggle.com/kyakovlev/m5-simple-fe): Simple Feature Extraction. Involves creating a Grid dataframe which stores product release date (p1), product prices (p2) after multiple aggregations, momentum and normalization, calendar dates (p3)

2. [Lags Features](https://www.kaggle.com/kyakovlev/m5-lags-features): Lags Feature Extraction usings Pandas function `shift()` and computing rolling lags. Apply on Simple FE output dataframes.

3. [Custom Features](https://www.kaggle.com/kyakovlev/m5-custom-features): Covers FE creation approaches, Sequential FE validation, Dimension reduction, FE validation by Permutation importance, Mean encodings, Parallelization for FE. Use Mean Encodings as feature set. 

4. [AUX models](https://www.kaggle.com/kyakovlev/m5-aux-models): Includes pretrained models and preprocessed test sets.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# import libraries
import os, sys, gc, time, warnings, pickle, psutil, random

# custom imports
from multiprocessing import Pool        # Multiprocess Runs

warnings.filterwarnings('ignore')

In [None]:
# import model frameworks
# import lightgbm as lgb
import xgboost as xgb

## Light Gradient Boosting

### Parameter Explanation: 

#### 'boosting_type': 'gbdt'
we have 'goss' option for faster training but it normally leads to underfit. Also there is good 'dart' mode but it takes forever to train and model performance depends a lot on random factor 
https://www.kaggle.com/c/home-credit-default-risk/discussion/60921

#### 'objective': 'tweedie'
Tweedie Gradient Boosting for Extremely Unbalanced Zero-inflated Data https://arxiv.org/pdf/1811.10192.pdf and many more articles about tweediie. Strange (for me) but Tweedie is close in results to my own ugly loss. My advice here - make OWN LOSS function: https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/140564 and https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/143070. I think many of you already using it (after poisson kernel appeared) (kagglers are very good with "params" testing and tuning). Try to figure out why Tweedie works. Probably it will show you new features options or data transformation (Target transformation?).

#### 'tweedie_variance_power': 1.1
default = 1.5. set this closer to 2 to shift towards a Gamma distribution. set this closer to 1 to shift towards a Poisson distribution. my CV shows 1.1 is optimal but you can make your own choice.

#### 'metric': 'rmse'
Doesn't mean anything to us as competition metric is different and we don't use early stoppings here. So rmse serves just for general model performance overview. Also we use "fake" validation set (as it makes part of the training set) so even general rmse score doesn't mean anything)) https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/133834

#### 'subsample': 0.5
Serves to fight with overfit this will randomly select part of data without resampling. Chosen by CV (my CV can be wrong!). Next kernel will be about CV

#### 'subsample_freq': 1
frequency for bagging. default value - seems ok

#### 'learning_rate': 0.03
Chosen by CV. Smaller - longer training but there is an option to stop in "local minimum". Bigger - faster training but there is a chance to not find "global minimum" minimum

#### 'num_leaves': 2**11-1

#### 'min_data_in_leaf': 2**12-1

Force model to use more features. We need it to reduce "recursive" error impact. Also it leads to overfit that's why we use small. 

#### 'max_bin': 100

#### l1, l2 regularizations
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c: Good tiny explanation. l2 can work with bigger num_leaves but my CV doesn't show boost.
                    
#### 'n_estimators': 1400
CV shows that there should be different values for each state/store. Current value was chosen  for general purpose. As we don't use any early stopings careful to not overfit Public LB.

#### 'feature_fraction': 0.5
LightGBM will randomly select part of features on each iteration (tree). We have maaaany features and many of them are "duplicates" and many just "noise"
good values here - 0.5-0.7 (by CV).

#### 'boost_from_average': False
There is some "problem" to code boost_from_average for custom loss. 'True' makes training faster BUT carefully use it https://github.com/microsoft/LightGBM/issues/1514 not our case but good to know cons.

In [None]:
# Model parameters
# lgb_params = {
#                     'boosting_type': 'gbdt',
#                     'objective': 'tweedie',
#                     'tweedie_variance_power': 1.1,
#                     'metric': 'rmse',
#                     'subsample': 0.5,
#                     'subsample_freq': 1,
#                     'learning_rate': 0.03,
#                     'num_leaves': 2**11-1,
#                     'min_data_in_leaf': 2**12-1,
#                     'feature_fraction': 0.5,
#                     'max_bin': 100,
#                     'n_estimators': 1400,
#                     'boost_from_average': False,
#                     'verbose': -1,
#                 }

xgb_params = {
#     'objective':'reg:tweedie',
#     'tweedie_variance_power': 1.1,
    'objective': 'reg:squarederror',
    'eval_metric':'rmse',
    'subsample': 0.5,
    'max_depth':7, 
    'max_bin': 100,
    'eta':0.03}

#### Helper Functions

In [None]:
# set random seed (to make all processes deterministic)
def seed_everything(seed=0):
    random.seed(seed)
    np.random.seed(seed)
    
# Multiprocess Runs
def df_parallelize_run(func, t_split):
    num_cores = np.min([N_CORES,len(t_split)])
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, t_split), axis=1)
    pool.close()
    pool.join()
    return df

# Helper to load data by store ID
def get_data_by_store(store):
    
    # Read and contact basic feature
    df = pd.concat([pd.read_pickle(BASE),
                    pd.read_pickle(PRICE).iloc[:,2:],
                    pd.read_pickle(CALENDAR).iloc[:,2:]],
                    axis=1)
    
    # Leave only relevant store
    df = df[df['store_id']==store]

    # As our Features Grids are aligned, we can use index to keep only necessary rows.
    df2 = pd.read_pickle(MEAN_ENC)[mean_features]
    df2 = df2[df2.index.isin(df.index)]
    
    df3 = pd.read_pickle(LAGS).iloc[:,3:]
    df3 = df3[df3.index.isin(df.index)]
    
    df = pd.concat([df, df2], axis=1)
    del df2 # to not reach memory limit 
    
    df = pd.concat([df, df3], axis=1)
    del df3 # to not reach memory limit 
    
    # Create features list
    features = [col for col in list(df) if col not in remove_features]
    df = df[['id','d',TARGET]+features]
    
    # Skipping first n rows
    df = df[df['d']>=START_TRAIN].reset_index(drop=True)
    
    return df, features

# Helper to make dynamic rolling lags
def make_lag(LAG_DAY):
    lag_df = base_test[['id','d',TARGET]]
    col_name = 'sales_lag_'+str(LAG_DAY)
    lag_df[col_name] = lag_df.groupby(['id'])[TARGET].transform(lambda x: x.shift(LAG_DAY)).astype(np.float16)
    return lag_df[[col_name]]

def make_lag_roll(LAG_DAY):
    shift_day = LAG_DAY[0]
    roll_wind = LAG_DAY[1]
    lag_df = base_test[['id','d',TARGET]]
    col_name = 'rolling_mean_tmp_'+str(shift_day)+'_'+str(roll_wind)
    lag_df[col_name] = lag_df.groupby(['id'])[TARGET].transform(lambda x: x.shift(shift_day).rolling(roll_wind).mean())
    return lag_df[[col_name]]

#### Global Parameters

In [None]:
# Global
VER = 1                          # Our model version
SEED = 42                        # We want all things
seed_everything(SEED)            # to be as deterministic 
# lgb_params['seed'] = SEED        # as possible
xgb_params['seed'] = SEED        # as possible
N_CORES = psutil.cpu_count()     # Available CPU cores

# LIMITS and const
TARGET      = 'sales'            # Our target
START_TRAIN = 0                  # We can skip some rows (Nans/faster training)
END_TRAIN   = 1913               # End day of our train set
P_HORIZON   = 28                 # Prediction horizon
USE_AUX     = True              # Use or not pretrained models

# FEATURES to remove. These features lead to overfit
# or values not present in test set
remove_features = ['id','state_id','store_id',
                   'date','wm_yr_wk','d',TARGET]
mean_features   = ['enc_cat_id_mean','enc_cat_id_std',
                   'enc_dept_id_mean','enc_dept_id_std',
                   'enc_item_id_mean','enc_item_id_std'] 

#PATHS for Features
ORIGINAL = '../input/m5-forecasting-accuracy/'
BASE     = '../input/m5-simple-fe/grid_part_1.pkl'
PRICE    = '../input/m5-simple-fe/grid_part_2.pkl'
CALENDAR = '../input/m5-simple-fe/grid_part_3.pkl'
LAGS     = '../input/m5-lags-features/lags_df_28.pkl'
MEAN_ENC = '../input/m5-custom-features/mean_encoding_df.pkl'

# AUX(pretrained) Models paths
# AUX_MODELS = '../input/m5-aux-models/'
AUX_MODELS = '../input/m5-xgb-aux-models/results/'

#STORES ids
STORES_IDS = pd.read_csv(ORIGINAL+'sales_train_validation.csv')['store_id']
STORES_IDS = list(STORES_IDS.unique())

#CATEGORY columns
cat_cols = ['item_id', 'dept_id', 'cat_id', 'event_name_1', 'event_type_1', 'event_name_2',
       'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI']

#SPLITS for lags creation
SHIFT_DAY  = 28
N_LAGS     = 15
LAGS_SPLIT = [col for col in range(SHIFT_DAY,SHIFT_DAY+N_LAGS)]
ROLS_SPLIT = []
for i in [1,7,14]:
    for j in [7,14,30,60]:
        ROLS_SPLIT.append([i,j])

### Aux Models
If you don't want to wait hours and hours to have result you can train each store in separate kernel and then just join result.

If we want to use pretrained models we can skip training (in our case do dummy training to show that we are good with memory and you can safely use this (all kernel) code)

In [None]:
# if USE_AUX:
#     lgb_params['n_estimators'] = 2
    
# Here is some 'logs' that can compare
#Train CA_1
#[100]	valid_0's rmse: 2.02289
#[200]	valid_0's rmse: 2.0017
#[300]	valid_0's rmse: 1.99239
#[400]	valid_0's rmse: 1.98471
#[500]	valid_0's rmse: 1.97923
#[600]	valid_0's rmse: 1.97284
#[700]	valid_0's rmse: 1.96763
#[800]	valid_0's rmse: 1.9624
#[900]	valid_0's rmse: 1.95673
#[1000]	valid_0's rmse: 1.95201
#[1100]	valid_0's rmse: 1.9476
#[1200]	valid_0's rmse: 1.9434
#[1300]	valid_0's rmse: 1.9392
#[1400]	valid_0's rmse: 1.93446

#Train CA_2
#[100]	valid_0's rmse: 1.88949
#[200]	valid_0's rmse: 1.84767
#[300]	valid_0's rmse: 1.83653
#[400]	valid_0's rmse: 1.82909
#[500]	valid_0's rmse: 1.82265
#[600]	valid_0's rmse: 1.81725
#[700]	valid_0's rmse: 1.81252
#[800]	valid_0's rmse: 1.80736
#[900]	valid_0's rmse: 1.80242
#[1000]	valid_0's rmse: 1.79821
#[1100]	valid_0's rmse: 1.794
#[1200]	valid_0's rmse: 1.78973
#[1300]	valid_0's rmse: 1.78552
#[1400]	valid_0's rmse: 1.78158

## Train Model

In [None]:
# # Train Models
# for store_id in ['WI_3']:
#     print('Train', store_id)
    
#     # Get grid for current store
#     grid_df, features_columns = get_data_by_store(store_id)
    
#     print('load train and valid sets')
#     # Masks for 
#     # Train (All data less than 1913)
#     # "Validation" (Last 28 days - not real validation set)
#     # Test (All data greater than 1913 day, with some gap for recursive features)
#     train_mask = grid_df['d']<=END_TRAIN
#     valid_mask = train_mask&(grid_df['d']>(END_TRAIN-P_HORIZON))
#     preds_mask = grid_df['d']>(END_TRAIN-100)
    
#     # Apply masks and save lgb dataset as bin
#     # to reduce memory spikes during dtype convertations
# #     train_data = lgb.Dataset(grid_df[train_mask][features_columns], 
# #                        label=grid_df[train_mask][TARGET])
#     temp_df = grid_df[train_mask][features_columns].copy()
#     for c in cat_cols:
#         temp_df[c] = temp_df[c].cat.codes
#     train_data = xgb.DMatrix(temp_df, 
#                        label=grid_df[train_mask][TARGET])
#     del temp_df
#     train_data.save_binary('train_data.bin')
#     # load train data
# #     train_data = lgb.Dataset('train_data.bin')
#     train_data = xgb.DMatrix('train_data.bin')
    
#     # prepare validation data
# #     valid_data = lgb.Dataset(grid_df[valid_mask][features_columns], 
# #                        label=grid_df[valid_mask][TARGET])
#     temp_df = grid_df[valid_mask][features_columns].copy()
#     for c in cat_cols:
#         temp_df[c] = temp_df[c].cat.codes
#     valid_data = xgb.DMatrix(temp_df, 
#                        label=grid_df[valid_mask][TARGET])
#     del temp_df
#     valid_data.save_binary('valid_data.bin')
#     # load train data
#     valid_data = xgb.DMatrix('valid_data.bin')

#     # Saving part of the dataset for later predictions
#     # Removing features that we need to calculate recursively 
#     grid_df = grid_df[preds_mask].reset_index(drop=True)
#     keep_cols = [col for col in list(grid_df) if '_tmp_' not in col]
#     grid_df = grid_df[keep_cols]
#     grid_df.to_pickle('test_'+store_id+'.pkl')
#     del grid_df
    
#     # Launch seeder again to make lgb training 100% deterministic
#     seed_everything(SEED)
    
#     print('train estimator')
#     # train estimator
# #     estimator = lgb.train(lgb_params,
# #                           train_data,
# #                           valid_sets = [valid_data],
# #                           verbose_eval = 100,
# #                           )
#     estimator = xgb.train(xgb_params, 
#                           train_data, 
#                           evals=[(valid_data,'eval')],
#                           num_boost_round=100,
#                           early_stopping_rounds=10)
    
#     # Save model - it's not real '.bin' but a pickle file
# #     model_name = 'lgb_model_'+store_id+'_v'+str(VER)+'.bin'
#     model_name = 'xgb_model_'+store_id+'_v'+str(VER)+'.bin'
#     pickle.dump(estimator, open(model_name, 'wb'))

#     # Remove temporary files and objects 
#     # to free some hdd space and ram memory
#     !rm train_data.bin
# #     del train_data, valid_data, estimator
#     del train_data, estimator
#     gc.collect()
    
#     # "Keep" models features for predictions
#     MODEL_FEATURES = features_columns

## Model Predictions

In [None]:
grid_df, features_columns = get_data_by_store('CA_1')
del grid_df
MODEL_FEATURES = features_columns

In [None]:
# Recombine Test set after training
def get_base_test():
    base_test = pd.DataFrame()

    for store_id in STORES_IDS:
        temp_df = pd.read_pickle(AUX_MODELS + 'test_'+store_id+'.pkl')
        temp_df['store_id'] = store_id
        base_test = pd.concat([base_test, temp_df]).reset_index(drop=True)
    
    return base_test

In [None]:
# Create Dummy DataFrame to store predictions
all_preds = pd.DataFrame()

# Join back the Test dataset with a small part of the training data to make recursive features
base_test = get_base_test()

# Timer to measure predictions time 
main_time = time.time()

# Loop over each prediction day. As rolling lags are the most timeconsuming, we will calculate it for whole day
for PREDICT_DAY in range(1,29):    
    print('Predict | Day:', PREDICT_DAY)
    start_time = time.time()

    # Make temporary grid to calculate rolling lags
    grid_df = base_test.copy()
    grid_df = pd.concat([grid_df, df_parallelize_run(make_lag_roll, ROLS_SPLIT)], axis=1)
        
    for store_id in STORES_IDS:
        # Read all our models and make predictions for each day/store pairs
#         model_path = 'lgb_model_'+store_id+'_v'+str(VER)+'.bin' 
        model_path = 'xgb_model_'+store_id+'_v'+str(VER)+'.bin' 
        if USE_AUX:
            model_path = AUX_MODELS + model_path
        # load estimator
        estimator = pickle.load(open(model_path, 'rb'))
        
        # apply masks per days and stores
        day_mask = base_test['d']==(END_TRAIN+PREDICT_DAY)
        store_mask = base_test['store_id']==store_id
        mask = (day_mask)&(store_mask)
#         base_test[TARGET][mask] = estimator.predict(grid_df[mask][MODEL_FEATURES])
        temp_df = grid_df[mask][MODEL_FEATURES].copy()
        for c in cat_cols:
            temp_df[c] = temp_df[c].cat.codes
        test_data = xgb.DMatrix(temp_df)
        del temp_df
        test_data.save_binary('test_data.bin')
        test_data = xgb.DMatrix('test_data.bin')
        base_test[TARGET][mask] = estimator.predict(test_data)
    
    # Make good column naming and add to all_preds DataFrame
    temp_df = base_test[day_mask][['id',TARGET]]
    temp_df.columns = ['id','F'+str(PREDICT_DAY)]
    if 'id' in list(all_preds):
        all_preds = all_preds.merge(temp_df, on=['id'], how='left')
    else:
        all_preds = temp_df.copy()
        
    print('#'*10, ' %0.2f min round |' % ((time.time() - start_time) / 60),
                  ' %0.2f min total |' % ((time.time() - main_time) / 60),
                  ' %0.2f day sales |' % (temp_df['F'+str(PREDICT_DAY)].sum()))
    del temp_df

# clean up dataframe
all_preds = all_preds.reset_index(drop=True)
all_preds

## Create Submission File (Export predictions)

In [None]:
# Reading competition sample submission and merging our predictions.
# As we have predictions only for "_validation" data, we need to do fillna() for "_evaluation" items
submission = pd.read_csv(ORIGINAL+'sample_submission.csv')[['id']]
submission = submission.merge(all_preds, on=['id'], how='left').fillna(0)
submission.to_csv('submission_v'+str(VER)+'.csv', index=False)