In [1]:
import sys
import os.path
import json
import numpy as np
import pandas as pd 
import sklearn
import scipy.sparse 
from itertools import product
import gc
import re
from catboost import CatBoostRegressor, Pool
from utils import load_feature_set, clipped_rmse, HoldOut
import pickle

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  from google.colab import drive
  drive.mount('/content/gdrive') 
  if not os.path.isfile('SETTINGS.json'):
       # hard coded data directory in drive is used if SETTINGS.json not present 
       config={}
       config['DATA_DIR'] = '/content/gdrive/My Drive/kaggle-c1'
       with open('SETTINGS.json', 'w') as outfile:
         json.dump(config, outfile)

with open('SETTINGS.json') as config_file:
    config = json.load(config_file)

DATA_DIR = config['DATA_DIR']

print('Using DATA_DIR ', DATA_DIR)

DATA_FOLDER = DATA_DIR

Using DATA_DIR  c:\repos\c1-final-test\datadir


# Feature selection

The purpose of the code in this notebook is to run sequential forward selection wrapper method for feature selection 
from the pre-generated five feature subsets (basic, text, within, allfeat, basicv2). The features 
are selected to maximise the performance 
in a validation experiment in terms of the clipped RMSE metric 
of a trained CatBoost regressor (with fixed parameters). 
The code writes full feature selection paths to a disk file, containing the  feature subsets
with number of the selected features covering the whole 1-25.



## Choosing the validation setup

Choosing a good validation strategy is problematic. There may not be too much that can be done, because the validation data necessarily has different distribution as the actual testing data. This is because the temporal nature of the prediction problem. The distributions slowly drift during cause of time. Therefore, it would be good to have the validation period temporally close to the test period. On the other hand, data analysis shows strong seasonal (=yearly) effects. For example, if we eould choose last training month (Oct 2015) as the validation set, predicting October sales based on previous months simply is a very different problem to predicting December sales, as sales figures seem to peak strongly in December and may otherwise have special characteristics.

Here we have chosen another alternative: predicting the sales of the last training December  month (2014) based on the previous data. In return of better matching the yearly cycle, we sacrifice quite much in temporal closeness of validation and test periods, as well as in smaller amount of validation data.

We are fully aware of the shortcomings of the validation scheme. Without a good cross-validation schreme in place, the 
selection algorithm (and similarly the hyperparameter search in the notebook 5) is likely to severely overfit to the data.
Rigorous cross-validation would be difficult to set up and would necessarily be a compromise between bad options, 
for the similar reasons whole validation is challenging: balancing between the amount of data used and the temporal proximity, strong yearly cycles.

Therefore, we shamelessly also advocate validation on the public leaderboard in this case, if we have only a small number of choices to evaluate. This depends on a data leak from th the test set via the leaderboard, but in a competition anything goes if only rules are not violated.



## Feature selection code

In [2]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import json

# start from existing results if they exist
with open(os.path.join(DATA_FOLDER,'search_path_sfs.pickle'), 'rb') as fp:    
            search_path_sfs = pickle.load(fp)
        
target_featcount = 25        

for id in ['allfeat','basic','basicv2','text','within']:
    if id in list(search_path_sfs.keys()):
        print('skipping already searched feature set {}'.format(id))
        continue;
        
    print('selecting features for feature set {}'.format(id))
    X_train, y_train, X_trainval, y_trainval, X_val, y_val, X_test, submissionidx2testidx = load_feature_set(id, data_folder=DATA_FOLDER)

    X_paramsearch =  pd.concat([X_train, X_val],ignore_index=True)
    y_paramsearch = pd.concat([y_train, y_val],ignore_index=True)
    train_indices = np.arange(X_train.shape[0])
    val_indices = np.arange(X_val.shape[0]) + X_train.shape[0]

    # 'item_price' generation is wrongly performed -> drop that column, if present
    
    to_drop_cols=[col for col in X_paramsearch.columns.values if re.search('item_price',col) ]
    
    X_paramsearch=X_paramsearch.drop(to_drop_cols,axis=1)
    ncol=X_paramsearch.shape[1]
    
    regparams={'learning_rate': 0.5, 'iterations': 30, 'depth': 12, 'l2_leaf_reg': 0.3, 'task_type': 'GPU', 'metric_period':30}
    model = CatBoostRegressor(**regparams)
    cv = HoldOut(train_indices=train_indices, test_indices=val_indices)

    sfs1 = SFS(model, 
           k_features=min(target_featcount,ncol-1), 
           forward=True, 
           floating=False, 
           scoring='neg_mean_squared_error',
           verbose=3,
           cv=cv)

    sfs1 = sfs1.fit(X_paramsearch, y_paramsearch)
    print('Feature selection path for feature subset {}:'.format(id))
    print(sfs1.subsets_)

    search_path_sfs[id]=sfs1.subsets_

    with open(os.path.join(DATA_FOLDER,'search_path_sfs.pickle'), 'wb') as fp:    
        pickle.dump (search_path_sfs, fp)

    del model
    del X_train
    del y_train
    del X_trainval
    del y_trainval 
    del X_val
    del y_val
    del X_test
    del X_paramsearch
    del y_paramsearch
    gc.collect()

skipping already searched feature set allfeat
skipping already searched feature set basic
skipping already searched feature set basicv2
skipping already searched feature set text
skipping already searched feature set within


Just for a check, print out the found feature selection paths

In [3]:
with open(os.path.join(DATA_FOLDER,'search_path_sfs.pickle'), 'rb') as fp:    
            search_path_sfs = pickle.load(fp)

search_path_sfs

{'allfeat': {1: {'feature_idx': (38,),
   'cv_scores': array([-2.30684221]),
   'avg_score': -2.306842212802926,
   'feature_names': ('target_lag_2',)},
  2: {'feature_idx': (38, 39),
   'cv_scores': array([-2.19293825]),
   'avg_score': -2.1929382460204048,
   'feature_names': ('target_lag_2', 'item_category_id')},
  3: {'feature_idx': (38, 39, 46),
   'cv_scores': array([-2.11309329]),
   'avg_score': -2.113093291795761,
   'feature_names': ('target_lag_2',
    'item_category_id',
    'target_category_tfidf_unigram_256_lag_5')},
  4: {'feature_idx': (38, 39, 46, 47),
   'cv_scores': array([-2.02533126]),
   'avg_score': -2.025331256778903,
   'feature_names': ('target_lag_2',
    'item_category_id',
    'target_category_tfidf_unigram_256_lag_5',
    'target_shop_lag_12')},
  5: {'feature_idx': (37, 38, 39, 46, 47),
   'cv_scores': array([-1.98108481]),
   'avg_score': -1.981084814344693,
   'feature_names': ('target_item_lag_4',
    'target_lag_2',
    'item_category_id',
    'target

The results reveal interesting things about feature importance. Some of these match well with prior expectations. For instance, whenever available the feature with most predictive power seems to be the value of the sales in the last preceding training month, i.e. 'target_lag_2'. 

It is nice to notice that both the features based on textual categories, and the shop-wise features get selected when all the features are available. Generating them was not wasted work.
